(no title)
mynameisjody | 3 months ago
Then it goes on, "After a couple of vague commands (“build it out more, make it better”) I got a 14 page paper." I hear..."I got 14 pages of words". But is it a good paper, that another PhD would think is good? Is it even coherent?
When I see the code these systems generate within a complex system, I think okay, well that's kinda close, but this is wrong and this is a security problem, etc etc. But because I'm not a PhD in these subjects, am I supposed to think, "Well of course the 14 pages on a topic I'm not an expert in are good"?
It just doesn't add up... Things I understand, it looks good at first, but isn't shippable. Things I don't understand must be great?
stavros|3 months ago
My current preference is Codex 5.1 (Sonnet 4.5 as a close second, though it got really dumb today for "some reason"). It's been good to the point where I shipped multiple projects with it without a problem (with eg https://pine.town being one I made without me writing any code).
yread|3 months ago
apwell23|3 months ago
how many prompts did it take you to make this?
how did you make sure that each new prompt didn't break some previous functionality?
did you have a precise vision for it when you started or did you just go with whatever was being given to you?
Madmallard|3 months ago
tempestn|3 months ago
gtirloni|3 months ago
Lerc|3 months ago
You could trust the expert analysis of people in that field. You can hit personal ideologies or outliers, but asking several people seems to find a degree of consensus.
You could try varying tasks that perform complex things that result in easy to test things.
When I started trying chatbots for coding, one of my test prompts was
That was about the level where some models would succeed and some will fail.Recently I found
Produced a nice demo with slider for parameters, a few refinements (hierarchical scaling version) and I got it to produce the same interface as a module that I had written myself and it worked as a drop in replacement.These things are fairly easy to check because if it is performant and visually correct then it's about good enough to go.
It's also worth noting that as they attempt more and more ambitious tasks, they are quite probably testing around the limit of capability. There is both marketing and science in this area. When they say they can do X, it might not mean it can do it every time, but it has done it at least once.
taurath|3 months ago
That’s the problem - the experts all promise stuff that can’t be easily replicated. The promises the experts send doesn’t match the model. The same request might succeed and might fail, and might fail in such a way that subsequent prompts might recover or might not.
adamors|3 months ago
Couple it with the tendency to please the user by all means and it ends up lieing to you but you won’t ever realise, unless you double check.
JumpCrisscross|3 months ago
Why aren't foundational model companies training separate enterprise and consumer models from the get go?
apendleton|3 months ago
> The idea was good, as were many elements of the execution, but there were also problems: some of its statistical methods needed more work, some of its approaches were not optimal, some of its theorizing went too far given the evidence, and so on. Again, we have moved past hallucinations and errors to more subtle, and often human-like, concerns.
jrumbut|3 months ago
Now the tightrope is a whole application or a 14 page paper and the short pieces of code and prose are now professional quality more often than not. That's some serious progress.
monooso|3 months ago
brightball|3 months ago
Definitely planning to use it more at work. The integrations across Google Workspace are excellent.
seidleroni|3 months ago
"So is this a PhD-level intelligence? In some ways, yes, if you define a PhD level intelligence as doing the work of a competent grad student at a research university. But it also had some of the weaknesses of a grad student. The idea was good, as were many elements of the execution, but there were also problems..."
Herring|3 months ago
lm28469|3 months ago
The sane conclusion would be to invest in education, not to dump hundreds of billions of llms, but ok
PostOnce|3 months ago
What use is an LLM in an illiterate society?
leeoniya|3 months ago
https://en.wikipedia.org/wiki/Gell-Mann_amnesia_effect
visarga|3 months ago
ManlyBread|3 months ago
eckesicle|3 months ago
It’s like the Gell-Mann amnesia effect applied to AI. :)
https://en.wikipedia.org/wiki/Gell-Mann_amnesia_effect
cgh|3 months ago
meindnoch|3 months ago
nbupadhya|3 months ago
secondbreakfast|3 months ago
tsss|3 months ago
ammbauer|3 months ago
Isn't the point of doing the master's thesis that you do the math and research, so that you learn and understand the math and research?
pojzon|3 months ago
Without knowledge how to use this “PROBALISTIC” slot machine to have better results ypu are only wasting energy those GPUs need to run and answer questions.
Majority of ppl use LLMs incorrectly.
Majority of ppl selling LLMs as a panacea for everyting are lying.
But we need hype or the bubble will burst taking whole market with it, so shuushh me.
Glemkloksdjf|3 months ago
[deleted]