top | item 46039383

(no title)

Every time I see an article like this, it's always missing --- but is it any good, is it correct? They always show you the part that is impressive - "it walked the tricky tightrope of figuring out what might be an interesting topic and how to execute it with the data it had - one of the hardest things to teach."

Then it goes on, "After a couple of vague commands (“build it out more, make it better”) I got a 14 page paper." I hear..."I got 14 pages of words". But is it a good paper, that another PhD would think is good? Is it even coherent?

When I see the code these systems generate within a complex system, I think okay, well that's kinda close, but this is wrong and this is a security problem, etc etc. But because I'm not a PhD in these subjects, am I supposed to think, "Well of course the 14 pages on a topic I'm not an expert in are good"?

It just doesn't add up... Things I understand, it looks good at first, but isn't shippable. Things I don't understand must be great?

discuss

stavros|3 months ago

It's gotten more and more shippable, especially with the latest generation (Codex 5.1, Sonnet 4.5, now Opus 4.5). My metric is "wtfs per line", and it's been decreasing rapidly.

My current preference is Codex 5.1 (Sonnet 4.5 as a close second, though it got really dumb today for "some reason"). It's been good to the point where I shipped multiple projects with it without a problem (with eg https://pine.town being one I made without me writing any code).

yread|3 months ago

I feel it sometimes tries to be overly correct. Like using BigInts when working with offsets in big files in javascript. My files are big but not 53bits of mantissa big. And no file APIs work with bigints. This was from Gemini 3 thinking btw

apwell23|3 months ago

> https://pine.town

how many prompts did it take you to make this?

how did you make sure that each new prompt didn't break some previous functionality?

did you have a precise vision for it when you started or did you just go with whatever was being given to you?

Madmallard|3 months ago

It's not really any different in my experience

tempestn|3 months ago

Have you tried Gemini 3 yet? I haven't done any coding with it, but on other tasks I've been impressed compared to gpt 5 and Sonnet 4.5.

gtirloni|3 months ago

Maybe the wtfs per line are decreasing because these models aren't saying anything interesting or original.

Lerc|3 months ago

I guess you have a couple of options.

You could trust the expert analysis of people in that field. You can hit personal ideologies or outliers, but asking several people seems to find a degree of consensus.

You could try varying tasks that perform complex things that result in easy to test things.

When I started trying chatbots for coding, one of my test prompts was

    Create a JavaScript function edgeDetect(image) that takes an ImageData object and returns a new ImageData object with all direction Sobel edge detection.

That was about the level where some models would succeed and some will fail.

Recently I found

    Can you create a webgl glow blur shader that takes a 2d canvas as a texture and renders it onscreen with webgl boosting the brightness so that #ffffff is extremely bright white and glowing,

Produced a nice demo with slider for parameters, a few refinements (hierarchical scaling version) and I got it to produce the same interface as a module that I had written myself and it worked as a drop in replacement.

These things are fairly easy to check because if it is performant and visually correct then it's about good enough to go.

It's also worth noting that as they attempt more and more ambitious tasks, they are quite probably testing around the limit of capability. There is both marketing and science in this area. When they say they can do X, it might not mean it can do it every time, but it has done it at least once.

taurath|3 months ago

> You could trust the expert analysis of people in that field

That’s the problem - the experts all promise stuff that can’t be easily replicated. The promises the experts send doesn’t match the model. The same request might succeed and might fail, and might fail in such a way that subsequent prompts might recover or might not.

adamors|3 months ago

> Things I don't understand must be great?

Couple it with the tendency to please the user by all means and it ends up lieing to you but you won’t ever realise, unless you double check.

JumpCrisscross|3 months ago

> Couple it with the tendency to please the user by all means

Why aren't foundational model companies training separate enterprise and consumer models from the get go?

apendleton|3 months ago

I think they get to that a couple of paragraphs later:

> The idea was good, as were many elements of the execution, but there were also problems: some of its statistical methods needed more work, some of its approaches were not optimal, some of its theorizing went too far given the evidence, and so on. Again, we have moved past hallucinations and errors to more subtle, and often human-like, concerns.

jrumbut|3 months ago

Well, that's why people still have jobs but I appreciate the idea of the post that the neat demo was a coherent paragraph or silly poem. The silly poems were all kind of similar, not very funny, and the paragraphs were a good start but I wouldn't use them for anything important.

Now the tightrope is a whole application or a 14 page paper and the short pieces of code and prose are now professional quality more often than not. That's some serious progress.

monooso|3 months ago

The author goes into the strengths and weaknesses of the paper later in the article.

brightball|3 months ago

I keep trying out different models. Gemini 3 is pretty good. It’s not quite as good at one shotting answers as Grok but overall it’s very solid.

Definitely planning to use it more at work. The integrations across Google Workspace are excellent.

seidleroni|3 months ago

The author actually discusses the results of the paper. He's not some rando but a Wharton Professor and when he is comparing the results to a grad student, it is with some authority.

"So is this a PhD-level intelligence? In some ways, yes, if you define a PhD level intelligence as doing the work of a competent grad student at a research university. But it also had some of the weaknesses of a grad student. The idea was good, as were many elements of the execution, but there were also problems..."

Herring|3 months ago

I think the point is we’re getting there. These models are growing up real fast. Remember 54% of US adults read at or below the equivalent of a sixth-grade level.

lm28469|3 months ago

> Remember 54% of US adults read at or below the equivalent of a sixth-grade level.

The sane conclusion would be to invest in education, not to dump hundreds of billions of llms, but ok

PostOnce|3 months ago

A question for the not-too-distant future:

What use is an LLM in an illiterate society?

leeoniya|3 months ago

> But because I'm not a PhD in these subjects, am I supposed to think, "Well of course the 14 pages on a topic I'm not an expert in are good"?

https://en.wikipedia.org/wiki/Gell-Mann_amnesia_effect

visarga|3 months ago

You don't use it that way. You use it to help you build and run experiments, and help you discuss your findings, and in the end helps you write your discoveries. You provide the content, and actual experiments provide the signal.

ManlyBread|3 months ago

Like clockwork. Each time someone criticizes any aspect of any LLM there's always someone to tell that person they're using the LLM wrong. Perhaps it's time to stop blaming the user?

eckesicle|3 months ago

> It just doesn't add up... Things I understand, it looks good at first, but isn't shippable. Things I don't understand must be great?

It’s like the Gell-Mann amnesia effect applied to AI. :)

https://en.wikipedia.org/wiki/Gell-Mann_amnesia_effect

cgh|3 months ago

This is a variation of the Gell-Mann amnesia effect: https://en.wikipedia.org/wiki/Gell-Mann_amnesia_effect

meindnoch|3 months ago

One could say, the GeLLMann amnesia effect. ( ͡° ͜ʖ ͡°)

nbupadhya|3 months ago

Thanks for introducing me this article

secondbreakfast|3 months ago

Loads of AI chatter is the Murray Gell-Mann Amnesia Effect on steroids

tsss|3 months ago

For what it's worth I have been using Gemini 2.5/3 extensively for my masters thesis and it has been a tremendous help. It's done a lot of math for me that I couldn't have done on my own (without days of research), suggested many good approaches to problems that weren't on my mind and helped me explore ideas quickly. When I ask it to generate entire chapters they're never up to my standard but that's mostly an issue of style. It seems to me that LLMs are good when you don't know exactly what you want or you don't care too much about the details. Asking it to generate a presentation is an utter crap shoot, even if you merely ask for bullet points without formatting.

ammbauer|3 months ago

> It's done a lot of math for me that I couldn't have done on my own (without days of research),

Isn't the point of doing the master's thesis that you do the math and research, so that you learn and understand the math and research?

pojzon|3 months ago

Truth is you still need human to review all of it, fix it where needed, guide it when it hallucinate and write correct instructions and prompts.

Without knowledge how to use this “PROBALISTIC” slot machine to have better results ypu are only wasting energy those GPUs need to run and answer questions.

Majority of ppl use LLMs incorrectly.

Majority of ppl selling LLMs as a panacea for everyting are lying.

But we need hype or the bubble will burst taking whole market with it, so shuushh me.

Glemkloksdjf|3 months ago

[deleted]