top | item 47007613

(no title)

btown | 16 days ago

It bears repeating that modern LLMs are incredibly capable, and relentless, at solving problems that have a verification test suite. It seems like this problem did (at least for some finite subset of n)!

This result, by itself, does not generalize to open-ended problems, though, whether in business or in research in general. Discovering the specification to build is often the majority of the battle. LLMs aren't bad at this, per se, but they're nowhere near as reliably groundbreaking as they are on verifiable problems.

discuss

utopiah|15 days ago

> modern LLMs are incredibly capable, and relentless, at solving problems that have a verification test suite.

Feel like it's a bit what I tried to expressed few weeks ago https://news.ycombinator.com/item?id=46791642 namely that we are just pouring computational resources at verifiable problems then claim that astonishingly sometimes it works. Sure LLMs even have a slight bias, namely they do rely on statistics so it's not purely brute force but still the approach is pretty much the same : throw stuff at the wall, see what sticks, once something finally does report it as grandiose and claim to be "intelligent".

virgildotcodes|15 days ago

> throw stuff at the wall, see what sticks, once something finally does report it as grandiose and claim to be "intelligent".

What do we think humans are doing? I think it’s not unfair to say our minds are constantly trying to assemble the pieces available to them in various ways. Whether we’re actively thinking about a problem or in the background as we go about our day.

Every once in a while the pieces fit together in an interesting way and it feels like inspiration.

The techniques we’ve learned likely influence the strategies we attempt, but beyond all this what else could there be but brute force when it comes to “novel” insights?

If it’s just a matter of following a predefined formula, it’s not intelligence.

If it’s a matter of assembling these formulas and strategies in an interesting way, again what else do we have but brute force?

piombisallow|15 days ago

That's also what most grad students are doing. Even in the unlikely case they completely stop improving, it's still a massive deal.

QuercusMax|16 days ago

Yes, this is where I just cannot imagine completely AI-driven software development of anything novel and complicated without extensive human input. I'm currently working in a space where none of our data models are particularly complex, but the trick is all in defining the rules for how things should work.

Our actual software implementation is usually pretty simple; often writing up the design spec takes significantly longer than building the software, because the software isn't the hard part - the requirements are. I suspect the same folks who are terrible at describing their problems are going to need help from expert folks who are somewhere between SWE, product manager, and interaction designer.

D-Machine|15 days ago

Even more generally than verification, just being tied to a loss function that represent something we actually care about. E.g. compiler and test errors, LEAN verification in Aristotle, basic physics energy configs in AlphaFold, or win conditions in e.g. RL, such as in AlphaGo.

RLHF is an attempt to push LLMs pre-trained with a dopey reconstruction loss toward something we actually care about: imagine if we could find a pre-training criterion that actually cared about truth and/or plausibility in the first place!

alex43578|15 days ago

There's been active work in this space, including TruthRL: https://arxiv.org/html/2509.25760v1. It's absolutely not a solved problem, but reducing hallucinations is a key focus of all the labs.