top | item 46586111

(no title)

20k | 1 month ago

I wish LLMs were good at search. I've tried to evaluate them many times for their quality at answering research questions for astrophysics (specifically numerical relativity). If they were good at answering questions, I'd use them in a heartbeat

Without exception, every technical question I've ever asked an LLM that I know the answer to, has been substantially wrong in some fashion. This makes it just.. absolutely useless for research. In some cases I've spotted it straight up plagiarising from the original sources, with random capitalisation giving it away

The issue is that once you get even slightly into a niche, they fall apart because the training data just doesn't exist. But they don't say "sorry there's insufficient training data to give you an answer", they just make shit up and state it as confidently incorrect

discuss

simonw|1 month ago

LLMs got good at search last year. You need to use the right ones though - ChatGPT Thinking mode and Google AI mode (that's https://www.google.com/ai - which is NOT the same as regular Google's "AI overviews" which are still mostly trash) are both excellent.

I've been tracking advances in AI assisted search here - https://simonwillison.net/tags/ai-assisted-search/ - in particular:

- https://simonwillison.net/2025/Apr/21/ai-assisted-search/ - April is when they started getting good, with o3 and the various deep research tools

- https://simonwillison.net/2025/Sep/6/research-goblin/ - GPT-5 got excellent. This post includes several detailed examples, including "Starbucks in the UK don’t sell cake pops! Do a deep investigative dive".

- https://simonwillison.net/2025/Sep/7/ai-mode/ - AI mode from Google

locknitpicker|1 month ago

> LLMs got good at search last year. You need to use the right ones though - ChatGPT Thinking mode and Google AI mode (that's https://www.google.com/ai - which is NOT the same as regular Google's "AI overviews" which are still mostly trash) are both excellent.

I disagree. You might have seen some improvements in the results, but all LLMs still hallucinate quite hard on simple queries where you prompt them to cite their sources. You'll see ChatGPT insist quite hard that the source of their assertions is the 404 link that it asserts is working.

20k|1 month ago

Oh boy, someone's claiming that chatgpt is actually great now, time to ask it some questions

I asked chatgpt's thinking mode if the adm formalism is strictly equivalent to general relativity, and it made several strongly incorrect statements

This is my favourite:

>3. Boundary terms matter

>To be fully equivalent:

>One must add the correct Gibbons–Hawking–York boundary term

>And handle asymptotic conditions carefully (e.g. ADM energy)

>Otherwise, the variational principle is not well-defined.

Which is borderline gibberish

>The theory still has 2 propagating DOF per spacetime point

This is pretty good too

>(lapse and shift act as Lagrange multipliers, not dynamical fields).

This is also as far as I'm aware just wrong, as the gauge conditions are nonphysical. In general, lapse and shift are generally always treated as dynamical fields

Its full answer reads like someone with minimal understanding of physics trying to bullshit you. Then I asked it if the BSSN formalism is strictly equivalent to the ADM formalism (it isn't, because it isn't covariant)

This answer is actually more wrong, surprisingly

>Yes — classically, the BSSN formalism is equivalent to ADM, but only under specific conditions. In practice, it is a reparameterization plus gauge fixing and constraint handling, not a new theory. The equivalence is more delicate than ADM ↔ GR.

The ONE thing that doesn't change in the BSSN formalism is the gauge conditions

>Rewriting the evolution equations, adding terms proportional to constraints.

This is also pretty inadequate

>Precise equivalence statement

>BSSN is strictly equivalent to ADM at the classical level if:

...

>Gauge choices are compatible >(e.g. lapse and shift not over-constraining the system)

This is complete gibberish

It also states:

>No extra degrees of freedom are introduced

I don't think chatgpt knows what a degree of freedom is

>Why the equivalence is more subtle than ADM ↔ GR >1. BSSN is not a canonical transformation

>Unlike ADM ↔ GR:

>BSSN is not manifestly Hamiltonian

>The Poisson structure is not preserved automatically

>One must reconstruct ADM variables to see equivalence

This is all absolute bollocks. Manifestly hamiltonian is literally gibberish. Neither of these formalisms have a "poisson structure" whatever that means, and sure yes you can construct the adm variables from the bssn variables whoopee

>When equivalence can fail

>Discretized (numerical) system -> Equivalence only approximate

Nobody explain to chatgpt that the ADM formalism is also a discretiseable series of PDEs!

>BSSN and ADM describe the same classical solutions of Einstein’s equations, but BSSN reshapes the phase space and constraint handling to make the evolution well-behaved, sacrificing manifest Hamiltonian structure off-shell.

We're starting to hit timecube levels of nonsense

It also gets the original question completely wrong: The BSSN formalism isn't covariant or coordinate free - there's an alterative bssn-like formalism called cBSSN (covariant bssn), which is similar to ccz4 and z4cc (both covariant). Its an important property that the regular BSSN formalism lacks, which is one of the ways you can identify it as being not a strict equivalence to the ADM formalism on mathematical grounds. So in the ADM formalism you can express your equations in polar coordinates, but if you make that transformation in the BSSN formalism - its no longer the same

This has actually gotten significantly worse than last time I asked chatgpt about this kind of thing, its more confidently incorrect now

pxc|1 month ago

> Without exception, every technical question I've ever asked an LLM that I know the answer to, has been substantially wrong in some fashion.

The other problem that I tend to hit is a tradeoff between wrongness and slowness. The fastest variants of the SOTA models are so frequently and so severely wrong that I don't find them useful for search. But the bigger, slower ones that spend more time "thinking" take so long to yield their (admittedly better) results that it's often faster for me to just do some web searching myself.

They tend to be more useful the first time I'm approaching a subject, or before I've familiarized myself with the documentation of some API or language or whatever. After I've taken some time to orient myself (even by just following the links they've given me a few times), it becomes faster for me to just search by myself.

sandworm101|1 month ago

>> at answering research questions for astrophysics

I googled for "helium 3" yesterday. Google's AI answer said that helium 3 is "primarily sourced from the moon", as if we were actively mining it there already.

BYazfVCcq|1 month ago

There are probably thousands of scifi books where the moon has some forms of helium 3 mining. Considering Google pirated and used them all for training it makes sense that it puts it in present tense.

fishtacos|1 month ago

On a similar note, Gemini told that I was born in 2025 when I did a cursory search for my real name. It's rather confident.

unknown|1 month ago

[deleted]

HPsquared|1 month ago

I wonder how much memory and computing time goes into making them, vs. a typical "proper" LLM prompt. It's like the freebies you get with a Christmas cracker.

elzbardico|1 month ago

If you nudge it towards tool use, A lot of time it can give you better answers.

Instead of "how cheese X is usually made" "search the web and give me a summary on the ways cheese X is made"

yunohn|1 month ago

> I wish LLMs were good at search

The entire situation of web search for LLMs is a mess. None of the existing providers return good or usable results; and Google refuses to provide general access to theirs. As a result, all LLMs (except maybe Gemini) are severely gimped forever until someone solves this.

I seriously believe that the only real new breakthrough for LLM research can be achieved by a clean, trustworthy, comprehensive search index. Maybe someone will build that? Otherwise we’re stuck with subpar results indefinitely.

embedding-shape|1 month ago

YaCy does a pretty good job, and is free, and you can run yourself, so the quality/experience is pretty much up to you. Paired together with a local GPT-OSS-120b with reasoning_effort set to high, I'm getting pretty good results. Validated with questions I do know the answer to, and seems alright although could be better of course, still getting better results out of GPT5.2 Pro which I guess is to be expected.

josecodea|1 month ago

> state it as confidently incorrect

It's funny for me to read this. They don't exhibit "confidence". You are just getting the most accurate text that it can produce. Of course, the training data doesn't contain "I don't know" for questions, that would be really bad training data! If you are getting "attitudes", it would be because you are triggering some kind of dialogue-esque data with your prompts (or the system prompt might be doing that).

Expecting the LLM to say "sorry I don't know" would be like expecting google search to return "we found some pages but deemed them wrong, so we won't show you any".

samuell|1 month ago

Did you try https://elicit.org ?

I have been impressed by its results.

I think this fact stems more from its initial search phase than its pure LLM processing power, but to me it seems the approach works really well.