I've done experiments and basically what I found was that LLM models are extremely sensitive to .....language. Well, duh but let me explain a bit. They will give a different quality/accuracy of answer depending on the system prompt order, language use, length, how detailed the examples are, etc... basically every variable you can think of is responsible for either improving or causing detrimental behavior in the output. And it makes sense once you really grok that LLM;s "reason and think" in tokens. They have no internal world representation. Tokens are the raw layer on which they operate. For example if you ask a bilingual human what their favorite color is, the answer will be that color regardless of what language they used to answer that question. For an LLM, that answer might change depending on the language used, because its all statistical data distribution of tokens in training that conditions the response. Anyway i don't want to make a long post here. The good news out of this is that once you have found the best way in asking questions of your model, you can consistently get accurate responses, the trick is to find the best way to communicate with that particular LLM. That's why i am hard at work on making an auto calibration system that runs through a barrage of ways in finding the best system prompts and other hyperparameters for that specific LLM. The process can be fully automated, just need to set it all up.
I somewhat agree, but I think that the language example is not a good one. As Anthropic have demonstrated[0], LLMs do have "conceptual neurons" that generalise an abstract concept which can later be translated to other languages.
The issue is that those concepts are encoded in intermediate layers during training, absorbing biases present in training data. It may produce a world model good enough to know that "green" and "verde" are different names for the same thing, but not robust enough to discard ordering bias or wording bias. Humans suffer from that too, albeit arguably less.
I found an absolutely fascinating analysis on precisely this topic by an AI researcher who's also a writer: https://archive.ph/jgam4
LLMs can generate convincing editorial letters that give a real sense of having deeply read the work. The problem is that they're extremely sensitive, as you've noticed, to prompting as well as order bias. Present it with two nearly identical versions of the same text, and it will usually choose based on order. And social proof type biases to which we'd hope for machines to be immune can actually trigger 40+ point swings on a 100-point scale.
If you don't mind technical details and occasional swagger, his work is really interesting.
This doesn't match Anthropics research on the subject
> Claude sometimes thinks in a conceptual space that is shared between languages, suggesting it has a kind of universal “language of thought.” We show this by translating simple sentences into multiple languages and tracing the overlap in how Claude processes them.
Once can see that very easily in image generation models.
The "Elephant" it generates is lot different from "Haathi" (Hindi/Urdu). Same goes for other concepts that have 1-to-1 translation but the results are different.
> For example if you ask a bilingual human what their favorite color is, the answer will be that color regardless of what language they used to answer that question.
It's a very interesting question. Has someone measured it? Bonus point for using a conceal way so the subjects don't realize you care about colors.
Anyway, I don't expect something interesting with colors, but it may be interesting with food (I guess, in particular desserts).
Imagine you live in England and one of your parents is form France and you go there every year to meet your grandparents, and your other parent is from Germany and you go there every year to meet your grandparents. What is your favorite dessert? I guess when you are speaking in one language you are more connected to the memories of the holidays there and the grandparents and you may choose differently.
Doesn't this assume one truth or one final answer to all questions? What if there are many?
What if asking one way means you are likely to have your search satisfied by one truth, but asking another way means you are best served by another wisdom?
EDIT: and the structure of language/thought can't know solid truth from more ambiguous wisdom. The same linguistic structures must encode and traverse both. So there will be false positives and false negatives, I suppose? I dunno, I'm shooting from the hip here :)
I thought embeddings were the internal representation? Does reasoning and thinking get expanded back out into tokens and fed back in as the next prompt for reasoning? Or does the model internally churn on chains of embeddings?
Humans aren't good at validation either. We need tools, experiments, labs. Unproven ideas are a dime a dozen. Remember the hoopla about room temperature superconductivity? The real source of validation is external consequences.
I think there's more nuance, and the way I read the article is more "beware of these shortcomings", instead of "aren't good". LLM-based evaluation can be good. Several models have by now been trained on previous-gen models used in filtering data and validating RLHF data (pairwise or even more advanced). LLama3 is a good example of this.
My take from this article is that there are plenty of gotchas along the way, and you need to be very careful in how you structure your data, and how you test your pipelines, and how you make sure your tests are keeping up with new models. But, like it or not, LLM based evaluation is here to stay. So explorations into this space are good, IMO.
> Positional preferences, order effects, prompt sensitivity undermine AI judgments
If you can read between the lines, that says that there's no actual "judgement" going on. If there was a strong logical underpinning to the output, minor differences in input like the phrasing (but not factual content) of a prompt wouldn't make the quality of the output unpredictable.
You could say the same about human "judgement" then.
Humans display biases very similar to that of LLMs. This is not a coincidence. LLMs are trained on human-generated data. They attempt to replicate human reasoning - bias and all.
There are decisions where "strong logical underpinning" is strong enough to completely drown out the bias and the noise. And then there are decisions that aren't nearly as clear-cut - allowing the bias and the noise to creep into the outcomes. This is true for human and LLM "judgement" both.
Yes and no. People also exhibit these biases, but because degree matters, and because we have no other choice, we still trust them most of the time. That's to say; bias isn't always completely invalidating. I wrote a slightly satirical piece "People are just as bad as my LLMs" here: https://wilsoniumite.com/2025/03/10/people-are-just-as-bad-a...
Some other known distributional biases include self-preference bias (gpt-4o prefers gpt-4o generations over claude generations for eg) and structured output/JSON-mode bias [1]. Interestingly, some models have a more positive/negative-skew than others as well. This library [2] also provides some methods for calibrating/stabilizing them.
LLMs make impressive graders-of-convenience, but their judgments swing wildly with prompt phrasing and option order. Treat them like noisy crowd-raters: randomize inputs, ensemble outputs, and keep a human in the loop whenever single-digit accuracy points matter.
"and keep a human in the loop whenever single-digit accuracy points matter"
So we are supposed to give up on accuracy now? At least with humans (assuming good actors) I can assume an effort for accuracy and correctness. And I can build trust based on some resume and on former interaction.
With LLMs, this is more like a coin-flip with each prompt. And since the models are updated constantly, its hard to build some sort of resume. In the meantime, people - especially casual users - might just trust outputs, because its convenient. A single digit error is harder to find. The costs of validating outputs increases with increased accuracy of LLMs. And casual users tend to skip validation because "its a computer and computers are correct".
I fear an overall decrease in quality wherever LLMs are included. And any productivity gains are eaten by that.
I see "panels of judges" mentioned once, but what is the weakness of this? Other than more resource.
Worst case you end up with some multi-modal distribution, where two opinions are equal - which seems somewhat unlikely as the panel size grows. Or it could maybe happen in some case with exactly two outcomes (yes/no), but I'd be surprised if such a panel landed on a perfect uniform distribution in its judgments/opinions (50% yes 50% no)
One method to get a better estimate is to extract the token log-probabilities of "YES" and "NO" from the final logits of the LLM and take a weighted sum [1] [2]. If the LLM is calibrated for your task, there should be roughly a ~50% chance of sampling YES (1) and ~50% chance of NO (0) — yielding 0.5.
But generally you wouldn't use a binary outcome when you can have samples that are 50/50 pass/fail. Better to use a discrete scale of 1..3 or 1..5 and specify exactly what makes a sample a 2/5 vs a 4/5, for example
You are correct to question the weaknesses of a panel. This class of methods depends on diversity through high-temperature sampling, which can lead to spurious YES/NO responses that don't generalize well and are effectively noise.
Good, but really none of this should be surprising, given that LLMs are a giant text statistic that generate text based on that statistic. Quirks of that statistic will show up as quirks of the output.
When you think about it like that, it doesn't really make sense to assume they have some magical correctness properties. In some sense, they don't classify, they immitate what classification looks like.
> In some sense, they don't classify, they immitate what classification looks like.
I thought I've seen it all when people decided to consider AI a marketing term and started going off about how current mainstream AI products aren't """"real AI"""", but this is next level.
LLMs are good at discovery, since they know a lot, and can retrieve that knowledge from a query that simpler (e.g. regex-based) search engines with the same knowledge couldn't. For example, an LLM that is input a case may discover an obscure law, or notice a pattern in past court cases which establishes precedent. So they can be helpful to a real judge.
Of course, the judge must check that the law or precedent aren't hallucinated, and apply to the case in the way the LLM claims. They should also prompt other LLMs and use their own knowledge in case the cited law/precedent contradicts others.
There's a similar argument for scientists, mathematicians, doctors, investors, and other fields. LLMs are good at discovery but must be checked.
I would add that "hallucinations" aren't even the only failure mode a LLM can have, it can partially or completely miss what its supposed to find in the discovery process and lead you to believe that there just isn't anything worth pursuing in that particular venue.
It’s a statistical database of corpuses, not a logic engine.
Stop treating LLMs like they are capable of logic, reasoning or judgement. They are not, they never will be.
The extent to which they can recall and remix human words to replicate the intent behind those words is an incredible facsimile to thought. It’s nothing short of a mathematical masterpiece. But it’s not intelligence.
If it were communicating it’s results in any less human of an interface than conversational, I truly feel that most people would not be so easily fooled into believing it was capable of logic.
This doesn’t mean that a facsimile of logic like this has no use. Of course it does, we have seen many uses - some incredible, some dangerous and some pointless - but it is important to always know that there is no thought happening behind the facade. Only a replication of statistically similar communication of thought that may or may not actually apply to your prompt.
Also related: In my observations with tool calling the order of your arguments or fields actually can make a positive or negative effect on performance. You really have to be very careful when constructing your contexts. It doesn't help when all these frameworks and protocols hide these things from you.
Incidentally DeepSeek will give very interesting results if you ask it for a tutorial on prompt engineering - be sure to ask it how to effectively use 'attention anchors' to create 'well-structured prompts', and why rambling disorganized prompts are usually, but not always, detrimental, depending on whether you want 'associative leaps' or not.
P.S. I find this intro very useful:
> "Task: evaluate the structure of the following prompt in terms of attention anchors and likelihood of it generating a well-structured response. Do not actually reply to the prompt, all we need is an analysis of the structure. Prompt begins:"
I listen to online debates, especially political ones on various platforms, and man. The AI slop that people slap around at each other is beyond horrendous. I would not want an LLM being the final say on something critical. I want the opposite, an LLM should identify things that need follow up review by a qualified person, a person should still confirm the things that "pass" but they can then prioritize what to validate first.
Comments like this break the site guidelines, and not just a little. Can you please review https://news.ycombinator.com/newsguidelines.html and take the intended spirit of this site more to heart? Note these:
"Please don't fulminate."
"Don't be curmudgeonly. Thoughtful criticism is fine, but please don't be rigidly or generically negative."
"Please don't sneer, including at the rest of the community."
"When disagreeing, please reply to the argument instead of calling names. 'That is idiotic; 1 + 1 is 2, not 3' can be shortened to '1 + 1 is 2, not 3."
There's plenty of LLM skepticism on HN and that's fine, but like all comments here, it needs to be thoughtful.
[+] [-] nowittyusername|9 months ago|reply
[+] [-] leonidasv|9 months ago|reply
The issue is that those concepts are encoded in intermediate layers during training, absorbing biases present in training data. It may produce a world model good enough to know that "green" and "verde" are different names for the same thing, but not robust enough to discard ordering bias or wording bias. Humans suffer from that too, albeit arguably less.
[0] https://transformer-circuits.pub/2025/attribution-graphs/bio...
[+] [-] not_maz|9 months ago|reply
LLMs can generate convincing editorial letters that give a real sense of having deeply read the work. The problem is that they're extremely sensitive, as you've noticed, to prompting as well as order bias. Present it with two nearly identical versions of the same text, and it will usually choose based on order. And social proof type biases to which we'd hope for machines to be immune can actually trigger 40+ point swings on a 100-point scale.
If you don't mind technical details and occasional swagger, his work is really interesting.
[+] [-] TOMDM|9 months ago|reply
> Claude sometimes thinks in a conceptual space that is shared between languages, suggesting it has a kind of universal “language of thought.” We show this by translating simple sentences into multiple languages and tracing the overlap in how Claude processes them.
https://www.anthropic.com/research/tracing-thoughts-language...
[+] [-] pton_xd|9 months ago|reply
[+] [-] smusamashah|9 months ago|reply
The "Elephant" it generates is lot different from "Haathi" (Hindi/Urdu). Same goes for other concepts that have 1-to-1 translation but the results are different.
[+] [-] gus_massa|9 months ago|reply
It's a very interesting question. Has someone measured it? Bonus point for using a conceal way so the subjects don't realize you care about colors.
Anyway, I don't expect something interesting with colors, but it may be interesting with food (I guess, in particular desserts).
Imagine you live in England and one of your parents is form France and you go there every year to meet your grandparents, and your other parent is from Germany and you go there every year to meet your grandparents. What is your favorite dessert? I guess when you are speaking in one language you are more connected to the memories of the holidays there and the grandparents and you may choose differently.
[+] [-] patcon|9 months ago|reply
What if asking one way means you are likely to have your search satisfied by one truth, but asking another way means you are best served by another wisdom?
EDIT: and the structure of language/thought can't know solid truth from more ambiguous wisdom. The same linguistic structures must encode and traverse both. So there will be false positives and false negatives, I suppose? I dunno, I'm shooting from the hip here :)
[+] [-] thinkling|9 months ago|reply
[+] [-] wyett|9 months ago|reply
[deleted]
[+] [-] shahbaby|9 months ago|reply
Think about it, if they were good at evaluation, you could remove all humans in the loop and have recursively self improving AGI.
Nice to see an article that makes a more concrete case.
[+] [-] visarga|9 months ago|reply
[+] [-] NitpickLawyer|9 months ago|reply
My take from this article is that there are plenty of gotchas along the way, and you need to be very careful in how you structure your data, and how you test your pipelines, and how you make sure your tests are keeping up with new models. But, like it or not, LLM based evaluation is here to stay. So explorations into this space are good, IMO.
[+] [-] SrslyJosh|9 months ago|reply
If you can read between the lines, that says that there's no actual "judgement" going on. If there was a strong logical underpinning to the output, minor differences in input like the phrasing (but not factual content) of a prompt wouldn't make the quality of the output unpredictable.
[+] [-] ACCount36|9 months ago|reply
Humans display biases very similar to that of LLMs. This is not a coincidence. LLMs are trained on human-generated data. They attempt to replicate human reasoning - bias and all.
There are decisions where "strong logical underpinning" is strong enough to completely drown out the bias and the noise. And then there are decisions that aren't nearly as clear-cut - allowing the bias and the noise to creep into the outcomes. This is true for human and LLM "judgement" both.
[+] [-] Wilsoniumite|9 months ago|reply
[+] [-] tiahura|9 months ago|reply
[+] [-] nimitkalra|9 months ago|reply
[1]: https://verdict.haizelabs.com/docs/cookbook/distributional-b... [2]: https://github.com/haizelabs/verdict
[+] [-] yencabulator|9 months ago|reply
[+] [-] PrimordialEdg71|9 months ago|reply
[+] [-] einrealist|9 months ago|reply
So we are supposed to give up on accuracy now? At least with humans (assuming good actors) I can assume an effort for accuracy and correctness. And I can build trust based on some resume and on former interaction.
With LLMs, this is more like a coin-flip with each prompt. And since the models are updated constantly, its hard to build some sort of resume. In the meantime, people - especially casual users - might just trust outputs, because its convenient. A single digit error is harder to find. The costs of validating outputs increases with increased accuracy of LLMs. And casual users tend to skip validation because "its a computer and computers are correct".
I fear an overall decrease in quality wherever LLMs are included. And any productivity gains are eaten by that.
[+] [-] TrackerFF|9 months ago|reply
Worst case you end up with some multi-modal distribution, where two opinions are equal - which seems somewhat unlikely as the panel size grows. Or it could maybe happen in some case with exactly two outcomes (yes/no), but I'd be surprised if such a panel landed on a perfect uniform distribution in its judgments/opinions (50% yes 50% no)
[+] [-] nimitkalra|9 months ago|reply
But generally you wouldn't use a binary outcome when you can have samples that are 50/50 pass/fail. Better to use a discrete scale of 1..3 or 1..5 and specify exactly what makes a sample a 2/5 vs a 4/5, for example
You are correct to question the weaknesses of a panel. This class of methods depends on diversity through high-temperature sampling, which can lead to spurious YES/NO responses that don't generalize well and are effectively noise.
[1]: https://arxiv.org/abs/2303.16634 [2]: https://verdict.haizelabs.com/docs/concept/extractor/#token-...
[+] [-] panstromek|9 months ago|reply
When you think about it like that, it doesn't really make sense to assume they have some magical correctness properties. In some sense, they don't classify, they immitate what classification looks like.
[+] [-] perching_aix|9 months ago|reply
I thought I've seen it all when people decided to consider AI a marketing term and started going off about how current mainstream AI products aren't """"real AI"""", but this is next level.
[+] [-] armchairhacker|9 months ago|reply
Of course, the judge must check that the law or precedent aren't hallucinated, and apply to the case in the way the LLM claims. They should also prompt other LLMs and use their own knowledge in case the cited law/precedent contradicts others.
There's a similar argument for scientists, mathematicians, doctors, investors, and other fields. LLMs are good at discovery but must be checked.
[+] [-] amlib|9 months ago|reply
[+] [-] devmor|9 months ago|reply
Stop treating LLMs like they are capable of logic, reasoning or judgement. They are not, they never will be.
The extent to which they can recall and remix human words to replicate the intent behind those words is an incredible facsimile to thought. It’s nothing short of a mathematical masterpiece. But it’s not intelligence.
If it were communicating it’s results in any less human of an interface than conversational, I truly feel that most people would not be so easily fooled into believing it was capable of logic.
This doesn’t mean that a facsimile of logic like this has no use. Of course it does, we have seen many uses - some incredible, some dangerous and some pointless - but it is important to always know that there is no thought happening behind the facade. Only a replication of statistically similar communication of thought that may or may not actually apply to your prompt.
[+] [-] tempodox|9 months ago|reply
I prefer to call it “prompt guessing”, it's like some modern variant of alchemy.
[+] [-] amlib|9 months ago|reply
[+] [-] layer8|9 months ago|reply
[+] [-] BurningFrog|9 months ago|reply
[+] [-] sidcool|9 months ago|reply
[+] [-] lyu07282|9 months ago|reply
[+] [-] photochemsyn|9 months ago|reply
https://www.cip.org/funding-partnerships
Incidentally DeepSeek will give very interesting results if you ask it for a tutorial on prompt engineering - be sure to ask it how to effectively use 'attention anchors' to create 'well-structured prompts', and why rambling disorganized prompts are usually, but not always, detrimental, depending on whether you want 'associative leaps' or not.
P.S. I find this intro very useful:
> "Task: evaluate the structure of the following prompt in terms of attention anchors and likelihood of it generating a well-structured response. Do not actually reply to the prompt, all we need is an analysis of the structure. Prompt begins:"
[+] [-] giancarlostoro|9 months ago|reply
[+] [-] batshit_beaver|9 months ago|reply
[+] [-] sanqui|9 months ago|reply
[+] [-] suddenlybananas|9 months ago|reply
[+] [-] ken47|9 months ago|reply
[+] [-] dang|9 months ago|reply
https://news.ycombinator.com/newsguidelines.html
(We detached this comment from https://news.ycombinator.com/item?id=44074957)
[+] [-] gitroom|9 months ago|reply
[deleted]
[+] [-] gizajob|9 months ago|reply
[deleted]
[+] [-] dang|9 months ago|reply
[+] [-] tremon|9 months ago|reply
[+] [-] andrepd|9 months ago|reply
[deleted]
[+] [-] dang|9 months ago|reply
"Please don't fulminate."
"Don't be curmudgeonly. Thoughtful criticism is fine, but please don't be rigidly or generically negative."
"Please don't sneer, including at the rest of the community."
"When disagreeing, please reply to the argument instead of calling names. 'That is idiotic; 1 + 1 is 2, not 3' can be shortened to '1 + 1 is 2, not 3."
There's plenty of LLM skepticism on HN and that's fine, but like all comments here, it needs to be thoughtful.
(We detached this comment from https://news.ycombinator.com/item?id=44074957)
[+] [-] wagwang|9 months ago|reply