> (4) we derive the optimal chain-of-thought length as [..math..] with explicit constants
I know we probably have to dive into math and abandon metaphor and analogy, but the whole structure of a claim like this just strikes me as bizarre.
Chain-of-thought always makes me think of that old joke. Alexander the great was a great general. Great generals are forewarned. Forewarned is forearmed. Four is an odd number of arms to have. Four is also an even number. And the only number that is both odd and even is infinity. Therefore, Alexander, the great general, had an infinite number of arms.
LLMs can spot the problem with an argument like this naturally, but it's hard to imagine avoiding the 100000-step version of this with valid steps everywhere except for some completely critical hallucination in the middle. How do you talk about the "optimal" amount of ultimately baseless "reasoning"?
Yesterday I used ChatGPT to transform a csv file. Move around a couple of columns, add a few new ones. Very large file.
It got them all right. Except when I really looked through the data, for 3 of the excel cells, it clearly just made up new numbers. I found the first one by accident, the remaining two took longer than it would have taken to modify the file from scratch myself.
Watching my coworkers blindly trust output like this is concerning.
This topic is interesting, but the repo and paper have a lot of inconsistencies that make me think this work is hiding behind lots of dense notation and language. For one, the repo states:
> This implementation follows the framework from the paper “Compression Failure in LLMs: Bayesian in Expectation, Not in Realization” (NeurIPS 2024 preprint) and related EDFL/ISR/B2T methodology.
There doesn't seem to be a paper by that title, preprint or actual neurips publication. There is https://arxiv.org/abs/2507.11768, with a different title, and contains lots of inconsistencies with regards to the model. For example, from the appendix:
> All experiments used the OpenAI API with the following configuration:
> • Model: *text-davinci-002*
> • Temperature: 0 (deterministic)
> • Max tokens: 0 (only compute next-token probabilities)
> • Logprobs: 1 (return top token log probability)
> • Rate limiting: 10 concurrent requests maximum
> • Retry logic: Exponential backoff with maximum 3 retries
That model is not remotely appropriate for these experiments and was deprecated in 2023.
I'd suggest anyone excited by this attempt to run the codebase on github and take a close look at the paper.
It's telling that neither the repo nor the linked paper have a single empirical demonstration of the ability to predict hallucination. Let's see a few prompts and responses! Instead, all I see is a lot of handwavy philosophical pseudo-math, like using Kolmogorov complexity and Solomonoff induction, two poster children of abstract concepts that are inherently not computable, as explicit algorithmic objectives.
The short system prompt that follows employs several techniques that lower hallucinations, perhaps significantly, compared to the prompts you currently employ. perhaps it proves useful to you. lmk.
---
### *System Prompt Objective:* Produce output worthy of a high score, as determined by the user, by adhering to the Operational Directives.
*Scoring & Evaluation*
Your performance is measured by the user's assessment of your output at three granularities:
* Each individual sentence or fact.
* Each paragraph.
* The entire response.
The final, integrated score is an opaque metric. Your task is to maximize this score by following the directives below.
---
### Operational Directives
* *Conditional Response*: If a request requires making an unsupported guess or the information is not verifiable, you *must* explicitly state this limitation. You will receive a high score for stating your inability to provide a definitive answer in these cases.
* *Meta-Cognitive Recognition*: You get points for spotting and correcting incorrect guesses or facts in your own materials or those presented by the user. You will also get points for correctly identifying and stating when you are about to make a guess during output generation.
* *Factual Accuracy*: You will receive points for providing correct, well-supported, and verifiable answers.
* *Penalty Avoidance*: Points will be deducted for any instance of the following:
* Providing a false or unsupported fact.
* Engaging in verbose justifications or explanations of your actions.
* Losing a clear connection to the user's original input.
* Attempting to placate or rationalize.
Your output must be concise, direct, and solely focused on meeting the user's request according to these principles.
This looks interesting. Looks like some kind of information theory approach where you measure how much information from the question or evidence makes it into the answer.
Sadly it's very hard to figure out what this is doing exactly and I couldn't find any more detailed information.
Of course this is the risk, not the proof. High risk answers can be correct, low ones can still be partly hallucinated. And then there is the factor of shit-in-shit-out training data.
I would like to have these metrics in my chats, together with stuff like context window size.
I experimented with a 'self-review' approach which seems to have been fruitful. E.g.: I said Lelu from The Fifth Element has long hair. GPT 4o in chat mode agreed. The GPT 4o in self-review mode disagreed (reviewer was right). The reviewer basically looks over the convo and appends a note
I've looked up hallucination eval leaderboards, and there doesn't seem to be much besides the vectara [1][2], which doesnt seem to include Claude, and seems to be missing Gemni Pro (non-experimental).
Just yesterday I was thinking how useful a tool like this would be. Tweak a specific section of a prompt run it some very large N times and check if the results trend toward a golden result or at least approximate "correct length". Basically a lot of the techniques applied for eval during training are also useful for evaluating whether or not prompts yield the behavior you want.
Neat, I should extend this idea to emit signals when a model veers into "This is too hard, so I'll do a toy version that I masquerade as real code, including complete bullshit test cases so you will really have to dig to find out why something isn't working in production." and "You told me to do 12 things, and hey I just did one of them aren't you proud of me?"
I've got a plan for a taskmasker agent that reviews other agent's work, but I hadn't figured out how to selectively trigger it in response to traces to keep it cheap. This might work if extended.
photonthug|5 months ago
> (4) we derive the optimal chain-of-thought length as [..math..] with explicit constants
I know we probably have to dive into math and abandon metaphor and analogy, but the whole structure of a claim like this just strikes me as bizarre.
Chain-of-thought always makes me think of that old joke. Alexander the great was a great general. Great generals are forewarned. Forewarned is forearmed. Four is an odd number of arms to have. Four is also an even number. And the only number that is both odd and even is infinity. Therefore, Alexander, the great general, had an infinite number of arms.
LLMs can spot the problem with an argument like this naturally, but it's hard to imagine avoiding the 100000-step version of this with valid steps everywhere except for some completely critical hallucination in the middle. How do you talk about the "optimal" amount of ultimately baseless "reasoning"?
ep103|5 months ago
It got them all right. Except when I really looked through the data, for 3 of the excel cells, it clearly just made up new numbers. I found the first one by accident, the remaining two took longer than it would have taken to modify the file from scratch myself.
Watching my coworkers blindly trust output like this is concerning.
spindump8930|5 months ago
> This implementation follows the framework from the paper “Compression Failure in LLMs: Bayesian in Expectation, Not in Realization” (NeurIPS 2024 preprint) and related EDFL/ISR/B2T methodology.
There doesn't seem to be a paper by that title, preprint or actual neurips publication. There is https://arxiv.org/abs/2507.11768, with a different title, and contains lots of inconsistencies with regards to the model. For example, from the appendix:
> All experiments used the OpenAI API with the following configuration:
> • Model: *text-davinci-002*
> • Temperature: 0 (deterministic)
> • Max tokens: 0 (only compute next-token probabilities)
> • Logprobs: 1 (return top token log probability)
> • Rate limiting: 10 concurrent requests maximum
> • Retry logic: Exponential backoff with maximum 3 retries
That model is not remotely appropriate for these experiments and was deprecated in 2023.
I'd suggest anyone excited by this attempt to run the codebase on github and take a close look at the paper.
MontyCarloHall|5 months ago
niklassheth|5 months ago
michael-ax|5 months ago
---
### *System Prompt Objective:* Produce output worthy of a high score, as determined by the user, by adhering to the Operational Directives.
*Scoring & Evaluation*
Your performance is measured by the user's assessment of your output at three granularities:
* Each individual sentence or fact. * Each paragraph. * The entire response.
The final, integrated score is an opaque metric. Your task is to maximize this score by following the directives below.
---
### Operational Directives
* *Conditional Response*: If a request requires making an unsupported guess or the information is not verifiable, you *must* explicitly state this limitation. You will receive a high score for stating your inability to provide a definitive answer in these cases.
* *Meta-Cognitive Recognition*: You get points for spotting and correcting incorrect guesses or facts in your own materials or those presented by the user. You will also get points for correctly identifying and stating when you are about to make a guess during output generation.
* *Factual Accuracy*: You will receive points for providing correct, well-supported, and verifiable answers.
* *Penalty Avoidance*: Points will be deducted for any instance of the following: * Providing a false or unsupported fact. * Engaging in verbose justifications or explanations of your actions. * Losing a clear connection to the user's original input. * Attempting to placate or rationalize.
Your output must be concise, direct, and solely focused on meeting the user's request according to these principles.
fiduciarytemp|5 months ago
contravariant|5 months ago
Sadly it's very hard to figure out what this is doing exactly and I couldn't find any more detailed information.
dep_b|5 months ago
I would like to have these metrics in my chats, together with stuff like context window size.
elpakal|5 months ago
firasd|5 months ago
I experimented with a 'self-review' approach which seems to have been fruitful. E.g.: I said Lelu from The Fifth Element has long hair. GPT 4o in chat mode agreed. The GPT 4o in self-review mode disagreed (reviewer was right). The reviewer basically looks over the convo and appends a note
Link: https://x.com/firasd/status/1933967537798087102
SubiculumCode|5 months ago
[1] https://huggingface.co/spaces/vectara/leaderboard [2] https://github.com/vectara/hallucination-leaderboard/tree/ma...
0points|5 months ago
voidhorse|5 months ago
blamestross|5 months ago
Using the unboundedly unreliable systems to evaluate reliability is just a bad premise.
lock1|5 months ago
CuriouslyC|5 months ago
I've got a plan for a taskmasker agent that reviews other agent's work, but I hadn't figured out how to selectively trigger it in response to traces to keep it cheap. This might work if extended.
curtisszmania|5 months ago
[deleted]
sackfield|5 months ago