(no title)
rybosworld | 9 days ago
This is not a valid experiment, because GPT models always have access to certain tools and will use them even if you tell them not to. They will fib the chain of thought after the fact to make it look like they didn't use a tool.
https://www.anthropic.com/research/alignment-faking
It's also well established that all the frontier models use python for math problems, not just GPT family of models.
simianwords|9 days ago
Is that enough to falsify?
rybosworld|9 days ago
This isn't an experiment a consumer of the models can actually run. If you have a chance to read the article I linked, it is difficult even for the model maintainers (openai, anthropic, etc.) to look into the model and see what it actually used in it's reasoning process. The models will purposefully hide information about how they reasoned. And they will ignore instructions without telling you.
The problem really isn't that LLM's can't get math/arithmetic right sometimes. They certainly can. The problem is that there's a very high probability that they will get the math wrong. Python or similar tools was the answer to the inconsistency.
chickenimprint|9 days ago
If you ask ChatGPT, it will confirm that it uses the python interpreter to do arithmetic on large numbers. To you, that should be convincing.
jibal|9 days ago