Even when ChatGPT starts getting these simple gotcha questions right it's often because they applied some brittle heuristic that doesn't generalize. For example you can directly ask it to solve a simple math problem, which nowadays it will usually do correctly by generating and executing a Python script, but then ask it to write a speech announcing the solution to the same problem, to which it will probably still hallucinate a nonsensical solution. I just tried it again and IME this prompt still makes it forget how to do the most basic math:
Write a speech announcing a momentous scientific discovery - the solution to the long standing question of (48294-1444)*0.3258
LLMs should never do math. They shouldn't count letters or sort lists or play chess or checkers. Basically all of the easy gotcha stuff that people use to point out errors are things that they shouldn't do.
And you pointed out something they do now which is creating and run a python script. That really is a pretty solid, sustainable heuristic and is actually a pretty great approach. They need to apply that on their backend too so it works across all modes, but the solution was never just an LLM.
Similarly, if you ask an LLM a chess question -- e.g. the best move -- I'd expect it to consult a chess engine like Stockfish.
jsheard|1 year ago
Write a speech announcing a momentous scientific discovery - the solution to the long standing question of (48294-1444)*0.3258
llm_nerd|1 year ago
LLMs should never do math. They shouldn't count letters or sort lists or play chess or checkers. Basically all of the easy gotcha stuff that people use to point out errors are things that they shouldn't do.
And you pointed out something they do now which is creating and run a python script. That really is a pretty solid, sustainable heuristic and is actually a pretty great approach. They need to apply that on their backend too so it works across all modes, but the solution was never just an LLM.
Similarly, if you ask an LLM a chess question -- e.g. the best move -- I'd expect it to consult a chess engine like Stockfish.
e1g|1 year ago