top | item 41531086

(no title)

Entirely possible. I did not try to test systematically or quantitatively, but it's been a recurring easy "demo" case I've used with releases since 3.5-turbo.

The super verbose chain-of-reasoning that o1 does seems very well suited to logic puzzles as well, so I expected it to do reasonably well. As with many other LLM topics, though, the framing of the evaluation (or the templating of the prompt) can impact the results enormously.

discuss

No comments yet.