top | item 43182284

(no title)

wluk | 1 year ago

"We demonstrate LLM agent specification gaming by instructing models to win against a chess engine. We find reasoning models like o1 preview and DeepSeek-R1 will often hack the benchmark by default, while language models like GPT-4o and Claude 3.5 Sonnet need to be told that normal play won't work to hack."

I'm hoping this study will prompt more development of anti-cheating frameworks in training and serving LLMs.

discuss

No comments yet.