top | item 43031295

Evaluating LLM Reasoning Through Live Computer Games

24 points| snyhlxde | 1 year ago |lmgame.org

14 comments

order

meru_2025|1 year ago

A dynamic human-in-the-loop evaluation benchmark is great for preventing data contamination and test saturation. Worth my time to read.

ginda307|1 year ago

Just played the game and it seems pretty fun - especially when you see the LLM speaking nonsense haha

snyhlxde|1 year ago

It's funny that reasoning models sometime speaking nonsense and perform worse than well-aligned models like claude-3.5-sonnet in multi-turn games like Akinator. I think it's one current weak point of applying longCoT RL vs. instruction-following alignment. Maybe we need to find a way to address both? Would be interesting to see some results

PY007|1 year ago

Static evaluation --> chatbot arena --> game arena. Seems to be promising.

Yuxuan_Zhang13|1 year ago

I played the game and found hard mode to be an exciting challenge—it's incredibly fun, and the AI is so clever it even guessed my intentions in the taboo game!

zhisbug|1 year ago

This is pretty clever and seems to have high potential, but it still relies on humans. What if some day all humans cannot outsmart AI?

snyhlxde|1 year ago

When super intelligence comes, it would be very interesting to see multi-party game play among AI too. What role humans play in this story is unclear. Maybe humans can't directly engage in the games neither as they are too naive and will be immediately identified and exploited by AI :)

mino1234uiui27|1 year ago

That is an interesting perspective that I hadn't previously considered, like using games to evaluate LLM.

wlsaidhi|1 year ago

Is it possible to setup a MLLM pipeline to play other roblox games and use that as another evaluation?

snyhlxde|1 year ago

I think it's totally possible. Multimodal reasoning eval would be fun to consider too

snyhlxde|1 year ago

Challenge yourself with latest reasoning LLMs and checkout our latest leaderboard!

leemack|1 year ago

Cool, actually not boring and hard to play, good gamification