top | item 44709303

Hierarchical Reasoning Model – 1k training samples SoTA reasoning v/s CoT

26 points| dreamer7 | 7 months ago |github.com

6 comments

order

dreamer7|7 months ago

To a casual observer, this seems like a big deal. Can knowledgeable folks comment on this work?

AIPedant|7 months ago

I am still reading the paper, but it is worth noting that this is not an LLM! It is closer to something like AlphaGo, trained only on ARC, Sudoku and mazes. I am skeptical that you could add a bunch of science facts and programming examples without degrading the performance on ARC / etc - frankly it’s completely unclear to me how you would make this architecture into a chatbot, period, but I haven’t thought about it very much.

Comparing the maze/Sudoku results to LLMs rather than maze/Sudoku-specific AIs strikes me as blatantly dishonest. “1k Sudoku training examples” is also dishonest, they generate about a million of them with permutations: https://news.ycombinator.com/item?id=44701264 (see also https://github.com/sapientinc/HRM/blob/main/dataset/build_su... And they seem to have deleted the Sudoku training data! Or maybe they made it private. It used to be here: https://github.com/imone and according to the Git history[1] they moved it here https://github.com/sapientinc but I cannot find it. Might be an innocent mistake; I suspect they got called out for lying about “1000 samples” and are hiding their tracks.

[1] https://github.com/sapientinc/HRM/commit/171e2fcde636bcb7e6c...

munro|7 months ago

Link to paper here https://arxiv.org/pdf/2506.21734

Still reading, but the benchmarks for ARC-AGI-1, ARC-AGI-2, Sudoku-Extreme (9x9), and Maze-Hard (30x30) look impressive.

tough|7 months ago

on gh someone reproduced but paper lacks total gpu hours and their benchmark results where 10-20% lower (read on gh issue)