top | item 47043756

(no title)

JB_5000 | 13 days ago

Interesting benchmark, but worth noting the methodology: skills are generated before the task, with no feedback loop. In practice, useful skills tend to emerge from doing — you attempt, observe what failed, then codify what worked. Generate → execute → observe → refine. The paper tests cold generation, which is a different (and less realistic) setup.

discuss

order

No comments yet.