top | item 47157527

(no title)

jedberg | 4 days ago

WOPR used reinforcement learning, and could learn from its simulated mistakes. LLMs can't do that without some sort of RL harness. :)

discuss

No comments yet.