top | item 47157527 (no title) jedberg | 4 days ago WOPR used reinforcement learning, and could learn from its simulated mistakes. LLMs can't do that without some sort of RL harness. :) discuss order hn newest No comments yet.
No comments yet.