top | item 45629375

(no title)

aaronSong | 4 months ago

What I liked here is how unglamorous it is: tiny prompts, context only when needed, evaluate against real usage, and resist multi‑agent stuff until a single boring pipeline is stable. They also pair rule checks with model checks and expect some reward‑hacking. Curious how others keep evals from drifting in production.

discuss

No comments yet.