top | item 45629375

(no title)

aaronSong | 4 months ago

What I liked here is how unglamorous it is: tiny prompts, context only when needed, evaluate against real usage, and resist multi‑agent stuff until a single boring pipeline is stable. They also pair rule checks with model checks and expect some reward‑hacking. Curious how others keep evals from drifting in production.

discuss

order

No comments yet.