(no title)
anorwell | 7 months ago
1. Custom scaffolding (system prompt and tools) using Qwen3-32B achieved 13.75% on Terminal-Bench. No training was involved.
2. The author has built an RL system, but it has not been used for anything due to cost limitations.
So there's actually no result related to training here. It well known that the scaffolding used can have a large impact on benchmark outcomes (the Terminal bench leaderboard also demonstrates this [1]).
esafak|7 months ago
1. Tooling for training a terminal agent.
2. An agent that was _not_ trained with this tooling but prompt engineered. I could not find the author's discussion on this point.