(no title)
prash2488 | 3 months ago
Observations:
1. Model-task matching matters. Codex's default code-specialized model struggled with writing analysis. Switching to GPT-5 improved output quality 4x.
2. Autonomy settings affect completion. Gemini with limited autonomy produced incomplete workâit kept pausing for approvals mid-task.
3. All three claimed "done." Output varied from 198 to 2,555 lines. Never trust completion claims without verification.
4. Deep reading beat clever shortcuts. Codex took an API-first approach (RSS, JSON endpoints). Valid methodology, but missed nuances that Claude caught by reading posts directly.
Claude won at 9.5/10, but the more interesting finding was how much configuration affected the other two agents' scores.
Full analysis with methodology in the post linked.
No comments yet.