I did this while trying to figure out what to use in our own tool. The task was to analyze around 12,000 screenshots and find recurring manual workflows worth automating.
Results:
- Claude Sonnet 4.6: 8/10, $0.53/run — wins on quality
- Kimi K2.5: 7/10, $0.09/run — 6x cheaper, now my production pick
- GPT-5.2: 6/10, $0.41/run — missed the most obvious patterns, odd
- DeepSeek V3.2: 0/10 — gave me a garbled XML...
Models that flagged a one-time DKIM setup as "recurring automation candidate" got penalized.
Happy to share more if folks find this interesting.
jzapletal|5 days ago
Results:
- Claude Sonnet 4.6: 8/10, $0.53/run — wins on quality
- Kimi K2.5: 7/10, $0.09/run — 6x cheaper, now my production pick
- GPT-5.2: 6/10, $0.41/run — missed the most obvious patterns, odd
- DeepSeek V3.2: 0/10 — gave me a garbled XML...
Models that flagged a one-time DKIM setup as "recurring automation candidate" got penalized.
Happy to share more if folks find this interesting.