(no title)
stared | 4 days ago
Finally, it matches my experience, and it is actually good (as good as the best models for localization, still impressive 0% false positive rate): https://quesma.com/benchmarks/binaryaudit/
Will rerun it on GPT-5.3-Codex shortly, as API is out (yet, the effort does not work correctly, and for "medium" it is very low).
No comments yet.