top | item 47158403

(no title)

stared | 4 days ago

I rerun it for GPT-5.2-Codex, for high and xhigh.

Finally, it matches my experience, and it is actually good (as good as the best models for localization, still impressive 0% false positive rate): https://quesma.com/benchmarks/binaryaudit/

Will rerun it on GPT-5.3-Codex shortly, as API is out (yet, the effort does not work correctly, and for "medium" it is very low).

discuss

No comments yet.