I highly doubt some of those results, GPT 5.2/+codex is incredible for cyber security and CTFs, and 5.3 Codex (not on API yet) even moreso. There is absolutely no way it's below Deepseek or Haiku. Seems like a harness issue, or they tested those models at none/low reasoning?
jakozaur|7 days ago
The code is open-source; you can run it yourself using Harbor Framework:
git clone git@github.com:QuesmaOrg/BinaryAudit.git
export OPENROUTER_API_KEY=...
harbor run --path tasks --task-name lighttpd-* --agent terminus-2 --model openrouter/anthropic/claude-opus-4.6 --model openrouter/google/gemini-3-pro-preview --model openrouter/openai/gpt-5.2 --n-attempts 3
Please open PR if you find something interesting, though our domain experts spend fair amount of time looking at trajectories.
Tiberium|7 days ago
Tiberium|7 days ago
stared|4 days ago
Finally, it matches my experience, and it is actually good (as good as the best models for localization, still impressive 0% false positive rate): https://quesma.com/benchmarks/binaryaudit/
Will rerun it on GPT-5.3-Codex shortly, as API is out (yet, the effort does not work correctly, and for "medium" it is very low).
stared|7 days ago
At the same time, various task can be different, and now all things that work the best end-to-end are the same as ones that are good for a typical, interactive workflow.
We used Terminus 2 agent, as it is the default used by Harbor (https://harborframework.com/), as we want to be unbiased. Very likely other frameworks will change the result.