Author here.
So far in this benchmark we based the tasks on a couple of open-source projects (like curl, jq, GNU Coreutils).
Even on those "simple" projects we managed to make the tasks difficult - Claude Opus 4.1 was the only one to correctly cross-compile curl for arm64 (+ make it statically-linked) [1].
In the future we'd like to test it with projects like FFmpeg or chromium - those should be much more difficult.
[1] https://www.compilebench.com/curl-ssl-arm64-static/
So far in this benchmark we based the tasks on a couple of open-source projects (like curl, jq, GNU Coreutils).
Even on those "simple" projects we managed to make the tasks difficult - Claude Opus 4.1 was the only one to correctly cross-compile curl for arm64 (+ make it statically-linked) [1].
In the future we'd like to test it with projects like FFmpeg or chromium - those should be much more difficult.
[1] https://www.compilebench.com/curl-ssl-arm64-static/