(no title)
fotcorn | 1 month ago
I started out by letting it write a naive C version without intrinsic, and validated it against the PyTorch version.
Then I asked it (and two other models, Gemini 3.0 and GPT 5.1) to come up with some ideas on how to make it faster using SIMD vector instructions and write those down as markdown files.
Finally, I started the agent loop by giving Cursor those three markdown files, the naive C code and some more information on how to compile the code, and also an SSH command where it can upload the program and test it.
It then tested a few different variants, ran it on the target (RISC-V SBC, OrangePI RV2) to check if it improves runtime, and then continue from there. It did this 10 times, until it arrived at the final version.
The final code is very readable, and faster than any other library or compiler that I have found so far. I think the clear guardrails (output has to match exactly the reference output from PyTorch, performance must be better than before) makes this work very well.
sifar|1 month ago
IIRC, Depthwise is memory bound so the bar might be lower. Perhaps you can try some thing with higher compute intensity like a matrix multiply. I have observed, it trips up with the columnar accesses for SIMD.
fotcorn|1 month ago
Also important to have a few test cases the agent can quickly check against, it will often generate wrong code, but if that is easily detectable the agent can fix it and continue quickly.
camel-cdr|1 month ago