Does anyone know more about the benchmark? 60% accuracy gets a drumroll? How would Claude do? How would a human do? I tried the previous version and was not impressed. I went back to Claude that is very hard to beat, and versatile with context enrichment.
No comments yet.