Yes, especially as models are known to have a preference towards outputs of models in the same family. I suspect this leaderboard would change dramatically with different models as the judge.
There may be different/better solutions for almost all those kind of tasks. I wouldn’t be surprised if optimal answer to some of them would be refusal/defer ask, refactor first, then solve it properly.
jacquesm|6 months ago
mirekrusin|6 months ago
spiderfarmer|6 months ago