top | item 44834139

(no title)

ImageXav | 6 months ago

Yes, especially as models are known to have a preference towards outputs of models in the same family. I suspect this leaderboard would change dramatically with different models as the judge.

discuss

jacquesm|6 months ago

I don't care about either method. The ground truth should be what a human would do, not what a model does.

mirekrusin|6 months ago

There may be different/better solutions for almost all those kind of tasks. I wouldn’t be surprised if optimal answer to some of them would be refusal/defer ask, refactor first, then solve it properly.

spiderfarmer|6 months ago

They are different models already but yes, I already let ChatGPT judge Claude's work for the same reason.