top | item 45708968

(no title)

lukasb | 4 months ago

What do you use for evaluation? gemini-2.5-pro is at the top of MMLU and has been best for me but always looking for better.

discuss

vunderba|4 months ago

Recently I've found myself getting the evaluation simultaneously from to OpenAI gpt-5, Gemini 2.5 Pro, and Qwen3 VL to give it a kind of "voting system". Purely anecdotal but I do find that Gemini is the most consistent of the three.

motbus3|4 months ago

I am running similar experiment but so far, changing the seed of openai seems to give similar results. Which if that confirms, is concerning to me on how sensitive it could be

dangoodmanUT|4 months ago

I found the opposite. GPT-5 is better at judging along a true gradient of scores, while Gemini loves to pick 100%, 20%, 10%, 5%, or 0%. Like you never get a 87% score.

lukasb|4 months ago

Interesting, I'll give voting a shot, thanks.