(no title)
nestorD | 1 month ago
In short: start with a dataset of question and answer pairs, where each question has been answered by two different LLMs. Ask the model you want to evaluate to choose the better answer for each pair. Then measure how consistently it selects winners. Does it reliably favor some models over the questions, or does it behave close to randomly? This consistency is a strong proxy for the model’s intelligence.
It is not subject to dataset leaks, lets you measure intelligence in many fields where you might not have golden answers, and converges pretty fast making it really cheap to measure.
vintermann|1 month ago
It seems to me many models - maybe by design - have a recognizable style which would be much easier to detect than evaluating the factual quality of answers.
nestorD|1 month ago
But, in practice, when asking a model to pick the best answer they see a single question / answers pair and focus on determining what they think is best.
esafak|1 month ago
nestorD|1 month ago