Very interesting comparison of RAG to traditional search development. I'm curious how large an evaluation dataset is usually necessary to compare results from different experiments of Google search?
The dataset probably ranges from hundreds to tens of thousands of queries. The exact number is confidential and frankly, it changes over time and differs from product to product, so the order of magnitude is more indicative. This also matches the case of most public datasets. https://github.com/beir-cellar/beir?tab=readme-ov-file#beers...
I guess the intuition is: if the dataset has less than 100 cases, it's arguably not diverse enough to cover all situations. On the other hand, the marginal gain of cases over 10,000 shrinks quickly. So O(1000) is probably a sweet spot if there is a way to automatically collect queries, e.g. from online traffic. If the dataset was hand-curated, it probably only makes sense to stay at O(100).
It's also important to note that at Google there are ML trained automatic rating in addition to human raters. Rating a query is a heavy job. The rating guideline itself has 36 pages: https://services.google.com/fh/files/misc/hsw-sqrg.pdf. Reportedly, Google hires 16000 external human raters. If all of the 800,000 experiments of the year were rated by humans, that would mean
800,000 experiments * 10,000 queries per exp / 250 working days per year / 16,000 raters = 2000 queries per rater per day (aka a rater needs to finish rating a query in 4 seconds)
Considering rating a query requires comprehending the results and making comparisons, this is unlikely to be achievable. So it's either the dataset is less than 10k, or a large portion of the rating is done by machine.
codingjaguar|2 years ago
I guess the intuition is: if the dataset has less than 100 cases, it's arguably not diverse enough to cover all situations. On the other hand, the marginal gain of cases over 10,000 shrinks quickly. So O(1000) is probably a sweet spot if there is a way to automatically collect queries, e.g. from online traffic. If the dataset was hand-curated, it probably only makes sense to stay at O(100).
It's also important to note that at Google there are ML trained automatic rating in addition to human raters. Rating a query is a heavy job. The rating guideline itself has 36 pages: https://services.google.com/fh/files/misc/hsw-sqrg.pdf. Reportedly, Google hires 16000 external human raters. If all of the 800,000 experiments of the year were rated by humans, that would mean
800,000 experiments * 10,000 queries per exp / 250 working days per year / 16,000 raters = 2000 queries per rater per day (aka a rater needs to finish rating a query in 4 seconds)
Considering rating a query requires comprehending the results and making comparisons, this is unlikely to be achievable. So it's either the dataset is less than 10k, or a large portion of the rating is done by machine.