(no title)
tartakovsky | 2 years ago
And in general if you have an LLM that performs really well on one d_i then who cares. The goal in LLM evaluation is to find a good performing LLM overall.
Finally, it feels that your Abstract and other snippets sound like an LLM wrote them.
Good luck.
doubtfuluser|2 years ago
If instead you can reduce the number of samples to look at and automatically find „clusters“ and their performance, you get a win. It won’t be the „average performance number“, but it will give you (hopefully) understanding which things work how well in the LLm.
The main drawback in this (as far as I can say after this short glimpse at it) is the embedding itself. Only if the distance in the embedding space really correlates with performance, this will work great. However we know from adversarial attacks, that already small changes in the embedding space can result in vastly different results