Recently I've been working on making LLM evaluations fast by using bayesian optimization to select a sensible subset.
Bayesian optimization is used because it’s good for exploration / exploitation of expensive black box (paraphrase, LLM).
I would love to hear your thoughts and suggestions on this!
This is a cool idea -- is this an inner-loop process (i.e. after each LLM evaluation, the output is considered to choose the next sample) or a pre-loop process (get a subset of samples before tests are run)?
What is your goal? if d1, d2, d3, etc is the dataset over which you're trying to optimize, then the goal is to find some best performing d_i. In this case, you're not evaluating. You're optimizing. Your acquisition function even says so: https://rentruewang.github.io/bocoel/research/
And in general if you have an LLM that performs really well on one d_i then who cares. The goal in LLM evaluation is to find a good performing LLM overall.
Finally, it feels that your Abstract and other snippets sound like an LLM wrote them.
I disagree that the goal in „evaluation is to find a good performing LLM overall“. The goal in evaluation is to understand the performance of an LLm (on average). This approach actually is more about finding „areas“ where the LLm does not behave well and where the LLm behaves well (by the Gaussian process approximation) This is indeed an important problem to look at. Often you just run an LLm evaluation on 1000s of samples, some of them similar and you don’t learn anything new from the sample „what time is it, please“ over „what time is it“.
If instead you can reduce the number of samples to look at and automatically find „clusters“ and their performance, you get a win. It won’t be the „average performance number“, but it will give you (hopefully) understanding which things work how well in the LLm.
The main drawback in this (as far as I can say after this short glimpse at it) is the embedding itself. Only if the distance in the embedding space really correlates with performance, this will work great. However we know from adversarial attacks, that already small changes in the embedding space can result in vastly different results
Same. "Evaluate" and "corpus" need to be defined. I don't think OP intended this to be clickbait but without clarification it sounds like they're claiming 10x faster inference, which I'm pretty sure it's not.
The "eval" phase is done after a model is trained to assess its performance on whatever tasks you wanted it to do. I think this is basically saying, "don't evaluate on the entire corpus, find a smart subset."
Hi, OP here. So you evaluate LLMs on corpuses to evaluate their performance right? Bayesian optimization is here to select points (in the latent space) and tell the LLM where to evaluate next. To be precise, entropy search is used here (coupled with some latent space reduction techniques like N-sphere representation and embedding whitening). Hope that makes sense!
This, exactly - what is meant by evaluate in this context? Is this more efficient inference using approximation, so you can create novel generations, or is it some test of model attributes?
What the OP is doing here is completely opaque to the rest of us.
"Evaluation" has a pretty standard meaning in the LLM community the same way that "unit test" does in software. Evaluations are suites of challenges presented to an LLM to evaluate how well it does as a form of bench-marking.
Nobody would chime in on an article on "faster unit testing in software with..." and complain that it's not clear because "is it a history unit? a science unit? what kind of tests are those students taking!?", so I find it odd that on HN people often complain about something similar for a very popular niche in this community.
If you're interested in LLMs, the term "evaluation" should be very familiar, and if you're not interested in LLMs then this post likely isn't for you.
Hi, OP here, sorry for late reply. I am not actually "evaluating", but rather using the "side effects" of bayesian optimization that allows zoning in/out on some regions on the latent space. Since embedders are so fast compared to LLM, it saves time by saving LLMs from evaluating on similar queries. Hope that makes sense!
I looked through the github.io documentation and skimmed through the code and research article draft. Correct me if I am wrong. What I think you are doing (at a high level) is you are you create a corpus of QA tasks, embeddings, and similarity metrics. Then you are somehow using NLP scoring and Bayesian Optimization to find a subset of the corpus that best matches a particular evaluation task. Then you can jut evaluate the LLM on this subset rather than the entire corpus, which is much faster.
I agree with the other comments. You need to do a much better job of motivating and contextualizing the research problem, as well as explaining your method in specific precise language in the README and other documentation. (Preferably in the README) You should make it clear that you are using GLUE and and Big-Bench for the evaluation (as well as any other evaluation benchmarks that you are using). You should also be explicit which LLM models and embedding you have tested and what datasets you used to train and evaluate on. You should also must add graphs and tables showing your method's speed and evaluation performance compared to the SOTA. I like the reference/overview section that shows the diagram (I think you should put it in the README to make it more visible to first time viewers). However, the description of the classes are cryptic. For example the Score class said "Evaluate the target with respect to the references." I had no idea what that meant, and I had to just google some of the class names to get an idea of what score was trying to do. That's true for pretty much all the classes. Also, you need to explain what factory class are and how they differ from the models classes, e.g. why does the bocoel.models.adaptors class require a score and a corpus (from overview), but factories.adaptor require "GLUE", lm, and choices (looking at the code from examples/getting_started/__main__.py)? However, I do like the fact that you have an example (although I haven't tried running it).
Thanks for the feedback! The reason the "code" part is more complete than the "research" part is because I originally planned for it to just be a hobby project and only very later on decided to perhaps try to be serious and make it a research work.
Not trying to make excuses tho. Your points are very valid and I would take them into account!
OP here, I came up with this cool idea because I was chatting with a friend about how to make LLM evaluations fast (which is so painfully slow on large datasets) and realized that somehow no one has tried it. So I decided to give it a go!
I designed 2 modes in the project, exploration mode and exploitation mode.
Exploration mode uses entropy search to explore the latent space (used for evaluating the LLM on the selected corpus to evaluate), and eploitation mode is used to figure out how well / bad the model is performing on what regions of the selected corpus.
For accurate evaluations, exploration is used. However, I'm also working on a visualization too s.t. users can see how well the model is performing at what region (courtesy of gaussian process models built in by bayesian optimization) and that is where exploitation mode can come in handy.
Sorry for the slightly messy explanation. Hope it clarifies things!
Hi, OP here. I would say not really because the goals are different. Although both uses retrieval techniques, RAG wants to augment your query with factual information, where here we retrieve in order to evaluate on as few queries as possible (with performance guaranteed by bayesian optimization)
eximius|2 years ago
renchuw|2 years ago
enonimal|2 years ago
ReD_CoDE|2 years ago
renchuw|2 years ago
tartakovsky|2 years ago
And in general if you have an LLM that performs really well on one d_i then who cares. The goal in LLM evaluation is to find a good performing LLM overall.
Finally, it feels that your Abstract and other snippets sound like an LLM wrote them.
Good luck.
doubtfuluser|2 years ago
If instead you can reduce the number of samples to look at and automatically find „clusters“ and their performance, you get a win. It won’t be the „average performance number“, but it will give you (hopefully) understanding which things work how well in the LLm.
The main drawback in this (as far as I can say after this short glimpse at it) is the embedding itself. Only if the distance in the embedding space really correlates with performance, this will work great. However we know from adversarial attacks, that already small changes in the embedding space can result in vastly different results
skyde|2 years ago
I know what a LLM is and I know very well what is Bayesian Optimization. But I don't understand what this library is trying to do.
I am guessing it's tryng to test the model's ability to generate correct and relevant responses to a given input.
But who is the judge ?
causal|2 years ago
deckar01|2 years ago
https://rentruewang.github.io/bocoel/research/
ragona|2 years ago
renchuw|2 years ago
azinman2|2 years ago
observationist|2 years ago
What the OP is doing here is completely opaque to the rest of us.
PheonixPharts|2 years ago
Nobody would chime in on an article on "faster unit testing in software with..." and complain that it's not clear because "is it a history unit? a science unit? what kind of tests are those students taking!?", so I find it odd that on HN people often complain about something similar for a very popular niche in this community.
If you're interested in LLMs, the term "evaluation" should be very familiar, and if you're not interested in LLMs then this post likely isn't for you.
renchuw|2 years ago
unknown|2 years ago
[deleted]
endernac|2 years ago
I agree with the other comments. You need to do a much better job of motivating and contextualizing the research problem, as well as explaining your method in specific precise language in the README and other documentation. (Preferably in the README) You should make it clear that you are using GLUE and and Big-Bench for the evaluation (as well as any other evaluation benchmarks that you are using). You should also be explicit which LLM models and embedding you have tested and what datasets you used to train and evaluate on. You should also must add graphs and tables showing your method's speed and evaluation performance compared to the SOTA. I like the reference/overview section that shows the diagram (I think you should put it in the README to make it more visible to first time viewers). However, the description of the classes are cryptic. For example the Score class said "Evaluate the target with respect to the references." I had no idea what that meant, and I had to just google some of the class names to get an idea of what score was trying to do. That's true for pretty much all the classes. Also, you need to explain what factory class are and how they differ from the models classes, e.g. why does the bocoel.models.adaptors class require a score and a corpus (from overview), but factories.adaptor require "GLUE", lm, and choices (looking at the code from examples/getting_started/__main__.py)? However, I do like the fact that you have an example (although I haven't tried running it).
renchuw|2 years ago
Not trying to make excuses tho. Your points are very valid and I would take them into account!
renchuw|2 years ago
OP here, I came up with this cool idea because I was chatting with a friend about how to make LLM evaluations fast (which is so painfully slow on large datasets) and realized that somehow no one has tried it. So I decided to give it a go!
pama|2 years ago
renchuw|2 years ago
So long as you have all the random seeds fixed, I think reproduction should be straight forward.
abhgh|2 years ago
renchuw|2 years ago
Exploration mode uses entropy search to explore the latent space (used for evaluating the LLM on the selected corpus to evaluate), and eploitation mode is used to figure out how well / bad the model is performing on what regions of the selected corpus.
For accurate evaluations, exploration is used. However, I'm also working on a visualization too s.t. users can see how well the model is performing at what region (courtesy of gaussian process models built in by bayesian optimization) and that is where exploitation mode can come in handy.
Sorry for the slightly messy explanation. Hope it clarifies things!
marclave|2 years ago
big kudos for this, so wonderfully excited to see this on HN and we will be using this
anentropic|2 years ago
renchuw|2 years ago