Show HN: Faster LLM evaluation with Bayesian optimization

eximius|2 years ago

This is "evaluating" LLMs in the sense of benchmarking how good they are, not improving LLM inference in speed or quality, yes?

renchuw|2 years ago

Correct.

enonimal|2 years ago

This is a cool idea -- is this an inner-loop process (i.e. after each LLM evaluation, the output is considered to choose the next sample) or a pre-loop process (get a subset of samples before tests are run)?

ReD_CoDE|2 years ago

It seems that you're the only one who understood the idea. I don't know current LLMs use such a method or not, but the idea could be 10 times faster

renchuw|2 years ago

This would be an inner loop process. However, the selection is way faster than LLMs so it shouldn't be noticable (hopefully).

tartakovsky|2 years ago

What is your goal? if d1, d2, d3, etc is the dataset over which you're trying to optimize, then the goal is to find some best performing d_i. In this case, you're not evaluating. You're optimizing. Your acquisition function even says so: https://rentruewang.github.io/bocoel/research/

And in general if you have an LLM that performs really well on one d_i then who cares. The goal in LLM evaluation is to find a good performing LLM overall.

Finally, it feels that your Abstract and other snippets sound like an LLM wrote them.

Good luck.

doubtfuluser|2 years ago

I disagree that the goal in „evaluation is to find a good performing LLM overall“. The goal in evaluation is to understand the performance of an LLm (on average). This approach actually is more about finding „areas“ where the LLm does not behave well and where the LLm behaves well (by the Gaussian process approximation) This is indeed an important problem to look at. Often you just run an LLm evaluation on 1000s of samples, some of them similar and you don’t learn anything new from the sample „what time is it, please“ over „what time is it“.

If instead you can reduce the number of samples to look at and automatically find „clusters“ and their performance, you get a win. It won’t be the „average performance number“, but it will give you (hopefully) understanding which things work how well in the LLm.

The main drawback in this (as far as I can say after this short glimpse at it) is the embedding itself. Only if the distance in the embedding space really correlates with performance, this will work great. However we know from adversarial attacks, that already small changes in the embedding space can result in vastly different results

skyde|2 years ago

what do they mean by "evaluating the model on corpus." and "Evalutes the corpus on the model".

I know what a LLM is and I know very well what is Bayesian Optimization. But I don't understand what this library is trying to do.

I am guessing it's tryng to test the model's ability to generate correct and relevant responses to a given input.

But who is the judge ?

causal|2 years ago

Same. "Evaluate" and "corpus" need to be defined. I don't think OP intended this to be clickbait but without clarification it sounds like they're claiming 10x faster inference, which I'm pretty sure it's not.

deckar01|2 years ago

Evaluate is referring to measuring the accuracy of a model on a standard dataset for the purpose of comparing model performance. AKA benchmark.

https://rentruewang.github.io/bocoel/research/

ragona|2 years ago

The "eval" phase is done after a model is trained to assess its performance on whatever tasks you wanted it to do. I think this is basically saying, "don't evaluate on the entire corpus, find a smart subset."

renchuw|2 years ago

Hi, OP here. So you evaluate LLMs on corpuses to evaluate their performance right? Bayesian optimization is here to select points (in the latent space) and tell the LLM where to evaluate next. To be precise, entropy search is used here (coupled with some latent space reduction techniques like N-sphere representation and embedding whitening). Hope that makes sense!

azinman2|2 years ago

What I don’t get from the webpage is what are you evaluating, exactly?

observationist|2 years ago

This, exactly - what is meant by evaluate in this context? Is this more efficient inference using approximation, so you can create novel generations, or is it some test of model attributes?

What the OP is doing here is completely opaque to the rest of us.

PheonixPharts|2 years ago

"Evaluation" has a pretty standard meaning in the LLM community the same way that "unit test" does in software. Evaluations are suites of challenges presented to an LLM to evaluate how well it does as a form of bench-marking.

Nobody would chime in on an article on "faster unit testing in software with..." and complain that it's not clear because "is it a history unit? a science unit? what kind of tests are those students taking!?", so I find it odd that on HN people often complain about something similar for a very popular niche in this community.

If you're interested in LLMs, the term "evaluation" should be very familiar, and if you're not interested in LLMs then this post likely isn't for you.

renchuw|2 years ago

Hi, OP here, sorry for late reply. I am not actually "evaluating", but rather using the "side effects" of bayesian optimization that allows zoning in/out on some regions on the latent space. Since embedders are so fast compared to LLM, it saves time by saving LLMs from evaluating on similar queries. Hope that makes sense!

unknown|2 years ago

[deleted]

endernac|2 years ago

I looked through the github.io documentation and skimmed through the code and research article draft. Correct me if I am wrong. What I think you are doing (at a high level) is you are you create a corpus of QA tasks, embeddings, and similarity metrics. Then you are somehow using NLP scoring and Bayesian Optimization to find a subset of the corpus that best matches a particular evaluation task. Then you can jut evaluate the LLM on this subset rather than the entire corpus, which is much faster.

I agree with the other comments. You need to do a much better job of motivating and contextualizing the research problem, as well as explaining your method in specific precise language in the README and other documentation. (Preferably in the README) You should make it clear that you are using GLUE and and Big-Bench for the evaluation (as well as any other evaluation benchmarks that you are using). You should also be explicit which LLM models and embedding you have tested and what datasets you used to train and evaluate on. You should also must add graphs and tables showing your method's speed and evaluation performance compared to the SOTA. I like the reference/overview section that shows the diagram (I think you should put it in the README to make it more visible to first time viewers). However, the description of the classes are cryptic. For example the Score class said "Evaluate the target with respect to the references." I had no idea what that meant, and I had to just google some of the class names to get an idea of what score was trying to do. That's true for pretty much all the classes. Also, you need to explain what factory class are and how they differ from the models classes, e.g. why does the bocoel.models.adaptors class require a score and a corpus (from overview), but factories.adaptor require "GLUE", lm, and choices (looking at the code from examples/getting_started/__main__.py)? However, I do like the fact that you have an example (although I haven't tried running it).

renchuw|2 years ago

Thanks for the feedback! The reason the "code" part is more complete than the "research" part is because I originally planned for it to just be a hobby project and only very later on decided to perhaps try to be serious and make it a research work.

Not trying to make excuses tho. Your points are very valid and I would take them into account!

renchuw|2 years ago

Side note:

OP here, I came up with this cool idea because I was chatting with a friend about how to make LLM evaluations fast (which is so painfully slow on large datasets) and realized that somehow no one has tried it. So I decided to give it a go!

pama|2 years ago

Does this method build assumptions about the distribution of the evaluation dataset and make the bit-level reproduction of an evaluation unlikely?

renchuw|2 years ago

Well, this method is based on the assumption that embeddings can accurately represent the texts and their structural relations are preserved.

So long as you have all the random seeds fixed, I think reproduction should be straight forward.

abhgh|2 years ago

What's the BayesOpt maximizing? As in it identifies a subset based on what criteria?

renchuw|2 years ago

I designed 2 modes in the project, exploration mode and exploitation mode.

Exploration mode uses entropy search to explore the latent space (used for evaluating the LLM on the selected corpus to evaluate), and eploitation mode is used to figure out how well / bad the model is performing on what regions of the selected corpus.

For accurate evaluations, exploration is used. However, I'm also working on a visualization too s.t. users can see how well the model is performing at what region (courtesy of gaussian process models built in by bayesian optimization) and that is where exploitation mode can come in handy.

Sorry for the slightly messy explanation. Hope it clarifies things!

marclave|2 years ago

this is unreal! i was just thinking about this on a walk yesterday for our internal evals on our new models we are building.

big kudos for this, so wonderfully excited to see this on HN and we will be using this

anentropic|2 years ago

is this an alternative way of doing RAG ?

renchuw|2 years ago

Hi, OP here. I would say not really because the goals are different. Although both uses retrieval techniques, RAG wants to augment your query with factual information, where here we retrieve in order to evaluate on as few queries as possible (with performance guaranteed by bayesian optimization)

43 comments