This is a cool idea -- is this an inner-loop process (i.e. after each LLM evaluation, the output is considered to choose the next sample) or a pre-loop process (get a subset of samples before tests are run)?
AFAICT, this is a more advanced way of using Embeddings (which can encode for the vibes similarity (not an official term) of prompts) to determine where you get the most "bang for your buck" in terms of testing.
For instance, if there are three conversations that you can use to test if your AI is working correctly:
(1) HUMAN: "Please say hello"
AI: "Hello!"
(2) HUMAN: "Please say goodbye"
AI: "Goodbye!"
(3) HUMAN: "What is 2 + 2?"
AI: "4!"
Let's say you can only pick two conversations to evaluate how good your AI is. Would you pick 1 & 2? Probably not. You'd pick 1 & 3, or 2 & 3.
Because Embeddings allow us to determine how similar in vibes things are, we have a tool with which we can automatically search over our dataset for things that have very different vibes, meaning that each evaluation run is more likely to return new information about how well the model is doing.
My question to the OP was mostly about whether or not this "vibe differentiated dataset" was constructed prior to the evaluation run, or populated gradually, based on each individual test case result.
ReD_CoDE|2 years ago
enonimal|2 years ago
For instance, if there are three conversations that you can use to test if your AI is working correctly:
(1) HUMAN: "Please say hello"
(2) HUMAN: "Please say goodbye" (3) HUMAN: "What is 2 + 2?" Let's say you can only pick two conversations to evaluate how good your AI is. Would you pick 1 & 2? Probably not. You'd pick 1 & 3, or 2 & 3.Because Embeddings allow us to determine how similar in vibes things are, we have a tool with which we can automatically search over our dataset for things that have very different vibes, meaning that each evaluation run is more likely to return new information about how well the model is doing.
My question to the OP was mostly about whether or not this "vibe differentiated dataset" was constructed prior to the evaluation run, or populated gradually, based on each individual test case result.
so anyway it's just vibes man
renchuw|2 years ago