(no title)
heavyarms | 1 year ago
The issue is not the technology. When it comes to natural language (LLM responses that are sentences, prose, etc.) there is no actual standard by which you can even judge the output. There is no gold standard for natural language. Otherwise language would be boring. There is also no simple method for determining truth... philosophers have been discussing this for thousands of years and after all that effort we now know that... ¯\_(ツ)_/¯... and also, Earth is Flat and Birds Are Not Real.
Take, for example, the first sentence of my comment: "Whenever I see one of these posts, I click just to see if the proposed solution to testing the output of an LLM is to use the output of an LLM... and in almost all cases it is." This is absolutely true, in my own head, as my selective memory is choosing to remember that one time I clicked on a similar post on HN. But beyond the simple question of if it is true or not, even an army of human fact checkers and literature majors could probably not come up with a definitive and logical analysis regarding the quality and veracity of my prose. Is it even a grammatically correct sentence structure... with the run-on ellipsis and what not... ??? Is it meant to be funny? Or snarky? Who knows ¯\_(ツ)_/¯ WFT is that random pile of punctuation marks in the middle of that sentence... does the LLM even have a token for that?
senko|1 year ago
You're not debating philosophy with the LLM, you're just asking it if the answer matches (semantically) to the expected one.
I usually test LLM output quality with the following prompt (simplified):
"An AI assistant was tasked with {task}. The relevant information for their task was {context}. Their answer is {answer}. The correct answer should be something like {ground truth}. Is their answer correct?"
Then you can spice it up with chain of thought, asking it to judge alongside preferred criteria/dimensions and output a score, etc... you can go as wild as you'd like. But even this simple approach tends to work really well.
> turtles all the way down.
Saying "LLM testing LLM" is bad is like saying "computer testing computer" is bad. Yet, automated tests have value. And just as the unit tests will not prove your program is bug free, LLM evals won't guarantee 100% correctness. But they're incredibly useful tool.
In my experience working on pretty complex multi-agent multi-step systems, trying to get those to work without an eval framework in place is like playing whack-a-mole, only way less fun.
senko|1 year ago
maeil|1 year ago
If you have a ground truth, what was the purpose of asking the AI assistant for an answer in the first place?
hansonkd|1 year ago
I wouldn't doubt that if each layer of an LLM added some additional check for an unreliable process that eventually you can make something reliable from the unreliable.
drpossum|1 year ago
Your suggestion at evaluating accuracy at the layers level necessarily implies there's some method of quantifiably detecting hallucinations. This is not necessarily possible given the particular attention models or even what is mathematically possible given an "infer this from finite text and no ability for independent verification"
sitkack|1 year ago
https://www.cs.unm.edu/~saia/talks/reliable-from-unreliable....
chrismorgan|1 year ago
arccy|1 year ago
jusssi|1 year ago
It just knows more than you. Google says:
katakana letter tu (U+30C4) - ツ