I worked on the model with our research team. Recently featured in this NYT (https://www.nytimes.com/2023/11/06/technology/chatbots-hallu.... Post here to AMA. We are also looking for collaborators to help us maintain this model and make it the best it can be. Let us know if you want to help
Boerworz|2 years ago
simonhughes22|2 years ago
gcr|2 years ago
Do you have a whitepaper describing how you trained this hallucination detection model?
Is each row of the leaderboard the mean of the Vectara model's judgment of the 831 (article,summary) pairs, or was there any human rating involved? With so few pairs, it seems feasible that human ratings should be able to quantify how much hallucination is actually occurring.
simonhughes22|2 years ago
Given the number of models involved, we have over 9k rows currently. Judging for this task is quite time consuming as you need to read a whole document and check it against a several sentence summary and some of the docs are a 1-3 min read. We wanted to automate this process and also make it as objective as possible (even humans can miss hallucinations or disagree on an annotation). Plus we also wanted people to be able to replicate the work, non of which is possible with a human rater, plus others have attempted that but on a much smaller scale, e.g. see AnyScales - https://www.anyscale.com/blog/llama-2-is-about-as-factually-... (but note that is under 1k examples).
We did some human validation and the model is well in alignment with humans but not in perfect agreement, as it is a model after all. And again human's don't agree 100% of the time on this task either.