top | item 42918237

Show HN: Klarity – Open-source tool to analyze uncertainty/entropy in LLM output

132 points| mrciffa | 1 year ago |github.com

We've open-sourced Klarity - a tool for analyzing uncertainty and decision-making in LLM token generation. It provides structured insights into how models choose tokens and where they show uncertainty.

What Klarity does:

- Real-time analysis of model uncertainty during generation - Dual analysis combining log probabilities and semantic understanding - Structured JSON output with actionable insights - Fully self-hostable with customizable analysis models

The tool works by analyzing each step of text generation and returns a structured JSON:

- uncertainty_points: array of {step, entropy, options[], type} - high_confidence: array of {step, probability, token, context} - risk_areas: array of {type, steps[], motivation} - suggestions: array of {issue, improvement}

Currently supports hugging face transformers (more frameworks coming), we tested extensively with Qwen2.5 (0.5B-7B) models, but should work with most HF LLMs.

Installation is simple: `pip install git+https://github.com/klara-research/klarity.git`

We are building OS interpretability/explainability tools to visualize & analyse attention maps, saliency maps etc. and we want to understand your pain points with LLM behaviors. What insights would actually help you debug these black box systems?

Links:

- Repo: https://github.com/klara-research/klarity - Our website: [https://klaralabs.com](https://klaralabs.com/)

26 comments

deoxykev|1 year ago

The fundemental challenge of using log probabilities to measure LLM certainty is the mismatch between how language models process information and how semantic meaning actually works. The current models analyze text token by token-- fragments that don't necessarily align with complete words, let alone complex concepts or ideas.

This creates a gap between the mechanical measurement of certainty and true understanding, much like mistaking the map for the territory or confusing the finger pointing at the moon with the moon itself.

I've done some work before in this space, trying to come up with different useful measures from the logprobs, such as measuring shannon entropy over a sliding window, or even bzip compression ratio as a proxy for information density. But I didn't find anything semantically useful or reliable to exploit.

The best approach I found was just multiple choice questions. "Does X entail Y? Please output [A] True or [B] False. Then measure the linprobs of the next token, which should be `[A` (90%) or `[B` (10%). Then we might make a statement like: The LLM thinks there is a 90% probability that X entails Y.

activatedgeek|1 year ago

That has been my understanding too. More generally, a verifier at the end certainly helps.

In our paper [1], we find that asking a follow up question like "Is the answer correct?" and taking the normalized probability of "Yes" or "No" token (or more generally any such token trained for) seems to be best bet so far to get well-calibrated probabilities out of the model.

In general, the log-probability of tokens is not a good indicator of anything other than satisfying the pre-training loss function of predicting the "next token." (it likely is very well-calibrated on that task though) Semantics of language are a much less tamable object, especially when we don't quite have a good way to estimate a normalizing constant because every answer can be paraphrased in many ways and still be correct. The volume of correct answers in the generation space of language model is just too small.

There is work that shows one way to approximate the normalizing constant via SMC [2], but I believe we are more likely to benefit from having a verifier at train-time than any other approach.

And there are stop-gap solutions to make log probabilities more reliable by only computing them on "relevant" tokens, e.g. only final numerical answer tokens for a math problem [3]. But this approach kind of side-steps the problem of actually trying to find relevant tokens. Perhaps something more in the spirit of System 2 attention which selects meaningful tokens for the generated output would be more promising [4].

[1]: https://arxiv.org/abs/2406.08391 [2]: https://arxiv.org/abs/2404.17546 [3]: https://arxiv.org/abs/2402.10200 [4]: https://arxiv.org/abs/2311.11829

mrciffa|1 year ago

Unfortunately LLMs are a gigantic monster to understand, we were considering your same approach with sliding window and we will try to keep the library updated with better and more reliable approaches based on new research papers and our internal tests.

canjobear|1 year ago

Do you have any writeups of this work?

codelion|1 year ago

This is great, can this be used to implement a sampler based on entropy like entropix (implemented in optillm here - https://github.com/codelion/optillm/blob/main/optillm/entrop...)

mrciffa|1 year ago

Yes, reasoning models can potentially be optimized with our uncertainty estimations. We are currently testing the library with DeepSeek R1

siliconc0w|1 year ago

It seems like it would be easy to upgrade existing benchmarks to include uncertainty as a dimension. Then if a model is less certain it could maybe spend more time reasoning or route to a bigger model.

mrciffa|1 year ago

Exactly! Uncertainty is critical to correctly evaluate LLM performance and we don't need reasoning models to spend thousands of tokens on simple questions

Folcon|1 year ago

Hey, you say in the README.md:

MIT License. See LICENSE for more information.

But the LICENSE is Apache-2.0 license.

Which is it?

mrciffa|1 year ago

Apache-2.0 is correct one

itssimon|1 year ago

I ended up playing with the background animtation on your website for 10 minutes, was fun

andreakl|1 year ago

Very interesting approach!! what models are u currently consider to integrate?

mrciffa|1 year ago

We want to integrate reasoning models as next steps because we see a lot of value in understanding better CoTs behaviour (DeepSeek R1 & Co)

KTibow|1 year ago

Why does the example code use a base model to generate the analysis input?

mrciffa|1 year ago

In the example I'm using the instruction tuned version of Qwen2.5-7B to generate the insights

kurisufag|1 year ago

this seems neat but you really need to work on commit messages other than "update code". it makes it harder to get a bearing on the codebase.

mrciffa|1 year ago

Oh damn, you are right. It's my first opensource project and I didn't thought about it

thomastjeffery|1 year ago

On your website, "Learn More" links to a meeting invite? That's... a decision...

I think most people clicking that button would be better served by scrolling down, but that's not made very obvious.

unknown|1 year ago

[deleted]