top | item 42990068

Show HN: Klarity – OS tool to debug LLM reasoning patterns with entropy analysis

3 points| mrciffa | 1 year ago |github.com

After struggling to understand why our reasoning models would sometimes produce flawless reasoning or go completely off track - we updated Klarity to get instant insights into reasoning uncertainty and concrete suggestions for dataset and prompt optimization. Just point it at your model to save testing time.

Key new features:

- Identify where your model's reasoning goes off track with step-by-step entropy analysis - Get actionable scores for coherence and confidence at each reasoning step - Training data insights: Identify which reasoning data lead to high-quality outputs

Structured JSON output with step-by-step analysis:

- steps: array of {step_number, content, entropy_score, semantic_score, top_tokens[]} - quality_metrics: array of {step, coherence, relevance, confidence} - reasoning_insights: array of {step, type, pattern, suggestions[]} - training_targets: array of {aspect, current_issue, improvement}

Example use cases:

- Debug why your model's reasoning edge cases - Identify which types of reasoning steps contribute to better outcomes - Optimize your RL datasets by focusing on high-quality reasoning patterns

Currently supports Hugging Face transformers and Together AI API, we tested the library with DeepSeek R1 distilled series (Qwen-1.5b, Qwen-7b etc)

Installation: `pip install git+https://github.com/klara-research/klarity.git`

We are building OS interpretability/explainability tools to debug generative models behaviors. What insights would actually help you debug these black box systems?

Links:

- Repo: https://github.com/klara-research/klarity - Our website: [https://klaralabs.com](https://klaralabs.com/) - Discord: https://discord.gg/wCnTRzBE

4 comments

order

Andreagobbo|1 year ago

'm curious—how does Klarity handle cases where reasoning errors are not just due to poor training data but also because of inherent limitations in the model architecture or prompt design? Are there specific suggestions for addressing those types of issues, or is the focus mainly on dataset optimization?

mrciffa|1 year ago

We are currently giving broad suggestions with an insight model that can be chosen during the setup. We will try to update and improve the suggestion prompt/code to make them more granular with new releases

andreakl|1 year ago

how does Klarity scale with more complex models or larger datasets? Does it maintain the same level of insight and actionable suggestions as the model grows in size and complexity? Great release btw

mrciffa|1 year ago

It should work with any type of model, obviously longer chain of thoughts will be more difficult to analyse by the evaluation model, because it will have way more reasoning steps to identify and separate. The quality of the outcome depends a lot on the chosen model to give you insights. We tested with Llama3-70B and worked smoothly most of the times.