(no title)
dheerkt | 1 year ago
The key idea lies in dynamically determining how queries are handled:
- Strong matches (≥80% similarity): Responses are directly served from the cache.
- Partial matches (60–80% similarity): Verified answers are used as few-shot examples to guide the LLM.
- No matches (<60% similarity): The query is processed by the LLM as usual.
This not only minimizes hallucinations but also reduces costs and improves response times.
Here's a Jupyter notebook walkthrough if anyone's interested in diving deeper: https://github.com/aws-samples/Reducing-Hallucinations-in-LL...
Would love to hear your thoughts—anyone else working on similar techniques or approaches? Thanks.
No comments yet.