LLM evaluations are tricky. You can measure accuracy, latency, cost, hallucinations, bias... but what really matters for your app? Instead of relying on generic benchmarks, build your own evals --> focused on your use case, and then, bring those evals into real-time monitoring of your LLM app. We open-sourced LangWatch to help with this..
How are you handling LLM evals in production?
draismaa|11 months ago