(no title)
Lienetic | 1 year ago
We use human evaluation but that is naturally far from scalable, which has especially been a problem when working on more complicated workflows/chains where changes can have a cascading effect. I've been encouraging a lot of dev experimentation on my team but would like to get a more consistent eval approach so we can evaluate and discuss changes with more grounded results. If all of these metrics are low confidence, they become counterproductive since people easily fall into the trap of optimizing the metric.
nirga|1 year ago
We've written a couple of blog posts about some of them: https://www.traceloop.com/blog
swyx|1 year ago