top | item 45669252

(no title)

payneio | 4 months ago

I get it. I've been through cycles of this over the past three years, too. Used a lot of various tools, had a lot of disappointment, wasted a lot of time and money.

But this is the kinda the whole point of my post...

In our system, we added fact checking itself, comparing different approaches, summarizing and effectively utilizing the "wisdom of the crowd" (and it's success over time).

And it made it work massively better for even non-trivial applications.

discuss

Madmallard|4 months ago

You're going to have to put quotes around "fact checking" if you're using LLMs to do it.

"comparing different approaches, summarizing and effectively utilizing the "wisdom of the crowd" (and it's success over time)"

I fail to see how this is defensible as well.

payneio|4 months ago

Compiling and evaluating output are types of fact checking. We've done more extensive automated evaluations of "groundedness" by extra ting factual statements and seeing whether or not they are based on input data or hallucinated. There are many techniques that work well.

For comparisons, you can ask the model to eval on various axis e.g. reliability, maintainability, cyclometeic complexity, API consistency, whatever, and they generally do fine.

We run multi-trial evals with multiple inputs across multiple semantic and deterministic metrics to create statistical scores we use for comparisons... basically creating benchmark suites by hand or generated. This also does well for guiding development.

payneio|4 months ago

And by "wisdom of the croud", I'm referring to sharing what works well and what doesn't and building good approaches into the frameworks... encoding human expertise. We do it all the time.