(no title)
potatolicious | 4 months ago
"We should have rigorous evaluations of whether or not [thing] works." seems like an incredibly obvious thought.
But in the realm of LLM-enabled use cases they're also expensive. You'd need to recruit dozens, perhaps even hundreds of developers to do this, with extensive observation and rating of the results.
So rather than actually try to measure the efficacy, we just get blog posts with cherry-picked example of "LLM does something cool". Everything is just anecdata.
This is also the biggest barrier to actual LLM adoption for many, many applications. The gap between "it does something REALLY IMPRESSIVE 40% of the time and shits the bed otherwise" and "production system" is a yawning chasm.
marcosdumay|4 months ago
It applies to using LLMs too. I guess the one largest difference here is that LLM has few enough companies with abundant enough money pushing it to make it trivial for them to run a test like this. So the fact that they aren't doing that also says a lot.
oblio|4 months ago
> "We should have rigorous evaluations of whether or not [thing] works." seems like an incredibly obvious thought.
Heh, I'd rephrase the first part to:
> What you're getting at is the heart of the problem with software development though, isn't it?
simonw|4 months ago
redhale|4 months ago
b_e_n_t_o_n|4 months ago
> Trial participants saved an average of 56 minutes a working day when using AICAs
That feels accurate to me, but again I'm just going on vibes :P
troupo|4 months ago
potatolicious|4 months ago
You absolutely can quantify the results of a chaotic black box, in the same way you can quantify the bias of a loaded die without examining its molecular structure.