top | item 42133702

(no title)

Not cynical at all. I think you're highlighting a real problem in the industry, and certainly something we've seen - teams, for a number of reasons (optics, marketing, hype/vibes, experimentation, pressure to adopt AI) use LLMs without perhaps proper consideration. Thats actually the opposite of what we're advocating for.

The whole point of proper testing is to determine if an LLM is suitable for your specific task, and then testing and measuring further to optimise for the outcome you want. The post refers to more testing LLMs at Scale, and the use cases we refer to assume a system design took place where the use of an LLM was deemed necessary for the task. Teams absolutely should have the option to determine if an LLM is not "fit for purpose". Reva actually helps with this - Good testing and validation during the experimentation stage often reveals when a simpler solution works better. But "pressure" can come in many forms, and I have empathy for teams that perhaps are not in a healthy environment where saying no is part of the culture.

We're working with teams that have real use cases, and we've seen a real problem with how teams are testing their use of LLMs. Its hard. Especially at scale! We built infrastructure allowing you to test with your own real historic data allowing you to measure actual performance improvements (or regressions) against the business outcome, rather than "yep, looks good!"

discuss

No comments yet.