(no title)
deepsquirrelnet | 1 month ago
Stop prompt engineering, put down the crayons. Statistical model outputs need to be evaluated.
deepsquirrelnet | 1 month ago
Stop prompt engineering, put down the crayons. Statistical model outputs need to be evaluated.
andy99|1 month ago
lorey|1 month ago
deepsquirrelnet|1 month ago
It’s shocking to me how often it happens. Aside from just the necessity to be able to prove something works, there are so many other benefits.
Cost and model commoditization are part of it like you point out. There’s also the potential for degraded performance because of the shelf benchmarks aren’t generalizing how you expect. Add to that an inability to migrate to newer models as they come out, potentially leaving performance on the table. There’s like 95 serverless models in bedrock now, and as soon as you can evaluate them on your task they immediately become a commodity.
But fundamentally you can’t even justify any time spent on prompt engineering if you don’t have a framework to evaluate changes.
Evaluation has been a critical practice in machine learning for years. IMO is no less imperative when building with llms.