top | item 44057382

(no title)

keithwhor | 9 months ago

I am building a company in this space, so can hopefully give some insight [0].

The issue right now is that both (1) function calling and (2) codegen just aren't really very good. The hype train far exceeds capabilities. Giving great demos like fetching some Stripe customers, generating an email or getting the weather work flawlessly. But anything more sophisticated goes off the rails very quickly. It's difficult to get models to reliably call functions with the right parameters, to set up multi-step workflows and more.

Add codegen into the mix and it's hairier. You need a deployment and testing apparatus to make sure the code actually works... and then what is it doing? Does it need secret keys to make web requests to other services? Should we rely on functions for those?

The price / performance curve is a consideration, too. Good models are slow and expensive. Which means their utility has to be higher in order to charge a customer to pay for the costs, but they also take a lot longer to respond to requests which reduces perception of value. Codegen is even slower in this case. So there's a lot of alpha in finding the right "mixture of models" that can plan and execute functions quickly and accurately.

For example, OpenAI's GPT-4.1-nano is the fastest function calling model on the market. But it routinely tries to execute the same function twice in parallel. So if you combine it with another fast model, like Gemini Flash, you can reduce error rates - e.g. 4.1-nano does planning, Flash executes. But this is non-obvious to anybody building these systems until they've tried and failed countless times.

I hope to see capabilities improve and costs and latency trend downwards, but what you're suggesting isn't quite feasible yet. That said I (and many others) are interested in making it happen!

[0] https://instant.bot

discuss

deadbabe|9 months ago

Well in the mean time we could just have the LLM shoot Jira tickets at human developers to build out new tools it requires ASAP? And until it’s done have a placeholder message returned to the client? Could be a good way to keep developers working constantly. And eventually when the tech is good you replace the human devs with LLMs.