top | item 36096811

Hard stuff when building products with LLMs

249 points| mavelikara | 2 years ago |honeycomb.io

109 comments

order
[+] muglug|2 years ago|reply
I predict there will be another six months of these sorts of articles, accompanied by a raft of LLM-powered features that aren't nearly as transformative as the people currently hyping AI are telling us to expect.

The engineers I know whose job it is to implement LLM features are much more skeptical about the near future than the engineers who are interested in the topic but lack hands-on experience.

[+] evrydayhustling|2 years ago|reply
The main thing LLMs can do is make products accessible/useful to a wider range of users - either by parsing their intents or by translating outputs to their needs.

This might result in a sort of transformation that engineers and power users aren't geared to appreciate. You might look at a natural language log query and say, "that would actually slow me down". But if it makes Honeycomb suddenly useful to stakeholders who couldn't before, it could lead to use cases not on the radar right now.

[+] alaskamiller|2 years ago|reply
The last chatbot wave (when FB opened up Messenger API and when Microsoft slung SKYPE as a plausible bot platform and when Slack rebranded their app store) fizzled out after 18 months.

All to figure out the singular most important thing: chat interfaces are the worst.

[+] deet|2 years ago|reply
I suspect you're right for how people are using and deploying LLMs now: hacking all kinds of functionality out of a text-completion model that, although it encodes a ton of data and some reasoning, is fundamentally still a text completion model and when deployed via commercial APIs like today without fine tuning, are not flexible beyond prompt engineering, chaining, etc. make possible.

But I think we've only scratched the surface as to what LLMs fine-tuned on specific tasks, especially for abstract reasoning over narrow domains, could do.

These applications possibly won't look anything like the chat interfaces that people are getting excited about now, and fine-tuning is not as accessible as prompt engineering. But there's a whole lot more to explore.

[+] tbalsam|2 years ago|reply
Am an ML engineer, was around long before LLMs. Definitely am skeptical, and I think I know where one dimension that we're missing performance is, a number of people do (in regards to steerability/controllability without losing performance), it's just something that's quite hard to do tbh. Quite hard to do indeedy.

Those who figure out how to do that well will have quite a legacy under their belts, and money if they're a profit-making company and handle it well.

It's not about whether or not it can be done, it's not hard to lay out the math and prove that pretty trivially if you know where to look. Just actually doing it is the hard part in a way where the inductive bias translates appropriately to the solution at hand.

[+] herval|2 years ago|reply
> The engineers I know whose job it is to implement LLM features are much more skeptical about the near future than the engineers who are interested in the topic but lack hands-on experience.

Isn't that always the case?

[+] dinvlad|2 years ago|reply
This will be entertaining (if dangerous) to watch, as people hopefully become disillusioned with the overselling and overhyping of this not-new-tech-but-wrapped-in-marketing-bs wave of ‘AI’. But, history shows we rarely learn from past mistakes.

I also hope there will be some whistleblower within OpenAI (and others like it) that exposes its internal practices and all of the hypocricy surrounding it. And, usually the fish rots from the head, as they say.

[+] coffeebeqn|2 years ago|reply
I tried building with LLMs but it has the basic problem that it’s totally wrong 20-50% of the time. That’s very meaningful for most business cases. It does fine when accuracy isn’t important but that’s fairly rare other than writing throwaway content
[+] typpo|2 years ago|reply
This is a great summary of why productionizing LLMs is hard. I'm working on a couple LLM products, including one that's in production for >10 million users.

The lack of formal tooling for prompt engineering drives me bonkers, and it compounds the problems outlined in the article around correctness and chaining.

Then there are the hot takes on Twitter from people claiming prompt engineering will soon be obsolete, or people selling blind prompts without any quality metrics. It's surprisingly hard to get LLMs to do _exactly_ what you want.

I'm building an open-source framework for systematically measuring prompt quality [0], inspired by best practices for traditional engineering systems.

0. https://github.com/typpo/promptfoo

[+] chaxor|2 years ago|reply
If you have a task that requires something suggested by "__exact__", then a full LLM is probably not the answer anyway. Try distilling step by step, especially if the goal to to generate a DSL or some restricted language. It can be helpful to have a different set of tokens available to the model for decoding, such that the only possible outcome is something like 'ATTCGGTCCCGGG' given some question to predict a DNA sequence.
[+] anotherpaulg|2 years ago|reply
This looks really useful.

Any thoughts on managing costs? I've been developing against gpt-4, and it runs up charges quickly. I've been thinking I will need to be careful about adding live api calls in any sort of testing situations.

Wondering if your tool has any features to help avoid/minimize wasted api usage?

[+] darkteflon|2 years ago|reply
This looks excellent, thank you - really nails the UI. Going to use this this week.
[+] jmccarthy|2 years ago|reply
Very nice, thank you! Will give it a try.
[+] fswd|2 years ago|reply
I could add a couple things from my own experiences. Storing prompts in a database seemed like a good idea, but in practice it ended up being a disater. Storing the prompt in a python/typescript file, up front at the top, works well. Using OpenAI playground with it's ability to export a prompt works well, or even better, something in gradio running in vscode with debugging mode, works even better. Few shots with refinements works really well. LangChain did not work well for any of my cases, I might go as boldly as saying that using langchain is bad practice.
[+] phillipcarter|2 years ago|reply
It's delightfully hacky, but we actually have our prompt (that we parameterize later) stored in a feature flag right now, with a few variations! I actually can't believe we shipped with that, but hey, it works? Each variation is pulled from a specific version in a separate repo where we iterate on the prompt.

We're going to likely settle on just storing whatever version of the prompt is considered "stable" as a source file, but for now this isn't actively hurting us, as far as we can tell, and there's a lot of prompt engineering left to do.

[+] ukuina|2 years ago|reply
So much "Yes!" for LangChain being bad practice. An unnecessarily bloated abstraction over what should be simple API calls.
[+] Dwood023|2 years ago|reply
Could you explain how storing prompts was a disaster?
[+] darkteflon|2 years ago|reply
I’m keeping all our prompts in a json file (along with some helpful metadata for us humans).

No idea if I’m doing it right.

[+] binarymax|2 years ago|reply
The first problem, context window size, is going to bite a lot of people.

The gotcha is that it’s a search problem. The article mentions embeddings and dot product, but that’s the most basic and naive search you can do. Search is information retrieval, and it’s a huge problem space.

You need a proper retriever that you tune for relevance. You should use a search engine for this, and have multiple features and do reranking.

That’s the only way to crack the context window problem. But the good news, is that once you do, things get much much better! You can then apply your search/retriever skills on all kinds of other problems - because search is really the backbone of all AI.

[+] darkteflon|2 years ago|reply
Such a great comment. The nice thing about this is that if search was a key part of your product pre-LLM, you likely already have something useful in place that requires very little adaptation.
[+] kristjansson|2 years ago|reply
Exactly. LLMs are incredible at information processing, and ok-to-terrible at information retrieval. All LLM applications that rely on accurate information are either infeasible or kick the entire can to the retrieval component.
[+] phillipcarter|2 years ago|reply
Yeah, we're definitely learning this. It's actually promising how well a very simple cosine similarity pass on data before sending it to an LLM can do [0]. But as we're learning, each further step towards accuracy is bigger and bigger, and there doesn't appear to be any turnkey solutions you can pay for right now.

[0]: https://twitter.com/_cartermp/status/1657037648400117760

[+] azinman2|2 years ago|reply
Context window size will eventually be solved, likely with its own trade offs.
[+] ZephyrBlu|2 years ago|reply
I would argue a more appropriate title would be something about integrating LLMs into complex products.

A lot of the problems are much more easily solved when you're working on a new product from scratch.

Also, I think LLMs work much better with structured data when you use them as a selector instead of a generator.

Asking an LLM to generate a structured schema is a bad idea. Ask it to pick from a set of pre-defined schemas instead, for example.

You're not using LLMs for their schema generating ability, you're using them for their intelligence and creativity. Don't make them do things they're not good at.

[+] selalipop|2 years ago|reply
That's how I built notionsmith.ai

I don't go directly to an LLM asking for structured data, or even a final answer, so you can type literally anything into the entry field and get a useful result

People are trying to treat them as conversational, but I'd say for most products it'll be rare to ever want more than one response for a given system prompt, and instead you'll want to build a final answer procedurally.

[+] phillipcarter|2 years ago|reply
Something we're looking to experiment with is asking the LLM to produce pieces of things that we then construct a query from, rather than ask it to also assemble it. The hypothesis is that it's more likely to produce things we can "work with" that are also "interesting" or "useful" to users.

FWIW we have about a ~7% failure rate (meaning it fails to produce a valid, runnable query) after some work done to correct what we consider correctable outputs. Not terrible, but we think the above idea could help with that.

[+] joelm|2 years ago|reply
Latency has been the biggest challenge for me.

They cite "two to 15+ seconds" in this blog post for responses. Via the OpenAI API I've been seeing more like 45-60 seconds for responses (using GPT-3.5-turbo or GPT-4 in chat mode). Note, this is using ~3500 tokens total.

I've had to extensively adapt to that latency in the UI of our product. Maybe I should start showing funny messages while the user is waiting (like I've seen porkbun do when you pay for domain names).

[+] phillipcarter|2 years ago|reply
Was this in the past week? We had much worse latency this past week compared to the rest (in addition to model unavailability errors), which we attributed to the Microsoft Build conference. One of our customers that uses it a lot is always at the token limit and their average latency was ~5 seconds, but that was closer to 10 second last week.

...also why we can't wait for other vendors to get SOC I/II clearance, and I guess eventually fine-tuning our own model, so we're not stuck with situations like this.

[+] simonw|2 years ago|reply
I think this may be the best thing I've read about real-world prompt engineering - there's SO MUCH hard earned knowledge in here.

The description of how they're handling the threat of prompt injection was particularly smart.

[+] softfalcon|2 years ago|reply
We had a hack-a-thon at my company around using AI tooling with respect to our products. The topics mentioned in this article are real and come up quickly when trying to make a real product that interfaces with an AI-API.

This was so true that there was an obvious chunk of teams in the hack-a-thon who didn’t even bother doing anything more than a fancy version of asking ChatGPT “where should I go for dinner in Brooklyn?” or straight up failed to even deliver a concept of a product.

Asking a clear question and harvesting accurate results from AI prompts is far more difficult than you might think it would be.

[+] devjab|2 years ago|reply
This seems more like a list of everything everybody is talking about while skipping the “how to make a profit” hard stuff.
[+] beachy|2 years ago|reply
The article has done its job if you come away with "honeycomb query" ringing in your ears.
[+] JimtheCoder|2 years ago|reply
"while skipping the “how to make a profit” hard stuff."

Some things never change...

[+] fogx|2 years ago|reply
> especially if those inputs are extremely vague (how on earth should we interpret slow?!)

isn't this exactly the (theoretical) strength of a chatbot - asking follow-up questions to remove uncertainty?

[+] andy99|2 years ago|reply
I'd call all of these things specific cases of some of the general problems we've faced with using neural networks for years. There's a big gap between demo and product. One one hand OpenAI has built a great product, on the other hand, it's not yet clear if downstream users will be able to do the same.

http://marble.onl/posts/into-the-great-wide-open-ai.html

[+] commandlinefan|2 years ago|reply
It’s hard to build a product using humans - I don’t know why anybody would think using AI would make that part any easier.
[+] bluepoint|2 years ago|reply
Can someone kindly explain, what does he mean by the use of embeddings? More specifically how are those embeddings calculated?
[+] fnordpiglet|2 years ago|reply
TL;DR new technologies are full of sharp edges and no blueprint for success. Engineering is still hard.
[+] hobs|2 years ago|reply
Engineering will always be hard, but I think a lot of this current AI hype cycle doesn't even have a product - its just "well that's cool so I want that."
[+] laratied|2 years ago|reply
It is also the mindset. I imagine the first time going on the internet and writing it off because you can't see what "products" can be built.