I predict there will be another six months of these sorts of articles, accompanied by a raft of LLM-powered features that aren't nearly as transformative as the people currently hyping AI are telling us to expect.
The engineers I know whose job it is to implement LLM features are much more skeptical about the near future than the engineers who are interested in the topic but lack hands-on experience.
The main thing LLMs can do is make products accessible/useful to a wider range of users - either by parsing their intents or by translating outputs to their needs.
This might result in a sort of transformation that engineers and power users aren't geared to appreciate. You might look at a natural language log query and say, "that would actually slow me down". But if it makes Honeycomb suddenly useful to stakeholders who couldn't before, it could lead to use cases not on the radar right now.
The last chatbot wave (when FB opened up Messenger API and when Microsoft slung SKYPE as a plausible bot platform and when Slack rebranded their app store) fizzled out after 18 months.
All to figure out the singular most important thing: chat interfaces are the worst.
I suspect you're right for how people are using and deploying LLMs now: hacking all kinds of functionality out of a text-completion model that, although it encodes a ton of data and some reasoning, is fundamentally still a text completion model and when deployed via commercial APIs like today without fine tuning, are not flexible beyond prompt engineering, chaining, etc. make possible.
But I think we've only scratched the surface as to what LLMs fine-tuned on specific tasks, especially for abstract reasoning over narrow domains, could do.
These applications possibly won't look anything like the chat interfaces that people are getting excited about now, and fine-tuning is not as accessible as prompt engineering. But there's a whole lot more to explore.
Am an ML engineer, was around long before LLMs. Definitely am skeptical, and I think I know where one dimension that we're missing performance is, a number of people do (in regards to steerability/controllability without losing performance), it's just something that's quite hard to do tbh. Quite hard to do indeedy.
Those who figure out how to do that well will have quite a legacy under their belts, and money if they're a profit-making company and handle it well.
It's not about whether or not it can be done, it's not hard to lay out the math and prove that pretty trivially if you know where to look. Just actually doing it is the hard part in a way where the inductive bias translates appropriately to the solution at hand.
> The engineers I know whose job it is to implement LLM features are much more skeptical about the near future than the engineers who are interested in the topic but lack hands-on experience.
This will be entertaining (if dangerous) to watch, as people hopefully become disillusioned with the overselling and overhyping of this not-new-tech-but-wrapped-in-marketing-bs wave of ‘AI’. But, history shows we rarely learn from past mistakes.
I also hope there will be some whistleblower within OpenAI (and others like it) that exposes its internal practices and all of the hypocricy surrounding it. And, usually the fish rots from the head, as they say.
I tried building with LLMs but it has the basic problem that it’s totally wrong 20-50% of the time. That’s very meaningful for most business cases. It does fine when accuracy isn’t important but that’s fairly rare other than writing throwaway content
This is a great summary of why productionizing LLMs is hard. I'm working on a couple LLM products, including one that's in production for >10 million users.
The lack of formal tooling for prompt engineering drives me bonkers, and it compounds the problems outlined in the article around correctness and chaining.
Then there are the hot takes on Twitter from people claiming prompt engineering will soon be obsolete, or people selling blind prompts without any quality metrics. It's surprisingly hard to get LLMs to do _exactly_ what you want.
I'm building an open-source framework for systematically measuring prompt quality [0], inspired by best practices for traditional engineering systems.
If you have a task that requires something suggested by "__exact__", then a full LLM is probably not the answer anyway. Try distilling step by step, especially if the goal to to generate a DSL or some restricted language. It can be helpful to have a different set of tokens available to the model for decoding, such that the only possible outcome is something like 'ATTCGGTCCCGGG' given some question to predict a DNA sequence.
Any thoughts on managing costs? I've been developing against gpt-4, and it runs up charges quickly. I've been thinking I will need to be careful about adding live api calls in any sort of testing situations.
Wondering if your tool has any features to help avoid/minimize wasted api usage?
I could add a couple things from my own experiences. Storing prompts in a database seemed like a good idea, but in practice it ended up being a disater. Storing the prompt in a python/typescript file, up front at the top, works well. Using OpenAI playground with it's ability to export a prompt works well, or even better, something in gradio running in vscode with debugging mode, works even better. Few shots with refinements works really well. LangChain did not work well for any of my cases, I might go as boldly as saying that using langchain is bad practice.
It's delightfully hacky, but we actually have our prompt (that we parameterize later) stored in a feature flag right now, with a few variations! I actually can't believe we shipped with that, but hey, it works? Each variation is pulled from a specific version in a separate repo where we iterate on the prompt.
We're going to likely settle on just storing whatever version of the prompt is considered "stable" as a source file, but for now this isn't actively hurting us, as far as we can tell, and there's a lot of prompt engineering left to do.
The first problem, context window size, is going to bite a lot of people.
The gotcha is that it’s a search problem. The article mentions embeddings and dot product, but that’s the most basic and naive search you can do. Search is information retrieval, and it’s a huge problem space.
You need a proper retriever that you tune for relevance. You should use a search engine for this, and have multiple features and do reranking.
That’s the only way to crack the context window problem. But the good news, is that once you do, things get much much better! You can then apply your search/retriever skills on all kinds of other problems - because search is really the backbone of all AI.
Such a great comment. The nice thing about this is that if search was a key part of your product pre-LLM, you likely already have something useful in place that requires very little adaptation.
Exactly. LLMs are incredible at information processing, and ok-to-terrible at information retrieval. All LLM applications that rely on accurate information are either infeasible or kick the entire can to the retrieval component.
Yeah, we're definitely learning this. It's actually promising how well a very simple cosine similarity pass on data before sending it to an LLM can do [0]. But as we're learning, each further step towards accuracy is bigger and bigger, and there doesn't appear to be any turnkey solutions you can pay for right now.
I would argue a more appropriate title would be something about integrating LLMs into complex products.
A lot of the problems are much more easily solved when you're working on a new product from scratch.
Also, I think LLMs work much better with structured data when you use them as a selector instead of a generator.
Asking an LLM to generate a structured schema is a bad idea. Ask it to pick from a set of pre-defined schemas instead, for example.
You're not using LLMs for their schema generating ability, you're using them for their intelligence and creativity. Don't make them do things they're not good at.
I don't go directly to an LLM asking for structured data, or even a final answer, so you can type literally anything into the entry field and get a useful result
People are trying to treat them as conversational, but I'd say for most products it'll be rare to ever want more than one response for a given system prompt, and instead you'll want to build a final answer procedurally.
Something we're looking to experiment with is asking the LLM to produce pieces of things that we then construct a query from, rather than ask it to also assemble it. The hypothesis is that it's more likely to produce things we can "work with" that are also "interesting" or "useful" to users.
FWIW we have about a ~7% failure rate (meaning it fails to produce a valid, runnable query) after some work done to correct what we consider correctable outputs. Not terrible, but we think the above idea could help with that.
They cite "two to 15+ seconds" in this blog post for responses. Via the OpenAI API I've been seeing more like 45-60 seconds for responses (using GPT-3.5-turbo or GPT-4 in chat mode). Note, this is using ~3500 tokens total.
I've had to extensively adapt to that latency in the UI of our product. Maybe I should start showing funny messages while the user is waiting (like I've seen porkbun do when you pay for domain names).
Was this in the past week? We had much worse latency this past week compared to the rest (in addition to model unavailability errors), which we attributed to the Microsoft Build conference. One of our customers that uses it a lot is always at the token limit and their average latency was ~5 seconds, but that was closer to 10 second last week.
...also why we can't wait for other vendors to get SOC I/II clearance, and I guess eventually fine-tuning our own model, so we're not stuck with situations like this.
We had a hack-a-thon at my company around using AI tooling with respect to our products. The topics mentioned in this article are real and come up quickly when trying to make a real product that interfaces with an AI-API.
This was so true that there was an obvious chunk of teams in the hack-a-thon who didn’t even bother doing anything more than a fancy version of asking ChatGPT “where should I go for dinner in Brooklyn?” or straight up failed to even deliver a concept of a product.
Asking a clear question and harvesting accurate results from AI prompts is far more difficult than you might think it would be.
I'd call all of these things specific cases of some of the general problems we've faced with using neural networks for years. There's a big gap between demo and product. One one hand OpenAI has built a great product, on the other hand, it's not yet clear if downstream users will be able to do the same.
Engineering will always be hard, but I think a lot of this current AI hype cycle doesn't even have a product - its just "well that's cool so I want that."
[+] [-] muglug|2 years ago|reply
The engineers I know whose job it is to implement LLM features are much more skeptical about the near future than the engineers who are interested in the topic but lack hands-on experience.
[+] [-] evrydayhustling|2 years ago|reply
This might result in a sort of transformation that engineers and power users aren't geared to appreciate. You might look at a natural language log query and say, "that would actually slow me down". But if it makes Honeycomb suddenly useful to stakeholders who couldn't before, it could lead to use cases not on the radar right now.
[+] [-] alaskamiller|2 years ago|reply
All to figure out the singular most important thing: chat interfaces are the worst.
[+] [-] deet|2 years ago|reply
But I think we've only scratched the surface as to what LLMs fine-tuned on specific tasks, especially for abstract reasoning over narrow domains, could do.
These applications possibly won't look anything like the chat interfaces that people are getting excited about now, and fine-tuning is not as accessible as prompt engineering. But there's a whole lot more to explore.
[+] [-] tbalsam|2 years ago|reply
Those who figure out how to do that well will have quite a legacy under their belts, and money if they're a profit-making company and handle it well.
It's not about whether or not it can be done, it's not hard to lay out the math and prove that pretty trivially if you know where to look. Just actually doing it is the hard part in a way where the inductive bias translates appropriately to the solution at hand.
[+] [-] herval|2 years ago|reply
Isn't that always the case?
[+] [-] dinvlad|2 years ago|reply
I also hope there will be some whistleblower within OpenAI (and others like it) that exposes its internal practices and all of the hypocricy surrounding it. And, usually the fish rots from the head, as they say.
[+] [-] NicoJuicy|2 years ago|reply
Photoshop: https://the-decoder.com/adobe-photoshop-can-now-modify-image...
Robotics: https://youtu.be/UuKAp9a6wMs
Home automation: ( couldn't find a good video, but it's with home assistant)
[+] [-] coffeebeqn|2 years ago|reply
[+] [-] lee101|2 years ago|reply
[deleted]
[+] [-] typpo|2 years ago|reply
The lack of formal tooling for prompt engineering drives me bonkers, and it compounds the problems outlined in the article around correctness and chaining.
Then there are the hot takes on Twitter from people claiming prompt engineering will soon be obsolete, or people selling blind prompts without any quality metrics. It's surprisingly hard to get LLMs to do _exactly_ what you want.
I'm building an open-source framework for systematically measuring prompt quality [0], inspired by best practices for traditional engineering systems.
0. https://github.com/typpo/promptfoo
[+] [-] chaxor|2 years ago|reply
[+] [-] anotherpaulg|2 years ago|reply
Any thoughts on managing costs? I've been developing against gpt-4, and it runs up charges quickly. I've been thinking I will need to be careful about adding live api calls in any sort of testing situations.
Wondering if your tool has any features to help avoid/minimize wasted api usage?
[+] [-] darkteflon|2 years ago|reply
[+] [-] jmccarthy|2 years ago|reply
[+] [-] fswd|2 years ago|reply
[+] [-] phillipcarter|2 years ago|reply
We're going to likely settle on just storing whatever version of the prompt is considered "stable" as a source file, but for now this isn't actively hurting us, as far as we can tell, and there's a lot of prompt engineering left to do.
[+] [-] ukuina|2 years ago|reply
[+] [-] Dwood023|2 years ago|reply
[+] [-] darkteflon|2 years ago|reply
No idea if I’m doing it right.
[+] [-] binarymax|2 years ago|reply
The gotcha is that it’s a search problem. The article mentions embeddings and dot product, but that’s the most basic and naive search you can do. Search is information retrieval, and it’s a huge problem space.
You need a proper retriever that you tune for relevance. You should use a search engine for this, and have multiple features and do reranking.
That’s the only way to crack the context window problem. But the good news, is that once you do, things get much much better! You can then apply your search/retriever skills on all kinds of other problems - because search is really the backbone of all AI.
[+] [-] darkteflon|2 years ago|reply
[+] [-] kristjansson|2 years ago|reply
[+] [-] phillipcarter|2 years ago|reply
[0]: https://twitter.com/_cartermp/status/1657037648400117760
[+] [-] pmoriarty|2 years ago|reply
[1] - https://poe.com/Claude-instant-100k
[+] [-] azinman2|2 years ago|reply
[+] [-] ZephyrBlu|2 years ago|reply
A lot of the problems are much more easily solved when you're working on a new product from scratch.
Also, I think LLMs work much better with structured data when you use them as a selector instead of a generator.
Asking an LLM to generate a structured schema is a bad idea. Ask it to pick from a set of pre-defined schemas instead, for example.
You're not using LLMs for their schema generating ability, you're using them for their intelligence and creativity. Don't make them do things they're not good at.
[+] [-] selalipop|2 years ago|reply
I don't go directly to an LLM asking for structured data, or even a final answer, so you can type literally anything into the entry field and get a useful result
People are trying to treat them as conversational, but I'd say for most products it'll be rare to ever want more than one response for a given system prompt, and instead you'll want to build a final answer procedurally.
[+] [-] phillipcarter|2 years ago|reply
FWIW we have about a ~7% failure rate (meaning it fails to produce a valid, runnable query) after some work done to correct what we consider correctable outputs. Not terrible, but we think the above idea could help with that.
[+] [-] joelm|2 years ago|reply
They cite "two to 15+ seconds" in this blog post for responses. Via the OpenAI API I've been seeing more like 45-60 seconds for responses (using GPT-3.5-turbo or GPT-4 in chat mode). Note, this is using ~3500 tokens total.
I've had to extensively adapt to that latency in the UI of our product. Maybe I should start showing funny messages while the user is waiting (like I've seen porkbun do when you pay for domain names).
[+] [-] phillipcarter|2 years ago|reply
...also why we can't wait for other vendors to get SOC I/II clearance, and I guess eventually fine-tuning our own model, so we're not stuck with situations like this.
[+] [-] simonw|2 years ago|reply
The description of how they're handling the threat of prompt injection was particularly smart.
[+] [-] softfalcon|2 years ago|reply
This was so true that there was an obvious chunk of teams in the hack-a-thon who didn’t even bother doing anything more than a fancy version of asking ChatGPT “where should I go for dinner in Brooklyn?” or straight up failed to even deliver a concept of a product.
Asking a clear question and harvesting accurate results from AI prompts is far more difficult than you might think it would be.
[+] [-] devjab|2 years ago|reply
[+] [-] beachy|2 years ago|reply
[+] [-] phillipcarter|2 years ago|reply
[+] [-] JimtheCoder|2 years ago|reply
Some things never change...
[+] [-] fogx|2 years ago|reply
isn't this exactly the (theoretical) strength of a chatbot - asking follow-up questions to remove uncertainty?
[+] [-] andy99|2 years ago|reply
http://marble.onl/posts/into-the-great-wide-open-ai.html
[+] [-] commandlinefan|2 years ago|reply
[+] [-] alecco|2 years ago|reply
[+] [-] bluepoint|2 years ago|reply
[+] [-] fnordpiglet|2 years ago|reply
[+] [-] hobs|2 years ago|reply
[+] [-] laratied|2 years ago|reply
[+] [-] lee101|2 years ago|reply
[deleted]
[+] [-] hyperliner|2 years ago|reply
[deleted]