Summarization is much more expensive than vector db's. Assume you have 1m tokens of context. You could run all through GPT-4 and summarize the information, but it would cost $60 (based on current prices) and take 10's of minutes of GPU time to do the inference.
Disclaimer: I work for a16z and on the infra team, so consider me biassed.
If you look through the comments here, folks are mostly referring to keeping for example a chat history. No one is doing 1m words of chat. A common pattern is to summarize a chat history and pass that in the prompt.
As for a corpus of documents (which is what you are presumably talking about), there are a couple problems with what you are saying:
First, you are implying that the content is always new - that's not true for many cases folks are talking about solving (like technical support or customer support), so it's a one time fee to summarize the corpus. You might run it periodically for updates.
Second, there is an assumption that a basic semantic search is the best way to search documents to find the most relevant content. That's questionable before the existence of LLMs, but with LLMs you are basically assuming your cosine similarity search on your vectors is better than an LLM can do with a simple table of contents and question "where should I search?" I haven't seen someone do a detailed study, but the implicit assumption that semantic search is the best idea for text could easily be a bad one.
Third, it assumes the quantum of data to search through is astronomically large and/or getting bigger compared to almost certain decreases in inference cost and increases in input tokens. This will be true for some subset of things, but unlikely to be many and in the cases it is true they'll do something more sophisticated than embeddings and embedding search. They'll probably fine tune the underlying model on an ongoing basis.
Regardless - the post you guys wrote seems... like a stretch for a definition of what this really is And, at least on the surface vector databases appear to be commodity infra. Pinecone might be growing fast now, but how do they ever make much money above their costs? But, you guys seem smart, so maybe there is something there?
Chat history may work, it depends on how long it is and the business model.
I don't quite understand how general summarization would work. If you use an LLM to simply to summarize in order to feed it into a prompt, the summarization needs to be specific to the query. i.e. "summarize what this text says about topic X". You can't summarize long text in a generic way without losing information. Or do I misunderstand the comment?
If you have a perfect table of context (or better, an index by topic) you may not need semantic search. But for the typical use case we are seeing you have unstructured data without an index (e.g. tech support knowledge db entries, company reports, emails). For that, semantic search work quite well.
For the sizes, the observation is that the data that people want to search over (e.g. your email, a wiki, JIRA, a knowledge base) is far larger than the context length. You are correct that we assume that inference cost and speed won't decrease sufficiently quickly in the near future. Why is a longer topic, but in a nutshell GPU speed increase is ~2.5x gen/gen and other than overtraining vs. Chinchilla we don't see immediate model gains. But that is speculative, we don't know what's in store.
To some degree we are just reacting to user adoption in the market. We don't build these systems, but if we see enough of them eventually we recognize the pattern. And while I am optimistic, we could be wrong. AI is major revolution and we are all students.
danielmarkbruce|2 years ago
As for a corpus of documents (which is what you are presumably talking about), there are a couple problems with what you are saying:
First, you are implying that the content is always new - that's not true for many cases folks are talking about solving (like technical support or customer support), so it's a one time fee to summarize the corpus. You might run it periodically for updates.
Second, there is an assumption that a basic semantic search is the best way to search documents to find the most relevant content. That's questionable before the existence of LLMs, but with LLMs you are basically assuming your cosine similarity search on your vectors is better than an LLM can do with a simple table of contents and question "where should I search?" I haven't seen someone do a detailed study, but the implicit assumption that semantic search is the best idea for text could easily be a bad one.
Third, it assumes the quantum of data to search through is astronomically large and/or getting bigger compared to almost certain decreases in inference cost and increases in input tokens. This will be true for some subset of things, but unlikely to be many and in the cases it is true they'll do something more sophisticated than embeddings and embedding search. They'll probably fine tune the underlying model on an ongoing basis.
Regardless - the post you guys wrote seems... like a stretch for a definition of what this really is And, at least on the surface vector databases appear to be commodity infra. Pinecone might be growing fast now, but how do they ever make much money above their costs? But, you guys seem smart, so maybe there is something there?
appenz|2 years ago
I don't quite understand how general summarization would work. If you use an LLM to simply to summarize in order to feed it into a prompt, the summarization needs to be specific to the query. i.e. "summarize what this text says about topic X". You can't summarize long text in a generic way without losing information. Or do I misunderstand the comment?
If you have a perfect table of context (or better, an index by topic) you may not need semantic search. But for the typical use case we are seeing you have unstructured data without an index (e.g. tech support knowledge db entries, company reports, emails). For that, semantic search work quite well.
For the sizes, the observation is that the data that people want to search over (e.g. your email, a wiki, JIRA, a knowledge base) is far larger than the context length. You are correct that we assume that inference cost and speed won't decrease sufficiently quickly in the near future. Why is a longer topic, but in a nutshell GPU speed increase is ~2.5x gen/gen and other than overtraining vs. Chinchilla we don't see immediate model gains. But that is speculative, we don't know what's in store.
To some degree we are just reacting to user adoption in the market. We don't build these systems, but if we see enough of them eventually we recognize the pattern. And while I am optimistic, we could be wrong. AI is major revolution and we are all students.
edit: disclaimer, I work for a16z.