Giving GPT “Infinite” Knowledge

[+] furyofantares|2 years ago|reply

Embeddings-based search is a nice improvement on search, but it's still search. Relative to ChatGPT answering on its training data, I find embeddings-based search to be severely lacking. The right comparison is to traditional search, where it becomes favorable.

It has the same advantages search has over ChatGPT (being able to cite sources, being quite unlikely to hallucinate) and it has some of the advantages ChatGPT has over search (not needing exact query) - but in my experience it's not really in the new category of information discovery that ChatGPT introduced us to.

Maybe with more context I'll change my tune, but it's very much at the whim of the context retrieval finding everything you need to answer the query. That's easy for stuff that search is already good at, and so provides a better interface for search. But it's hard for stuff that search isn't good at, because, well: it's search.

[+] sudoapps|2 years ago|reply

Agreed, GPT answering based on its own training data has been the best experience by far (aside from hallucinations) and comparing against that is difficult. Embeddings might not even be the long term solution. I think it's still early to really know for certain but models are already getting better at interpreting with less overall training data so there are bound to be some new ideas.

[+] b33j0r|2 years ago|reply

Many points stated well. Agree. Now, I’m not certain of this, but I’m starting to get an intuition that duct-taping databases to an agent isn’t going to be the answer (I still kinda feel like hundreds of agents might be).

But these optimizations are applications of technology stacks we already know about. Sometimes, this era of AI research reminds me of all the whacky contraptions from the era before building airplanes became an engineering discipline.

I would likely have tried building a backyard ornithopter powered by mining explosives, if I had been alive during that period of experimentation.

Prediction: the best interfaces for this will be the ones we use for everything else as humans. I am trying to approach it more like that, and less like APIs and “document vs relational vs vector storage”.

[+] fzliu|2 years ago|reply

Encoder-decoder (attention) architectures still have a tough time with long-range dependencies, so even with longer context lengths, you'll still need a retrieval solution.

I agree that there's probably a better solution than pure embedding-based or mixed embedding/keyword search, but the "better" solution will still be based around semantics... aka embeddings.

[+] mlyle|2 years ago|reply

> It has the same advantages search has over ChatGPT (being able to cite sources, being quite unlikely to hallucinate) and it has some of the advantages ChatGPT has over search (not needing exact query) - but in my experience it's not really in the new category of information discovery that ChatGPT introduced us to.

I think the two could be paired up effectively. Context windows are getting bigger, but are still limited in the amount of information ChatGPT can sift through. This in turn limits the utility of current plugin based approaches.

Letting ChatGPT ask for relevant information, and sift through it based on its internal knowledge, seems valuable. If nothing else, it allows "learning" from recent development and effectively would augment its reasoning capability by having more information in working memory.

[+] stavros|2 years ago|reply

Is there any way to fine-tune GPT to make documentation a part of its training set, so you won't need embeddings? OpenAI lets you fine-tune GPT-3, but I don't know how well that works.

[+] d00d1toldme2p|2 years ago|reply

[deleted]

[+] ftxbro|2 years ago|reply

> "Once these models achieve a high level of comprehension, training larger models with more data may not offer significant improvements (not to be mistaken with reinforcement learning through human feedback). Instead, providing LLMs with real-time, relevant data for interpretation and understanding can make them more valuable."

To me this viewpoint looks totally alien. Imagine you have been training this model to predict the next token. At first it can barely interleave vowels and consonants. Then it can start making words, then whole sentences. Then it starts unlocking every cognitive ability one by one. It begins to pass nearly every human test and certification exam and psychological test of theory of mind.

Now imagine thinking at this point "training larger models with more data may not offer significant improvements" and deciding that's why you stop scaling it. That makes absolutely no sense to me unless 1) you have no imagination or 2) you want to stop because you are scared to make superhuman intelligence or 3) you are lying to throw off competitors or regulators or other people.

[+] spacephysics|2 years ago|reply

I don’t think we’re close to super human intelligence in the colloquial sense.

ChatGPT scrapes all the information given, then predicts the next token. It has no ability to understand what is truthful or correct. It’s as good as the data being fed to it.

To me, this is a step closer to AGI but we’re still far off. There’s a difference between “what’s statistically likely to be the next word” vs “despite this being the most likely next word, it’s actually wrong and here’s why”

If we say, “well, we’ll tell chatgpt what the correct sources of information are” that’s no better really. It’s not reasoning, it’s just a neutered data set.

I imagine they need to add something like chatgpt 4 has with live internet models or something else to get the next meaningful bump

I don’t recall who said it, but a similar thread had a researcher in the field express that we have squeezed far more juice than expected from these transformer models. Not that new progress in this direction can be made, but it seems like we’re approaching diminishing returns

I believe the next step that’s close is to have these train on less and less horsepower. If we can have these models run on a phone locally, oh boy that’s gonna be something

[+] muskmusk|2 years ago|reply

I agree with your general premise, but I think you left a couple of points of your list at the end:

it is obscenely expensive to keep training + there are other more low hanging fruit + you expect hardware to get better over time.

I don't think Altman is trying to fool anyone. Even if he were it wouldn't work. The competition is not that stupid and he knows that :)

It's just that hardware tends to get better at a rate that resembles Moore's law so in 18 months the cost of training a 100 mill dollar model is 50 mill dollar. You certainly can just throw money at the problem, but it's expensive and there are other options that are just as effective for now. Why spend money on things that are half as valuable in 18 months when you can spend money on things that don't devalue as fast like producing more/better data?

All that being said you can bet your ass there will be a gpt5 :)

[+] tyre|2 years ago|reply

It's possible that training with more data has diminishing gains. For example, we know that current LLMs have a problem with hallucination, so maybe a more valuable next area of research/development is to fix that.

Or work on consistency within a scope. For example, it can't write a novel because it doesn't have object consistency. A character will be 15 years old then 28 years old three sentences later.

Or allow it database/API access so it can interpolate canonical information into its responses.

None of these have to do with scale of data (as far as I understand.) All of them are, in my opinion, higher ROI areas for development for LLM => AGI.

[+] HarHarVeryFunny|2 years ago|reply

These LLMs are trained to model humans - they are going to be penalized, not rewarded, if they generate outputs that disagree with the training data, whether due to being too dumb OR too smart.

Best you can hope for is that they combine the expertise of all authors in the training data, which would be very impressive, but more top-tier human than super-human. However, achieving this level of performance may well be beyond what a transformer of any size can do. It may take a better architecture.

I suspect that there is also probably a dumbing-down effect by training the model on material from people who themselves are on a spectrum of different abilities. Simply put the model is being rewarded when trained for being correct as often as possible (i.e on average), so if it saw the same subject matter in the training set 10 times, once by an expert and 10x by mid-wits, then it's going to be rewarded for mid-wit performance.

[+] sudoapps|2 years ago|reply

This wasn't mean't to say that all training would stop. I think, to some extent, the model won't need additional recent data (that is already similar in structure to what it has) to better understand language and interpret the next set of characters. I could be completely wrong, but I still think techniques like transformers, RLHF and of course others will still exist and evolve to eventually get to some higher intelligence level.

[+] nomel|2 years ago|reply

This assumes that current neural networks topologies can "solve" intelligence. "Gains" could be a problem of missing subsystems, rather than missing data.

For a squishy example of a known conscious system, if you scoop out certain small, relatively fixed, regions of our brains, you can make consciousness, memory, and learning mostly cease. This suggests it's partly due to special subsystems, rather than total connection count.

[+] vidarh|2 years ago|reply

I think it's more a question of diminishing return and the cost of scaling it up, which is getting to a point where looking for ways of maximizing the impact of what is there makes sense. I'm sure we'll see models trained on more data, but maybe after efficiency improvements makes it cheaper both to train and run large models.

[+] joshspankit|2 years ago|reply

My takeaway from his statements is that if you sum up all of human knowledge then add every unique bit of knowledge that humans could uncover in the next 20 years, there’s a plateau and that plateau is probably lower than our dreams of what LLMs can do.

[+] woah|2 years ago|reply

Maybe it gets twice as good each time you spend 10x more training it. In this case, you might indeed hit a wall at some point.

[+] Der_Einzige|2 years ago|reply

I get annoyed by articles like this. Yes, it's cool to educate readers who aren't aware of embeddings/embeddings stores/vectorDB technologies that this is possible.

What these articles don't touch on is what to do once you've got the most relevant documents. Do you use the whole document as context directly? Do you summarize the documents first using the LLM (now the risk of hallucination in this step is added)? What about that trick where you shrink a whole document of context down to the embedding space of a single token (which is how ChatGPT is remembering the previous conversations). Doing that will be useful but still lossey

What about simply asking the LLM to craft its own search prompt to the DB given the user input, rather than returning articles that semantically match the query the closest? This would also make hybird search (keyword or bm25 + embeddings) more viable in the context of combining it with an LLM

Figuring out which of these choices to make, along with an awful lot more choices I'm likely not even thinking about right now, is what will seperate the useful from the useless LLM + Extractive knowledge systems

[+] EForEndeavour|2 years ago|reply

> What about that trick where you shrink a whole document of context down to the embedding space of a single token (which is how ChatGPT is remembering the previous conversations)

This is news to me. Where could I read about this trick?

[+] unknown|2 years ago|reply

[deleted]

[+] sudoapps|2 years ago|reply

The article is definitely still high level and mean't to provide enough understanding of what capabilities are today. Some of what you are mentioning goes deeper on how you take these learnings/tools and come up with the any number of solutions to fit the problem you are solving for.

> "Do you use the whole document as context directly? Do you summarize the documents first using the LLM (now the risk of hallucination in this step is added)?"

In my opinion the best approach is to take a large document and break it down into chunks before storing as embeddings and only querying back the relevant passages (chunks).

> "What about that trick where you shrink a whole document of context down to the embedding space of a single token (which is how ChatGPT is remembering the previous conversations)"

Not sure I follow here but seems interesting if possible, do you have any references?

> "What about simply asking the LLM to craft its own search prompt to the DB given the user input, rather than returning articles that semantically match the query the closest? This would also make hybird search (keyword or bm25 + embeddings) more viable in the context of combining it with an LLM"

This is definitely doable but just adds to the overall processing/latency (if that is a concern).

[+] gaogao|2 years ago|reply

> What about simply asking the LLM to craft its own search prompt to the DB given the user input, rather than returning articles that semantically match the query the closest?

I played with that approach in this post - https://friend.computer/jekyll/update/2023/04/30/wikidata-ll.... "Craft a query" is nice as it gives you a very declarative intermediate state for debugging.

[+] orasis|2 years ago|reply

One caveat about about embedding based retrieval is that there is no guarantee that the embedded documents will look like the query.

One trick is to have a LLM hallucinate a document based on the query, and then embed that hallucinated document. Unfortunately this increases the latency since it incurs another round trip to the LLM.

[+] taberiand|2 years ago|reply

Is that something easily handed off to a faster/cheaper LLM? I'm imagining something like running the main process through GPT-4 and hand of the hallucinations to GPT 3 turbo.

If you could spot the need for it while streaming a response you could possibly even have it ready ahead of time

[+] d4rkp4ttern|2 years ago|reply

Some people packaged this rather intuitive idea, named it Hyde (Hypothetical Document Embeddings) and wrote a paper about it —

https://arxiv.org/abs/2212.10496

Summary —

HyDE is a new method for creating effective zero-shot dense retrieval systems that generates hypothetical documents based on queries and encodes them using an unsupervised contrastively learned encoder to identify relevant documents. It outperforms state-of-the-art unsupervised dense retrievers and performs strongly compared to fine-tuned retrievers across various tasks and languages.

[+] wasabi991011|2 years ago|reply

>One caveat about about embedding based retrieval is that there is no guarantee that the embedded documents will look like the query.

Aleph Alpha provides an asymmetric embedding model which I believe is an attempt to resolve this issue (haven't looked into it much, just saw the entry in langchain's documentation)

[+] rco8786|2 years ago|reply

> One trick is to have a LLM hallucinate a document based on the query

I'm not following why you would want to do this? At that point, just asking the LLM without any additional context would/should produce the same (inaccurate) results.

[+] redskyluan|2 years ago|reply

i have an opposite way on doing this. Tried to generate questions based on doc chunks and embedding on questions. It works perfect!

[+] williamcotton|2 years ago|reply

“We’re gonna need a bigger boat.”

[+] Beltiras|2 years ago|reply

I'm working on something where I need to basically add on the order of 150,000 tokens into the knowledge base of an LLM. Finding out slowly I need to delve into training a whole ass LLM to do it. Sigh.

[+] v3ss0n|2 years ago|reply

https://deepai.org/publication/scaling-transformer-to-1m-tok...

Can this be implemented in current opensource models?

[+] akvadrako|2 years ago|reply

Can't you use fine-tuning for this?

A other option is to ask GPT to compress your tokens into a shorter prompt for itself.

[+] RhodesianHunter|2 years ago|reply

Or, at this rate, just wait 6 months.

[+] chartpath|2 years ago|reply

Search query expansion: https://en.wikipedia.org/wiki/Query_expansion

We've done this in NLP and search forever. I guess even SQL query planners and other things that automatically rewrite queries might count.

It's just that now the parameters seem squishier with a prompt interface. It's almost like we need some kind of symbolic structure again.

[+] sudoapps|2 years ago|reply

If you are wondering what the latest is on giving LLM's access to large amounts of data, I think this article is a good start. Seems like this is a space where there will be a ton of innovation so interested to learn what else is coming.

[+] jeffchuber|2 years ago|reply

hi everyone, this is jeff from Chroma (mentioned in the article) - happy to answer any questions.

[+] hartator|2 years ago|reply

Is Chroma already trained or only trained in the supplied documents?

I can try to make a Ruby client.

[+] iot_devs|2 years ago|reply

A similar idea is been developed in: https://github.com/pieroit/cheshire-cat

[+] pbhjpbhj|2 years ago|reply

>There is an important part of this prompt that is partially cut off from the image:

>> “If you don't know the answer, just say that you don't know, don't try to make up an answer”

//

It seems silly to make this part of the prompt rather than a separate parameter, surely we could design the response to be close to factual. Then run a checker to ascertain a score for the factuality of the output?

[+] sudoapps|2 years ago|reply

A lot of what prompting has turned into seems silly to me too, but it has shown to be effective (at least with GPT-4).

[+] nico|2 years ago|reply

Can we build a model based purely on search?

The model searches until it finds an answer, including distance and resolution

Search is performed by a DB, the query then sub-queries LLMs on a tree of embeddings

Each coordinate of an embedding vector is a pair of coordinate and LLM

Like a dynamic dictionary, in which the definition for the word is an LLM trained on the word

Indexes become shortcuts to meanings that we can choose based on case and context

Does this exist already?

[+] fzliu|2 years ago|reply

Not sure what you mean by dynamic dictionary, but the embedding tree you mention is already freely available Milvus via the Annoy index.

[+] m3kw9|2 years ago|reply

This is like asking gpt to summarize what it found on Google, this is basically what bing does when you try to find stuff like hotels and other recent subjects. Not the revolution we are all expecting

[+] A_D_E_P_T|2 years ago|reply

"Infinite" is a technical term with a highly specific meaning.

In this case, it can't possibly be approached. It certainly can't be attained.

Borges' Library of Babel, which represents all possible combinations of letters that can fit into a 400-page book, only contains some 25^1312000 books. And the overwhelming majority of its books are full of gibberish. The amount of "knowledge" that a LLM can learn or describe is VERY strictly bounded and strictly finite. (This is perhaps its defining characteristic.)

I know this is pedantic, but I am a philosopher of mathematics and this is a matter that's rather important to me.

[+] hartator|2 years ago|reply

> I know this is pedantic, but I am a philosopher of mathematics and this is a matter that's rather important to me.

I don’t think this is pedantic. Words carry a specific meaning or what’s the point of words otherwise.

86 comments