From a confidentiality viewpoint, isn't this equivalent to uploading my docs to ChatPDF? (Genuine question, I'm not well versed in LLMs.) Because I'm uploading the embeddings to OpenAI or a similar service provider-if their chatbot can then answer my questions about my docs, wouldn't the informational content also allow it to answer any question asked by anybody other than me? E.g., "what's the document about?", "what's the PII of principal X?", etc.
Bare in mind that this approach is very limited. Any query requiring global awareness of the context will likely be disappointing. For example, "how was the experimental design different compared to how we've previously done this sort of work?", over a corpus of design of experiments specifications. Since target of this query doesn't have any simple correspondence to the words used in the query any sort of semantic embedding at a paragraph of sentence level just won't be useful.
Most don't train/fine-tune on your data, they stick it into a vector database and perform similarity search. The method is called Retrieval Augmented Generation.
This is the high level algorithm:
1) Sentence segmentation/text splitting. The data is indexed in disparate chunks so the user can look up the specific information they want.
2) The split sentences/text chunks are ran through a cheap LLM/specialized model, usually not state of the art but powerful/big enough to separate and associate the individual concepts in the latent space. Current models being davinci-003 and SentenceTransformers. The embeddings generation is usually the first step in an NLP deep learning model, so it's relatively cheap/lightweight. Essentially you take the first layer or two of a neural network and multiply the weight matrix against the input. This is the simplest type of embedding (see the original word2vec) algorithm. Transformer embeddings are a bit different but functionally they operate similarly.
3) The generated embedding vectors represent the input data in latent space, i.e. abstract representations. The most famous sentence in modern nature language processing being possibly King - Man + Woman = Queen.
4) The vector embeddings are stored somewhere, usually a database, but you can dump them in an excel file too if you want. You want to put it in a dictionary structure, where individual embedding -> original sentence chunk.
5) The user creates a query which is passed to the embeddings generator model and another vector embedding is generated (the query can be anything, so long it's natural language based since that's what the embeddings generator LLM was trained on). This query can also be created by an upstream LLM too, the specifics do not matter so long the sentence is mostly well-formed.
6) We obtain an answer by performing a similarity search (nearest neighbor in the vector latent space, using cosine/Euclidean distance/whatever metric). There are many approaches to do this, you can use kNN, you can use basic linear algebra and do n comparisons against all other vectors, or you can use a graph data structure (the currently preferred method for the fastest libraries).
7) You find the closest vector and the text chunk/sentences it represents and you take these sentences and the original query (the raw natural language text, not the generated embedding) and feed everything into a new LLM prompt. The LLM in this step is usually a state of the art chat model like GPT-4 or Llama 2, not the cheap model used for indexing and generating vectors. You pass a prompt like this:
Answer the following query: {original query text} with the given context: {text chunks, sentences}.
And that's it. Retrieval augmented generation has a fancy name and langchain's code feels opaque as hell like it was written by enterprise java people but the underlying algorithm is less than 30 lines of code with the standard ML and linear algebra libraries.
Re: step 3, I think you mean text-embedding-ada-002—OpenAI's current embeddings model, which replaces all 15 (or is it 16?) first generation embeddings models.
With respect to open source embeddings models, instructor-xl is the state of the art currently—as effective as text-embedding-ada-002.
That said, instructor-xl has a context length of 512 tokens, while text-embedding-ada-002 has a context length of 8192 tokens, which is markedly more convenient.
Last but not least, parent's comment re: langchain is spot on. It's simple and straightforward to write these few lines of code yourself.
I explored privateGPT today and gave it a source of documents in txt format. It worked well for some queries but for some it gave responses which were not part of the source documents. Any tips on fixing this? There is an open issue for this problem - https://github.com/imartinez/privateGPT/issues/517
I'm building a service called https://useftn.com which allows people to do exactly that. You just upload your dataset and we'll fine tune Llama 2-7B for you on that data. The model works with huggingface and all of the mainstream nlp frameworks.
It's in its really early stages right now (mainly just looking to learn/help people), so if you have your data in a specific format I'll be happy to code something up to make it work with your data.
How do you finetune a dataset on LLaMA2-7b with an A10 with 24 GB?
I thought you need to load the dataset 4 times for training, so with 16 bit weights this would be 7 * 10^9 * 2 bytes * 4 = 56 GB GPU RAM.
Have you found a way to train with 4 bit weights?
And would a single textbook be enough text for a meaningful finetuning?
I am worried that a model that is trained on billions of documents will not adapt strongly enough to a comparatively small document.
So RAG (retrieval augmented generation) is all the rage, but there are problems with it that make it appear almost conceptually flawed - to the point where results are flat-out poor?
There‘s the paper about non-uniform attention („Lost in the Middle: How Language Models Use Long Contexts“) and some other paper mentioned that LLMs may de-focus on irrelevant retrieved content as soon as the ratio between relevant vs. irrelevant content becomes small.
So, what‘s the current best practice to actually embed your content within the model?
It isnt very hard. The easy (and probably best alt unless you are big enough to justify training your own LLM everytime your documents change) is to use vector search to find the most relevant parts of your documents (I use openai embeeder and pgvector for postgres) and then you feed that text to an LLM (could be GTP4 or Llama) and asks it to answer the question using the text you provide.
That isn't what the person really wants to do though. I am building an application to use AI to query sets of documents in the legal space right now, for example, and this method has severe limitations. The LLMs themselves need fine-tuning for the purpose and bigger context windows, not to mention the bottleneck of the initial vector search itself.
If you want something easy just to try it out give https://xata.io/chatgpt a try. Load your docs into the db, and start asking questions. Xata has a pretty good free tier and you could likely get something running in about an hour.
That will at least get you to having some results quickly. I've found chatgpt is really more about the data you feed it, than anything else.
https://flowch.ai (our project) does this and is currently free to use. It doesn't train a custom model but it gets good results.
A simple way to do this is to upload your files (PDFs, Wod docs, virtually any type is supported), then generate reports using prompts based on those uploaded files. You can go from uploading to results in around a minute.
I've had good results uploading a bunch of documents, then running the same prompt on each of the documents with a few clicks using the "Flow Reports" feature.
We're working on lots of stuff on top of this, like scheduled reports (daily summaries / analysis / newsletters) and automated web scraping and data upload.
Here's an example using a BBC News article that I just uploaded to FlowChai (prompt: summarize in 200 words):
This thread is fantastic with a lot of great resources!
I have a related question but I don't want to start a new Ask HN for it. I have been using Machato app on Mac (https://news.ycombinator.com/item?id=35471091) for the last few months. This is a very nice app but with limited functionality. For example, it doesn't allow uploading PDF documents and asking questions on them as described in so many responses in this thread. I tried to search for a Mac app with this capability but my search came up empty. All the results I got are for web apps. Has anyone come across such an app? Paid/free doesn't matter.
I think most people don’t know this (I didn’t until recently) but you can upload documents to ChatGPT if you’re a paying member:
https://chat.openai.com
Choose GPT4 and Code Interpreter (you have to turn it on in your Settings).
Do you specifically need training, or would being able to reference your documents be good enough? You can have a look at projects such as langchain where they use embeddings in order to provide the LLM the relevant documents upon a user's query, which the LLM can then read and respond with
I am not OP, but I am facing issues with the common solution of "similarity search your documents --> pass the top ~5 chunks along with the query to the LLM"
[+] [-] gavinray|2 years ago|reply
You give it a directory containing documents and ask it to build an index and vector data embeddings over the documents
Then you can use this index with models like ChatGPT
Tutorial here shows the end to end process
https://gpt-index.readthedocs.io/en/latest/getting_started/s...
[+] [-] fernirello|2 years ago|reply
[+] [-] whynotmaybe|2 years ago|reply
Especially for French and Spanish.
[+] [-] usgroup|2 years ago|reply
1. Semantically index your documents.
2. Given a prompt, extract relevant paragraphs from your own documents.
3. Frame a context for the prompt from extracted paragraphs.
4. Ask ChatGPT to answer the prompt, mindful of the context.
That way ChatGPT can be used out-of-the-box.
[+] [-] usgroup|2 years ago|reply
[+] [-] KRAKRISMOTT|2 years ago|reply
This is the high level algorithm:
1) Sentence segmentation/text splitting. The data is indexed in disparate chunks so the user can look up the specific information they want.
2) The split sentences/text chunks are ran through a cheap LLM/specialized model, usually not state of the art but powerful/big enough to separate and associate the individual concepts in the latent space. Current models being davinci-003 and SentenceTransformers. The embeddings generation is usually the first step in an NLP deep learning model, so it's relatively cheap/lightweight. Essentially you take the first layer or two of a neural network and multiply the weight matrix against the input. This is the simplest type of embedding (see the original word2vec) algorithm. Transformer embeddings are a bit different but functionally they operate similarly.
3) The generated embedding vectors represent the input data in latent space, i.e. abstract representations. The most famous sentence in modern nature language processing being possibly King - Man + Woman = Queen.
4) The vector embeddings are stored somewhere, usually a database, but you can dump them in an excel file too if you want. You want to put it in a dictionary structure, where individual embedding -> original sentence chunk.
5) The user creates a query which is passed to the embeddings generator model and another vector embedding is generated (the query can be anything, so long it's natural language based since that's what the embeddings generator LLM was trained on). This query can also be created by an upstream LLM too, the specifics do not matter so long the sentence is mostly well-formed.
6) We obtain an answer by performing a similarity search (nearest neighbor in the vector latent space, using cosine/Euclidean distance/whatever metric). There are many approaches to do this, you can use kNN, you can use basic linear algebra and do n comparisons against all other vectors, or you can use a graph data structure (the currently preferred method for the fastest libraries).
7) You find the closest vector and the text chunk/sentences it represents and you take these sentences and the original query (the raw natural language text, not the generated embedding) and feed everything into a new LLM prompt. The LLM in this step is usually a state of the art chat model like GPT-4 or Llama 2, not the cheap model used for indexing and generating vectors. You pass a prompt like this:
Answer the following query: {original query text} with the given context: {text chunks, sentences}.
And that's it. Retrieval augmented generation has a fancy name and langchain's code feels opaque as hell like it was written by enterprise java people but the underlying algorithm is less than 30 lines of code with the standard ML and linear algebra libraries.
[+] [-] npsomaratna|2 years ago|reply
With respect to open source embeddings models, instructor-xl is the state of the art currently—as effective as text-embedding-ada-002.
That said, instructor-xl has a context length of 512 tokens, while text-embedding-ada-002 has a context length of 8192 tokens, which is markedly more convenient.
Last but not least, parent's comment re: langchain is spot on. It's simple and straightforward to write these few lines of code yourself.
[+] [-] replwoacause|2 years ago|reply
[+] [-] soultrees|2 years ago|reply
[+] [-] thaw13579|2 years ago|reply
[+] [-] onlypositive|2 years ago|reply
List of tools to bookmark: https://github.com/awesome-chatgpt/awesome-chatgpt
[+] [-] Obald|2 years ago|reply
docker run -itd --gpus all -p $(PORT):5111 --name llm-local-wizardlm-7b obald/llm-launcher:0.0.2
just use localhost:port in the browser and upload docs then ask questions in the gui.
Really nice for easy lookup of rules in boardgames and such. As it provides the relevant text from the docs in addition to the query answer.
https://gitlab.com/PeterHedman/llm-local
[+] [-] akbarnama|2 years ago|reply
[+] [-] Nevin1901|2 years ago|reply
It's in its really early stages right now (mainly just looking to learn/help people), so if you have your data in a specific format I'll be happy to code something up to make it work with your data.
(Also Disclaimer: I own this service)
[+] [-] kirdiekirdie|2 years ago|reply
[+] [-] potamic|2 years ago|reply
[+] [-] ndr_|2 years ago|reply
There‘s the paper about non-uniform attention („Lost in the Middle: How Language Models Use Long Contexts“) and some other paper mentioned that LLMs may de-focus on irrelevant retrieved content as soon as the ratio between relevant vs. irrelevant content becomes small.
So, what‘s the current best practice to actually embed your content within the model?
[+] [-] victorbjorklund|2 years ago|reply
[+] [-] catlover76|2 years ago|reply
[+] [-] azmodeus|2 years ago|reply
[+] [-] snide|2 years ago|reply
That will at least get you to having some results quickly. I've found chatgpt is really more about the data you feed it, than anything else.
(disclosure: I work at Xata)
[+] [-] llmllmllm|2 years ago|reply
A simple way to do this is to upload your files (PDFs, Wod docs, virtually any type is supported), then generate reports using prompts based on those uploaded files. You can go from uploading to results in around a minute.
I've had good results uploading a bunch of documents, then running the same prompt on each of the documents with a few clicks using the "Flow Reports" feature.
We're working on lots of stuff on top of this, like scheduled reports (daily summaries / analysis / newsletters) and automated web scraping and data upload.
Here's an example using a BBC News article that I just uploaded to FlowChai (prompt: summarize in 200 words):
https://flowch.ai/shared/3c6d6ead-3ebc-4190-a143-ffeee81945a...
[+] [-] tikkun|2 years ago|reply
There are indeed quite a few startups that do this.
Note that these are all 'retrieval-augmented generation' tools rather than fine-tuning tools.
[+] [-] malshe|2 years ago|reply
I have a related question but I don't want to start a new Ask HN for it. I have been using Machato app on Mac (https://news.ycombinator.com/item?id=35471091) for the last few months. This is a very nice app but with limited functionality. For example, it doesn't allow uploading PDF documents and asking questions on them as described in so many responses in this thread. I tried to search for a Mac app with this capability but my search came up empty. All the results I got are for web apps. Has anyone come across such an app? Paid/free doesn't matter.
[+] [-] sharonzhou|2 years ago|reply
https://colab.research.google.com/drive/1QMeGzR9FnhNJJFmcHtm...
https://lamini-ai.github.io/
from llama import QuestionAnswerModel
model = QuestionAnswerModel()
model.load_question_answer_from_csv("data.csv")
model.train() # returns id to run inference & playground interface
[+] [-] pud|2 years ago|reply
Choose GPT4 and Code Interpreter (you have to turn it on in your Settings).
Then click the “plus” icon in the chat box.
Don’t upload anything sensitive.
[+] [-] ssddanbrown|2 years ago|reply
[1] https://github.com/danswer-ai/danswer [2] https://news.ycombinator.com/item?id=36667374
[+] [-] theblazehen|2 years ago|reply
[+] [-] catlover76|2 years ago|reply
Is there a better way?
[+] [-] potamic|2 years ago|reply
[+] [-] jrpt|2 years ago|reply
[+] [-] arihantparsoya|2 years ago|reply
[+] [-] akvadrako|2 years ago|reply
[+] [-] kekeblom|2 years ago|reply
[+] [-] knbrlo|2 years ago|reply
[+] [-] ianpurton|2 years ago|reply
Anyway for UI you could look at chainlit, for API some of the models are already getting wrapped up in an open ai compatible rest interface.
See https://github.com/go-skynet/LocalAI
[+] [-] alexandr1us|2 years ago|reply
[+] [-] null4bl3|2 years ago|reply
Not a recommend approach