Granted I'm not coming from the python world, but I have tried many of these projects, and very few of them install out of the box. They usually end with some incompatibility, and files scattered all over the place, leading to future nightmares.
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
sentry-sdk 1.22.2 requires urllib3<2.0.0, but you have urllib3 2.0.2 which is incompatible
Just for fun, here's the result of python -m pip install -r ./requirements.txt for tortoise-tts;
…many many lines
raise ValueError("%r is not a directory" % (package_path,))
ValueError: 'build/py3k/scipy' is not a directory
Converting to Python3 via 2to3...
…
/tmp/pip-install-hkb_4lh7/scipy_088b20410aca4f0cbcddeac86ac7b7b1/build/py3k/scipy/signal/fir_filter_design.py
[end of output]
note: This error originates from a subprocess, and is
likely not a problem with pip.
error: metadata-generation-failed
I'm not asking for support, just saying if people really want to make something 'easy' they'd use docker. I gather there are better python package managers, but I gather that's a bit of a mess too.
Someone is thinking "this is part of learning the language," but I think it's just bad design.
You don’t need Docker, you just need a virtual env for each random thing you try instead of making them all conflict with each other. Maybe some day pip will add a switch to automatically create one, but until then,
python3 -m venv venv
. venv/bin/activate
before you try something random.
Also, `python` is usually Python 2.7. If it is, I advise removing it from your system unless you have a strong reason to keep it.
Yep, I just tried to install a Python-based project and there was a conflict between Pyenv's and Homebrew's versions of pip... despite having used Homebrew to install Pyenv. I ended up just getting rid of Pyenv altogether... but now Python may be in some screwed-up state on my system.
It's too bad the ecosystem seems to be so messy, because Python seems like the best language for general utilities.
Are you familiar with virtual environments? It's the standard Python technique for isolating dependencies across projects. [Most projects mention this in the setup / quickstart section of their docs.]
You should not be seeing these dependency conflict issues if you install each project in its own virtual environment.
If you just want them to be easily installed you can just use pipx (`pipx install my-package`) which will manage the virtual environment automatically.
Making a full blown Docker image for it is overkill 99% of the time. Virtual environments serve the same purpose while being much faster and lighter weight.
This is the primary reason I'm averse to languages and ecosystems that rely on package managers. I have never had a good experience where these things aren't just constantly breaking. Stack/cabal, cargo, pip, npm/yarn, gem. Scattering files across my filesystem and having extremely brittle configs that shatter the ecosystem into a billion pieces at seemingly random intervals. A problem exacerbated by these package managers often being more complex than the compiler/interpreter itself. Luarocks is probably the least problematic, and that's mostly because it hosts really simple and self-contained software.
Say what you will about the old school way of manually building and copying shit around, at least when something breaks I don't have to spend a couple hours keelhauling a bloated toolchain in a debugger for mutiny.
I would say it's more an artifact of historical tech debt that is hard to change now without breaking everyone. As another commenter pointed out, you want to use a venv - I use pipenv as a tool to automate this but there are others as well (poetry is probably better but pipenv seems to work for me).
Self-hosted + self-trained LLMs are probably the future for enterprise.
While consumers are happy to get their data mined to avoid paying, businesses are the opposite: willing to pay a lot to avoid feeding data to MSFT/GOOG/META.
They may give assurances on data protection (even here GitHub copilot TOS has sketchy language around saving down derived data), but can’t get around fundamental problem that their products need user interactions to work well.
So it seems with BigTechLLM there’s inherent tension between product competitiveness and data privacy, which makes them incompatible with enterprise.
Biz ideas along these lines:
- Help enterprises set up, train, maintain own customized LLMs
- Security, compliance, monitoring tools
- Help AI startups get compliant with enterprise security
- Fine tuning service
In the book “To sleep in a sea of stars” there’s a concept of a “ship mind” that is local to each space craft. It’s smarter than “pseudo ai” and can have real conversations, answer complex questions, and even tell jokes.
I can see a self-hosted LLM being akin to a company’s ship-mind. Anyone can ask questions, order analyses, etc, so long as you are a member of the company. No two LLM’s will be exactly the same - and that’s ok.
I suspect the major cloud providers will also each offer their own “enterprise friendly” LLM services (Azure already offers a version of OpenAI’s API). If they have the right data guarantees, that’ll probably be sufficient for companies that are already using their IaaS offerings.
> willing to pay a lot to avoid feeding data to MSFT/GOOG/META.
Right now, you can't pay a lot and get a local LLM with similar performance to GPT-4.
Anything you can run on-site isn't really even close in terms of performance.
The ability to finetune to your workplaces terminology and document set is certainly a benefit, but for many usecases that doesn't outweigh the performance difference.
Use the following pieces of context to answer the question at the end. If you don't
know the answer, just say that you don't know, don't try to make up an answer.
{context}
Question: {question}
Helpful Answer:
- [GitHub - e-johnstonn/BriefGPT: Locally hosted tool that connects documents to LLMs for summarization and querying, with a simple GUI.](https://github.com/e-johnstonn/BriefGPT)
- [GitHub - go-skynet/LocalAI: Self-hosted, community-driven, local OpenAI-compatible API. Drop-in replacement for OpenAI running LLMs on consumer-grade hardware. No GPU required. LocalAI is a RESTful API to run ggml compatible models: llama.cpp, alpaca.cpp, gpt4all.cpp, rwkv.cpp, whisper.cpp, vicuna, koala, gpt4all-j, cerebras and many others!](https://github.com/go-skynet/LocalAI)
- [GitHub - paulpierre/RasaGPT: RasaGPT is the first headless LLM chatbot platform built on top of Rasa and Langchain. Built w/ Rasa, FastAPI, Langchain, LlamaIndex, SQLModel, pgvector, ngrok, telegram](https://github.com/paulpierre/RasaGPT)
- [GitHub - imartinez/privateGPT: Interact privately with your documents using the power of GPT, 100% privately, no data leaks](https://github.com/imartinez/privateGPT)
- [GitHub - deepset-ai/haystack: Haystack is an open source NLP framework to interact with your data using Transformer models and LLMs (GPT-4, ChatGPT and alike). Haystack offers production-ready tools to quickly build complex question answering, semantic search, text generation applications, and more.](https://github.com/deepset-ai/haystack)
Got this working locally - badly needs GPU support (have a 3090 so come on!) there is some workaround but expect it will come pretty soon. This video was a useful walkthough esp on using different model and upping the CPU threads. https://www.youtube.com/watch?v=A3F5riM5BNE
I want to have the memory part of langchain down, vector store + local database + client to chat with an LLM (gpt4all model can be swapped with OpenAI api just switching the base URL)
Sorry for my ignorance. But memory refers to the process of using embeddings for QA right?
The process roughly is:
Ingestion:
- Process embeddings for your documents (from text to array of numbers)
- Store your documents in a Vector DB
Query time:
- Process embeddings for the query
- Find documents similar to the query using distance from other docs in the Vector db
- Construct prompt with format:
"""
Answer question using this context:
{DOCUMENTS RETRIEVED}
Question: {question}
Answer:
"""
Is that correct? Now, my question is, can the models be swapped easily? Or that requires a complete recalculation of the embedding (and new ingestion)?
Working on something similar that uses keyterm extraction for traversal of topics and fragments, without using Langchain. It's not designed to be private, however: https://github.com/FeatureBaseDB/DocGPT/tree/main
Well maybe it works on Obsidian vaults for note taking heh, but with llama models' 2k input token range it'd get a tenth of the way before starting to drop context. Likely useless without something like an 100k model.
Hi, very interesting... what are the memory/disk requirements to run it?
16GB of RAM would be enough?
I suggest to add these requirements to the README
Well I'm not sure which models specifically work, but it runs on llama.cpp, which would mean lama derivative ones. Here's a little table for quantized CPU (GGML) versions and the RAM they require as a general rule of thumb:
> Name Quant method Bits Size RAM required Use case
wizard-vicuna-13B.ggmlv3.q4_1.bin q4_1 4bit 8.95GB 11.0GB 4-bit. Higher accuracy than q4_0 but not as high as q5_0. However has quicker inference than q5 models.
wizard-vicuna-13B.ggmlv3.q5_1.bin q5_1 5bit 9.76GB 12.25GB 5-bit. Even higher accuracy, and higher resource usage and slower inference.
wizard-vicuna-13B.ggmlv3.q8_0.bin q5_1 5bit 16GB 18GB 8-bit. Almost indistinguishable from float16. Huge resource use and slow. Not recommended for normal use.
> Name Quant method Bits Size RAM required Use case
VicUnlocked-30B-LoRA.ggmlv3.q4_1.bin q4_1 5bit 24.4GB 27GB 4-bit. Higher accuracy than q4_0 but not as high as q5_0. However has quicker inference than q5 models.
VicUnlocked-30B-LoRA.ggmlv3.q5_1.bin q5_1 5bit 24.4GB 27GB 5-bit. Even higher accuracy, and higher resource usage and slower inference.
VicUnlocked-30B-LoRA.ggmlv3.q8_0.bin q8_0 8bit 36.6GB 39GB 8-bit. Almost indistinguishable from float16. Huge resource use and slow. Not recommended for normal use.
Copied of some of The-Bloke's model descriptions on huggingface. With 16G you can run practically all 7B and 13B versions. With shared GPU+CPU inference, one can also offload some layers onto a GPU (not sure if that makes the initial RAM requirement smaller), but you do need CUDA of course.
Would someone do me the kindness of explaining (a little more) how this works?
It looks like you can ask a question and the model will use its combined knowledge of all your documents to figure out the answer. It looks like it isn't fine-tuned or trained on all the documents, is that right? How is each document turned into an embedding, and then how does the model figure out which documents to consult to answer the question?
When you split a document into chunks, doesn't some crucial information get cut in half? In that case, you'd probably lose that information in the context if that information was immediately followed by an irrelevant information that reduces the cosine similarity. Is there a "smarter" way to feed documents as context to LLMs?
Don't know if there is a smarter way, but these libraries usually offer an overlap parameter that allows you to repeat the last N characters of a chunk in the first N of the next chunk.
Projects like this for using with your documents datasets are invaluable, but everything I've tried so far is hallucinating, so not practical. What's the state of the art of the LLM without hallucination at the moment?
Like many others, I’m also building my own platform to accomplish this. What I’ve learned is the document preparation is key in getting the LLM to answer correctly. The text splitting portion is a crucial step here. Picking the correct splitter and parameters for your use case is important. At first I was getting incorrect or made up answers. Setting up a proper prompt template and text splitting parameters fixed the issue for the most part and now I have 99% success.
Also, the local model used makes a big difference. Right now wizard-mega and manticore are the best ones to use. I run the 16b ggml versions in an M2 Pro and it takes about 30 seconds to “warm up” and produce some quality responses.
Not exactly sure if this would qualify as an LLM in the GPT4 sense. But for no hallucination this seems good:
https://www.thirdai.com/pocketllm/
Full disclosure. I know the founder, but not really associated with the company in any way.
This is a shortcut/workaround to transforming the private docs to a prompt:answer dataset and fine tuning right?
What would be the difference in user experience or information retrieval performance between the two?
My impression is it saves work on the dataset transformation and compute for fine tuning, so it must be less performant. Is there a reason to prefer the strategy here other than ease of setup?
For some reason, downloading the model they suggest keeps failing. I tried to download it in Firefox and Edge. I'm using Windows, if that matters. Anyone else seeing similar issues?
Is there a benchmark for retrieval from multiple ft documents? I tried the LangchainQA with Pinecone and wasn't impressed with the search result when using it on my Zotero library.
Chroma doesn't seem to be a real DB, it's rather a wrapper around tools like hnswlib, DuckDB or Clickhouse. Qdrant is way more mature - it has its own HNSW implementation with some tweaks to incorporate filtering directly during the vector search phase, supports horizontal and vertical scaling, as well as provides its own managed cloud offering.
In general, Qdrant is a real DB, not a library and that's a huge difference.
davidy123|2 years ago
…many many lines
… I'm not asking for support, just saying if people really want to make something 'easy' they'd use docker. I gather there are better python package managers, but I gather that's a bit of a mess too.Someone is thinking "this is part of learning the language," but I think it's just bad design.
oefrha|2 years ago
Also, `python` is usually Python 2.7. If it is, I advise removing it from your system unless you have a strong reason to keep it.
ShadowBanThis01|2 years ago
It's too bad the ecosystem seems to be so messy, because Python seems like the best language for general utilities.
RcouF1uZ4gsC|2 years ago
Python is very fragile to deploy and run on your own machine.
tedmiston|2 years ago
You should not be seeing these dependency conflict issues if you install each project in its own virtual environment.
If you just want them to be easily installed you can just use pipx (`pipx install my-package`) which will manage the virtual environment automatically.
Making a full blown Docker image for it is overkill 99% of the time. Virtual environments serve the same purpose while being much faster and lighter weight.
johnnyjeans|2 years ago
Say what you will about the old school way of manually building and copying shit around, at least when something breaks I don't have to spend a couple hours keelhauling a bloated toolchain in a debugger for mutiny.
seydor|2 years ago
anaisbetts|2 years ago
j_shi|2 years ago
While consumers are happy to get their data mined to avoid paying, businesses are the opposite: willing to pay a lot to avoid feeding data to MSFT/GOOG/META.
They may give assurances on data protection (even here GitHub copilot TOS has sketchy language around saving down derived data), but can’t get around fundamental problem that their products need user interactions to work well.
So it seems with BigTechLLM there’s inherent tension between product competitiveness and data privacy, which makes them incompatible with enterprise.
Biz ideas along these lines: - Help enterprises set up, train, maintain own customized LLMs - Security, compliance, monitoring tools - Help AI startups get compliant with enterprise security - Fine tuning service
SamuelAdams|2 years ago
I can see a self-hosted LLM being akin to a company’s ship-mind. Anyone can ask questions, order analyses, etc, so long as you are a member of the company. No two LLM’s will be exactly the same - and that’s ok.
https://fractalverse.net/explore-to-sleep-in-a-sea-of-stars/...
cddotdotslash|2 years ago
londons_explore|2 years ago
Right now, you can't pay a lot and get a local LLM with similar performance to GPT-4.
Anything you can run on-site isn't really even close in terms of performance.
The ability to finetune to your workplaces terminology and document set is certainly a benefit, but for many usecases that doesn't outweigh the performance difference.
simonw|2 years ago
In this case it appears to be using RetrievalQA from LangChain, which I think is this prompt here: https://github.com/hwchase17/langchain/blob/v0.0.176/langcha...
seydor|2 years ago
jstarfish|2 years ago
zora_goron|2 years ago
EGreg|2 years ago
skykooler|2 years ago
awestroke|2 years ago
hodanli|2 years ago
- [GitHub - e-johnstonn/BriefGPT: Locally hosted tool that connects documents to LLMs for summarization and querying, with a simple GUI.](https://github.com/e-johnstonn/BriefGPT)
- [GitHub - go-skynet/LocalAI: Self-hosted, community-driven, local OpenAI-compatible API. Drop-in replacement for OpenAI running LLMs on consumer-grade hardware. No GPU required. LocalAI is a RESTful API to run ggml compatible models: llama.cpp, alpaca.cpp, gpt4all.cpp, rwkv.cpp, whisper.cpp, vicuna, koala, gpt4all-j, cerebras and many others!](https://github.com/go-skynet/LocalAI)
- [GitHub - paulpierre/RasaGPT: RasaGPT is the first headless LLM chatbot platform built on top of Rasa and Langchain. Built w/ Rasa, FastAPI, Langchain, LlamaIndex, SQLModel, pgvector, ngrok, telegram](https://github.com/paulpierre/RasaGPT)
- [GitHub - imartinez/privateGPT: Interact privately with your documents using the power of GPT, 100% privately, no data leaks](https://github.com/imartinez/privateGPT)
- [GitHub - reworkd/AgentGPT: Assemble, configure, and deploy autonomous AI Agents in your browser.](https://github.com/reworkd/AgentGPT)
- [GitHub - deepset-ai/haystack: Haystack is an open source NLP framework to interact with your data using Transformer models and LLMs (GPT-4, ChatGPT and alike). Haystack offers production-ready tools to quickly build complex question answering, semantic search, text generation applications, and more.](https://github.com/deepset-ai/haystack)
- [PocketLLM « ThirdAi](https://www.thirdai.com/pocketllm/)
- [GitHub - imClumsyPanda/langchain-ChatGLM: langchain-ChatGLM, local knowledge based ChatGLM with langchain | 基于本地知识库的 ChatGLM 问答](https://github.com/imClumsyPanda/langchain-ChatGLM)
monkeydust|2 years ago
thefourthchime|2 years ago
"ggml_new_tensor_impl: not enough space in the context's memory pool (needed 18296202768, available 18217606000)"
soferio|2 years ago
aldarisbm|2 years ago
I want to have the memory part of langchain down, vector store + local database + client to chat with an LLM (gpt4all model can be swapped with OpenAI api just switching the base URL)
https://github.com/aldarisbm/memory
It's still got ways to go, if someone wants to help let me know :)
santiagobasulto|2 years ago
The process roughly is:
Ingestion:
- Process embeddings for your documents (from text to array of numbers)
- Store your documents in a Vector DB
Query time:
- Process embeddings for the query
- Find documents similar to the query using distance from other docs in the Vector db
- Construct prompt with format:
""" Answer question using this context: {DOCUMENTS RETRIEVED}
Question: {question} Answer: """
Is that correct? Now, my question is, can the models be swapped easily? Or that requires a complete recalculation of the embedding (and new ingestion)?
kordlessagain|2 years ago
Wronnay|2 years ago
This project could help me create a personal AI which answers any questions to my life, finances or knowledge...
moffkalast|2 years ago
lysp|2 years ago
https://www.youtube.com/watch?v=A3F5riM5BNE
Also has a suggestion of a few alternative models to use.
daitangio|2 years ago
MandieD|2 years ago
moffkalast|2 years ago
> Name Quant method Bits Size RAM required Use case
WizardLM-7B.GGML.q4_0.bin q4_0 4bit 4.2GB 6GB 4bit.
WizardLM-7B.GGML.q4_1.bin q4_0 4bit 4.63GB 6GB 4-bit. Higher accuracy than q4_0 but not as high as q5_0. However has quicker inference than q5 models.
WizardLM-7B.GGML.q5_0.bin q5_0 5bit 4.63GB 7GB 5-bit. Higher accuracy, higher resource usage and slower inference.
WizardLM-7B.GGML.q5_1.bin q5_1 5bit 5.0GB 7GB 5-bit. Even higher accuracy, and higher resource usage and slower inference.
WizardLM-7B.GGML.q8_0.bin q8_0 8bit 8GB 10GB 8-bit. Almost indistinguishable from float16. Huge resource use and slow. Not recommended for normal use.
> Name Quant method Bits Size RAM required Use case
wizard-vicuna-13B.ggmlv3.q4_0.bin q4_0 4bit 8.14GB 10.5GB 4-bit.
wizard-vicuna-13B.ggmlv3.q4_1.bin q4_1 4bit 8.95GB 11.0GB 4-bit. Higher accuracy than q4_0 but not as high as q5_0. However has quicker inference than q5 models.
wizard-vicuna-13B.ggmlv3.q5_0.bin q5_0 5bit 8.95GB 11.0GB 5-bit. Higher accuracy, higher resource usage and slower inference.
wizard-vicuna-13B.ggmlv3.q5_1.bin q5_1 5bit 9.76GB 12.25GB 5-bit. Even higher accuracy, and higher resource usage and slower inference.
wizard-vicuna-13B.ggmlv3.q8_0.bin q5_1 5bit 16GB 18GB 8-bit. Almost indistinguishable from float16. Huge resource use and slow. Not recommended for normal use.
> Name Quant method Bits Size RAM required Use case
VicUnlocked-30B-LoRA.ggmlv3.q4_0.bin q4_0 4bit 20.3GB 23GB 4-bit.
VicUnlocked-30B-LoRA.ggmlv3.q4_1.bin q4_1 5bit 24.4GB 27GB 4-bit. Higher accuracy than q4_0 but not as high as q5_0. However has quicker inference than q5 models.
VicUnlocked-30B-LoRA.ggmlv3.q5_0.bin q5_0 5bit 22.4GB 25GB 5-bit. Higher accuracy, higher resource usage and slower inference.
VicUnlocked-30B-LoRA.ggmlv3.q5_1.bin q5_1 5bit 24.4GB 27GB 5-bit. Even higher accuracy, and higher resource usage and slower inference.
VicUnlocked-30B-LoRA.ggmlv3.q8_0.bin q8_0 8bit 36.6GB 39GB 8-bit. Almost indistinguishable from float16. Huge resource use and slow. Not recommended for normal use.
Copied of some of The-Bloke's model descriptions on huggingface. With 16G you can run practically all 7B and 13B versions. With shared GPU+CPU inference, one can also offload some layers onto a GPU (not sure if that makes the initial RAM requirement smaller), but you do need CUDA of course.
zestyping|2 years ago
It looks like you can ask a question and the model will use its combined knowledge of all your documents to figure out the answer. It looks like it isn't fine-tuned or trained on all the documents, is that right? How is each document turned into an embedding, and then how does the model figure out which documents to consult to answer the question?
behnamoh|2 years ago
haolez|2 years ago
divan|2 years ago
Projects like this for using with your documents datasets are invaluable, but everything I've tried so far is hallucinating, so not practical. What's the state of the art of the LLM without hallucination at the moment?
Art9681|2 years ago
Also, the local model used makes a big difference. Right now wizard-mega and manticore are the best ones to use. I run the 16b ggml versions in an M2 Pro and it takes about 30 seconds to “warm up” and produce some quality responses.
anu7df|2 years ago
XCSme|2 years ago
unknown|2 years ago
[deleted]
debbiedowner|2 years ago
What would be the difference in user experience or information retrieval performance between the two?
My impression is it saves work on the dataset transformation and compute for fine tuning, so it must be less performant. Is there a reason to prefer the strategy here other than ease of setup?
superbiome|2 years ago
arbol|2 years ago
arbol|2 years ago
amelius|2 years ago
superbiome|2 years ago
rolisz|2 years ago
sinandrei91|2 years ago
amelius|2 years ago
jaimehrubiks|2 years ago
Havoc|2 years ago
bohlenlabs|2 years ago
seydor|2 years ago
joebiden2|2 years ago
ChocoluvH|2 years ago
kacperlukawski|2 years ago
In general, Qdrant is a real DB, not a library and that's a huge difference.
keeptrying|2 years ago
yosito|2 years ago
Why? Why can't I define any directory (my existing Obsidian vault, for example) as the source directory?
carlio|2 years ago
LaurenceRitchie|2 years ago
[deleted]
weikju|2 years ago
udev4096|2 years ago
https://news.ycombinator.com/item?id=35914810
dutchbrit|2 years ago