Show HN: Chat with your data using LangChain, Pinecone, and Airbyte
220 points| mtricot | 2 years ago |airbyte.com
A few of our team members at Airbyte (and Joe, who killed it!) recently played with building our own internal support chat bot, using Airbyte, Langchain, Pinecone and OpenAI, that would answer any questions we ask when developing a new connector on Airbyte.
As we prototyped it, we realized that it could be applied for many other use cases and sources of data, so... we created a tutorial that other community members can leverage [http://airbyte.com/tutorials/chat-with-your-data-using-opena...] and the Github repo to run it [https://github.com/airbytehq/tutorial-connector-dev-bot]
The tutorial shows: - How to extract unstructured data from a variety of sources using Airbyte Open Source - How to load data into a vector database (here Pinecone), preparing the data for LLM usage along the way - How to integrate a vector database into ChatGPT to ask questions about your proprietary data
I hope some of it is useful, and would love your feedback!
jmorgan|2 years ago
https://python.langchain.com/docs/integrations/llms/ollama
This can be a great option if you'd like to keep your data local versus submitting it to a cloud LLM, with the added benefit of saving costs if you're submitting many questions in a row (e.g. in batches)
mtricot|2 years ago
hnhg|2 years ago
zarazas|2 years ago
bestcoder69|2 years ago
bnchrch|2 years ago
So much on LLMs lately is mostly blog spam for SEO but this actually is information dense and practical. Definitely bookmarking this for tonight.
Also really happy to see a bonus section on pulling in data from third party websites. I think this is where LLMs get really interesting. Not only is data much easier to query with these new models, its also orders of magnitude easier to ingest from traditionally malformated sources.
BoorishBears|2 years ago
hubraumhugo|2 years ago
bnchrch|2 years ago
My original comment had the intent of my own personal opinion as someone who reads/writes tutorials as fun.
But I owe an apology to the readers because I did not add any disclosure, and honestly I shouldve
mtricot|2 years ago
wanderingmind|2 years ago
ramesh31|2 years ago
mtricot|2 years ago
- Airbyte has two self-hosted options: OSS & Enterprise
- Langchain: OSS
- OpenAI: you can host an OSS model if you want to
- Pinecone: there are OSS/self-hosted alternatives
_pdp_|2 years ago
mtricot|2 years ago
The great thing we get by plugging this whole stack together is that we get all the refreshed data as more issues/connectors get created.
rahimnathwani|2 years ago
sandGorgon|2 years ago
Second, for a LOT of enterprises, they want to use non-openai embedding models (minilm, GTE, BGE), will you support that. For e.g. in Edgechains we natively support BGE and minilm. Would you be able to support that ?
amanivan|2 years ago
anupsurendran|2 years ago
They still have work to do with different connectors (e.g. PDF etc) but the realtime simple document pipeline is what helps a lot.
gz5|2 years ago
When using a local vector db, what is the security model between my data and Airbyte? For example, do I need to permit Airbyte IPs into my enviro, and is there a VPN type option for private connectivity?
mtricot|2 years ago
Airbyte comes in 3 flavors: OSS, Cloud, Enterprise.
For OSS & Enterprise, data doesn't leave your infra since Airbyte is running in your infrastructure. For Cloud, you would have to allow some IPs to allow us to access your local db.
r_thambapillai|2 years ago
mtricot|2 years ago
If you have data with PII:
One option would be to use Airbyte and bring the data into files/local db rather than directly to the vector store, add an extra step that strips the data from all PII and then configure Airbyte to move the clean file/record to the vector store.
The option that jmorgan mention is relevant here, using a "self-hosted" model.
frankfrank13|2 years ago
unknown|2 years ago
[deleted]
swyx|2 years ago
what was the thinking behind choosing to support "Vector Database (powered by LangChain)" instead of directly supporting Pinecone Chroma et al directly as you do in the other destinations? when is direct integration the right approach vs when is it better to have an (possibly brittle, but faster time to market) integration of an integration?
mtricot|2 years ago
rschwabco|2 years ago
amelius|2 years ago
mtricot|2 years ago
kingforaday|2 years ago
johndhi|2 years ago
mtricot|2 years ago
zby|2 years ago
BoorishBears|2 years ago
LangChain doesn't make sense for a ton of reasons, but the top few are the code quality being horrid, the scope being ill defined, and the fact that most of the tasks it does are better done with a prompt that was designed for your exact use case.
electrondood|2 years ago
everythingmeta|2 years ago
Any plans to write a tutorial for fine-tuning local models?
mtricot|2 years ago
croes|2 years ago
mtricot|2 years ago