top | item 37050532

Show HN: Chat with your data using LangChain, Pinecone, and Airbyte

220 points| mtricot | 2 years ago |airbyte.com

Hi HN,

A few of our team members at Airbyte (and Joe, who killed it!) recently played with building our own internal support chat bot, using Airbyte, Langchain, Pinecone and OpenAI, that would answer any questions we ask when developing a new connector on Airbyte.

As we prototyped it, we realized that it could be applied for many other use cases and sources of data, so... we created a tutorial that other community members can leverage [http://airbyte.com/tutorials/chat-with-your-data-using-opena...] and the Github repo to run it [https://github.com/airbytehq/tutorial-connector-dev-bot]

The tutorial shows: - How to extract unstructured data from a variety of sources using Airbyte Open Source - How to load data into a vector database (here Pinecone), preparing the data for LLM usage along the way - How to integrate a vector database into ChatGPT to ask questions about your proprietary data

I hope some of it is useful, and would love your feedback!

59 comments

jmorgan|2 years ago

LangChain supports local LLMs like Llama 2 with Ollama (https://github.com/jmorganca/ollama) as of this morning, in both their Python and Javascript versions:

https://python.langchain.com/docs/integrations/llms/ollama

This can be a great option if you'd like to keep your data local versus submitting it to a cloud LLM, with the added benefit of saving costs if you're submitting many questions in a row (e.g. in batches)

mtricot|2 years ago

I am sure we can build something around that. Going to take a look at it. Thanks for mentioning it.

hnhg|2 years ago

Thanks. How would this differ from running Llama2 through the Huggingface-Langchain integration? I haven't tried it but it looked like the way to go until you shared this.

zarazas|2 years ago

Can llama 2 also be used to create the embeddings?

bestcoder69|2 years ago

How well does it work?

bnchrch|2 years ago

Always so happy to see a tutorial with actual substance.

So much on LLMs lately is mostly blog spam for SEO but this actually is information dense and practical. Definitely bookmarking this for tonight.

Also really happy to see a bonus section on pulling in data from third party websites. I think this is where LLMs get really interesting. Not only is data much easier to query with these new models, its also orders of magnitude easier to ingest from traditionally malformated sources.

BoorishBears|2 years ago

It's a little weird writing a comment worded like this if you work at Airbyte isn't it?

hubraumhugo|2 years ago

You're right, LLMs are really good at extracting and structuring data from third party sources. I've been working on a "Zapier for data extraction" for this reason: https://kadoa.com

bnchrch|2 years ago

A late but still very applicable disclaimer: I do work here!

My original comment had the intent of my own personal opinion as someone who reads/writes tutorials as fun.

But I owe an apology to the readers because I did not add any disclosure, and honestly I shouldve

mtricot|2 years ago

Let me know how that works out for you and if you would add anything to this tutorial!

wanderingmind|2 years ago

When there are so many awesome FOSS vector databases available, I wonder what motivated the airbyte team to use Pinecone, the one database that is anti-FOSS?

ramesh31|2 years ago

I don't know what datasets you guys are working with that have no issues being shared in plain text across three separate proprietary paid services, but this is a nonstarter for me.

mtricot|2 years ago

When reading the tutorial, we are describing one stack to build a specific app. But the stack is made of building blocks that you can replace with others if you need to.

- Airbyte has two self-hosted options: OSS & Enterprise

- Langchain: OSS

- OpenAI: you can host an OSS model if you want to

- Pinecone: there are OSS/self-hosted alternatives

_pdp_|2 years ago

A fantastic starting point for beginners! Personally, I believe this tutorial provides a solid foundation, but there's so much more to explore. Building something truly effective involves tackling various nuanced situations and special cases. While querying records in Pinecone can sometimes give you the right results, it can also be a bit unpredictable, depending on what and how you query. You might want to check out options like Weaviate, or even delve into the world of sparse indexes for an added layer of complexity. The models themselves have their own quirks too. For example, GPT3.5 Turbo tends to respond well when given clear instructions at the beginning of the context, while GPT4, although more flexible, still comes with its own set of challenges. Despite this, I'm genuinely excited about the push to highlight the potential of LLM applications (more of that, please!). Just remember, while tutorials like this are a great step, achieving seamless results might require some hands-on experience and learning along the way.

mtricot|2 years ago

Thanks! I agree with your point. There is a lot of tuning that needs to happen, including context aware splitting and any other kind of transformation before the unstructured data gets indexed. This is one of the big challenge of productionizing LLM apps with external data. So far we are using internally since the team as experience dealing with building these connectors and that becomes a great co-pilot.

The great thing we get by plugging this whole stack together is that we get all the refreshed data as more issues/connectors get created.

rahimnathwani|2 years ago

I'm curious: did you have ChatGPT lightly edit this comment before posting? A few things about the style (like the final sentence) sound similar to GPT-4 output.

sandGorgon|2 years ago

hi folks, when will you have pgvector as a destination ? we (https://github.com/arakoodev/edgechains) work with a lot of enterprises and they would not move away from using redis or pgvector even as their vector store. Is there a way where we can leverage that ?

Second, for a LOT of enterprises, they want to use non-openai embedding models (minilm, GTE, BGE), will you support that. For e.g. in Edgechains we natively support BGE and minilm. Would you be able to support that ?

amanivan|2 years ago

This is cool, I would like access to the code contents as well, not just the issue. Is that possible with airbyte? If so, how?

anupsurendran|2 years ago

I feel that there are too many moving pieces here especially for prototyping. There was a much more simpler app recently I took a look at on a recent hackernews post : https://news.ycombinator.com/item?id=36894142

They still have work to do with different connectors (e.g. PDF etc) but the realtime simple document pipeline is what helps a lot.

gz5|2 years ago

Very well written and illustrated, thank you.

When using a local vector db, what is the security model between my data and Airbyte? For example, do I need to permit Airbyte IPs into my enviro, and is there a VPN type option for private connectivity?

mtricot|2 years ago

It depends.

Airbyte comes in 3 flavors: OSS, Cloud, Enterprise.

For OSS & Enterprise, data doesn't leave your infra since Airbyte is running in your infrastructure. For Cloud, you would have to allow some IPs to allow us to access your local db.

r_thambapillai|2 years ago

How are you thinking about preventing customer PII making it to OpenAI?

mtricot|2 years ago

For the purpose of the tutorial that we built, it really comes down to the type of data that you're using.

If you have data with PII:

One option would be to use Airbyte and bring the data into files/local db rather than directly to the vector store, add an extra step that strips the data from all PII and then configure Airbyte to move the clean file/record to the vector store.

The option that jmorgan mention is relevant here, using a "self-hosted" model.

frankfrank13|2 years ago

This is always the first good question to ask about any chat bot IMO

unknown|2 years ago

[deleted]

swyx|2 years ago

congrats team!

what was the thinking behind choosing to support "Vector Database (powered by LangChain)" instead of directly supporting Pinecone Chroma et al directly as you do in the other destinations? when is direct integration the right approach vs when is it better to have an (possibly brittle, but faster time to market) integration of an integration?

mtricot|2 years ago

Great question :) We want to get to value as fast as possible. I am certain that at some point we will need to go deeper with those integrations and they will likely require to be separate destinations. It will also depend on how they differentiate from each others, we will need more granularity with configurations.

rschwabco|2 years ago

A version supporting Pinecone directly is coming soon!

amelius|2 years ago

I like to keep my tools simple so just give me a single AI that can do everything, browse through my data, generate pictures and give me suggestions in my code editor, etc. etc., instead of a different AI for every tool out there.

mtricot|2 years ago

Isn't it the dream? Today there is a lot of stack that needs to be built to enable what you're describing. This is actually what we are doing with that post. What foundations do we need to build so that the UX for the end user is what you're describing. Will take some time to get there :)

kingforaday|2 years ago

The next great debate. MonolithicAI vs Micro-serviceAI.

johndhi|2 years ago

How large of a dataset can I submit? I have hundreds of thousands of words of text.

mtricot|2 years ago

Shouldn't have any limits here. Can you let us know how it goes?

zby|2 years ago

So I guess after all those discussions we are still stuck with LangChain for everything to do with LLMs.

BoorishBears|2 years ago

I spend all day talking to people shipping AI products and approximately zero of them actually use LangChain.

LangChain doesn't make sense for a ton of reasons, but the top few are the code quality being horrid, the scope being ill defined, and the fact that most of the tasks it does are better done with a prompt that was designed for your exact use case.

electrondood|2 years ago

We're using it in production for several products, and are quite happy with it.

everythingmeta|2 years ago

nice to see a tutorial that recognizes the case where the underlying data can change and the embedding needs to be updated.

Any plans to write a tutorial for fine-tuning local models?

mtricot|2 years ago

Not at the moment but let me bring that to the team so we can brainstorm what it could look like.

croes|2 years ago

Why is the OpenAI from the article title missing?

mtricot|2 years ago

No good reason. Does "it made the post's title too long" work?