top | item 38075196

Show HN: Playground for comparing embedding models on Wikipedia+book retrieval

5 points| davidtsong | 2 years ago |embeds.ai

Introducing embeds.ai: an embedding playground to compare how embedding models work on a real world use case (retrieval augmented generation for Wikipedia articles + Elad Gil's High growth handbook)

A few weeks ago, Shreyan and I were looking for an embedding model to use for RAG. We eventually came across the MTEB leaderboard, but we struggled to understand the benchmark scores.

We wanted a tool to test various embedding models with example queries on real-world datasets. After unsuccessfully looking for such a “playground”, we decided to just build one ourselves!

We embedded HuggingFace’s Simple Wikipedia dataset using @OpenAI, @Cohere, and 2 open-source models via @Baseten. We then stored the embeddings in @Supabase using pgvector. Finally, we built a web app using NextJS and deployed it on @Vercel.

Now we’re hosting the playground for anyone to use for free, as well as open-sourcing our work so people can try evaluating other models, datasets, or indexes.

Learn more here in our full blog post here: https://shreyanjain.substack.com/p/announcing-embedding-batt...

And the repo is here: https://github.com/EGCap/playground

If you have other suggestions / pain points from working with embedding models, vector DBs, or RAG, or if you would like to collaborate on any of the above or unrelated projects, please reach out! @shreyanj98 @davidtsong on Twitter

11 comments

varunshenoy|2 years ago

Awesome job guys, and thank you for creating it. Curious if you guys have any insights on long-term memory and if there are better ways to do retreivel apart from top-k.

Seems weird that every RAG app uses top-k especially since you might pull in information irrelevant to the context (e.g. if you were asking for the names of the authors of paper, you probably only want the top-1 embedding).

davidtsong|2 years ago

Definitely, top-k is a very naive way to do RAG. I think people have experimented with using a cross encoder like approach or even letting the LLM choose the sources. We will experiment with more approaches like this :)

clueless_stats|2 years ago

Looks useful - will be cool to see the results as more models and datasets are added!

davidtsong|2 years ago

Thank you!

sr33j|2 years ago

very cool work! if you used diff models to embed the docs, did they give you diff sized vectors? did this cause any problems in db storage or calculating vector distances?

bigheadgpt|2 years ago

Did u try VoyageAI’s new embedding models for this?

davidtsong|2 years ago

Not yet, we just saw their announcement today here (for context): https://blog.voyageai.com/2023/10/29/voyage-embeddings/.

We'll definitely work on adding this model next. Seems promising! Thanks for sharing.

tigs_|2 years ago

nice tool! curious - what was your instruction prompt for instructor-large? did that change based on the document type at all?

shreyanj|2 years ago

We used a really simple prompt: "Represent the document for retrieval: <doc>". We did not get around to experimenting with it or changing it based on the document type; that's a great idea for future extension!

ankitd33|2 years ago

woah cool! What was the rationale for supabase vs vector db?

davidtsong|2 years ago

Supabase has pgvector which makes it pretty easy to get started :)!