top | item 35583638

(no title)

spacetime_cmplx | 2 years ago

Unless you have several hundred million documents, just write a simple encoder that serializes the embedding vectors to a flat binary file.

Writing code from scratch to process and search 200k unstructured documents -- parsing, cleaning, chunking, OpenAI embedding API, serialization code, linear search with cosine similarity, and the actual time to debug, test and run all this -- took me less than 3 hours in Go. The flat binary representation of all vectors is under 500 MB. I even went ahead and made it mmap-friendly for the fun of it even though I could read it into all into memory.

Even the dumb linear search I wrote takes just 20-30ms per query on my Macbook for the 200k documents. The search results are fantastic.

discuss

order

dnadler|2 years ago

Less than 3 hours is impressive, but it took me less than 10 minutes to do the same with Chroma.

qualudeheart|2 years ago

Could you share the code with us?

spacetime_cmplx|2 years ago

Sure! https://pastebin.com/xm7D1c30

I didn't bother cleaning it so it's just a code dump, but it's fairly straightforward. Not included are a Python script to parse and clean the raw documents into JSON files (used in `summarize` to output results), code to read these files and get the embeddings from OpenAI for use in `newEmbeddingJSON `, and a bunch of random parallelization shell scripts that I didn't save.

To use it, I call newDBFromJSON from a directory of JSON embedding vectors and serialize the binary representation. This takes a few minutes mostly because parsing JSON is slow, but you I only needed to do this once. When I need to search for the top 10 documents most similar to document X, I call `search` with the embedding vector for that doc. Alternatively if I need to do semantic search with natural language, I'll call the OpenAI API to get the embedding vector for the query and call `search` with that vector. It's pretty fast thanks to Go concurrency maxing out my CPU. It's super accurate with the search results thanks to OpenAI's embeddings.

It's nowhere close to production-ready (it's littered with panics), but it was good enough for me.

Hope this helps!

Edit: oh and don't use float64 (OpenAI's vectors are float16)