(no title)
spacetime_cmplx | 2 years ago
Writing code from scratch to process and search 200k unstructured documents -- parsing, cleaning, chunking, OpenAI embedding API, serialization code, linear search with cosine similarity, and the actual time to debug, test and run all this -- took me less than 3 hours in Go. The flat binary representation of all vectors is under 500 MB. I even went ahead and made it mmap-friendly for the fun of it even though I could read it into all into memory.
Even the dumb linear search I wrote takes just 20-30ms per query on my Macbook for the 200k documents. The search results are fantastic.
dnadler|2 years ago
qualudeheart|2 years ago
spacetime_cmplx|2 years ago
I didn't bother cleaning it so it's just a code dump, but it's fairly straightforward. Not included are a Python script to parse and clean the raw documents into JSON files (used in `summarize` to output results), code to read these files and get the embeddings from OpenAI for use in `newEmbeddingJSON `, and a bunch of random parallelization shell scripts that I didn't save.
To use it, I call newDBFromJSON from a directory of JSON embedding vectors and serialize the binary representation. This takes a few minutes mostly because parsing JSON is slow, but you I only needed to do this once. When I need to search for the top 10 documents most similar to document X, I call `search` with the embedding vector for that doc. Alternatively if I need to do semantic search with natural language, I'll call the OpenAI API to get the embedding vector for the query and call `search` with that vector. It's pretty fast thanks to Go concurrency maxing out my CPU. It's super accurate with the search results thanks to OpenAI's embeddings.
It's nowhere close to production-ready (it's littered with panics), but it was good enough for me.
Hope this helps!
Edit: oh and don't use float64 (OpenAI's vectors are float16)