top | item 35591287

(no title)

qualudeheart | 2 years ago

Could you share the code with us?

discuss

order

spacetime_cmplx|2 years ago

Sure! https://pastebin.com/xm7D1c30

I didn't bother cleaning it so it's just a code dump, but it's fairly straightforward. Not included are a Python script to parse and clean the raw documents into JSON files (used in `summarize` to output results), code to read these files and get the embeddings from OpenAI for use in `newEmbeddingJSON `, and a bunch of random parallelization shell scripts that I didn't save.

To use it, I call newDBFromJSON from a directory of JSON embedding vectors and serialize the binary representation. This takes a few minutes mostly because parsing JSON is slow, but you I only needed to do this once. When I need to search for the top 10 documents most similar to document X, I call `search` with the embedding vector for that doc. Alternatively if I need to do semantic search with natural language, I'll call the OpenAI API to get the embedding vector for the query and call `search` with that vector. It's pretty fast thanks to Go concurrency maxing out my CPU. It's super accurate with the search results thanks to OpenAI's embeddings.

It's nowhere close to production-ready (it's littered with panics), but it was good enough for me.

Hope this helps!

Edit: oh and don't use float64 (OpenAI's vectors are float16)