If you have 10M or bigger dataset of real-world OpenAI- dimensional vectors, please share, I'll use it in the next benchmarks. Random datasets are too misleading for vector search benchmarks because all ANN engines make use of internal distributions in datasets to struggle with the curse of dimensionality. So I never use random datasets for ann indexed benchmarking. Using simplified less dimensional (eg 128 instead of 1536) vectors also changes performance trends.
moab|2 years ago
They're not OpenAI embeddings, but they are realistic, and much larger (number of vectors).
I think many production embeddings at non-OpenAI companies will use lower-dimensional vectors than 1536, so it makes sense to focus on non-OpenAI embeddings as well in your benchmarking.