Nice to see people care about index construction time.
I'm the lead author of JVector, which scales linearly to at least 32 cores and may be the only graph-based vector index designed around nonblocking data structures (as opposed to using locks for thread safety): https://github.com/jbellis/jvector/
JVector looks to be about 2x as fast at indexing as Lantern, ingesting the Sift1M dataset in under 25s on a 32 core aws box (m6i.16xl), compared to 50s for Lantern in the article.
(JVector is based on DiskANN, not HNSW, but the configuration parameters are similar -- both are configured with graph degree and search width.)
This reads like a marketing piece, not an honest technical blogpost.
I agree that Usearch is fast, but it feels pretty dishonest to take credit for someone else's work. Like maybe at least honestly profile what's going on with USearch vs pgvector (..and which settings for pgvector??), and write something interesting about it?
The last time I tried Lantern, it'd segfault when I tried to do anything non-trivial with it, and was incredibly unsafe with how it handled memory. Hopefully that's at least fixed.. but lantern has so many red flags.
Not sure if it's fair to compare USearch and pgvector. One is an efficient indexing structure, the other is more like a pure database plugin. Not that they can't be used in a similar fashion.
If you are looking for pure indexing benchmarks, you might be interested in USearch vs FAISS HNSW implementation [1]. We run them ourselves (and a couple of other tech companies), so take them with a grain of salt. They might be biased.
As for Lantern vs pgvector, impressed by the result! A lot of people would benefit from having fast vector-search compatible with Postgres! The way to go!
It's wasn't a trivial integration by any means, and the Lantern team was very active - suggesting patches into the upstream version to ease integrations with other databases. Some of those are tricky and have yet to be merged [2]. So stay tuned for the USearch v3. Lots of new features coming :)
Hi, sorry that you didn't have a good experience with Lantern before. We first posted in HN about 3 months ago - Things should be better now, please let us know if you have any issues.
Using ivfflat is much faster for bulk index creation than lantern. There are a lot of trade offs depending on what everyone's specific use case is, but it seems like a pretty massive thing to leave out.
```
postgres=# CREATE INDEX ON sift USING ivfflat (v vector_l2_ops) WITH (lists=1000);
CREATE INDEX
Time: 65697.411 ms (01:05.697)
```
As someone who just indexed 6m documents with pgvector, I can say it’s a massive time sync - on the order of days, even with a 32 core 64Gb RDS instance.
That sounds much longer than it should. I am not sure on your exact use-case but I would encourage you to check out Marqo (https://github.com/marqo-ai/marqo - disclaimer, I am a co-founder). All inference and orchestration is included (no api calls) and many open-source or fine-tuned models can be used.
How does performance scale (vs pgvector) when you have an index and start loading data in parallel? Or how does this scale vs the to-be-released pgvector 0.5.2?
> Operator <-> can only be used inside of an index
Isn't the use of the distance operator in scan+sort critical for generating the expected/correct result that's needed for validating the recall of an ANN-only index?
Yes it is WAL protected: the advantage of external indexing is that the HNSW graph is being constructed externally on multiple cores instead on a single core inside the Postgres process. But eventually the graph is being parsed and processed inside Postgres with all the necessary WAL logs for blocks.
This extension is licensed under the Business Source License[0], which makes it incompatible with most DBaaS offerings. The BSL is a closed-source license. Good choice for Lantern, but unusable for everyone else.
Some Postgres offerings allow you to bring your own extensions to workaround limitations of these restrictive licenses, for instance Neon[1], where I work. I tried to look at the AWS docs for you, but couldn't find anything about that. I did find Trusted Language Extensions[2], but that seems to be more about writing your own extension. Couldn't find a way to upload arbitrary extensions.
I will add that you could use logical replication[3] to mirror data from your primary database into a Lantern-hosted database (or host your own database with the Lantern extension). This obviously has a couple downsides, but thought I would mention it.
Likely as an extension eventually. I know RDS has a variety of postgres extensions you can use. Pg_vector is supported, so likely lantern could get support as well.
jbellis|2 years ago
I'm the lead author of JVector, which scales linearly to at least 32 cores and may be the only graph-based vector index designed around nonblocking data structures (as opposed to using locks for thread safety): https://github.com/jbellis/jvector/
JVector looks to be about 2x as fast at indexing as Lantern, ingesting the Sift1M dataset in under 25s on a 32 core aws box (m6i.16xl), compared to 50s for Lantern in the article.
(JVector is based on DiskANN, not HNSW, but the configuration parameters are similar -- both are configured with graph degree and search width.)
VoVAllen|2 years ago
unknown|2 years ago
[deleted]
nerfborpit|2 years ago
I agree that Usearch is fast, but it feels pretty dishonest to take credit for someone else's work. Like maybe at least honestly profile what's going on with USearch vs pgvector (..and which settings for pgvector??), and write something interesting about it?
The last time I tried Lantern, it'd segfault when I tried to do anything non-trivial with it, and was incredibly unsafe with how it handled memory. Hopefully that's at least fixed.. but lantern has so many red flags.
ashvardanian|2 years ago
Not sure if it's fair to compare USearch and pgvector. One is an efficient indexing structure, the other is more like a pure database plugin. Not that they can't be used in a similar fashion.
If you are looking for pure indexing benchmarks, you might be interested in USearch vs FAISS HNSW implementation [1]. We run them ourselves (and a couple of other tech companies), so take them with a grain of salt. They might be biased.
As for Lantern vs pgvector, impressed by the result! A lot of people would benefit from having fast vector-search compatible with Postgres! The way to go!
It's wasn't a trivial integration by any means, and the Lantern team was very active - suggesting patches into the upstream version to ease integrations with other databases. Some of those are tricky and have yet to be merged [2]. So stay tuned for the USearch v3. Lots of new features coming :)
[1]: https://www.unum.cloud/blog/2023-11-07-scaling-vector-search... [2]: https://github.com/unum-cloud/usearch/pull/171/files
diqi|2 years ago
nerfborpit|2 years ago
``` postgres=# CREATE INDEX ON sift USING ivfflat (v vector_l2_ops) WITH (lists=1000); CREATE INDEX Time: 65697.411 ms (01:05.697) ```
lettergram|2 years ago
cyanydeez|2 years ago
jn2clark|2 years ago
mattashii|2 years ago
mattashii|2 years ago
> https://github.com/lanterndata/lantern/blob/040f24253e5a2651...
> Operator <-> can only be used inside of an index
Isn't the use of the distance operator in scan+sort critical for generating the expected/correct result that's needed for validating the recall of an ANN-only index?
diqi|2 years ago
We think our approach will still significantly outperform pgvector because it does less on your production database.
We generate the index remotely, on a compute-optimized machine, and only use your production database for index copy.
Parallel pgvector would have to use your production database resources to run the compute-intensive HNSW index creation workload.
levkk|2 years ago
diqi|2 years ago
justinclift|2 years ago
TuringNYC|2 years ago
diqi|2 years ago
netcraft|2 years ago
Still, very impressive
tristan957|2 years ago
Some Postgres offerings allow you to bring your own extensions to workaround limitations of these restrictive licenses, for instance Neon[1], where I work. I tried to look at the AWS docs for you, but couldn't find anything about that. I did find Trusted Language Extensions[2], but that seems to be more about writing your own extension. Couldn't find a way to upload arbitrary extensions.
I will add that you could use logical replication[3] to mirror data from your primary database into a Lantern-hosted database (or host your own database with the Lantern extension). This obviously has a couple downsides, but thought I would mention it.
[0]: https://github.com/lanterndata/lantern/commit/dda7f064ca80af...
[1]: https://neon.tech/docs/extensions/pg-extensions#custom-built...
[2]: https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/Postg...
[3]: https://www.postgresql.org/docs/current/logical-replication....
themanmaran|2 years ago
[1] https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/Postg...
lee101101|2 years ago
[deleted]