Having worked with Simon he knows his sh*t. We talked a lot about what the ideal search stack would look when we worked together at Shopify on search (him more infra, me more ML+relevance). I discussed how I just want a thing in the cloud to provide my retrieval arms, let me express ranking in a fluent "py-data" first way, and get out of my way
My ideal is that turbopuffer ultimately is like a Polars dataframe where all my ranking is expressed in my search API. I could just lazily express some lexical or embedding similarity, boost with various attributes like, maybe by recency, popularity, etc to get a first pass (again all just with dataframe math). Then compute features for a reranking model I run on my side - dataframe math - and it "just works" - runs all this as some kind of query execution DAG - and stays out of my way.
+1, had the fortune to work with him at a previous startup and meetup in person. Our convo very much broadened my perspective on engineering as a career and a craft, always excited to see what he's working on. Good luck Simon!
Unrelated to the core topic, I really enjoy the aesthetic of their website. Another similar one is from Fixie.ai (also, interestingly, one of their customers).
This was my first thought too, after reading through their blog. This feels like a no-frills website made by an engineer, who makes things that just work.
The documentation is great, I really appreciate them putting the roadmap front and centre.
Yes, I like the turboxyz123 animation and contrast to the minimalist website (reminds me of the zen garden with a single rock). I think people forget nowadays in their haste to add the latest and greatest react animation, that too much noise is a thing.
200$/TB/month for raw RAM, not RAM that's presented to you behind a usable API that's distributed and operated by someone else, freeing you of time.
It's not particularly useful to compare the cost of raw unorganized information medium on a single node, to highly organized information platform. It's like saying "this CPU chip is expensive, just look at the price of this sand".
> In 2022, production-grade vector databases were relying on in-memory storage
This is irking me. pg_vector has existed from before that, doesn't require in-memory storage and can definitely handle vector search for 100m+ documents in a decently performant manner. Did they have a particular requirement somewhere?
Have you tried it? pgvector performance falls off a cliff once you can't cache in ram. Vector search isn't like "normal" workloads that follow a nice pareto distribution.
- it does not do vector search. It can rank docs using BM25, but usually people just want to sort by timestamp.
- its does not use an SSD cache. Quickwit reads directly into the object storage.
- it is append-only (you can't modify documents)
- it scales really well and typically shines on the 1TB .. 100PB range
- it has a Elastic search compatible API.
Is there a good general purpose solution where I can store a large read only database in s3 or something and do lookups directly on it?
Duckdb can open parquet files over http and query them but I found it to trigger a lot of small requests reading bunch of places from the files. I mean a lot.
I mostly need key / value lookups and could potentially store each key in a seperate object in s3 but for a couple hundred million objects.. It would be a lot more managable to have a single file and maybe a cacheable index.
> trigger a lot of small requests reading bunch of places from the files. I mean a lot.
That’s… the whole point. That’s how Parquet files are supposed to be used. They’re an improvement over CSV or JSON because clients can read small subsets of them efficiently!
For comparison, I’ve tried a few other client products that don’t use Parquet files properly and just read the whole file every time, no matter how trivial the query is.
Is it feasible to try to build this kind of approach (hot SSD cache nodes sitting in front of object storage) with prior open-source art (Lucene)? Or are the search indexes themselves also proprietary in this solution?
Having witnessed some very large Elasticsearch production deployments, being able to throw everything into S3 would be incredible. The applicability here isn't only for vector search.
Elasticsearch and OpenSearch already support S3 backed indices. See features like https://opensearch.org/docs/latest/tuning-your-cluster/avail... The files in S3 are plain old Lucene segment files (just wrapped in OpenSearch snapshots which provide a way to track metadata around those files).
If you don't need vector search and have very large Elasticsearch deployment, you can have a look at Quickwit, it's a search engine on object storage, it's OSS and works for append-only datasets (like logs, traces, ...)
Yeah, thinking about this more I now understand Clickhouse to be more of an operational warehouse similar to Materialize, Pinot, Druid, etc. if I understand correctly? So bunching with BigQuery/Snowflake/Trino/Databricks... wasn't the right category (although operational warehouses certainly can have a ton of overlap)
I left that category out for simplicity (plenty of others that didn't make it into the taxonomy, e.g. queues, nosql, time-series, graph, embedded, ..)
This looks super interesting. I'm not that familiar with vector databases. I thought they were mostly something used for RAG and other AI-related stuff.
Seems like a topic I need to delive into a bit more.
Slightly relevant - do people really want article recommendations? I don’t think I’ve ever read an article and wanted a recommendation. Even with this one - I sort of read it and that’s it; no feeling of wanting recommendations.
Am I alone in this?
In any case this seems like a pretty interesting approach. Reminds me of Warpstream which does something similar with S3 to replace Kafka.
That’s some woefully disappointing and incorrect metrics (read and write latency are both sub-second, storage medium would be “ Memory + Replicated SSDs”) you’ve got for Clickhouse there, but I understand what you’re going for and why you categorized it where you did.
softwaredoug|1 year ago
My ideal is that turbopuffer ultimately is like a Polars dataframe where all my ranking is expressed in my search API. I could just lazily express some lexical or embedding similarity, boost with various attributes like, maybe by recency, popularity, etc to get a first pass (again all just with dataframe math). Then compute features for a reranking model I run on my side - dataframe math - and it "just works" - runs all this as some kind of query execution DAG - and stays out of my way.
bkitano19|1 year ago
snthpy|1 year ago
You mean like a fluent API like `data.transform().filter()...` , that sort of thing?
cmcollier|1 year ago
k2so|1 year ago
The documentation is great, I really appreciate them putting the roadmap front and centre.
xarope|1 year ago
unknown|1 year ago
[deleted]
itunpredictable|1 year ago
swyx|1 year ago
5-|1 year ago
nsguy|1 year ago
nh2|1 year ago
It doesn't have to be that way.
At Hetzner I pay $200/TB/month for RAM. That's 18x cheaper.
Sometimes you can reach the goal faster with less complexity by removing the part with the 20x markup.
AYBABTME|1 year ago
It's not particularly useful to compare the cost of raw unorganized information medium on a single node, to highly organized information platform. It's like saying "this CPU chip is expensive, just look at the price of this sand".
formerly_proven|1 year ago
> $3600.00/TB/month (incumbents)
> $70.00/TB/month (turbopuffer)
That's still 3x cheaper than your number and it's a SaaS API, not just a piece of rented hardware.
TechDebtDevin|1 year ago
omneity|1 year ago
This is irking me. pg_vector has existed from before that, doesn't require in-memory storage and can definitely handle vector search for 100m+ documents in a decently performant manner. Did they have a particular requirement somewhere?
jbellis|1 year ago
bigbones|1 year ago
pushrax|1 year ago
fulmicoton|1 year ago
eknkc|1 year ago
Duckdb can open parquet files over http and query them but I found it to trigger a lot of small requests reading bunch of places from the files. I mean a lot.
I mostly need key / value lookups and could potentially store each key in a seperate object in s3 but for a couple hundred million objects.. It would be a lot more managable to have a single file and maybe a cacheable index.
jiggawatts|1 year ago
That’s… the whole point. That’s how Parquet files are supposed to be used. They’re an improvement over CSV or JSON because clients can read small subsets of them efficiently!
For comparison, I’ve tried a few other client products that don’t use Parquet files properly and just read the whole file every time, no matter how trivial the query is.
tionis|1 year ago
Simon Willison wrote about it: https://simonwillison.net/2022/Aug/10/sqlite-http/
cdchn|1 year ago
I think this is pretty much what AWS Athena is.
imiric|1 year ago
canadiantim|1 year ago
solatic|1 year ago
Having witnessed some very large Elasticsearch production deployments, being able to throw everything into S3 would be incredible. The applicability here isn't only for vector search.
rohitnair|1 year ago
francoismassot|1 year ago
Repo: https://github.com/quickwit-oss/quickwit
zX41ZdbW|1 year ago
Logging, real-time analytics, and RAG are also suitable for ClickHouse.
Sirupsen|1 year ago
I left that category out for simplicity (plenty of others that didn't make it into the taxonomy, e.g. queues, nosql, time-series, graph, embedded, ..)
drodgers|1 year ago
cdchn|1 year ago
arnorhs|1 year ago
Seems like a topic I need to delive into a bit more.
endisneigh|1 year ago
Am I alone in this?
In any case this seems like a pretty interesting approach. Reminds me of Warpstream which does something similar with S3 to replace Kafka.
CyberDildonics|1 year ago
yawnxyz|1 year ago
vidar|1 year ago
yamumsahoe|1 year ago
hipadev23|1 year ago
mjlxyz|1 year ago
[deleted]
bean_salad_123|1 year ago
[deleted]