(no title)
temuze | 3 years ago
Basically, it worked like this:
- All of our data lived in compressed SQLite DBs on S3.
- Upon receiving a query, Postgres would use a custom foreign data wrapper we built.
- This FDW would forward the query to a web service.
- This web service would start one lambda per SQLite file. Each lambda would fetch the file, query it, and return the result to the web service.
- This web service would re-issue lambdas as needed and return the results to the FDW.
- Postgres (hosted on a memory-optimized EC2 instance) would aggregate.
It was straight magic. Separated compute + storage with basically zero cost and better performance than Redshift and Vertica. All of our data was time-series data, so it was extraordinarily easy to partition.
Also, it was also considerably cheaper than Athena. On Athena, our queries would cost us ~$5/TB (which hasn't changed today!), so it was easily >$100 for most queries and we were running thousands of queries per hour.
I still think, to this day, that the inevitable open-source solution for DWs might look like this. Insert your data as SQLite or DuckDB into a bucket, pop in a Postgres extension, create a FDW, and `terraform apply` the lambdas + api gateway. It'll be harder for non-timeseries data but you can probably make something that stores other partitions.
ignoramous|3 years ago
- instead of S3, we now use R2.
- instead of Postgres+Sqlite3, we use DuckDB+CSV/Parquet.
- instead of Lambda, we use AWS AppRunner (considering moving it to Fly.io or Workers).
It worked gloriously for variety of analytical workloads, even if slower had we used Clickhouse/Timescale/Redshift/Elasticsearch.
Cwizard|3 years ago
teh|3 years ago
petethepig|3 years ago
* Assuming you can parse faster than you read from S3 (true for most workloads?) that read throughput is your bottleneck.
* Set target query time, e.g 1s. That means for queries to finish in 1s each record on S3 has to be 90MB or smaller.
* Partition your data in such a way that each record on S3 is smaller than 90 MBs.
* Forgot to mention, you can also do parallel reads from S3, depending on your data format / parsing speed might be something to look into as well.
This is somewhat of a simplified guide (e.g for some workloads merging data takes time and we're not including that here) but should be good enough to start with.
[0] - https://bryson3gps.wordpress.com/2021/04/01/a-quick-look-at-...
simonw|3 years ago
I've been thinking about building systems that store SQLite in S3 and pull them to a lambda for querying, but I'm nervous about how feasible it is based on database file size and how long it would take to perform the fetch.
I honestly hadn't thought about compressing them, but that would obviously be a big win.
simonw|3 years ago
Cwizard|3 years ago
simonw|3 years ago
dfinninger|3 years ago
SQLite -> parquet (for columnar instead of row storage) Lambda -> Worker Tasks FDW -> Connector Postgres Aggregation -> Worker Stage
We run it in Kubernetes (EKS) with auto-scaling, so that works sort of like lambda.
notmattbark|3 years ago
joevandyk|3 years ago
temuze|3 years ago
We had to balance between making the files too big (which would be slow) and making them too small (too many lambdas to start)
I _think_ they were around ~10 GB each, but I might be off by an order of magnitude.