(no title)
memco | 1 year ago
> Using smallpond and 3FS depends largely on your data size and infrastructure:
> Under 10TB: smallpond is likely unnecessary unless you have very specific distributed computing needs. A single-node DuckDB instance or simpler storage solutions will be simpler and possibly more performant.
> 10TB to 1PB: smallpond begins to shine. You'd set up a cluster with several nodes, leveraging 3FS or another fast storage backend to achieve rapid parallel processing.
> Over 1PB (Petabyte-Scale): smallpond and 3FS were explicitly designed to handle massive datasets. At this scale, you'd need to deploy a larger cluster with substantial infrastructure investments.
Makes it very easy to determine if this would be useful for me and how much work I would expect to do to use it.
dartos|1 year ago
IMO pretty obvious, surface level, information and some prose on each bullet.
xixixao|1 year ago
(because obviousness is subjective and depends on the knowledge, experience, and context of the audience)
genewitch|1 year ago
go on...
like people talking about 1gbit iSCSI, and no one thought to say that 120MB/s, which is technically slower than ATA/133 which came out twenty years ago, might be the bottleneck. Obviously 10gbit will be "as fast as a local drive"!
Yes, exactly right! This means you need to buy additional hardware, like network cards[0], and possibly gbic and fiber optics.
fs111|1 year ago
mritchie712|1 year ago
jimmyl02|1 year ago
mritchie712|1 year ago
one benefit for me personally: you should be able to move from local dev to cloud more easily.
benrutter|1 year ago
I guess it comes down to how well written the distributed workflows are, there's a lot to get wrong, but in theory it should be able to achieve very impressive numbers.
My reasoning behind this is Dask, which uses Pandas under the hood being capable of better benchmarks than Spark, I think this is partly some good optimisations, but also simply that pandas is faster than spark's row based model. Duckdb is on some benchmarks more than 10x faster than pandas, you can see where this is going. . .