top | item 43233307

(no title)

memco | 1 year ago

Love this straightforward analysis of use cases:

> Using smallpond and 3FS depends largely on your data size and infrastructure:

> Under 10TB: smallpond is likely unnecessary unless you have very specific distributed computing needs. A single-node DuckDB instance or simpler storage solutions will be simpler and possibly more performant.

> 10TB to 1PB: smallpond begins to shine. You'd set up a cluster with several nodes, leveraging 3FS or another fast storage backend to achieve rapid parallel processing.

> Over 1PB (Petabyte-Scale): smallpond and 3FS were explicitly designed to handle massive datasets. At this scale, you'd need to deploy a larger cluster with substantial infrastructure investments.

Makes it very easy to determine if this would be useful for me and how much work I would expect to do to use it.

discuss

dartos|1 year ago

I very much felt like that entire portion of the article was ai generated, actually.

IMO pretty obvious, surface level, information and some prose on each bullet.

xixixao|1 year ago

Saying something is “obvious” without specifying an audience is meaningless.

(because obviousness is subjective and depends on the knowledge, experience, and context of the audience)

genewitch|1 year ago

with some "no s, sherlock" on the ">1PB will require additional infra."

go on...

like people talking about 1gbit iSCSI, and no one thought to say that 120MB/s, which is technically slower than ATA/133 which came out twenty years ago, might be the bottleneck. Obviously 10gbit will be "as fast as a local drive"!

Yes, exactly right! This means you need to buy additional hardware, like network cards[0], and possibly gbic and fiber optics.

fs111|1 year ago

The authors are Chinese so they may simply use AI to make it sound right in English

mritchie712|1 year ago

some was AI generated, but I made sure everything was accurate. I'd normally rewrite everything, but I wrote this quickly before I had to leave the house. Didn't think it'd be on the front page!

jimmyl02|1 year ago

I wonder at which scale spark fits into this picture and what the tradeoffs / benefits would be

mritchie712|1 year ago

spark is certainly the incumbent for this sort of thing.

one benefit for me personally: you should be able to move from local dev to cloud more easily.

benrutter|1 year ago

Yeah I reeeaaally want to see benchmarks! Single node duckdb is absolutely insane (as in fast) performance wise, especially compared to something like Spark. There's been a lot of speed focussed work in the project and I don't know of any faster data processing (I'm not counting traditional SQL since a lot of the speed benefits there come from indexing etc and essentially doing additional work ahead of time).

I guess it comes down to how well written the distributed workflows are, there's a lot to get wrong, but in theory it should be able to achieve very impressive numbers.

My reasoning behind this is Dask, which uses Pandas under the hood being capable of better benchmarks than Spark, I think this is partly some good optimisations, but also simply that pandas is faster than spark's row based model. Duckdb is on some benchmarks more than 10x faster than pandas, you can see where this is going. . .