top | item 10468406

(no title)

jrbancel | 10 years ago

Is 100 Billion (order of a few TB) Big Data?

In my experience, CPU is rarely the big issue when dealing with a lot of data (I am talking about tens of PB per day). IO is the main problem and designing systems that move the least amount of data is the real challenge.

discuss

gaius|10 years ago

No it is not, not even close. At a job 10 years ago nearly, we had 50Tb in plain old-fashioned Oracle, and we knew people with 200Tb in theirs (you would be surprised who if I told you). A few Tb these days, you could crunch quite easily on a high-end desktop or a single midrange server, using entirely conventional techniques.

tobz|10 years ago

(Toby from Localytics here)

Yep, this is a great point. The data locality/reducing IO is huge, but the way things actually play out for us when data isn't segmented/partitioned properly, it chews up CPU/memory. This is a lot of why the post was geared around CPU usage: concurrency in Vertica can be a little tricky, and stabilizing compute across the cluster has paid more dividends than any storage or network subsystem tweaks we've made.

We're not at the PB/day mark, though, so there's definitely classes of problems we are blissfully ignorant on. :)

hvidgaard|10 years ago

The major takeaway I had from my courses in data intensive applications, was that IO is all that matters. It is the limiting factor to such an extend that you don't really care about the algorithmic efficiency with regards to CPU calculations, or memory.

You analyse algorithms in terms of IO access, and specifically access pattern. If you cannot make the algorithm in a scanning fashion, you're in for a bad time.

bdarfler|10 years ago

There is always a balance here between CPU and IO. For a long time databases and big data platforms were pretty terrible with IO. However, as the computer engineering community has had time to work with these problems we have gotten considerably better at understanding how to store data via sorted and compressed columnar formats how to exploit data locality via segmentation and partitioning. As such most well constructed big data products are CPU bound at this point. For instance check out the NSDI `15 paper on Spark performance that found it was CPU bound. Vertica is also generally CPU bound.

https://www.usenix.org/conference/nsdi15/technical-sessions/...

msellout|10 years ago

Specifically network traffic, rather than disk read/write.