(no title)
jrbancel | 10 years ago
In my experience, CPU is rarely the big issue when dealing with a lot of data (I am talking about tens of PB per day). IO is the main problem and designing systems that move the least amount of data is the real challenge.
gaius|10 years ago
tobz|10 years ago
Yep, this is a great point. The data locality/reducing IO is huge, but the way things actually play out for us when data isn't segmented/partitioned properly, it chews up CPU/memory. This is a lot of why the post was geared around CPU usage: concurrency in Vertica can be a little tricky, and stabilizing compute across the cluster has paid more dividends than any storage or network subsystem tweaks we've made.
We're not at the PB/day mark, though, so there's definitely classes of problems we are blissfully ignorant on. :)
hvidgaard|10 years ago
You analyse algorithms in terms of IO access, and specifically access pattern. If you cannot make the algorithm in a scanning fashion, you're in for a bad time.
bdarfler|10 years ago
https://www.usenix.org/conference/nsdi15/technical-sessions/...
msellout|10 years ago