Cool analysis. I wonder if you could show something like a LOESS curve fitted across all the articles' timeseries? Or if they're all roughly linear descents, I wonder if you could show the distribution of slopes - do some descend faster than others? Why?
And then, a bone to pick:
Need a beefy RDBMS for 15mm rows? Maybe if you want to store the whole denormalized table in memory, but if you're just indexing a small field (or even partial-indexing a larger field) you should have no problem. The table will just spill to disk and page in as necessary, and you're mostly appending anyway so you shouldn't have much trouble. Plus, you could normalize the data: store the (large) article title in an Articles table with an id (hash of title?) and then just store the ranks in a Ranks table for less overall storage than the NoSQL database (thus needing a less-beefy machine).
Nothing against modern Not-only-SQL solutions or document stores, but don't discount RDBMS. Schemas aren't so scary or unwieldy that you should never use them.
>Need a beefy RDBMS for 15mm rows? Maybe if you want to store the whole denormalized table in memory, but if you're just indexing a small field (or even partial-indexing a larger field) you should have no problem.
Good point. Honestly, I don't have that much experience with using row-based RDBMS for analytics purposes (my background is mostly in finance where folks use expensive proprietary columnar databases) and Hadoop. Any good resources on testing the limits of using MySQL/PostgreSQL for analytics?
He works for Treasure Data. This post, while providing some information, is most likely a shill for their NoSQL platform.
If not, I genuinely hope the rest of the NoSQL crowd isn't so incredibly ignorant about what a RDBMS is capable of, nor posses such a strong aversion to what would be a very straightforward normalized schema.
[+] [-] gojomo|12 years ago|reply
http://ycombinator.com/newsnews.html#12may11
...was tweaked a bit: now starting at #6 rather than #4; now descending one position every 8 minutes rather than every 15 minutes.
(I'm surprised they don't stay for a full day; that would seem a reasonable way to reach anyone who visits at least once daily.)
[+] [-] NKCSS|12 years ago|reply
[+] [-] zopf|12 years ago|reply
And then, a bone to pick:
Need a beefy RDBMS for 15mm rows? Maybe if you want to store the whole denormalized table in memory, but if you're just indexing a small field (or even partial-indexing a larger field) you should have no problem. The table will just spill to disk and page in as necessary, and you're mostly appending anyway so you shouldn't have much trouble. Plus, you could normalize the data: store the (large) article title in an Articles table with an id (hash of title?) and then just store the ranks in a Ranks table for less overall storage than the NoSQL database (thus needing a less-beefy machine).
Nothing against modern Not-only-SQL solutions or document stores, but don't discount RDBMS. Schemas aren't so scary or unwieldy that you should never use them.
Anyway, thanks for an informative post!
[+] [-] kiyoto|12 years ago|reply
Good point. Honestly, I don't have that much experience with using row-based RDBMS for analytics purposes (my background is mostly in finance where folks use expensive proprietary columnar databases) and Hadoop. Any good resources on testing the limits of using MySQL/PostgreSQL for analytics?
[+] [-] meritt|12 years ago|reply
If not, I genuinely hope the rest of the NoSQL crowd isn't so incredibly ignorant about what a RDBMS is capable of, nor posses such a strong aversion to what would be a very straightforward normalized schema.
[+] [-] unknown|12 years ago|reply
[deleted]
[+] [-] jacquesm|12 years ago|reply
[+] [-] pearjuice|12 years ago|reply
[+] [-] brickmort|12 years ago|reply