(no title)
kyt
|
13 years ago
He forgot 5) Write it in C/C++. 150M records is not that large and using Hadoop, which is generally used for I/O bound problems, seems like overkill. A lot of these problems can be avoided by simply dropping down to a lower level language. For example, I was able to write a C implementation of a matrix factorization algorithm (100M records) that ran on my laptop in ~5 minutes. The same algorithm took over 24 hours to run on a Mahout/Hadoop cluster (it also cost about $30 to run on AWS EMR).
absherwin|13 years ago
They also won't even be seriously considered at most large companies because the typical person in that role doesn't have the skills and a single-person using a different solution makes it difficult to transition work.