top | item 5116377

(no title)

kyt | 13 years ago

He forgot 5) Write it in C/C++. 150M records is not that large and using Hadoop, which is generally used for I/O bound problems, seems like overkill. A lot of these problems can be avoided by simply dropping down to a lower level language. For example, I was able to write a C implementation of a matrix factorization algorithm (100M records) that ran on my laptop in ~5 minutes. The same algorithm took over 24 hours to run on a Mahout/Hadoop cluster (it also cost about $30 to run on AWS EMR).

discuss

order

absherwin|13 years ago

C/C++ shouldn't generally be needed. The computationally intensive part is optimized. What makes a given system slow is either poorly optimized numerical code (less common) or doing a bunch of needless repetitive work because the system doesn't make it easy to separate different components of the modeling.

They also won't even be seriously considered at most large companies because the typical person in that role doesn't have the skills and a single-person using a different solution makes it difficult to transition work.