top | item 6000871

Ask HN: Fast, In-Memory, Distributed data analysis and machine learning?

5 points| henrythe9th | 12 years ago

We're looking to implement a new data pipeline architecture at work. The primary goal is speed (data size is small enough to fit entirely in memory, sharded across multiple machines if needed). The primary bottleneck is feature extraction, transformation and iteration, which is both CPU and read/write intensive. Model building is not too slow, so no need to distribute training/testing as of yet.

I've heard good things about Spark/Shark and Storm. Does anyone have any experiences or recommendations? Maybe we don't even need a super sophisticated system and a Riak/Redis K-V store cluster would do?

Thanks in advance

13 comments

order

karterk|12 years ago

Hard to offer suggestions without knowing rough size of data - depending on how much money you're willing to cough up, even 1 TB is in the range of "can fit in the memory" territory.

Having said that, Spark is really great for running iterative algorithms and will definitely fit with what you have described. I suggest staying away from building it on your own using riak/redis (atleast until you have ruled out spark), as you will run into lots of operational issues like handling failures, resource allocation, retries etc.

henrythe9th|12 years ago

Thanks for your input. We're roughly talking around 5GB of data. Data growth should be linear in the next 6months. Money is not a big concern. Speed of iteration is key.

We frequently run different processing algorithms over the entire stored dataset (stored data doesn't change) and update the calculated features each time. Not sure if this helps narrows things down. Thanks

agibsonccc|12 years ago

I can vouch for storm. If only for the fact it's pretty easy to setup (especially compared to hadoop) Being able to leverage zookeeper for coordination allows you some extra capabilities for coordination as well. With that being said, just watch how you build your bolts/spouts. There's lots of ways you can send data in to the system, but in general , storm's documentation has been superb to work with.

I built a mini library for myself to auto construct the topologies based on a set of named dependencies to handle bolt/spout wiring. Aside from that, the builder interface for it is really nice if your data pipeline doesn't change.

There's good support for testing with a local cluster as well.

henrythe9th|12 years ago

Thanks for your suggestion. Do you have any specific readings for me to look into for building bolts/spouts for sending data into the system?

Thanks

x0x0|12 years ago

you should check out http://0xdata.com/ ; it's built from the ground up on a custom dkv to do in-memory ML. Reasons to check it out:

1 - it's open source https://github.com/0xdata/h2o

2 - ingest data from hdfs, s3, csv

3 - I've built systems like what you're discussing twice; the ML algorithms are often easier to write than expected while data management (moving data, sending updates, etc) which initially seems easier is much harder. 0xdata handles this for you.

4 - under active development

5 - it cleanly runs on your dev box with 1 or many nodes for development; deploying is a simple as uploading a jar to a cluster and putting a single file on each naming peers in the cluster

5a - see scripts to walk you through doing this

disclosure: I work on it as of very recently =P

nihar|12 years ago

Have you looked at Oracle Coherence? It's pretty light weight and has clustering features as well.

henrythe9th|12 years ago

Thanks for the suggestion. Looks very interesting, but couldn't find much information about it besides on Oracle.

How's the community and use cases for Coherence?

Thanks