Ask HN: Fast, In-Memory, Distributed data analysis and machine learning?
5 points| henrythe9th | 12 years ago
I've heard good things about Spark/Shark and Storm. Does anyone have any experiences or recommendations? Maybe we don't even need a super sophisticated system and a Riak/Redis K-V store cluster would do?
Thanks in advance
karterk|12 years ago
Having said that, Spark is really great for running iterative algorithms and will definitely fit with what you have described. I suggest staying away from building it on your own using riak/redis (atleast until you have ruled out spark), as you will run into lots of operational issues like handling failures, resource allocation, retries etc.
henrythe9th|12 years ago
We frequently run different processing algorithms over the entire stored dataset (stored data doesn't change) and update the calculated features each time. Not sure if this helps narrows things down. Thanks
agibsonccc|12 years ago
I built a mini library for myself to auto construct the topologies based on a set of named dependencies to handle bolt/spout wiring. Aside from that, the builder interface for it is really nice if your data pipeline doesn't change.
There's good support for testing with a local cluster as well.
henrythe9th|12 years ago
Thanks
x0x0|12 years ago
1 - it's open source https://github.com/0xdata/h2o
2 - ingest data from hdfs, s3, csv
3 - I've built systems like what you're discussing twice; the ML algorithms are often easier to write than expected while data management (moving data, sending updates, etc) which initially seems easier is much harder. 0xdata handles this for you.
4 - under active development
5 - it cleanly runs on your dev box with 1 or many nodes for development; deploying is a simple as uploading a jar to a cluster and putting a single file on each naming peers in the cluster
5a - see scripts to walk you through doing this
disclosure: I work on it as of very recently =P
nihar|12 years ago
henrythe9th|12 years ago
How's the community and use cases for Coherence?
Thanks