A parallel recommendation engine in Julia

[+] pathsjs|9 years ago|reply

The experiments were conducted by invoking spark with flags --master local[1]

This means running on a single core. Not really a fair comparison with the Julia multithreaded or multiprocess version

[+] ViralBShah|9 years ago|reply

I am pretty sure that is a typo in the blog post, since the performance of Spark improves as more cores are used, as does Julia's and they both show similar scaling characteristics.

The performance plot would probably be more readable if the y axis was on log scale.

The typo is also now fixed.

[+] shipman05|9 years ago|reply

Why test distributed computing technologies on a single core? Am I missing something?

[+] optimali|9 years ago|reply

The experiments were conducted on a 30 core Intel Xeon machine with 132 GB memory and 2 hyperthreads per core

[+] minimaxir|9 years ago|reply

The Spark comparison, given the April 2016 posting of the article, was likely done with Spark 1.6. Spark 2.0, released in July, added significant performance improvements (https://docs.cloud.databricks.com/docs/latest/sample_applica...), so it is possible the performance difference may be different nowadays.

[+] ViralBShah|9 years ago|reply

Quite possible, and would be interesting to see how this stacks up today. I was just glad to see that Julia's parallel computing could out of the box give results comparable to Spark, with the ALS algorithm completely written in Julia without crazy optimized code.

[+] jlrubin|9 years ago|reply

Since Viral seems to be responding here...

What's going on with Multithread support? I was trying to do a project a while back to make a pure julia mapreduce like engine with a distributed file system, but it was hard to get off the ground due to poor multithreading support.

For the uninitiated, Julia has two types of concurrency built in. Tasks, which are co-routines on the same thread and Clusters, which are "separate machines".

[+] ViralBShah|9 years ago|reply

The multi-threading in Julia is really new and limited. The plan is first to get the whole codebase to be thread-safe and provide some simple parallelism models and then figure out what a good composable multi-threading model could be.

For now, since the GC effectively runs only in one thread, you get good speedup with multi-threading if you avoid allocation and thus GC in the parallel code sections. In some cases this is possible, but in many cases it is unnatural. Of course, all this is under heavy development.

To build a julia mapreduce engine on a distributed filesystem, Julia's multi-processing should be pretty good though. For simple problems we attempted with packages like Elly.jl, that is what our experience has been.

[+] yarapavan|9 years ago|reply

Impressive results! Congrats Julia team.

[+] mikestaszel|9 years ago|reply

Does anyone have a link to the code that was used for the comparison?

[+] yarapavan|9 years ago|reply

https://github.com/abhijithch/RecSys.jl should have it

[+] StreamBright|9 years ago|reply

This was written 22 Apr 2016, still pretty good read.

[+] coldtea|9 years ago|reply

Still? Because being merely 6 months is supposed to date a programming article?

[+] kynights|9 years ago|reply

[deleted]

29 comments