top | item 8875096

(no title)

hnriot | 11 years ago

actually this is not true, what you're not noticing is that the parallel() use is akin to Spark, basically these streams are just map functions and if you can put the closure onto multiple cores/machines you get much better performance without any additional programmer intelligence.

If you think that api is complicated then I don't think programming is for you, this is a very ordinary and usual construct in programming.

discuss

jfager|11 years ago

In the cases where your application can benefit from parallelizing simple operations over a large data set stored in a collection, `parallel()` is fine.

It's even fine in the case where you're pulling data from a file or other low-latency sequential data source, assuming that the cost of filling a spliterator buffer is less than your cost of processing.

But there's a list of gotchas all more dangerous than the "magic make it faster" button of .parallel() imply:

- For the sequential data source case, if the cost of filling the spliterator buffers is higher than the cost of processing, you're just wasting a ton of overhead trying to use parallel.

- You have to be aware that by default all uses of parallel() run on the same threadpool, which makes it a potential timebomb if someone uses it in the context of, say, a webserver where multiple requests might all individually process streams. This also means blocking operations during stream processing are very dangerous.

- Mutating an external variable goes from being fine for a sequential stream to a race condition for a parallel one.

- You can't hand out Streams that you intend to be executed sequentially, b/c your callers can just call parallel() whenever they want.

And, yes, all of these considerations make the api more complicated than one operating over plain old iterators.