top | item 29453939

(no title)

runT1ME | 4 years ago

I really think this is where pure FP shines.

If you look at the architecture of something like Apache Beam, while you describe your computations in a language like Java or Python, you're really using a DSL for creating an immutable DAG that processes chunks of immutable data, persisiting each chunk (being loose with terms here) for durability, 'modifying' it in the immutable sense and then passing it on to the next edge and so on.

In a single standalone system, many argue that a purely immutable conceptual approach has pros and cons. In the Big Data world, I've never heard anyone argue for an imperative approach. Immutability is necessary, and as soon as you need immutability you want all the other goodies that facilitate it in a clean way like monads, lenses, and the like.

discuss

order

chrisjc|4 years ago

Intriguing, but I think I lost you a little in the details and really dying to understand your perspective.

    1) creating an immutable DAG that processes chunks of immutable data, 
    2) persisting each chunk (being loose with terms here) for durability, 
    3) 'modifying' it in the immutable sense 
    4) and then passing it on to the next edge and so on
Isn't #4 the persisting part, not #2? Maybe I'm confused by what you mean by "persist"? Render the instructions (mutate the data)?

And when you say "'modifying' it in the immutable sense", does this mean that the original data stays untouched, but "instructions" on how to modify the data get "pinned" to the data?

Then every other subsequent step in the DAG is just modifying the "instructions"?

If I understand you correctly, your point about how this is where FP shines is really illustrated in #4. You end up passing instructions around instead of the data. But how FP instructions can be modified and modified again really baffles me. The closest I can get to understanding it in practical terms is by using something like Clojure and thinking about how Clojure is just data, and you can modify, or build upon Clojure by modifying like you modify data. I struggle extending this to another language like Scala or JavaScript.

runT1ME|4 years ago

Let's say your immutable DAG has four different 'transformations' (think either a map or a fold on the data). Beam or Spark will take a chunk of your data, partition it somehow (so you get parallelism), and do a transform on the data.

Now if you're doing a bunch of computations off of tens or hundreds machines you don't want to fail the whole thing because one hypervisor crashes or there's a JVM segfault or whatnot. So each step is usually saved to disk via checkpoint and then moved along to the next transformation in batches, so if the subsequent transfomation is where it dies, you have a safe spot to restart that particular computation from.

In addition, your (chunk of) data might need to be shipped to an additional machine to be transformed so you're not so much 'updating' data as you are creating new data in a space efficient way of the changes and shuffling those along the various steps.

Scarbutt|4 years ago

Besides performance, what other cons exists for immutable data structures in a single standalone system?

runT1ME|4 years ago

It's awkward to work with fully nested structures. Think about having a map of customer objects which have a list of addresses and you need to update everyone's first address to be capitalized for some raeson. You'd really want a fully fleshed out lens library and maybe even macros for auto generating them on your concrete data structures to make it easy to 'update' (via a persistent structure) without having to deal with all the cruft of doing it by hand.

melony|4 years ago

Memory usage, the fancier data structures (most of which are pioneered by Okasaki) in newer FP frameworks and languages are often very memory hungry.

Guvante|4 years ago

Sharing immutable data is hard. Mutable throw you throw a lock on it and mutate in place. How do you share mutations when you can't change anything?

To be clear it is certainly a solvable problem (as this post shows) but it can make things difficult.