If you look at the architecture of something like Apache Beam, while you describe your computations in a language like Java or Python, you're really using a DSL for creating an immutable DAG that processes chunks of immutable data, persisiting each chunk (being loose with terms here) for durability, 'modifying' it in the immutable sense and then passing it on to the next edge and so on.
In a single standalone system, many argue that a purely immutable conceptual approach has pros and cons. In the Big Data world, I've never heard anyone argue for an imperative approach. Immutability is necessary, and as soon as you need immutability you want all the other goodies that facilitate it in a clean way like monads, lenses, and the like.
Intriguing, but I think I lost you a little in the details and really dying to understand your perspective.
1) creating an immutable DAG that processes chunks of immutable data,
2) persisting each chunk (being loose with terms here) for durability,
3) 'modifying' it in the immutable sense
4) and then passing it on to the next edge and so on
Isn't #4 the persisting part, not #2? Maybe I'm confused by what you mean by "persist"? Render the instructions (mutate the data)?
And when you say "'modifying' it in the immutable sense", does this mean that the original data stays untouched, but "instructions" on how to modify the data get "pinned" to the data?
Then every other subsequent step in the DAG is just modifying the "instructions"?
If I understand you correctly, your point about how this is where FP shines is really illustrated in #4. You end up passing instructions around instead of the data. But how FP instructions can be modified and modified again really baffles me. The closest I can get to understanding it in practical terms is by using something like Clojure and thinking about how Clojure is just data, and you can modify, or build upon Clojure by modifying like you modify data. I struggle extending this to another language like Scala or JavaScript.
Let's say your immutable DAG has four different 'transformations' (think either a map or a fold on the data). Beam or Spark will take a chunk of your data, partition it somehow (so you get parallelism), and do a transform on the data.
Now if you're doing a bunch of computations off of tens or hundreds machines you don't want to fail the whole thing because one hypervisor crashes or there's a JVM segfault or whatnot. So each step is usually saved to disk via checkpoint and then moved along to the next transformation in batches, so if the subsequent transfomation is where it dies, you have a safe spot to restart that particular computation from.
In addition, your (chunk of) data might need to be shipped to an additional machine to be transformed so you're not so much 'updating' data as you are creating new data in a space efficient way of the changes and shuffling those along the various steps.
It's awkward to work with fully nested structures. Think about having a map of customer objects which have a list of addresses and you need to update everyone's first address to be capitalized for some raeson. You'd really want a fully fleshed out lens library and maybe even macros for auto generating them on your concrete data structures to make it easy to 'update' (via a persistent structure) without having to deal with all the cruft of doing it by hand.
which pattern matches on some `object` (map) and does processing. We find it less fragile than specifying explicit path to an element. It can also work in a polymorphic fashion. On the other hand, there is a risk of a false positive (when you modify address that you shouldn't). But you can mitigate that risk by using additional checks (in case of a customer, you can check for additional set of fields that are specific for that object(map)
edit:formatting
I agree that you want specialised mini-DSLs for querying and updating highly nested structures (... which you should probably avoid creating anyway), but for such a simple task you could just do this (Clojure example):
In JavaScript world, there is ImmutableJS, which uses this style of updates. However, there is better approach which is what ImmerJS uses – you write a function where you can do mutation to the target data structure, wrap it in a "produce" function, and library will pick on those mutations and create a new copy of the data that is structurally shared with the original.
That Clojure code isn't quite grasping the essence of ImmerJS. The whole point of ImmerJS is that JS has nice, built-in syntax for in-place mutation and we can reuse that syntax to generate updates to an immutable data structure so long as we scope the mutation syntax so that outside of an individual block of mutation syntax everything stays immutable. That it is implemented with JS proxies is something of an implementation detail (it could e.g. be implemented with linear or affine types or something like Haskell's ST type).
In this sense it's closer to Clojure's transients, if Clojure came prepackaged with special mutation syntax (notably assignment via =) and if transient-ness could be propagated transitively to all nested structures in a data structure.
Nothing wrong with using incredibly un-fancy data structures, like arrays, in a FP way. You just need to make sure that you don't do operations that modify individual elements, but work in large batches. So no for loops inserting elements one by one, but map, filter, flatmap...
Most good FP data structures will be tree like, but with a low branching factor and chunky leafs. Now of course that does not save you from lots of memory consumption if you use a language/platform where even primitives are boxed, like the JVM. But that is a completely separate topic.
Sharing mutable data without a mutex (which suffers from unbounded contention) is hard. Approaches that work include updating persistent data structures and sharing the new copy, or sharing diff objects over a lock-free queue.
Sharing the copy is not what's solved by lenses. In a certain sense lenses are just a way to try to do away with the boilerplate of updating nested immutable data structures. It is very easy to write this boilerplate (in the sense that it is straightforward and hard to mess up). It's just mind-numbingly tedious.
Although I don't know what you mean by growing lenses.
Mutex + inner mutation is no easier than CAS (which is the usual solution with concurrent writing of immutable data structures) a la Java AtomicReference or STM (another popular one) and in my opinion significantly harder as soon as you have multiple mutexes.
CAS is a massive pain in the ass for complex changes. Lenses are a way to point to part of the object which can in theory make it less painful as you can redo a change without redoing the work to setup the change necessarily. Growing just refers to growing a codebase using them by adding them is all.
I am not against FP I think the point of "it is good if you can fix performance" is very accurate. I just think it is also important to acknowledge that the paradigm can be more complex in situations where it is supposed to help leading to a mixed bag.
Similar to distributed databases. Your database can now be phenomenally powerful but you can't do a stored proc anymore without losing that power.
It can be a very effective and totally worth it but it isn't a pure win necessarily.
What are you CASing? If the object is too large you will have contention to the point that you might as well be single threaded. If the object is too small you now have to CAS multiple things which is far from trivial.
CAS is a primitive, it isn't complex itself but it can be complex to work with once you are talking non trivial work. Just like locks. A global lock is dumb simple but a real locking system can be as complex as you will let it.
> If the object is too large you will have contention to the point that you might as well be single threaded. If the object is too small you now have to CAS multiple things which is far from trivial.
Right but both of those things are true for locks too right? CAS seems no harder than locks.
Retry logic is hidden for locks while in your face for CAS.
Additionally while multiple locks is annoying it is way easier than multiple CAS. "Have a global order for locks" is the hard but solvable problem for multiple locks. For CAS if you need to CAS two dependent things you... I don't know it depends.
> Retry logic is hidden for locks while in your face for CAS.
It's hidden in both cases (usually CAS instructions are hidden behind an `update` interface as in Java's Atomic* family rather than directly used in the same way that generally a lock is an interface for a TAS instruction/spin lock + upgrading to wait queue).
> For CAS if you need to CAS two dependent things you... I don't know it depends.
You use nested atomic references. Just like locks it's not a great way of doing things, and an analogous problem to ordering locks rears its head (by virtue of nesting you cannot mess up ordering in the strict sense, but you can accidentally "cross the boundaries" of two atomics in an update that goes against the nesting order), but it's doable in the same way as locks.
The usual CAS-like but better approach is STM (which is where immutability really shines).
I'm still not seeing how CAS is any harder than locks.
IMO the hard part of immutable structures is mutation (solved by https://lib.rs/crates/im), the hard part of diffing is writing one "command" subclass per type of operation (which can be more or less manageable), and the hard part of mutexes is remembering to lock the mutex or rwlock in every single access (Rust fixes this), avoiding priority inversions (Rust doesn't help), and avoiding deadlocks (Rust doesn't help).
Under the hood it's a effectively a single CAS instruction that loops on failure (which only occurs under contention, but then you have waiting with locks too).
I really hope Unison succeeds. It looks like a really interesting take on FP / distributed systems. The fact that Paul Chiusano and Runar Bjarnason are at the helm is exciting, especially to Scala / FP devs like me who know them as the writers of the legendary "red book". Just based on the rare quality of that book, I'm super interested in other FP projects that they're involved in.
If you mean the manning book "Functional Programming in Scala", I don't even use Scala, yet I feel I got so much out of that book. It's the best functional programming book I've read.
It would recommend the book to anyone that wants to learn FP, regardless if you will be using Scala.
That's the one. It's great for any FP learner for sure. A great example of how to write a book on a deep topic in an engaging and beautifully structured way.
Tangential to the thread, but do you have any other favorite Scala resources you would recommend? I've had only glancing exposure to it, and will soon be taking a role where it's the primary language.
I do indeed. The "red book" aka Functional Programming in Scala I wouldn't recommend to a beginner. It teaches FP at a fairly deep level to the extent of implementing the abstractions yourself. Take a look at that later perhaps. Don't get me wrong, it is utterly brilliant but it wouldn't be the place to start.
I would recommend Essential Scala for the basics. Scala with Cats if you're going to be working with the Typelevel stack (a popular set of FP libraries). Lots of other resources but those two alone will take you far.
1. The posts mention Spark, but were you (the Unison team) also inspired by dataflow / stream processing engines (e.g. Flink, Beam, Timely Dataflow etc.) or any other technologies?
2. Relatedly, are there any plans to add a stream processing API? Or perhaps that would belong in a 3rd-party library?
3. Broadly, where do you think (in your opinion) Unison fits in the ecosystem of tooling? Is Unison specialized at solving big data problems that require distributed execution e.g. something I pull out of my toolbox for a problem I'd otherwise solve with Spark? Or is it a truly general-purpose programming language?
4. With the Tree example, you show the "wrong" way and then the "right" way -- one problem I've experienced in the past is how do you as a new user know the right way. For example, I wrap my entire program in IO then .unsafeRunSync it -- technically following the rules, but got nothing out of doing that. Have you thought about how the APIs could guide the user to the idiomatic solution? Perhaps via types?
By the way, I really loved the FP in Scala book, it was my first step into FP -- thank you and Runar for writing it
Hi there! I can't speak to questions 1-3, but with regards to question 4, some of the examples we generated were born from early mistakes I had stumbled into when writing the `Tree` functions, so I have some thoughts. It would be interesting to codify the behavior of evaluating the `Tree` in more types, but looking back, the thing that would have helped me most in avoiding some of the missteps might have been a richer test interpreter, where I could have "mocked" a network of, say, 3 nodes and distributed the `Tree` across them, emitting information about where the data was being moved, and where the various functions were being applied. We had actually briefly thought about writing such an interpreter for the purpose of this article and it's still something I'd like to do at some point.
So what we were going for with the core API is that it should be expressive enough for all kinds of distributed programming tasks, not just Spark-like or batch computing stuff.
We’re planning on doing several of these articles to show different applications. I think some distributed stream processing library would be super cool and there are lots of sources of inspiration for that like you mention. It would be a more interesting translation than the Spark-like stuff though. Likely using the channels ability for message passing.
> By the way, I really loved the FP in Scala book, it was my first step into FP -- thank you and Runar for writing it
> the thing that would have helped me most in avoiding some of the missteps might have been a richer test interpreter, where I could have "mocked" a network
This is a really intriguing idea. Debugging distributed can be a huge headache, and I can see a lot of value in having something like that.
> We’re planning on doing several of these articles to show different applications.
Looking forward to future articles and definitely will be following this project!
Could you point to some resources on how the underlying (distributed) runtime is architected/works? Also, is it primarily designed to support 'bringing the code to the data', or can it also support 'bringing the data to the code' a la snowflake?
Hi there, the other article author chiming in here! You can perform side effects in Unison via abilities (our name for algebraic effects) - they're described here: https://www.unisonweb.org/docs/abilities/
Some basic IO functionality is supported by the `base` library (the standard lib) which contains functions like `openFile` which are effectful functions expressed in terms of the `IO` ability.
I think there were a lot of great ideas presented in these small examples.
1. Explicitly marking non-local _data_ vs. in-memory. A very cool idea -- languages and libraries I've used all seem to try to make this transparent (unsuccessfully, as indicated by the random serialization errors you get)
2. Making the noise around serialization and RPC transparent to the user
3. Bring the code to the data, not the data to the code
4. Immutability, FP, and I like the syntax -- no coincidence in its similarities to Haskell and Scala, I'm sure :)
These are all pain points in a lot of existing "big data systems" I've used, where they either solve the problem half-baked or don't address at all. I'm excited to see where this project goes!
Your use of the word transparent confused me a little. It seems you are using it in the sense that details that are being made transparent are being hidden from the user.
But if you make something transparent to the user, it could also mean that you are directly exposing that thing to the user. (Right? I could be wrong about this.)
Just confused me and thought it might be useful to you to know that. Not trying to be an asshole.
I dislike the term "immutable data structure", because it has two widely used conflicting meanings:
1. Read-only data structures, which are used by everyone in every language all the time.
2. Persistent data structures, which make the underlying immutable structures kind of mutable by creating new versions and reusing parts from the old versions of the structure.
Aren't those the same things? i.e. 2. are just read-only data structures with functions to create new read-only data structures based on them. (e.g. immutable linked lists that you can filter, concat, etc. with)
Or are you talking about something more specific, like mutable data structures that provide immutable snapshots of the backing data?
In computing, a persistent data structure or not ephemeral data structure is a data structure that always preserves the previous version of itself when it is modified.
I usually distinguish "readonly" values (references which block mutation, but you can mutate via another reference) and "immutable" values (what OP calls persistent, where modification may copy-on-write but you're certain the original reference won't mutate).
If you look at the architecture of something like Apache Beam, while you describe your computations in a language like Java or Python, you're really using a DSL for creating an immutable DAG that processes chunks of immutable data, persisiting each chunk (being loose with terms here) for durability, 'modifying' it in the immutable sense and then passing it on to the next edge and so on.
In a single standalone system, many argue that a purely immutable conceptual approach has pros and cons. In the Big Data world, I've never heard anyone argue for an imperative approach. Immutability is necessary, and as soon as you need immutability you want all the other goodies that facilitate it in a clean way like monads, lenses, and the like.