> Yep, the Spark codebase is a great example of what a big Scala codebase should...

hocuspocus · on Feb 11, 2021

This is a false dichotomy. You'll run into all sorts of pain and frustration as soon as Spark touches your codebase, no matter what kind of Scala you're writing.

marcinzm · on Feb 11, 2021

I've almost every big data processing system out there and Spark causes no more pain than any other. Less than most. It's also a data processing system not a library so if you're integrating it into existing code (rather than writing code for it and running code on top of it) I'd argue you're doing it wrong.

hocuspocus · on Feb 11, 2021

One big reason, maybe the biggest, to write Spark jobs in Scala and (not move to pyspark) is code reuse between different components. My team maintains a fairly sizeable library that is common to our Spark jobs and several web services. Spark is a library dependency if you do anything remotely complex with it and it can easily creep up everywhere if you aren't careful.

Decoupling modules isn't always obvious in a codebase that's grown organically for 6-7 years now (long before I joined) and the cohabitation with Spark is inevitably going to cause some pains. A couple examples:

- Play-json codecs aren't serializable (ironically enough, Circe, the "hardcore Scala" library is).

- Any library that depends on Jackson is likely to cause binary compatibility issues due to the ancient versions shipped with Spark. Guava can be a problem too. Soon enough you'll need to shade a bunch of libraries in your Spark assembly.

- We have a custom sparse matrix implementation that fits our domain well, it was completely broken by the new collections in Scala 2.13. It makes cross-publishing complicated if I don't want to be stuck to Scala 2.12 because of Spark.