One big reason, maybe the biggest, to write Spark jobs in Scala and (not move to...

One big reason, maybe the biggest, to write Spark jobs in Scala and (not move to pyspark) is code reuse between different components. My team maintains a fairly sizeable library that is common to our Spark jobs and several web services. Spark is a library dependency if you do anything remotely complex with it and it can easily creep up everywhere if you aren't careful.

Decoupling modules isn't always obvious in a codebase that's grown organically for 6-7 years now (long before I joined) and the cohabitation with Spark is inevitably going to cause some pains. A couple examples:

- Play-json codecs aren't serializable (ironically enough, Circe, the "hardcore Scala" library is).

- Any library that depends on Jackson is likely to cause binary compatibility issues due to the ancient versions shipped with Spark. Guava can be a problem too. Soon enough you'll need to shade a bunch of libraries in your Spark assembly.

- We have a custom sparse matrix implementation that fits our domain well, it was completely broken by the new collections in Scala 2.13. It makes cross-publishing complicated if I don't want to be stuck to Scala 2.12 because of Spark.