Member-only story
How/when does repartitioning in spark helps to improve performance?

What is partition → In spark, RDD is a data structure that holds data. Generally data size will be huge to fit into a single node, so it should be split and placed (partitioned) across various nodes.In short — partition is an atomic chunk of data stored on a given node in a cluster and RDD is a collection of those partitions.
So how many partitions should the data be chopped into → Having too few partitions cause less concurrency, data skewing and improper resource utilization and on the other side too many partitions cause task scheduling to take more time than actual execution time.
So what is repartition → It is a transformation in spark that will change the number of partitions and balances the data. It can be used to increase or decrease the number of partitions and always shuffles all the data over the network. So it will be termed as a fairly expensive operation.
Spark also has an optimized version of repartition() called coalesce() that allows avoiding data movement, and only be used to decrease the number of partitions
So in which scenarios, repartition actually helps to improve performance if it is an expensive operation?
Consider a real-world example(instead of a sample code like sc.parallelize(List("this is","an
…