Member-only story

How/when does repartitioning in spark helps to improve performance?

Aditya
3 min readApr 16, 2020

What is partition → In spark, RDD is a data structure that holds data. Generally data size will be huge to fit into a single node, so it should be split and placed (partitioned) across various nodes.In short — partition is an atomic chunk of data stored on a given node in a cluster and RDD is a collection of those partitions.

So how many partitions should the data be chopped into → Having too few partitions cause less concurrency, data skewing and improper resource utilization and on the other side too many partitions cause task scheduling to take more time than actual execution time.

So what is repartition → It is a transformation in spark that will change the number of partitions and balances the data. It can be used to increase or decrease the number of partitions and always shuffles all the data over the network. So it will be termed as a fairly expensive operation.

Spark also has an optimized version of repartition() called coalesce() that allows avoiding data movement, and only be used to decrease the number of partitions

So in which scenarios, repartition actually helps to improve performance if it is an expensive operation?

Consider a real-world example(instead of a sample code like sc.parallelize(List("this is","an

Create an account to read the full story.

The author made this story available to Medium members only.
If you’re new to Medium, create a new account to read this story on us.

Or, continue in mobile web

Already have an account? Sign in

Aditya
Aditya

Written by Aditya

Principal data engineer → Distributed Threat hunting security platform | aws certified solutions architect | gssp-java | Chicago-IL

No responses yet

Write a response