How/when does repartitioning in spark helps to improve performance?

What is partition → In spark, RDD is a data structure that holds data. Generally data size will be huge to fit into a single node, so it should be split and placed (partitioned) across various nodes.In short — partition is an atomic chunk of data stored on a given node in a cluster and RDD is a collection of those partitions.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Aditya

Aditya

Principal data engineer → Distributed Threat hunting security platform | aws certified solutions architect | gssp-java | Chicago-IL