In which scenarios need to use mapPartitions or foreachPartitionin in spark (Simple question that can gauge your knowledge in spark programming)
As a data engineer, while developing spark jobs and performing operations, you will encounter a situation where your spark code that is running on executors(like using map or foreach) might need to perform below operations
- Call aservice to enrich the row of the given dataset.
- Store or retrieve something from a data store (like database/redis) e.t.c
- To perform some statfull operations or aggregation specific to partition that is executing on a given executor as task
These are expensive operations with heavy initialization , for example ,if we need to make service call, we need to create HTTP client or some grpc client and then make call for each row in dataset , process the result and close the connection . Same with database operation.Cost of creating connection and closing is huge if we perform for each element in dataset.
So what if instead of operating on each row(element) of dataset/rdd if we operate at partition level(logical unit that execute on executor in parallel) , then for 1000’s of rows in a given partition we can do heavy operations like above only once. Enter mapPartitions and foreachPartition
“mapPartitions” → The only narrow transformation achieve partition-wise processing, meaning, process data partitions as a whole, means the code we write inside it will not be executed till we call some action operation like count or collect e.t.c.
“foreachPartition” → Some times we don't need to actually perform some transformation like map but we just need to loopthrough dataset and perform operations like service look ups. In that case we can use foreachPartition.
Unlike mapPartitions , foreachPartition is an action so it will be executed at the same time it called unlike mapPartitions which is a lazy operation that wait for action to call on transformed data returned from mapPartitions
If used judiciously in correct scenrio like above foreachPartition and mapPartitions can speed up the performance and efficiency of the underlying Spark job by many folds