Member-only story
In which scenarios need to use mapPartitions or foreachPartitionin in spark (Simple question that can gauge your knowledge in spark programming)
As a data engineer, while developing spark jobs and performing operations, you will encounter a situation where your spark code that is running on executors(like using map or foreach) might need to perform below operations
- Call aservice to enrich the row of the given dataset.
- Store or retrieve something from a data store (like database/redis) e.t.c
- To perform some statfull operations or aggregation specific to partition that is executing on a given executor as task
These are expensive operations with heavy initialization , for example ,if we need to make service call, we need to create HTTP client or some grpc client and then make call for each row in dataset , process the result and close the connection . Same with database operation.Cost of creating connection and closing is huge if we perform for each element in dataset.
So what if instead of operating on each row(element) of dataset/rdd if we operate at partition level(logical unit that execute on executor in parallel) , then for 1000’s of rows in a given partition we can do heavy operations like above only once. Enter mapPartitions and foreachPartition
“mapPartitions” → The only narrow transformation achieve partition-wise processing, meaning, process data partitions as a whole, means the…