When a spark application invoke an action, such as collect() or take() on your DataFrame or Dataset, the action will create a job. Below is the Cheat sheet to remember.
One job results one or more stages
One stage results one or more tasks
one task operates on one partition
So what does executors run ? → Executor(for interview or concept perspective) is one jvm on one physical node(each physical node can contain or or more executors). One executor can run one or more tasks
General day-to-day example to understand the above scenario
Let’s say our spark job objective is to go to the bank and withdraw some cash and pay bills.
Stage 1 → Going in a car from home to the bank is one stage.
- Starting car is one task
- Starting gps is another task
- Driving on road is another task
Stage 2 → Going into bank is another stage
- Get down from car one task
- Walk down to the bank another task
- Go to the teller window is another task
Stage 3 → Withdraw cash from teller
- Giving debit card to teller on task
- Withdraw amount request another task
- Collect money another task
So in above example, stage 3 depends on stage 2 and stage 2 depends on stage 1 (generally for shuffle or wide transformation we see this scenario also called ShuffleMapStage)
Say after withdrawing money if we use that to pay off some bills then can consider it as the final stage(called as ResultStage)