Spark job vs stage vs task in simple terms(with cheat sheet)

Aditya
2 min readSep 20, 2022

When a spark application invoke an action, such as collect() or take() on your DataFrame or Dataset, the action will create a job. Below is the Cheat sheet to remember.

One job results one or more stages

One stage results one or more tasks

one task operates on one partition

So what does executors run ? → Executor(for interview or concept perspective) is one jvm on one physical node(each physical node can contain or or more executors). One executor can run one or more tasks

General day-to-day example to understand the above scenario

Let’s say our spark job objective is to go to the bank and withdraw some cash and pay bills.

Stage 1 → Going in a car from home to the bank is one stage.

  • Starting car is one task
  • Starting gps is another task
  • Driving on road is another task

Stage 2 → Going into bank is another stage

  • Get down from car one task
  • Walk down to the bank another task
  • Go to the teller window is another task

Stage 3 → Withdraw cash from teller

  • Giving debit card to teller on task
  • Withdraw amount request another task
  • Collect money another task

So in above example, stage 3 depends on stage 2 and stage 2 depends on stage 1 (generally for shuffle or wide transformation we see this scenario also called ShuffleMapStage)

Say after withdrawing money if we use that to pay off some bills then can consider it as the final stage(called as ResultStage)

--

--

Aditya
Aditya

Written by Aditya

Principal data engineer → Distributed Threat hunting security platform | aws certified solutions architect | gssp-java | Chicago-IL

No responses yet