What is compaction in big data applications(hudi, hive, spark, kafka, e.t.c) ? [Important concept for Big data engineer interview]

Aditya
3 min readJun 19, 2022

Compaction → Process of converting small files to large file(s) (consolidation of files) and clean up of the smaller files.

Generally, compaction jobs run in the background and most of the big data processing applications support manual and automatic compactions.

What is the issue with small files → Having so many small files in data lake is a nightmare because downstream distributed computing applications will take time to load(open and close) all the smaller files and their metadata which in turn slow computation.

How large a file should be → It depends on processing technology at hand. Suppose for a spark job it depends on number of cores on executor .So that each file will be fed to one core

Some popular distributed processing technologies that use compaction

Hive : Hive creates a set of delta files for each transaction that alters a table or partition and stores them in a separate delta directory. By default, Hive automatically compacts delta and base files at regular intervals

Two types of compaction:

  • Minor → Rewrites a set of delta files to a single delta file for a bucket.
  • Major → Rewrites one or more delta files and the base file as a new base file for a bucket.

--

--

Aditya

Principal data engineer → Distributed Threat hunting security platform | aws certified solutions architect | gssp-java | Chicago-IL