What is compaction in big data applications(hudi, hive, spark, kafka, e.t.c) ? [Important concept for Big data engineer interview]

Aditya
3 min readJun 19, 2022

Compaction → Process of converting small files to large file(s) (consolidation of files) and clean up of the smaller files.

Generally, compaction jobs run in the background and most of the big data processing applications support manual and automatic compactions.

What is the issue with small files → Having so many small files in data lake is a nightmare because downstream distributed computing applications will take time to load(open and close) all the smaller files and their metadata which in turn slow computation.

How large a file should be → It depends on processing technology at hand. Suppose for a spark job it depends on number of cores on executor .So that each file will be fed to one core

Some popular distributed processing technologies that use compaction

Hive : Hive creates a set of delta files for each transaction that alters a table or partition and stores them in a separate delta directory. By default, Hive automatically compacts delta and base files at regular intervals

Two types of compaction:

  • Minor → Rewrites a set of delta files to a single delta file for a bucket.
  • Major → Rewrites one or more delta files and the base file as a new base file for a bucket.

KAFKA : Kafka does “Log compaction”, it will not merge small files to bigger one, but it does in a different fashion as explained below, but it is called compaction as the resulting process will operate on creating larger segment files(which is physical represention of partition data, can be found at /var/lib/kafka/data)

Kafka message contains key and value. Key used to select the partition(by hashing key) i.e messages with same key land in the same partition all the time. Therefore, if we send multiple messages with the same key, all of them end up in the same partition.

So if we have two values with same key entered at different times, like “k1:v1” and “k1:v2", bckground threat will retain only latest value i.e “k1:v2”.

Because of that, log compaction happens within a single Kafka broker and does not require coordination with other nodes of the Kafka cluster.

Aditya

Principal data engineer → Distributed Threat hunting security platform | aws certified solutions architect | gssp-java | Chicago-IL