Multi Version Concurrency Control MVCC based design (hot question in distributed systems interview)What is concurrency → Ability of a program to do multiple things at once.May 17, 2023May 17, 2023
Step by Step guide to expose spark jmx metrics and funnel them to datadog.Please read my previous article…Nov 4, 2022Nov 4, 2022
How JMX metrics from spark applications will help to configure driver/executor memory correctly…What are jmx metrics → Java Management Extensions (JMX) is a specification for monitoring and managing Java applications.Oct 31, 2022Oct 31, 2022
Spark job vs stage vs task in simple terms(with cheat sheet)When a spark application invoke an action, such as collect() or take() on your DataFrame or Dataset, the action will create a job. Below is…Sep 20, 2022Sep 20, 2022
What is compaction in big data applications(hudi, hive, spark, kafka, e.t.c)Compaction → Process of converting small files to large file(s) (consolidation of files) and clean up of the smaller files.Jun 19, 2022Jun 19, 2022
In which scenarios need to use mapPartitions or foreachPartitionin in spark (Simple question that…As a data engineer, while developing spark jobs and performing operations, you will encounter a situation where your spark code that is…May 6, 2022May 6, 2022
How attackers use log4j vulnerability(CVE-2021–44228)to access applications and how to quickly…A vulnerability( CVE-2021–44228) in Apache Log4j, a widely used logging package for Java has been found( first reported to Apache on…Dec 17, 2021Dec 17, 2021
Concurrency vs Parallelism in simple terms (Important question in system design interviews)One of the most important concepts in programming languages(like go, java, .e.t.c) or distributed computing is the difference between…Dec 16, 2021Dec 16, 2021
Structured vs Semi-structured vs Unstructured dataWhat is data → Data is a representation of some aspect of the real world. We can classify data as structured or unstructured or…Nov 8, 2021Nov 8, 2021
My biggest issue with AWS MSK (resulting in over charging)AWS MSK(managed streaming for kafka) is a fully managed service that enables you to build and run applications that use Apache Kafka to…Aug 26, 20211Aug 26, 20211