Jun 19Member-onlyWhat is compaction in big data applications(hudi, hive, spark, kafka, e.t.c) ? [Important concept for Big data engineer interview]Compaction → Process of converting small files to large file(s) (consolidation of files) and clean up of the smaller files. Generally, compaction jobs run in the background and most of the big data processing applications support manual and automatic compactions. What is the issue with small files → Having so…Compaction3 min read
May 6Member-onlyIn which scenarios need to use mapPartitions or foreachPartitionin in spark (Simple question that can gauge your knowledge in spark programming)As a data engineer, while developing spark jobs and performing operations, you will encounter a situation where your spark code that is running on executors(like using map or foreach) might need to perform below operations Call aservice to enrich the row of the given dataset. Store or retrieve something from…Spark2 min read
Dec 17, 2021Member-onlyHow attackers use log4j vulnerability(CVE-2021–44228)to access applications and how to quickly patch it in production/pilot systemsA vulnerability( CVE-2021–44228) in Apache Log4j, a widely used logging package for Java has been found( first reported to Apache on November 24 2021 and was patched with version 2.15.0 of Log4j on December 9 2021 To know more about logging frameworks/wrappers in java refer What is the difference between…Log 4 J Vulnerability2 min read
Dec 16, 2021Member-onlyConcurrency vs Parallelism in simple terms (Important question in system design interviews)One of the most important concepts in programming languages(like go, java, .e.t.c) or distributed computing is the difference between concurrency and parallelism. Both used to speed up a computation using their own methods, so let’s dive into them. Let’s say we need to prepare pasta with sauce (made of tomato…Concurrency2 min read
Nov 8, 2021Member-onlyStructured vs Semi-structured vs Unstructured dataWhat is data → Data is a representation of some aspect of the real world. We can classify data as structured or unstructured or semi-structured based on how it is organized. Customers mostly select structured or un structured or semi structured not based on their data, but on the applications…Structured Data2 min read
Aug 26, 2021Member-onlyMy biggest issue with AWS MSK (resulting in over charging)AWS MSK(managed streaming for kafka) is a fully managed service that enables you to build and run applications that use Apache Kafka to process streaming data. I worked on building kafka cluster on on-premise servers as well as ec2 instances, and AWS MSK really helped to relieve so many manul…Aws Msk1 min read
Aug 20, 2021Member-onlyWhy it is good design practice to send AWS S3 object notifications to SNS instead of to SQS or Lambda ?Amazon Simple Storage Service (Amazon S3) is a cost-efficient and highly scalable persistent or temporary object storage that most of the organizations consider using to store regular or big data. Before Nov 2014, whenever objects were created/deleted/e.t.c there was no notification system to transmit the events. To detect those events…Aws S 32 min read
Aug 19, 2021Member-onlyGit vs GitLab vs GitHub vs BitBucket(Stash)If you are not worked on “clearcase” then you are really a fortunate soul. “ClearCase” is one of the complex repository management software and I always used to face issues due to rebase and other operations to commit code and the errors not able to resolve and some times I…Git Vs Gitlab2 min read
Aug 19, 2021Member-onlySpark streaming vs Spark structured streamingTo process continuous streams of data from sources HDFS directories, TCP sockets, Kafka, Flume, Twitter e.t.c spark has two methodologies. Spark streaming in general works on something called as a micro batch. …Spark Streaming2 min read
Apr 10, 2021Member-onlyKafka best practices edition → How to design Kafka message key and why it is important in…Kafka best practices edition → How to design Kafka message key and why it is important in determining application performance? What is a Kafka Message: A record or unit of data within Kafka. Each message has a key and a value, and optionally headers.The key is commonly used for data about the message and the value is the body of the message Message Key → Can be null or contain…Kafka Message Key Design3 min read