AWS MSK(managed streaming for kafka) is a fully managed service that enables you to build and run applications that use Apache Kafka to process streaming data.
I worked on building kafka cluster on on-premise servers as well as ec2 instances, and AWS MSK really helped to relieve so many manul steps wrt to kafka cluster setup and simplified monitoring(CloudWatch) and maintainability(rolling software/hardware upgrades)
But one problem with AWS MSK as of Aug 2021, is once we scale up EBS storage volume per broker by using manual or auto-scaling, we cannot scale back down (say due to reduced workload ) and…
Amazon Simple Storage Service (Amazon S3) is a cost-efficient and highly scalable persistent or temporary object storage that most of the organizations consider using to store regular or big data.
Before Nov 2014, whenever objects were created/deleted/e.t.c there was no notification system to transmit the events. To detect those events we need to either have every process that uploaded to S3 subsequently trigger back end code in some way or poll the S3 bucket to see if objects had been added/deleted e.t.c. This adds additional coupling to the system.
Once S3 event notification introduced in Nov 2014, whenever S3 generate…
If you are not worked on “clearcase” then you are really a fortunate soul. “ClearCase” is one of the complex repository management software and I always used to face issues due to rebase and other operations to commit code and the errors not able to resolve and some times I used to lose all the local changes as I cannot commit changes locally in ClearCase, unlike git.
When I switched to use Git as repository management software life becomes predictable and easy. It is mainly due to the architectural difference between ClearCase and Git.
ClearCase is centralized while Git is…
To process continuous streams of data from sources HDFS directories, TCP sockets, Kafka, Flume, Twitter e.t.c spark has two methodologies.
Spark streaming in general works on something called as a micro batch. The stream pipeline is registered with some operations and the Spark polls the source after every batch duration (defined in the application) .
Spark streaming uses DStream api and spark structured streaming uses dataset/dataframe — short answer :-) .
Micro-batch processing is the practice of collecting data in small groups (“batches”) for the purposes of taking action on (processing) that data. Contrast this to traditional “batch processing,” which…
What is a Kafka Message: A record or unit of data within Kafka. Each message has a key and a value, and optionally headers.The key is commonly used for data about the message and the value is the body of the message
Message Key → Can be null or contain some value that say’s something about data, like user/email id or hash of message e.t.c
Message Value → It is the actual data that need to be send to kafka.
Why “Key” value is important →
If we need to stream data from “kafka” and perform some transformations and persist the results on AWS S3 or Azure Blob storage or Google cloud storage or kafka or hdfs we have wide range of options to choose from like spark structured streaming, kafka streams, storm, flink, akka streams.
Akka streams gaining traction in streaming world from past couple of years, mainly from early 2019. It sets its own use cases and scenarios to become a alternative to its famous competitors like spark, storm, kafka streams, flink e.t.c.
Technology should be chosen based on the use case at hand…
Background related to S3 consistency issue → Spark computations involve jobs divided into stages in turn divided into tasks that use rename functions while committing intermediate data to storage systems like S3 or hdfs.
If the underlying system is POSIX compliant, actions like file rename will be atomic, even though hdfs is not posix compliant fully, its rename operation is atomic.
But for s3, rename operation involves copy to a new file and delete of old file, so it’s not an atomic operation. …
Please refer https://sprinkle-twinkles.medium.com/docker-container-vs-docker-image-8e35a416509b to understand the difference between image and container.
Docker client and daemon → Docker use a client-server architecture. The Docker client talks to the Docker daemon(dockerd), which does the heavy lifting of building(docker build), running(docker run). The docker daemon is responsible for the state of your containers and images, and facilitates any interaction with “the outside world.”
The Docker client is merely used to translate commands into API calls that are sent to the Docker Daemon. This allows using of a local or remote docker daemon. …
Docker Image → A Docker image is an immutable file that contains source code, libraries, dependencies, tools, and other files needed for an application to run(like “my image” from below instructions)
COPY *.properties /app/properties
COPY *.jar /app/jars/
RUN make /app
CMD python /app/app.py
This image is usually built by executing Docker instructions(like above) , which add layers on top of an existing image or os distribution(like ubuntu).
Assume it as ONION with layers(instructions like copy,cmd, from) on top of exisitng image or os distribution(like ubuntu)
So image is just a template(set of instructions) , and cannot run on…
Lambda architecture identifies itself with big data, not to be confused with AWS lambda which is just a function(or piece of code) invoked on an event on a source(like s3, sqs e.t.c) or lambda expressions in java.
Big data is the most celebrated word in the last decade, thanks to the gigantic explosion of data and the requirement to make sense of this data to drive business
What is batch processing → It’s a processing methodology where data allowed to accumulate for a specific ( period like 30 or 60 or 180 min). …