Member-only story
To process continuous streams of data from sources HDFS directories, TCP sockets, Kafka, Flume, Twitter e.t.c spark has two methodologies.
Spark streaming in general works on something called as a micro batch. The stream pipeline is registered with some operations and the Spark polls the source after every batch duration (defined in the application) .
Spark streaming uses DStream api and spark structured streaming uses dataset/dataframe — short answer :-) .
Micro-batch processing is the practice of collecting data in small groups (“batches”) for the purposes of taking action on (processing) that data. Contrast this to traditional “batch processing,” which often implies taking action on a large group of data.General timelines will be around 2 to 20 min micro batches for spark streaming jobs depending on how live we need data to be processed
Spark Streaming→ Spark Streaming is a separate library in spark based on DStream API . DStream represents a continuous stream of data(as a sequence of RDDs.)DStreams data structure in turn divide data in chunks as RDDs received from the source of streaming(say kafka/flume) to be processed and after processing sends it to the destination
Spark structured streaming → Structured streaming uses dataframe/dataset APIs to perform streaming operations.In DataFrame/Dataset data is organized into named columns, like a table in a relational database.This helps to give structure to distributed streaming…