Member-only story
Good news from AWS for projects writing/reading to S3 using spark streaming with EMR(S3 now strong consistent!! No need to use EMRFS)
Background related to S3 consistency issue → Spark computations involve jobs divided into stages in turn divided into tasks that use rename functions while committing intermediate data to storage systems like S3 or hdfs.
If the underlying system is POSIX compliant, actions like file rename will be atomic, even though hdfs is not posix compliant fully, its rename operation is atomic.
But for s3, rename operation involves copy to a new file and delete of old file, so it’s not an atomic operation. Due to this spark applications, sometimes experience intermittent failures(file not found exceptions) while reading files in between operations (copying and deleting) due to eventual consistency.
In a nutshell, after a call to an S3 API function such as PUT that stores or modifies data, there’s a small time window where the data has been accepted and durably stored, but not yet visible to all GET or LIST requests.
So how to fix this issue?
EMRFS→ To bypass this rename and eventual consistency issue, spark applications can use EMRFS (While creating EMR cluster we can specify to use EMR file system).EMRFS in turn provides cushioning of dynamo db to tackle this issue. That means when spark application performs write/create/delete/rename to S3, metadata related to that operation will be placed in…