Member-only story
Options to submit spark job from remote(edge) node to aws emr cluster.
Spark jobs can be scheduled to submit to EMR cluster using schedulers like livy or custom code written in java/python/cron that will using spark-submit code wrappers depending on the language/requirements.
If you are using a remote node(EC2 or on premise edge nodes) to schedule spark jobs, to be submitted to remote EMR cluster, AWS already published an article with detail steps
But even after following the above steps in aws documentation like allowing traffic between the remote node and emr node, copying hadoop & spark conf, installing hadoop client, spark core e.t.c still, we may experience several exceptions like below.
- Class not found exception related to missing emrfs, kinesis, goodies e.t.c jars(in /usr/share/aws/emr folders) based on how the spark job configured.
- Permission issues (some which we can fix, like changing /mnt/ permissions,but other errors are too cryptic)
- Spark app submitted from custom scheduler or command line using spark-submit wrapper , not getting submitted to EMR cluster and no error as well in logs. So it's like submitting to a black hole.
So what is the solution ?
- Create an AMI image of EMR cluster and launch remote nodes(edge node) from the AMI image.
- Now remote node(edge node) armed with all the configs(spark, hadoop, jars in /usr/share/aws e.t.c) and submitting spark job from EMR node is equivalent to submitting from remote node.
If spark-submit jobs are simple enough with minor dependencies, then we may not need to implement above AMI image solution, and just following AWS article may be sufficient.