Spark Streaming, Kinesis, and EMR: Overcoming Pain Points

Spark is a distributed MapReduce framework designed for large scale batch and streaming operations. Over the past few months we’ve been exploring the use of Spark Streaming on Amazon’s Elastic MapReduce (EMR) as an alternative to existing custom AWS based solutions for large scale data processing of Kinesis streams. This blog post will go into detail on some of the challenges of running Spark Streaming + Kinesis workloads on EMR.

Specifically it will touch on
* IAM permissions for S3
* Passing flags for spark-submit and setting environment variables in CloudFormation
* Accessing Spark and YARN UIs
* Configuring executors with Kinesis in mind.

Read it on Medium

Previous
Previous

Startups & Remote-first Culture

Next
Next

Becoming better