AWSFlow: From zero to Amazon EMR jobs and Lambda functions with Python

Posted on April 06, 2019 in devops • 1 min read

After lots of cleanup and refactoring, the AWSFlow project goes public! It lets you define programmatically workloads for AWS Elastic Map Reduce (EMR) clusters and Lambda functions using Python with a concise methodology aimed at fast prototyping.

The most interesting design choice is that the awsflow package itself gets deployed everywhere: in a local container, on all EMR cluster nodes, and in the context of Lambda functions. It becomes incredibly easy to:

  • Add Python-defined command line tools on all cluster nodes
  • Implement EMR bootstrap actions and steps in Python with Fabric
  • Define Python Lambda functions with dependencies to other packages, including awsflow
  • Trigger the creation of parametrised EMR clusters to run Spark jobs from Lambda functions
  • Manage the lifecycle of EMR clusters and Lambda functions in a simplified way

Initially used only in my data science team, it is now a founding stone of our macro analytics pipelines at Telefónica NEXT and is being used internally by diverse teams in both Berlin and Munich. In the future, we might transition to more complete and battle-tested solutions like Terraform and CloudFormation if its simplicity and compactness become too limiting.

We use AWSFlow everyday to create clusters, run Jupyter & Zeppelin notebooks persisted on S3, schedule the creation of clusters, and manage the execution of Spark jobs triggered by CloudWatch events.

The official documentation and code are available at https://github.com/elehcimd/awsflow. Enjoy! And don’t forget to Star the project on GitHub if you like it! Thank You.

[Thread on hacker news]