GithubHelp home page GithubHelp logo

chrisabbott / amazon-emr-cli Goto Github PK

View Code? Open in Web Editor NEW

This project forked from awslabs/amazon-emr-cli

0.0 0.0 0.0 68 KB

A command-line interface for packaging, deploying, and running your EMR Serverless Spark jobs

License: Apache License 2.0

Python 95.03% Dockerfile 4.97%

amazon-emr-cli's Introduction

EMR CLI

So we're all working on data pipelines every day, but wouldn't be nice to just hit a button and have our code automatically deployed to staging or test accounts? I thought so, too, thats why I created the EMR CLI (emr) that can help you package and deploy your EMR jobs so you don't have to.

The EMR CLI supports a wide variety of configuration options to adapt to your data pipeline, not the other way around.

  1. Packaging - Ensure a consistent approach to packaging your production Spark jobs.
  2. Deployment - Easily deploy your Spark jobs across multiple EMR environments or deployment frameworks like EC2, EKS, and Serverless.
  3. CI/CD - Easily test each iteration of your code without resorting to messy shell scripts. :)

The initial use cases are:

  1. Consistent packaging for PySpark projects.
  2. Use in CI/CD pipelines for packaging, deployment of artifacts, and integration testing.

Warning: This tool is still under active development, so commands may change until a stable 1.0 release is made.

Quick Start

You can use the EMR CLI to take a project from nothing to running in EMR Serverless is 2 steps.

First, let's install the emr command.

python3 -m pip install -U emr-cli

Note This tutorial assumes you have already setup EMR Serverless and have an EMR Serverless application, job role, and S3 bucket you can use. You can also use the emr bootstrap command.

  1. Create a sample project
emr init scratch

๐Ÿ“” Tip: Use --project-type poetry to create a Poetry project!

You should now have a sample PySpark project in your scratch directory.

scratch
โ”œโ”€โ”€ Dockerfile
โ”œโ”€โ”€ entrypoint.py
โ”œโ”€โ”€ jobs
โ”‚ย ย  โ””โ”€โ”€ extreme_weather.py
โ””โ”€โ”€ pyproject.toml

1 directory, 4 files
  1. Now deploy and run on an EMR Serverless application!
emr run \
    --entry-point entrypoint.py \
    --application-id ${APPLICATION_ID} \
    --job-role ${JOB_ROLE_ARN} \
    --s3-code-uri  s3://${S3_BUCKET}/tmp/emr-cli-demo/ \
    --build \
    --wait

This command performs the following actions:

  • Packages your project dependencies into a python virtual environment
  • Uploads the Spark entrypoint and packaged dependencies to S3
  • Starts an EMR Serverless job
  • Waits for the job to run to a successful completion!

And you're done. Feel free to modify the project to experiment with different things. You can simply re-run the command above to re-package and re-deploy your job.

pyspark code

In many organizations, PySpark is the primary language for writing Spark jobs. But Python projects can be structured in a variety of ways โ€“ย a single .py file, requirements.txt, setup.py files, or even poetry configurations. EMR CLI aims to bundle your PySpark code the same way regardless of which system you use.

Spark scala code (coming)

While Spark Scala or Java code will be more standard from a packaging perspective, it's still useful to able to easily deploy and run your jobs across multiple EMR environments.

Spark SQL (coming)

Want to just write some .sql files and have those deployed? No problem.

Sample Commands

  • Create a new PySpark project (other frameworks TBD)
emr init project-dir
  • Package your project into a virtual environment archive
emr package --entry-point main.py

The EMR CLI auto-detects the project type and will change the packaging method appropriately.

If you have additional .py files, those will be included in the archive.

  • Deploy an existing package artifact to S3.
emr deploy --entry-point main.py --s3-code-uri s3://<BUCKET>/code/
  • Deploy a PySpark package to S3 and trigger an EMR Serverless job
emr run --entry-point main.py \
    --s3-code-uri s3://<BUCKET>/code/ \
    --application-id <EMR_SERVERLESS_APP> \
    --job-role <JOB_ROLE_ARN>
  • Build, deploy, and run an EMR Serverless job and wait for it to finish.
emr run --entry-point main.py \
    --s3-code-uri s3://<BUCKET>/code/ \
    --application-id <EMR_SERVERLESS_APP> \
    --job-role <JOB_ROLE_ARN> \
    --build \
    --wait

Note: If the job fails, the command will exit with an error code.

In the future, you'll also be able to do the following:

  • Utilize the same code against an EMR on EC2 cluster
emr run --cluster-id j-8675309
  • Or an EMR on EKS virtual cluster.
emr run --virtual-cluster-id 654abacdefgh1uziuyackhrs1

Security

See CONTRIBUTING for more information.

License

This project is licensed under the Apache-2.0 License.

amazon-emr-cli's People

Contributors

dacort avatar amazon-auto avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.