GithubHelp home page GithubHelp logo

aws-samples / emr-serverless-samples Goto Github PK

View Code? Open in Web Editor NEW
139.0 7.0 70.0 1.73 MB

Example code for running Spark and Hive jobs on EMR Serverless.

Home Page: https://aws.amazon.com/emr/serverless/

License: MIT No Attribution

Dockerfile 3.25% Shell 3.14% Python 93.07% Batchfile 0.54%
aws emr serverless analytics spark hive cdk cloudformation emr-serverless iceberg

emr-serverless-samples's Introduction

EMR Serverless Samples

This repository contains example code for getting started with EMR Serverless and using it with Apache Spark and Apache Hive.

In addition, it provides Container Images for both the Spark History Server and Tez UI in order to debug your jobs.

For full details about using EMR Serverless, please see the EMR Serverless documentation.

Pre-Requisites

These demos assume you are using an Administrator-level role in your AWS account

  1. Amazon EMR Serverless is now Generally Available! Check out the console to Get Started with EMR Serverless.

  2. Create an Amazon S3 bucket in region where you want to use EMR Serverless (we'll assume us-east-1).

aws s3 mb s3://BUCKET-NAME --region us-east-1
  1. Create an EMR Serverless execution role (replacing BUCKET-NAME with the one you created above)

This role provides both S3 access for specific buckets as well as read and write access to the Glue Data Catalog.

aws iam create-role --role-name emr-serverless-job-role --assume-role-policy-document '{
    "Version": "2012-10-17",
    "Statement": [
      {
        "Effect": "Allow",
        "Principal": {
          "Service": "emr-serverless.amazonaws.com"
        },
        "Action": "sts:AssumeRole"
      }
    ]
  }'

aws iam put-role-policy --role-name emr-serverless-job-role --policy-name S3Access --policy-document '{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "ReadFromOutputAndInputBuckets",
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::noaa-gsod-pds",
                "arn:aws:s3:::noaa-gsod-pds/*",
                "arn:aws:s3:::BUCKET-NAME",
                "arn:aws:s3:::BUCKET-NAME/*"
            ]
        },
        {
            "Sid": "WriteToOutputDataBucket",
            "Effect": "Allow",
            "Action": [
                "s3:PutObject",
                "s3:DeleteObject"
            ],
            "Resource": [
                "arn:aws:s3:::BUCKET-NAME/*"
            ]
        }
    ]
}'

aws iam put-role-policy --role-name emr-serverless-job-role --policy-name GlueAccess --policy-document '{
    "Version": "2012-10-17",
    "Statement": [
      {
        "Sid": "GlueCreateAndReadDataCatalog",
        "Effect": "Allow",
        "Action": [
            "glue:GetDatabase",
            "glue:GetDataBases",
            "glue:CreateTable",
            "glue:GetTable",
            "glue:GetTables",
            "glue:GetPartition",
            "glue:GetPartitions",
            "glue:CreatePartition",
            "glue:BatchCreatePartition",
            "glue:GetUserDefinedFunctions"
        ],
        "Resource": ["*"]
      }
    ]
  }'

Now you're ready to go! Check out the examples below.

Examples

SDK Usage

You can call EMR Serverless APIs using standard AWS SDKs. The examples below show how to do this.

Utilities

The following UIs are available in the EMR Serverless console, but you can still use them locally if you wish.

  • Spark UI- Use this Dockerfile to run Spark history server in a container.

  • Tez UI- Use this Dockerfile to run Tez UI and Application Timeline Server in a container.

Other Resources

Security

See CONTRIBUTING for more information.

License

This library is licensed under the MIT-0 License. See the LICENSE file.

emr-serverless-samples's People

Contributors

amazon-auto avatar conorfl avatar dacort avatar dependabot[bot] avatar guptashailesh92 avatar karthick86mca avatar kiyoung-aws avatar kkhanmm avatar melodyyangaws avatar meniluca avatar mokkhan avatar pellizzon avatar sariabod avatar vgkowski avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

emr-serverless-samples's Issues

EMR Serverless Adding Option to Boto3 for Glue Catlog

AWS has announced AWS EMR CLI

https://aws.amazon.com/blogs/big-data/build-deploy-and-run-spark-jobs-on-amazon-emr-with-the-open-source-emr-cli-tool/

 

I have tried and CLi works great simplifies submitting jobs
image

However, could you tell us how to enable the Glue Hive meta store when submitting a job via CLI or in Boto3
i have looked at documentation i don't see an argument for supplying use Glue CatLog option on boto3

image

Here is a sample of how we are submitting jobs qith EMR-CLI

emr run     --entry-point entrypoint.py
    --application-id     --job-role <arn>
    --s3-code-uri s3:///emr_scripts/     --spark-submit-opts "--conf spark.jars=/usr/lib/hudi/hudi-spark-bundle.jar --conf spark.serializer=org.apache.spark.serializer.KryoSerializer"
    --build `
    --wait

 

Created A Github Issue awslabs/amazon-emr-cli#18

If you can kindly get back to us on issue that would be great 😃

EMR Serverless plugin in conflict with Airflow 2.2.2 constraints file

Hi all,

I'm trying to use the latest release of the serverless plugin on MWAA with Airflow version 2.2.2: https://github.com/aws-samples/emr-serverless-samples/releases/tag/v1.0.1

The install is in conflict with the airflow v2.2.2 constraints file found here: https://raw.githubusercontent.com/apache/airflow/constraints-2.2.2/constraints-3.7.txt

Steps to reproduce

Requirements.txt contents:

--constraint "https://raw.githubusercontent.com/apache/airflow/constraints-2.2.2/constraints-3.7.txt"
emr_serverless @ https://github.com/aws-samples/emr-serverless-samples/releases/download/v1.0.1/mwaa_plugin.zip

Run:

pip3 install -r requirements.txt

Output:

The conflict is caused by:
    emr-serverless 1.0.1 depends on boto3>=1.23.9 and ~=1.23
    The user requested (constraint) boto3==1.18.65

how to exec hive sql file with parameters

I want to exec hsql file with parameters.
local shell script add "-hivevar date_param=string:$date_param" ,example like this:
"""
#!/bin/bash
date_param=date "+%Y%m%d"

aws emr-serverless start-job-run
--application-id XXXXX
--execution-role-arn arn:aws:iam::xxxxx:role/xxxxxxxx
--job-driver '{
"hive": {
"initQueryFile": "s3://mm-emr/emr-serverless-hive/sql/createTable.sql",
"query": "s3://mm-emr/emr-serverless-hive/sql/insert3.sql",
"parameters": "--hiveconf hive.exec.scratchdir=s3://mm-emr/emr-serverless-hive/hive/scratch --hiveconf hive.metastore.warehouse.dir=s3://aiways-emr/emr-serverless-hive/hive/warehouse
-hivevar date_param=string:$date_param"
}
}'
"""
and i load the sql files to s3 path and run the app.
Get a Error :An error occurred (ValidationException) when calling the StartJobRun operation: Flag '-hivevar' is not supported

how should i do ?

The suggested way of using Python libraries with EMR Serverless does not work

As detailed here and here we should be able to install and use a python venv:

--conf spark.archives=s3://DOC-EXAMPLE-BUCKET/EXAMPLE-PREFIX/pyspark_venv.tar.gz#environment 
--conf spark.emr-serverless.driverEnv.PYSPARK_DRIVER_PYTHON=./environment/bin/python
--conf spark.emr-serverless.driverEnv.PYSPARK_PYTHON=./environment/bin/python 
--conf spark.emr-serverless.executorEnv.PYSPARK_PYTHON=./environment/bin/python 

but that doesn't seem to work. The application fails with this error:

Unpacking an archive s3://DOC-EXAMPLE-BUCKET/EXAMPLE-PREFIX/pyspark_venv.tar.gz#environment from /tmp/spark-02908b0e-9b64-469d-xxx-xxxxxxxx/pyspark_venv.tar.gz to /home/hadoop/./environment
Exception in thread "main" java.io.IOException: Cannot run program "./environment/bin/python": error=2, No such file or directory
	at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048)
	at org.apache.spark.deploy.PythonRunner$.main(PythonRunner.scala:105)
	at org.apache.spark.deploy.PythonRunner.main(PythonRunner.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
	at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:1003)
	at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
	at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
	at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
	at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1092)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1101)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.io.IOException: error=2, No such file or directory
	at java.lang.UNIXProcess.forkAndExec(Native Method)
	at java.lang.UNIXProcess.<init>(UNIXProcess.java:247)
	at java.lang.ProcessImpl.start(ProcessImpl.java:134)
	at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029)
	... 14 more
22/07/20 05:36:18 INFO ShutdownHookManager: Shutdown hook called
22/07/20 05:36:18 INFO ShutdownHookManager: Deleting directory /tmp/spark-02908b0e-9b64-469d-b094-edee291a2426

MWAA 2.2.2 constraints file

How are you able to install the emr_serverless package on MWAA 2.2.2 using the official constraints file:

The emr_serverless operator depends on boto3>=1.23.9

While MWAA 2.2.2 has the following constraints:
boto3==1.18.65
boto==2.49.0
botocore==1.21.65

EmrServerlessStartJobOperator does not raise airflow exception

Hi! I am testing the Operator and I found that airflow is marking a task as SUCCESS even though the EMR job state is FAILED.
I think the problem is in the EmrServerlessHook, this condition:

if state in failure_states:
    raise AirflowException(
        f"{object_type.title()} reached failure state {state}."
    )

should be at the end in the while loop.

Thanks!

EMR serverless "java.lang.ClassNotFoundException"

we have a job which uses mssql driver and currently I am supplying below config as part of "spark properties" but I am getting below error.

"--conf spark.archives=/artifacts/pyspark/pyspark_ge.tar.gz#environment --conf spark.submit.pyFiles=/package-1.0.0-py3.8.egg --jars /jar/spark-mssql-connector_2.12-1.1.0.jar --conf spark.emr-serverless.driverEnv.PYSPARK_DRIVER_PYTHON=./environment/bin/python --conf spark.emr-serverless.driverEnv.PYSPARK_PYTHON=./environment/bin/python --conf spark.executorEnv.PYSPARK_PYTHON=./environment/bin/python"

Error:

": java.lang.ClassNotFoundException: com.microsoft.sqlserver.jdbc.SQLServerDriver"

Getting "No module named 'airflow.compat'"

Hi Team,

While trying to test EMRServerless operator using the github link in requirements file, I am getting the following error

"Broken DAG: [/usr/local/airflow/dags/EMR_Serverless.py] Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/emr_serverless/operators/emr.py", line 22, in
from emr_serverless.hooks.emr import EmrServerlessHook
File "/usr/local/lib/python3.7/site-packages/emr_serverless/hooks/emr.py", line 23, in
from airflow.compat.functools import cached_property
ModuleNotFoundError: No module named 'airflow.compat' "

Versions used in Docker container:

  • Airflow = 2.1.0
  • boto3 = 1.23.9
  • apache-airflow-providers-amazon==3.0.0
  • Python 3.7 Slim-buster

It would be great if someone can suggest me workarounds.

Thanks

Parameter countdown cannot be passed to EmrServerlessStartJobOperator

Hello, I am trying to increase the default parameter countdown in the EmrServerlessStartJobOperator but I cannot do that because the waiter is not receiving it:

            self.hook.waiter(
                get_state_callable=self.hook.conn.get_job_run,
                get_state_args={
                    "applicationId": self.application_id,
                    "jobRunId": response["jobRunId"],
                },
                parse_response=["jobRun", "state"],
                desired_state=EmrServerlessJobSensor.SUCCESS_STATES,
                failure_states=EmrServerlessJobSensor.FAILURE_STATES,
                object_type="job",
                action="run",
            )

[pyspark-dependencies] - DockerFile does not automatically move the tar.gz file to the local folder

Hi,
Trying to execute the first two command lines listed in the readme file
docker build --output . .; aws s3 cp pyspark_ge.tar.gz s3://${S3_BUCKET}/artifacts/pyspark/

I noticed that the file pyspark_ge.tar.gz was not loaded locally. I had to run the container related to the image built with the previous command and next the Docker cp command.

I was wondering if it was just my problem. If not I volunteer to update the documentation with the two additional commands I had to run.
Thank you very much.

'template_fields' in 'EmrServerlessDeleteApplicationOperator' should be a tuple, not a string

It's me again.
I noticed that the in the Airflow operators for EMR Serverless you have this line (https://github.com/aws-samples/emr-serverless-samples/blob/main/airflow/emr_serverless/operators/emr.py#L239):

    template_fields: Sequence[str] = "application_id"

The problem with it is that Python converts the string 'application_id' into the list: ['a', 'p', 'p', 'l', ...], and therefore when Airflow/MWAA tries to use that operator it fails with the error message:

AttributeError: 'EmrServerlessDeleteApplicationOperator' object has no attribute 'a'

I found a good explanation in this SO question: https://stackoverflow.com/questions/56845602/python-airflow-error-attributeerror-xsensor-object-has-no-attribute-l

Based on that I changed that line to:

    template_fields: Sequence[str] = ("application_id",)

and after that change, the operator 'EmrServerlessDeleteApplicationOperator' works.

Franco

Would it be possible to add 'config' to the list of template fields for EmrServerlessStartJobOperator?

We have a use case where we would like to start job runs in EMR Serverless where the job name is derived from the output from a previous task.
Since the job name is passed via the 'config' argument to the EmrServerlessStartJobOperator, this request is to add 'config' to the list of 'template_fields' in that operator (https://github.com/aws-samples/emr-serverless-samples/blob/main/airflow/emr_serverless/operators/emr.py#L144-L149), if at all possible.

Thanks in advance,
Franco Venturi

Custom python versions >= 3.10 fail on EMR Studio/Jupyter due to a badly patched version of livy

This repository contains a great example of using a more recent python interpreter on EMR serverless.

Using that example I am able to use a custom python3.11 + preinstalled modules venv. This works fine for spark-submit jobs. In interactive mode, namely EMR Studio, I can also use my custom venv. However the deployed version of jupyter, in pariticular ipython, has compatibility issues with newer versions of python. Some commands fail with:

An error was encountered:
required field "type_ignores" missing from Module
Traceback (most recent call last):
  File "/tmp/6833554925722006797", line 226, in execute
    code = compile(mod, '<stdin>', 'exec')
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: required field "type_ignores" missing from Module

I see 2 possible solutions to this:

  • upgrade jupyter/ipython - however I don't know if that is part of the docker image for emr-serverless/spark/emr-7.0.0, if so, can it be upgraded with a custom image? -> it turned out to be livy, similar but not same as ipython
  • run an own jupyter server that connects to EMR serverless

But maybe I miss something? Is there another way? Do I misinterpret the stacktrace?

It is possible to develop locally of course, however the data/computation should happen in AWS.

ModuleNotFoundError when running sample code

Hi,

I have run the example from
https://github.com/aws-samples/emr-serverless-samples/tree/main/examples/pyspark/dependencies

aws emr-serverless start-job-run \ --application-id $APPLICATION_ID \ --execution-role-arn $JOB_ROLE_ARN \ --job-driver '{ "sparkSubmit": { "entryPoint": "s3://'${S3_BUCKET}'/code/pyspark/ge_profile.py", "entryPointArguments": ["s3://'${S3_BUCKET}'/tmp/ge-profile"], "sparkSubmitParameters": "--conf spark.archives=s3://'${S3_BUCKET}'/artifacts/pyspark/pyspark_ge.tar.gz#environment --conf spark.emr-serverless.driverEnv.PYSPARK_DRIVER_PYTHON=./environment/bin/python --conf spark.emr-serverless.driverEnv.PYSPARK_PYTHON=./environment/bin/python --conf spark.emr-serverless.executorEnv.PYSPARK_PYTHON=./environment/bin/python" } }' \ --configuration-overrides '{ "monitoringConfiguration": { "s3MonitoringConfiguration": { "logUri": "s3://'${S3_BUCKET}'/logs/" } } }'

And keep getting

Traceback (most recent call last):
File "/tmp/spark-adc7d82f-ca5d-41ca-8d69-a89f144589a0/ge_profile.py", line 4, in
import great_expectations as ge
ModuleNotFoundError: No module named 'great_expectations'

I can confirm that the module is included in pyspark_ge.tar.gz

Thanks for the help

Eric

Hive Example Fails when using JSON Data

I am using the hive example as a template, but instead using json data. Using the setup below, I receive an error every time. Is there a different setup I should be using? Also, is there a good example for hive / EMR Serverless using JSON data that should work?

Having an additional hive example based on JSON in this repository would be helpful since the existing one uses CSV formatted data

Details

Given an S3 bucket that contains files with a format such as

{"Id":"123","Name":"my-name","Type":"some-type"}

and the initialization script here
and the query script here

When I run a job in EMR serverless using these inputs
Then I receive an error message stating

Job failed, please check complete logs in configured logging destination. ExitCode: 2. Last few exceptions: Caused by: java.lang.ClassNotFoundException: Class org.apache.hive.hcatalog.data.JsonSerDe not found Caused by: java.lang.RuntimeException: Map operator initialization failed ], TaskAttempt 2 failed, info=[Error: Error while running task ( failure ) : attempt_1665435121822_0001_1_00_000000_2:java.lang.RuntimeException: java.lang.RuntimeException: Map operator initialization failed Caused by: java.lang.ClassNotFoundException: Class org.apache.hive.hcatalog.data.JsonSerDe not found Caused by: java.lang.RuntimeException: Map operator initialization failed...

Add support for "config" to be a templated field in the EmrServerlessCreateApplicationOperator

Hi, I'd like that config in the EmrServerlessCreateApplicationOperator be a templated field so that I can use jinja templates in it. Would it be possible to incorporate it into the next release? I have tried overriding the operator and it's worked just fine:

class CustomEmrServerlessCreateApplicationOperator(EmrServerlessCreateApplicationOperator):
    template_fields: Sequence[str] = ("config",)

Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.