aws-samples / emr-serverless-samples Goto Github PK

View Code? Open in Web Editor NEW

139.0 7.0 70.0 1.73 MB

Example code for running Spark and Hive jobs on EMR Serverless.

Home Page: https://aws.amazon.com/emr/serverless/

License: MIT No Attribution

Dockerfile 3.25% Shell 3.14% Python 93.07% Batchfile 0.54%

aws emr serverless analytics spark hive cdk cloudformation emr-serverless iceberg

emr-serverless-samples's Introduction

EMR Serverless Samples

This repository contains example code for getting started with EMR Serverless and using it with Apache Spark and Apache Hive.

In addition, it provides Container Images for both the Spark History Server and Tez UI in order to debug your jobs.

For full details about using EMR Serverless, please see the EMR Serverless documentation.

Pre-Requisites

These demos assume you are using an Administrator-level role in your AWS account

Amazon EMR Serverless is now Generally Available! Check out the console to Get Started with EMR Serverless.
Create an Amazon S3 bucket in region where you want to use EMR Serverless (we'll assume us-east-1).

aws s3 mb s3://BUCKET-NAME --region us-east-1

Create an EMR Serverless execution role (replacing BUCKET-NAME with the one you created above)

This role provides both S3 access for specific buckets as well as read and write access to the Glue Data Catalog.

aws iam create-role --role-name emr-serverless-job-role --assume-role-policy-document '{
    "Version": "2012-10-17",
    "Statement": [
      {
        "Effect": "Allow",
        "Principal": {
          "Service": "emr-serverless.amazonaws.com"
        },
        "Action": "sts:AssumeRole"
      }
    ]
  }'

aws iam put-role-policy --role-name emr-serverless-job-role --policy-name S3Access --policy-document '{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "ReadFromOutputAndInputBuckets",
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::noaa-gsod-pds",
                "arn:aws:s3:::noaa-gsod-pds/*",
                "arn:aws:s3:::BUCKET-NAME",
                "arn:aws:s3:::BUCKET-NAME/*"
            ]
        },
        {
            "Sid": "WriteToOutputDataBucket",
            "Effect": "Allow",
            "Action": [
                "s3:PutObject",
                "s3:DeleteObject"
            ],
            "Resource": [
                "arn:aws:s3:::BUCKET-NAME/*"
            ]
        }
    ]
}'

aws iam put-role-policy --role-name emr-serverless-job-role --policy-name GlueAccess --policy-document '{
    "Version": "2012-10-17",
    "Statement": [
      {
        "Sid": "GlueCreateAndReadDataCatalog",
        "Effect": "Allow",
        "Action": [
            "glue:GetDatabase",
            "glue:GetDataBases",
            "glue:CreateTable",
            "glue:GetTable",
            "glue:GetTables",
            "glue:GetPartition",
            "glue:GetPartitions",
            "glue:CreatePartition",
            "glue:BatchCreatePartition",
            "glue:GetUserDefinedFunctions"
        ],
        "Resource": ["*"]
      }
    ]
  }'

Now you're ready to go! Check out the examples below.

Examples

CloudFormation Templates

Sample templates for creating an EMR Serverless application as well as various dependencies.
CloudWatch Dashboard Template

Template for creating a CloudWatch Dashboard for monitoring your EMR Serverless application.
CDK Examples

Examples of building EMR Serverless environments with Amazon CDK.
Airflow Operator

Sample DAGs and preview version of the Airflow Operator. Check the releases page for updates.
EMR Serverless PySpark job

This sample script shows how to use EMR Serverless to run a PySpark job that analyzes data from the open NOAA Global Surface Summary of Day dataset.
Python Dependencies

Shows how to package Python dependencies (Great Expectations) using a Virtualenv and venv-pack.
Custom Python version

Shows how to use a different Python version than the default (3.7.10) provided by EMR Serverless.
Genomics analysis using Glow

This sample shows how to use EMR Serverless to combine both Python and Java dependencies in order to run genomic analysis using Glow and 1000 Genomes.
EMR Serverless Hive query

This sample script shows how to use Hive in EMR Serverless to query the same NOAA data.

SDK Usage

You can call EMR Serverless APIs using standard AWS SDKs. The examples below show how to do this.

Utilities

EMR Serverless Estimator - Estimate the cost of running Spark jobs on EMR Serverless based on Spark event logs.

The following UIs are available in the EMR Serverless console, but you can still use them locally if you wish.

Spark UI- Use this Dockerfile to run Spark history server in a container.
Tez UI- Use this Dockerfile to run Tez UI and Application Timeline Server in a container.

Other Resources

Run a data processing job on Amazon EMR Serverless with AWS Step Functions

Security

See CONTRIBUTING for more information.

License

This library is licensed under the MIT-0 License. See the LICENSE file.

emr-serverless-samples's People

Contributors

Stargazers

Watchers

emr-serverless-samples's Issues

Add support for name and tags in EmrServerlessStartJobOperator

EmrServerlessStartJobOperator doesn't currently support providing a job name or tags.

EMR Serverless Adding Option to Boto3 for Glue Catlog

AWS has announced AWS EMR CLI

https://aws.amazon.com/blogs/big-data/build-deploy-and-run-spark-jobs-on-amazon-emr-with-the-open-source-emr-cli-tool/

I have tried and CLi works great simplifies submitting jobs

However, could you tell us how to enable the Glue Hive meta store when submitting a job via CLI or in Boto3
i have looked at documentation i don't see an argument for supplying use Glue CatLog option on boto3

Here is a sample of how we are submitting jobs qith EMR-CLI

emr run     --entry-point entrypoint.py
    --application-id     --job-role <arn>
    --s3-code-uri s3:///emr_scripts/     --spark-submit-opts "--conf spark.jars=/usr/lib/hudi/hudi-spark-bundle.jar --conf spark.serializer=org.apache.spark.serializer.KryoSerializer"
    --build `
    --wait

Created A Github Issue awslabs/amazon-emr-cli#18

If you can kindly get back to us on issue that would be great 😃

No pom.xml file provided for PySpark dependencies example

Could the pom.xml file mentioned in the README for the PySpark dependencies example please be provided?

configuration_overrides shouldn't be required

configuration_overrides is a required field in EmrServerlessStartJobOperator, but it's not a required field.

Error message when trying to start a job without it:

E           airflow.exceptions.AirflowException: Argument ['configuration_overrides'] is required

EMR Serverless plugin in conflict with Airflow 2.2.2 constraints file

Hi all,

I'm trying to use the latest release of the serverless plugin on MWAA with Airflow version 2.2.2: https://github.com/aws-samples/emr-serverless-samples/releases/tag/v1.0.1

The install is in conflict with the airflow v2.2.2 constraints file found here: https://raw.githubusercontent.com/apache/airflow/constraints-2.2.2/constraints-3.7.txt

Steps to reproduce

Requirements.txt contents:

--constraint "https://raw.githubusercontent.com/apache/airflow/constraints-2.2.2/constraints-3.7.txt"
emr_serverless @ https://github.com/aws-samples/emr-serverless-samples/releases/download/v1.0.1/mwaa_plugin.zip

Run:

pip3 install -r requirements.txt

Output:

The conflict is caused by:
    emr-serverless 1.0.1 depends on boto3>=1.23.9 and ~=1.23
    The user requested (constraint) boto3==1.18.65

how to exec hive sql file with parameters

I want to exec hsql file with parameters.
local shell script add "-hivevar date_param=string:$date_param" ,example like this：
"""
#!/bin/bash
date_param=date "+%Y%m%d"

aws emr-serverless start-job-run
--application-id XXXXX
--execution-role-arn arn:aws:iam::xxxxx:role/xxxxxxxx
--job-driver '{
"hive": {
"initQueryFile": "s3://mm-emr/emr-serverless-hive/sql/createTable.sql",
"query": "s3://mm-emr/emr-serverless-hive/sql/insert3.sql",
"parameters": "--hiveconf hive.exec.scratchdir=s3://mm-emr/emr-serverless-hive/hive/scratch --hiveconf hive.metastore.warehouse.dir=s3://aiways-emr/emr-serverless-hive/hive/warehouse
-hivevar date_param=string:$date_param"
}
}'
"""
and i load the sql files to s3 path and run the app.
Get a Error :An error occurred (ValidationException) when calling the StartJobRun operation: Flag '-hivevar' is not supported

how should i do ?

Consider using Cloudwatch Variables for Application ID

Cloudwatch Variables are now GA and have full CDK support.

It would be great to use a CFN Variable for Application ID where the Stack Parameter is optional and that just sets the default value.

This way a user can continue using the same stack + dashboard to monitor multiple EMR Serverless Applications.

The suggested way of using Python libraries with EMR Serverless does not work

As detailed here and here we should be able to install and use a python venv:

--conf spark.archives=s3://DOC-EXAMPLE-BUCKET/EXAMPLE-PREFIX/pyspark_venv.tar.gz#environment 
--conf spark.emr-serverless.driverEnv.PYSPARK_DRIVER_PYTHON=./environment/bin/python
--conf spark.emr-serverless.driverEnv.PYSPARK_PYTHON=./environment/bin/python 
--conf spark.emr-serverless.executorEnv.PYSPARK_PYTHON=./environment/bin/python

but that doesn't seem to work. The application fails with this error:

Unpacking an archive s3://DOC-EXAMPLE-BUCKET/EXAMPLE-PREFIX/pyspark_venv.tar.gz#environment from /tmp/spark-02908b0e-9b64-469d-xxx-xxxxxxxx/pyspark_venv.tar.gz to /home/hadoop/./environment
Exception in thread "main" java.io.IOException: Cannot run program "./environment/bin/python": error=2, No such file or directory
	at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048)
	at org.apache.spark.deploy.PythonRunner$.main(PythonRunner.scala:105)
	at org.apache.spark.deploy.PythonRunner.main(PythonRunner.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
	at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:1003)
	at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
	at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
	at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
	at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1092)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1101)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.io.IOException: error=2, No such file or directory
	at java.lang.UNIXProcess.forkAndExec(Native Method)
	at java.lang.UNIXProcess.<init>(UNIXProcess.java:247)
	at java.lang.ProcessImpl.start(ProcessImpl.java:134)
	at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029)
	... 14 more
22/07/20 05:36:18 INFO ShutdownHookManager: Shutdown hook called
22/07/20 05:36:18 INFO ShutdownHookManager: Deleting directory /tmp/spark-02908b0e-9b64-469d-b094-edee291a2426

Version mismatch in 'airflow/setup.py'

I just noticed that the file 'setup.py' in the 'airflow' directory has still version 0.0.2 (https://github.com/aws-samples/emr-serverless-samples/blob/v0.0.4-preview/airflow/setup.py#L5), while the current version should be 0.0.4.

You may also want to look into updating the version numbers for the dependencies 'boto3' and 'botocore'.

Franco

MWAA 2.2.2 constraints file

How are you able to install the emr_serverless package on MWAA 2.2.2 using the official constraints file:

The emr_serverless operator depends on boto3>=1.23.9

While MWAA 2.2.2 has the following constraints:
boto3==1.18.65
boto==2.49.0
botocore==1.21.65

EmrServerlessStartJobOperator does not raise airflow exception

Hi! I am testing the Operator and I found that airflow is marking a task as SUCCESS even though the EMR job state is FAILED.
I think the problem is in the EmrServerlessHook, this condition:

if state in failure_states:
    raise AirflowException(
        f"{object_type.title()} reached failure state {state}."
    )

should be at the end in the while loop.

Thanks!

EMR serverless "java.lang.ClassNotFoundException"

we have a job which uses mssql driver and currently I am supplying below config as part of "spark properties" but I am getting below error.

"--conf spark.archives=/artifacts/pyspark/pyspark_ge.tar.gz#environment --conf spark.submit.pyFiles=/package-1.0.0-py3.8.egg --jars /jar/spark-mssql-connector_2.12-1.1.0.jar --conf spark.emr-serverless.driverEnv.PYSPARK_DRIVER_PYTHON=./environment/bin/python --conf spark.emr-serverless.driverEnv.PYSPARK_PYTHON=./environment/bin/python --conf spark.executorEnv.PYSPARK_PYTHON=./environment/bin/python"

Error:

": java.lang.ClassNotFoundException: com.microsoft.sqlserver.jdbc.SQLServerDriver"

Getting "No module named 'airflow.compat'"

Hi Team,

While trying to test EMRServerless operator using the github link in requirements file, I am getting the following error

"Broken DAG: [/usr/local/airflow/dags/EMR_Serverless.py] Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/emr_serverless/operators/emr.py", line 22, in
from emr_serverless.hooks.emr import EmrServerlessHook
File "/usr/local/lib/python3.7/site-packages/emr_serverless/hooks/emr.py", line 23, in
from airflow.compat.functools import cached_property
ModuleNotFoundError: No module named 'airflow.compat' "

Versions used in Docker container:

Airflow = 2.1.0
boto3 = 1.23.9
apache-airflow-providers-amazon==3.0.0
Python 3.7 Slim-buster

It would be great if someone can suggest me workarounds.

Thanks

Parameter countdown cannot be passed to EmrServerlessStartJobOperator

Hello, I am trying to increase the default parameter countdown in the EmrServerlessStartJobOperator but I cannot do that because the waiter is not receiving it:

            self.hook.waiter(
                get_state_callable=self.hook.conn.get_job_run,
                get_state_args={
                    "applicationId": self.application_id,
                    "jobRunId": response["jobRunId"],
                },
                parse_response=["jobRun", "state"],
                desired_state=EmrServerlessJobSensor.SUCCESS_STATES,
                failure_states=EmrServerlessJobSensor.FAILURE_STATES,
                object_type="job",
                action="run",
            )

[pyspark-dependencies] - DockerFile does not automatically move the tar.gz file to the local folder

Hi,
Trying to execute the first two command lines listed in the readme file
docker build --output . .; aws s3 cp pyspark_ge.tar.gz s3://${S3_BUCKET}/artifacts/pyspark/

I noticed that the file pyspark_ge.tar.gz was not loaded locally. I had to run the container related to the image built with the previous command and next the Docker cp command.

I was wondering if it was just my problem. If not I volunteer to update the documentation with the two additional commands I had to run.
Thank you very much.

'template_fields' in 'EmrServerlessDeleteApplicationOperator' should be a tuple, not a string

It's me again.
I noticed that the in the Airflow operators for EMR Serverless you have this line (https://github.com/aws-samples/emr-serverless-samples/blob/main/airflow/emr_serverless/operators/emr.py#L239):

    template_fields: Sequence[str] = "application_id"

The problem with it is that Python converts the string 'application_id' into the list: ['a', 'p', 'p', 'l', ...], and therefore when Airflow/MWAA tries to use that operator it fails with the error message:

AttributeError: 'EmrServerlessDeleteApplicationOperator' object has no attribute 'a'

I found a good explanation in this SO question: https://stackoverflow.com/questions/56845602/python-airflow-error-attributeerror-xsensor-object-has-no-attribute-l

Based on that I changed that line to:

    template_fields: Sequence[str] = ("application_id",)

and after that change, the operator 'EmrServerlessDeleteApplicationOperator' works.

Franco

Would it be possible to add 'config' to the list of template fields for EmrServerlessStartJobOperator?

We have a use case where we would like to start job runs in EMR Serverless where the job name is derived from the output from a previous task.
Since the job name is passed via the 'config' argument to the EmrServerlessStartJobOperator, this request is to add 'config' to the list of 'template_fields' in that operator (https://github.com/aws-samples/emr-serverless-samples/blob/main/airflow/emr_serverless/operators/emr.py#L144-L149), if at all possible.

Thanks in advance,
Franco Venturi

Custom python versions >= 3.10 fail on EMR Studio/Jupyter due to a badly patched version of livy

This repository contains a great example of using a more recent python interpreter on EMR serverless.

Using that example I am able to use a custom python3.11 + preinstalled modules venv. This works fine for spark-submit jobs. In interactive mode, namely EMR Studio, I can also use my custom venv. However the deployed version of jupyter, in pariticular ipython, has compatibility issues with newer versions of python. Some commands fail with:

An error was encountered:
required field "type_ignores" missing from Module
Traceback (most recent call last):
  File "/tmp/6833554925722006797", line 226, in execute
    code = compile(mod, '<stdin>', 'exec')
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: required field "type_ignores" missing from Module

I see 2 possible solutions to this:

~~upgrade jupyter/ipython - however I don't know if that is part of the docker image for emr-serverless/spark/emr-7.0.0, if so, can it be upgraded with a custom image?~~ -> it turned out to be livy, similar but not same as ipython
run an own jupyter server that connects to EMR serverless

But maybe I miss something? Is there another way? Do I misinterpret the stacktrace?

It is possible to develop locally of course, however the data/computation should happen in AWS.

ModuleNotFoundError when running sample code

Hi,

I have run the example from
https://github.com/aws-samples/emr-serverless-samples/tree/main/examples/pyspark/dependencies

aws emr-serverless start-job-run \ --application-id $APPLICATION_ID \ --execution-role-arn $JOB_ROLE_ARN \ --job-driver '{ "sparkSubmit": { "entryPoint": "s3://'${S3_BUCKET}'/code/pyspark/ge_profile.py", "entryPointArguments": ["s3://'${S3_BUCKET}'/tmp/ge-profile"], "sparkSubmitParameters": "--conf spark.archives=s3://'${S3_BUCKET}'/artifacts/pyspark/pyspark_ge.tar.gz#environment --conf spark.emr-serverless.driverEnv.PYSPARK_DRIVER_PYTHON=./environment/bin/python --conf spark.emr-serverless.driverEnv.PYSPARK_PYTHON=./environment/bin/python --conf spark.emr-serverless.executorEnv.PYSPARK_PYTHON=./environment/bin/python" } }' \ --configuration-overrides '{ "monitoringConfiguration": { "s3MonitoringConfiguration": { "logUri": "s3://'${S3_BUCKET}'/logs/" } } }'

And keep getting

Traceback (most recent call last):
File "/tmp/spark-adc7d82f-ca5d-41ca-8d69-a89f144589a0/ge_profile.py", line 4, in
import great_expectations as ge
ModuleNotFoundError: No module named 'great_expectations'

I can confirm that the module is included in pyspark_ge.tar.gz

Thanks for the help

Eric

Hive Example Fails when using JSON Data

I am using the hive example as a template, but instead using json data. Using the setup below, I receive an error every time. Is there a different setup I should be using? Also, is there a good example for hive / EMR Serverless using JSON data that should work?

Having an additional hive example based on JSON in this repository would be helpful since the existing one uses CSV formatted data

Details

Given an S3 bucket that contains files with a format such as

{"Id":"123","Name":"my-name","Type":"some-type"}

and the initialization script here
and the query script here

When I run a job in EMR serverless using these inputs
Then I receive an error message stating

Job failed, please check complete logs in configured logging destination. ExitCode: 2. Last few exceptions: Caused by: java.lang.ClassNotFoundException: Class org.apache.hive.hcatalog.data.JsonSerDe not found Caused by: java.lang.RuntimeException: Map operator initialization failed ], TaskAttempt 2 failed, info=[Error: Error while running task ( failure ) : attempt_1665435121822_0001_1_00_000000_2:java.lang.RuntimeException: java.lang.RuntimeException: Map operator initialization failed Caused by: java.lang.ClassNotFoundException: Class org.apache.hive.hcatalog.data.JsonSerDe not found Caused by: java.lang.RuntimeException: Map operator initialization failed...

Add support for "config" to be a templated field in the EmrServerlessCreateApplicationOperator

Hi, I'd like that config in the EmrServerlessCreateApplicationOperator be a templated field so that I can use jinja templates in it. Would it be possible to incorporate it into the next release? I have tried overriding the operator and it's worked just fine:

class CustomEmrServerlessCreateApplicationOperator(EmrServerlessCreateApplicationOperator):
    template_fields: Sequence[str] = ("config",)