awslabs / aws-emr-launch Goto Github PK

License: Apache License 2.0

Python 99.95% Shell 0.05%

aws-emr-launch's Introduction

EMR Launch

An AWS Professional Service open source initiative | [email protected]

The intent of the EMR Launch library is to simplify the development experience for Builders defining, deploying, managing, and using EMR Clusters by:

defining reusable Security, Resource, and Launch Configurations enabling developers to Define Once and Reuse
separating the definition of Cluster Security Configurations and Cluster Resource Configurations into reusable and shareable Constructs
providing a suite of Tools to simplify the construction of Orchestration Pipelines using Step Functions and EMR Clusters

Concepts (and Constructs)

This library utilizes the AWS CDK for deployment and management of resources. It is recommended that users familiarize themselves with the CDK's basic concepts and usage.

EMR Profile

An EMR Profile (emr_profile) is a reusable definition of the security profile used by an EMR Cluster. This includes:

Service Role: an IAM Role used by the EMR Service to manage the Cluster
Instance Role: an IAM Role used by the EC2 Instances in an EMR Cluster
AutoScaling Role: an IAM Role used to autoscale and resize an EMR Cluster
Service Group: a Security Group granting the EMR Service basic access to EC2 Instances in Cluster. This is required to deploy Instances into a Private Subnet.
Master Group: the Security Group assigned to the EMR Master Instance
Workers Group: the Security Group assigned to the EMR Worker Instances (Core and Task nodes)
Security Configuration: the Security Configuration used by the Cluster
Kerberos Attributes: the attributes required to enable Kerberos authentication

Each emr_profile requires a unique profile_name. This name and the namespace uniquely identify a profile. The namespace is a logical grouping of profiles and has a default value of "default".

Deploying an emr_profile creates these resources and stores the profile definition and metadata in the Parameter Store. The Profile can either be used immediately in the Stack when it is defined, or reused in other Stacks by loading the Profile definition by profile_name and namespace.

Cluster Configuration

A Cluster Configuration (cluster_configuration) is a reusable definition of the physical resources in an EMR Cluster. This incudes:

EMR Release Label: the EMR release version (e.g. emr-5.28.0)
Applications: the Applications to install on Cluster (e.g. Hadoop, Hive, SparK)
Bootstrap Actions: the Bootstrap Actions to execute on each node after Applications have been installed
Configurations: configuration parameters to set for the various Applications installed
Step Concurrency Level: the number of concurrent Steps the Cluster is configured to run
Instances: the configuration of the Master, Core, and Task nodes in the Cluster (e.g. Master Instance Type, Core Instance Type, Core Instance Count, etc)

Like the emr_profile, each cluster_configuration requires a unique configuration_name. This name and the namespace uniquely identify a configuration.

Deploying a cluster_configuration stores the configuration definition and metadata in the Parameter Store. The Configuration can either be used immediately in the Stack when it is defined, or reused in other Stacks by loading the Configuration definition by configuration_name and namespace.

EMR Launch Function

An EMR Launch Function (emr_launch_function) is an AWS Step Functions State Machine that launches an EMR Cluster. The Launch Function is defined with an emr_profile, cluster_configuration, cluster_name, and tags. When the function is executed it creates an EMR Cluster with the given name, tags, security profile, and physical resources then synchronously monitors the cluster for successful start.

To be clear, deploying an emr_launch_function does not create an EMR Cluster, it only creates the State Machine. The cluster is created when the State Machine is executed.

The emr_launch_function is a mechanism for easily combining the reusable emr_profile and cluster_configuration.

Like the emr_profile and cluster_configuration, each emr_launch_function requires a unique launch_function_name. This name and the namespace uniquely identify the launch function.

Chains and Tasks

Chains and Tasks are preconfigured components that simplify the use of AWS Step Function State Machines as orchestrators of data processing pipelines. These components allow the developer to easily build complex, serverless pipelines using EMR Clusters (both Transient and Persistent), Lambdas, and nested State Machines.

Security

Care is taken to ensure that emr_launch_functions and emr_profiles can't be used to create clusters with elevated or unintended privileges.

IAM policies can be used to restrict the Users and Roles that can create EMR Clusters by granting states:StartExecution to specific State Machine ARNs.
By storing the metadata and configuration of emr_profiles, cluster_configurations, and emr_launch_functions in the Systems Manager Parameter Store, IAM Policies can be used to grant or restrict Read/Write access to these
- Access can be managed for ALL metadata and configurations, specific namespaces, or individual ARNs
Each emr_launch_function uses a specific AWS Lambda function to load and combine its specific emr_profile and cluster_configuration. The IAM Policy associated with this Lambda allows it to read only these specific ARNs from the Parameter Store.
Each emr_launch_function is granted iam:PassRole to the specific EMR Roles defined in the emr_profile assigned to the launch function. Attempting to change the Roles used by directly modifying the metadata of the emr_profile in the Parameter Store will result in a cluster launch failure.

Usage

This library acts as a plugin to the AWS CDK providing additional L2 Constructs. To avoid circular references with CDK dependencies this package will not install CDK and Boto3. These should be installed manually from one of the requirements.txt files (depending on the version of aws-emr-launch).

It is recommended that a Python3 venv be used for all CDK builds and deployments.

To get up and running quickly:

Prerequisites

The AWS CDK v2.x utilizes containers to automate some tasks. EMR Launch uses and deploys a CDK PythonLayerVersion, this Construct uses a container to create the bundle for the Lambda Layer. As such, a docker runtime is required to deploy.

Deployment

Install the CDK CLI
```
npm install -g aws-cdk
```
Use your mechanism of choice to create and activate a Python3 venv:
```
python3 -m venv .env
source .env/bin/activate
```
Install the CDK and Boto3 minimum requirements:
```
pip install -r requirements-2.x.txt
```
Install aws-emr-launch package:
```
pip install aws-emr-launch
```

Development

Follow Steps 1 - 3 above to configure an environment and install requirements

After activating your venv:

Install development requirements:
```
pip install -r requirements-dev.txt
```
Install the library locally:
```
pip install -e .
```

Managing Layer Packages

Update the aws_emr_launch/lambda_sources/layers/emr_config_utils/requirements.txt adding/updating/removing package(s)

Testing

To run the test suite (from within the venv):

pytest

After running tests

View test coverage reports by opening htmlcov/index.html in your web browser.

To write a test

start a file named test_[the module you want to test].py
import the module you want to test at the top of the file
write test case functions that match either test_* or *_test

For more information refer to pytest docs

Contributing

See CONTRIBUTING for more information.

License

This project is licensed under the terms of the Apache 2.0 license. See LICENSE. Included AWS Lambda functions are licensed under the MIT-0 license. See LICENSE-LAMBDA.

aws-emr-launch's People

Contributors

Stargazers

Watchers

Forkers

raju-r xjbych1224 awsavazq mbadwalgithub sungyoulpark eshack94 amitkumar27 cheoly m3tspotify jayberk atdavidpark kevinsoucy kukushking shelleysu84 maimoonaiqbal2000 antonyonamu andrewlane

aws-emr-launch's Issues

[BUG] - RunJobFlowServiceRole missing permissions to create AWSServiceRoleForEMRCleanup role

Describe the bug
The package does not work in a brand new AWS account (or an account that has never used EMR before), because RunJobFlowServiceRole created is missing permissions to create AWSServiceRoleForEMRCleanup role.
Reference: https://docs.aws.amazon.com/emr/latest/ManagementGuide/using-service-linked-roles.html

To Reproduce
Steps to reproduce the behavior:

Get a new AWS account or use an AWS account where AWSServiceRoleForEMRCleanup role doesn't exist yet (has never used EMR before)
Create a CDK app that will launch an EMR cluster using this package
Launch cluster will fail with the error message: "Terminated with errorsService-linked role 'AWSServiceRoleForEMRCleanup' for EMR is required. Please create this role directly or add permission to create it in your IAM entity."
PS: The screen shot is attached.

Expected behavior
AWSServiceRoleForEMRCleanup role should be automatically created if RunJobFlowServiceRole has sufficient permissions.

Screenshots
If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

OS: N/A
Browser N/A
Version N/A

Additional context

[BUG] - PythonLayerVersion gives error jsii.errors.JSIIError: spawnSync docker ENOENT

Describe the bug
The emr_launch_function deployment failed. The PythonLayerVersion throws error message:
` jsii.errors.JavaScriptError:
Error: spawnSync docker ENOENT
at Object.spawnSync (node:internal/child_process:1086:20)
at Object.spawnSync (node:child_process:812:24)
at dockerExec (/private/var/folders/yc/g40_3hkd7ql019jlcc89hjs00000gs/T/jsii-kernel-78Rggx/node_modules/aws-cdk-lib/core/lib/bundling.js:1:4968)
at Function.fromBuild (/private/var/folders/yc/g40_3hkd7ql019jlcc89hjs00000gs/T/jsii-kernel-78Rggx/node_modules/aws-cdk-lib/core/lib/bundling.js:1:3553)
at new Bundling (/private/var/folders/yc/g40_3hkd7ql019jlcc89hjs00000gs/T/jsii-kernel-78Rggx/node_modules/@aws-cdk/aws-lambda-python-alpha/lib/bundling.js:28:93)
at Function.bundle (/private/var/folders/yc/g40_3hkd7ql019jlcc89hjs00000gs/T/jsii-kernel-78Rggx/node_modules/@aws-cdk/aws-lambda-python-alpha/lib/bundling.js:43:50)
at new PythonLayerVersion (/private/var/folders/yc/g40_3hkd7ql019jlcc89hjs00000gs/T/jsii-kernel-78Rggx/node_modules/@aws-cdk/aws-lambda-python-alpha/lib/layer.js:43:39)
at /private/var/folders/yc/g40_3hkd7ql019jlcc89hjs00000gs/T/tmpxltap7ay/lib/program.js:8171:58
at Kernel._wrapSandboxCode (/private/var/folders/yc/g40_3hkd7ql019jlcc89hjs00000gs/T/tmpxltap7ay/lib/program.js:8592:24)
at Kernel._create (/private/var/folders/yc/g40_3hkd7ql019jlcc89hjs00000gs/T/tmpxltap7ay/lib/program.js:8171:34)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/codes/aws-emr-launch-test/examples/emr_launch_function/app.py", line 29, in
launch_function = emr_launch_function.EMRLaunchFunction(
File "/codes/aws-emr-launch-test/.venv/lib/python3.9/site-packages/jsii/_runtime.py", line 86, in call
inst = super().call(*args, **kwargs)
File "/codes/aws-emr-launch-test/.venv/lib/python3.9/site-packages/aws_emr_launch/constructs/step_functions/emr_launch_function.py", line 109, in init
load_cluster_configuration = emr_tasks.LoadClusterConfigurationBuilder.build(
File "/codes/aws-emr-launch-test/.venv/lib/python3.9/site-packages/aws_emr_launch/constructs/step_functions/emr_tasks.py", line 447, in build
load_cluster_configuration_lambda = emr_lambdas.LoadClusterConfigurationBuilder.build(
File "/codes/aws-emr-launch-test/.venv/lib/python3.9/site-packages/aws_emr_launch/constructs/lambdas/emr_lambdas.py", line 57, in build
layer = EMRConfigUtilsLayerBuilder.get_or_build(scope)
File "/codes/aws-emr-launch-test/.venv/lib/python3.9/site-packages/aws_emr_launch/constructs/lambdas/emr_lambdas.py", line 287, in get_or_build
layer = PythonLayerVersion(
File "/codes/aws-emr-launch-test/.venv/lib/python3.9/site-packages/jsii/_runtime.py", line 86, in call
inst = super().call(*args, **kwargs)
File "/codes/aws-emr-launch-test/.venv/lib/python3.9/site-packages/aws_cdk/aws_lambda_python_alpha/init.py", line 1347, in init
jsii.create(self.class, self, [scope, id, props])
File "/codes/aws-emr-launch-test/.venv/lib/python3.9/site-packages/jsii/_kernel/init.py", line 290, in create
response = self.provider.create(
File "/codes/aws-emr-launch-test/.venv/lib/python3.9/site-packages/jsii/_kernel/providers/process.py", line 344, in create
return self._process.send(request, CreateResponse)
File "/codes/aws-emr-launch-test/.venv/lib/python3.9/site-packages/jsii/_kernel/providers/process.py", line 326, in send
raise JSIIError(resp.error) from JavaScriptError(resp.stack)
jsii.errors.JSIIError: spawnSync docker ENOENT

Subprocess exited with error 1
`

To Reproduce
Steps to reproduce the behavior:

Follow all instruction in README.md - Usage, using requirements-2.x.txt and requirements-lambda-layer.txt
Go to './examples' directory
Run 'deployall.sh'
The first four stacks were successful, but the next stage 'emr_launch_function' will fail with the above error messages

Expected behavior
Successful deployment using deployall.sh script in the examples directory.

Screenshots
N/A.

Desktop:

OS: macOS Darwin 21.4.0 Darwin Kernel Version 21.4.0: root:xnu-8020.101.4~15/RELEASE_X86_64 x86_64
CDK Version 2: 2.27.0 (build 8e89048)

Additional context
N/A.

Error deploying emr_launch_function/

Following the launch examples raises a error when deploying emr_launch_function:

jsii.errors.JavaScriptError:
Error: ENOENT: no such file or directory, stat '/home/ec2-user/environment/aws-emr-launch/aws_emr_launch/lambda_sources/layers/emr_config_utils'

[FEATURE] - Update to CDK v2.x

CDK v1.x is entering maintenance mode soon. Minor changes to the emr-launch library are required to support CDK v2.x.

Add support for Managed Scaling

Enable creating EMR clusters with Managed Scaling.

Issue in response for describe-instance-types in java-sdk-ec2-1.11.820

I'm trying to get the response from describe-instance-types from java SDK and there is an extra comma( , ) issue encountered in the response.
This response when tried to parse for JSON causes MalformedJsonException.

Here is a snippet for the replication: (Java SDK version-> com.amazonaws:aws-java-sdk-ec2:1.11.820):

import .... 
...
AmazonEC2Async ec2;
ec2 = AmazonEC2AsyncClientBuilder.standard().withCredentials(new DefaultAWSCredentialsProviderChain()).build();

// Please note, I've tried this for instance type r5.xlarge and r5.2xlarge.. Similar issue was encountered
DescribeInstanceTypesRequest describeInstanceTypesRequest = new DescribeInstanceTypesRequest().withInstanceTypes("r3.xlarge");
DescribeInstanceTypesResult describeInstanceTypesResult = ec2.describeInstanceTypes(describeInstanceTypesRequest);

JsonObject jsonObject = (new JsonParser()).parse(describeInstanceTypesResult.toString()).getAsJsonObject();

Output:
Response::
{InstanceTypes: [{InstanceType: r3.xlarge,CurrentGeneration: false,FreeTierEligible: false,SupportedUsageClasses: [on-demand, spot],SupportedRootDeviceTypes: [ebs, instance-store],SupportedVirtualizationTypes: [hvm],BareMetal: false,Hypervisor: xen,ProcessorInfo: {SupportedArchitectures: [x86_64],SustainedClockSpeedInGhz: 2.5},VCpuInfo: {DefaultVCpus: 4,DefaultCores: 2,DefaultThreadsPerCore: 2,ValidCores: [1, 2],ValidThreadsPerCore: [1, 2]},MemoryInfo: {SizeInMiB: 31232},InstanceStorageSupported: true,InstanceStorageInfo: {TotalSizeInGB: 80,Disks: [{SizeInGB: 80,Count: 1,Type: ssd}]},EbsInfo: {EbsOptimizedSupport: supported,EncryptionSupport: supported,EbsOptimizedInfo: {BaselineBandwidthInMbps: 500,BaselineThroughputInMBps: 62.5,BaselineIops: 4000,MaximumBandwidthInMbps: 500,MaximumThroughputInMBps: 62.5,MaximumIops: 4000},NvmeSupport: unsupported},NetworkInfo: {NetworkPerformance: Moderate,MaximumNetworkInterfaces: 4,Ipv4AddressesPerInterface: 15,Ipv6AddressesPerInterface: 15,Ipv6Supported: true,EnaSupport: unsupported,EfaSupported: false},PlacementGroupInfo: {SupportedStrategies: [cluster, partition, spread]},HibernationSupported: true,BurstablePerformanceSupported: false,DedicatedHostsSupported: true,AutoRecoverySupported: true}],}

com.google.gson.JsonSyntaxException: com.google.gson.stream.MalformedJsonException: Expected name at line 1 column 1256 path $.InstanceTypes

Please note the comma (,) in the last line is acting as orphan causing MalformededJsonException.
I've tried this same technique to get the information in python and the AWS-CLI, but the responses had well formed JSON.

python snippet:

>>>python3
>>> import json
>>> import boto3
>>> client = boto3.client('ec2')
>>> response = client.describe_instance_types(InstanceTypes=['r3.xlarge'])
>>> test_json = json.dumps(response)
>>> print(test_json)
{"InstanceTypes": [{"InstanceType": "r3.xlarge", "CurrentGeneration": false, "FreeTierEligible": false, "SupportedUsageClasses": ["on-demand", "spot"], "SupportedRootDeviceTypes": ["ebs", "instance-store"], "BareMetal": false, "Hypervisor": "xen", "ProcessorInfo": {"SupportedArchitectures": ["x86_64"], "SustainedClockSpeedInGhz": 2.5}, "VCpuInfo": {"DefaultVCpus": 4, "DefaultCores": 2, "DefaultThreadsPerCore": 2, "ValidCores": [1, 2], "ValidThreadsPerCore": [1, 2]}, "MemoryInfo": {"SizeInMiB": 31232}, "InstanceStorageSupported": true, "InstanceStorageInfo": {"TotalSizeInGB": 80, "Disks": [{"SizeInGB": 80, "Count": 1, "Type": "ssd"}]}, "EbsInfo": {"EbsOptimizedSupport": "supported", "EncryptionSupport": "supported"}, "NetworkInfo": {"NetworkPerformance": "Moderate", "MaximumNetworkInterfaces": 4, "Ipv4AddressesPerInterface": 15, "Ipv6AddressesPerInterface": 15, "Ipv6Supported": true, "EnaSupport": "unsupported"}, "PlacementGroupInfo": {"SupportedStrategies": ["cluster", "partition", "spread"]}, "HibernationSupported": true, "BurstablePerformanceSupported": false, "DedicatedHostsSupported": true, "AutoRecoverySupported": true}], "ResponseMetadata": {"RequestId": "e4c4e015-7072-4e92-b546-b17b34e8979b", "HTTPStatusCode": 200, "HTTPHeaders": {"x-amzn-requestid": "e4c4e015-7072-4e92-b546-b17b34e8979b", "content-type": "text/xml;charset=UTF-8", "transfer-encoding": "chunked", "vary": "accept-encoding", "date": "Tue, 14 Jul 2020 09:56:04 GMT", "server": "AmazonEC2"}, "RetryAttempts": 0}}

AWS-CLI :
Command: aws ec2 describe-instance-types --instance-types r3.xlarge
Reponse was well parsed JSON.

This is pretty serious issue. I've got the check this in newer versions.
Will this be fixed back in the same version as I'm pretty much using this SDK version intensively.

[FEATURE] - Docker runtime requirement preventing automated deployment via CodeBuild/CodePipeline

Is your feature request related to a problem? Please describe.
The new version with the PythonLayerVersion of the CDK v2 requires docker runtime. This prevents me to automate the deployment of EMR resources via a deployment pipeline, such as using CodePipeline and CodeBuild. CodeBuild will fail and throw error:
"...Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
91 | jsii.errors.JavaScriptError:
92 | Error: docker exited with status 1..."

Describe the solution you'd like
Perhaps, using the manual bundling method used in the 1.x version.

Describe alternatives you've considered
None

Additional context
None

[QUERY] - EMR stuck at bootstrapping

Describe the bug
This may not be a bug because the root cause is not yet identified. This is more of a question in case this is a known problem/symptom.

I used the same exact package and version with the same code successfully a few months ago (Jan 2023) but I have been encountering this problem in the last few weeks. The EMR launch workflow failed with bootstrapping phase took an hour before it eventually failed with internal error. It is confirmed not the bootstrapping script, but most likely the next step after bootstrapping. I tried different EMR version (initially 6.6.0, tried 6.8.0-6.9.0) and region but gave same result.
Was there a similar behaviour encountered by anyone else?
Any suggestion on where/what to investigate?

To Reproduce
Steps to reproduce the behavior:
Deploy and launch the EMR with the workflow. Error message: "Status": {"State": "TERMINATING", "StateChangeReason": {"Message": "An internal error occurred.""

Expected behavior
Successful EMR launch.

Screenshots