aws / aws-emr-best-practices Goto Github PK

A best practices guide for using AWS EMR. The guide will cover best practices on the topics of cost, performance, security, operational excellence, reliability and application specific best practices across Spark, Hive, Hudi, Hbase and more.

License: Other

Shell 79.01% JavaScript 17.67% CSS 3.32%

aws-emr-best-practices's Issues

how to Configure Spark History Server (SHS) custom executor log URL?

in emr best practices doc, they mention setting the executor log url to the job history server:

{{HTTP_SCHEME}}<JHS_HOST>:<JHS_PORT>/jobhistory/logs/{{NM_HOST}}:{{NM_PORT}}/{{CONTAINER_ID}}/{{CONTAINER_ID}}/{{USER}}/{{FILE_NAME}}?start=-4096

How do I figure what these variables are?:

JHS_HOST
JHS_PORT
NM_HOST
NM_PORT
CONTAINER_ID
USER
FILE_NAME

create section for Capacity best practices

Create section for spot best practices

Create Section in Spot Best Practices for Using AWS FIS Spot Interruptions Experiments

One can configure AWS FIS Spot Interruptions in order to attack up to 5 EC2 instances at once. The experiment should be tied to a EMR Tag key:value pair.

This is powerful because it allows us to test running EMR workflows using cheaper Spot Instances and test various combinations of CORE+TASK nodes and settings. This improves confidence when promoting jobs with SLA to production.

For instance, Hive/Tez jobs do not have the same built in resiliency as Spark. They are not aware of the 2 minute interruption. In the case of a large Spot Interruption we may encounter >3 failed task attempts causing the entire job to fail. The FIS Spot Interruption experiment was handy in discovering the minimal value for the tez.am.task.max.failed.attempts parameter to ensure we can outlast a large spot interruption event.

Experiment Setup:

Large Tez job of 5k tasks of 1hr+ duration
tez.am.task.max.failed.attempts=20
10 R Core Nodes On-demand
10 R Task Nodes Spot
66 Simulated Task Interruptions over 1hr
A few task Interruptions of 100% task nodes

Experiment Results

Job Succeeded
3000 Failed Tasks
5000 Succeeded Tasks
Max of 7 Failed Tez Attempts

Conclusion

Setting attempts to 20 should endure a very large interruption event.

(WIP) Update Spark Optimization Practices for latest releases

Few notable changes:

Managed Scaling

As of EMR 6.11.0/Hadoop 3.3.3, EMR Scale down is not longer Spark shuffle or cache aware with default settings
- Set yarn.resourcemanager.decommissioning-nodes-watcher.wait-for-applications = True

Spot Instances

Spot Instances + Spark Caching + Intermediate Tables
- A lost rdd block results in entire cache being recomputed, this can be expensive. Without caching, lost spark data can be incrementally recomputed, so may be better to not cache
- Consider only using on-demand if caching necessary
- Consider storing intermediate tables in HDFS or S3 instead. Consider cost of reading/writing/storing intermediate data vs re-computing intermediate data.
- Caching will cause spark evaluation at that point in time, potentially losing improvements from sql optimizer which occur when DAG is analyzed in entirety

AQE

Since EMR 6.6/Spark 3.2, default settings force AQE to run in legacy behavior, "to avoid performance regression when enabling adaptive query execution"
- Enable AQE by setting spark.sql.adaptive.coalescePartitions.parallelismFirst = false
- Now, in the Query plan you should see "AQEShuffleRead coalesced" if it is working
Optimizing AQE
- Set spark.sql.adaptive.coalescePartitions.initialPartitionNum to large number, such as 10x what you might set spark.sql.shuffle.partitions to. This allows AQE to have small enough initial partitions to optimize them using the advisoryPartitionSizeInBytes setting.
- Set spark.sql.adaptive.advisoryPartitionSizeInBytes by analyzing the resulting Task memory pressure on the Executor, consider increasing the value if memory is underutilized
- Optimization Example
  - Environment Setup
    - Instance Choice: r6.4xlarge
    - Core Units: 64 units
    - Task Units: 500 units
    - Spark Executor Memory: 32GB
    - Spark Executor Cores: 5
    - Spark.sql.adaptive.coalescePartitions.initialPartitionNum: 100,000
    - Dataset: S3 - 523GB - 2,700 files - Orc+Snappy - 4,584,646,650 rows
    - Spark Query performs wide joins
- Spark Shuffle = 10,000 and AQE Disabled
  - TODO
- AQE Enabled and advisoryPartitionSizeInBytes=64MB
  - TODO
- AQE Enabled and advisoryPartitionSizeInBytes=256MB
  - TODO
- AQE Enabled and advisoryPartitionSizeInBytes=512MB
  - TODO

Live Docs not matching main branch

Hey Folks,

I was taking a read through the EMR best practices over at https://aws.github.io/aws-emr-best-practices/features/managed_scaling/best_practices/ and noticed a typo "YARNMeoryAvailablePercentage". I came to submit a PR but noticed it has already been fixed.

Commit where changes were made: d3730bc

Yet the changes don't seem to be live, perhaps CI/CD is broken? I don't think it'll be a cache problem (I checked in an incognito page in case) given the age of the commit.

Let me know if I can be of any use, all the best

BP 2.4 has references for ELB best practices

The following statement and diagram looks related to ELB and not EMR.

"With Instance Groups, you must explicitly set the subnet at provisioning time. You can still spread clusters across your AZs by picking through round robin or at random."

spark 2.4 and emr serverless

in spark best practices, y'all mention that we can run our spark 2.4 code in emr serverless.

Doesn't serverless require spark3?

YARN customization and their significance

In YARN there are few properties regarding memory allocation to the cluster which impact the overall memory given to the spark executors in spite of the configuration provided to the spark.
Eg: yarn.scheduler.maximum-allocation-mb, yarn.nodemanager.resource.memory-mb

We want to understand the significance and impact of these properties on Spark Application and how to fine tune them as per the job.

Create Section on Maximizing HDFS Read/Write Throughput Cost Performance

There isn't currently a good guide that I could find anywhere for maximizing HDFS read/write throughput.

Example: 10TB Dataset that is copied to Local HDFS for local processing

Optimizing the instance selection for Cost

Many of the NVME backed instances are bottle-necked by Network Bandwidth for large data transfers plus the on-demand costs are higher than equivalent GP2 EBS sized volumes. The C6gn family offers an ideal balance of Network and EBS Bandwidth for large datasets. Ganglia makes finding the bottlenecks easier. It does not capture EBS bandwidth per volume so you'll have to use the EBS metrics in Cloudwatch.

Notice that moving from a 2xlarge to a 4xlarge results in double the cost but not double the network bandwidth. Settling for the C6gn.2xlarge will net the best network/cost ratio.

Optimizing EBS Volume Count for Instance Types

The GP2 Volume has the following note
"The throughput limit is between 128 MiB/s and 250 MiB/s, depending on the volume size. Volumes smaller than or equal to 170 GiB deliver a maximum throughput of 128 MiB/s. Volumes larger than 170 GiB but smaller than 334 GiB deliver a maximum throughput of 250 MiB/s if burst credits are available. Volumes larger than or equal to 334 GiB deliver 250 MiB/s regardless of burst credits"

In order to maximize EC2 EBS throughput the formula is EC2 EBS Bandwidth/GP2 Throughput.
9.5GBS / 250 MiB/s = ~5 GP2 Volumes
So an cost/performant HDFS Core node would be the c6gn.2xlarge with 5 GP2 Volumes.

Create entry in BP 5.1.4 - Optimal Split Size

Tuning the split sizes can greatly improve performance for reading S3. Local HDFS will get some benefit too.

ORC Specific Issue

PrestoDB/Trino, used by Athena, has a setting where if the ORC stripe size is <8MB, it results in entire data file scan.
The Orc Stripe default size is 64 MB. Post Compression this can in a result a <8MB stripe. One can verify in Athena by observing the Data Scanned value and notice that these orc files always result in entire data scan.
By increasing Stripe size Athena will only scan the necesssary columns in ORC, reducing costs and improving speed.

ORC+Parquet+(Any splittable file type)+S3

For very large Tables we can see a substantial improvement in query response on s3 backed tables.
Let's take for example a table containing 10TB of ORC+ZLIB compressed files. Let's assume the compressed stripe size is 6MB and they're perfectly distributed in the files, this would result in (10TB x 1024 x 1024)/6MB= 1,310,720 ORC Stripes. If this was a Spark job, the driver would be spending minutes loading all of these splits into memory during the plan, increasing costs and runtime. This can be verified by observing the Driver time spent on s3 file threads with jstack.
Relevant Settings
orc.stripe.size
parquet.block.size

Create entry in BP 4.2.6 for Small Spark Task Size

Adjusting Spark task size to something that can be finished in less than 2 minutes reduces the impact of spot loss, especially in the case of no shuffle or cached data. In a simulated test with FIS interrupting all EMR Task nodes 3 times over, we notice just 3 failures out of 100k tasks, with a 10 sec task size.