GithubHelp home page GithubHelp logo

melodyyangaws / hive-emr-on-eks Goto Github PK

View Code? Open in Web Editor NEW

This project forked from aws-samples/hive-emr-on-eks

0.0 0.0 0.0 8.48 MB

License: MIT No Attribution

Shell 27.06% JavaScript 8.79% Python 55.07% Smarty 5.51% Dockerfile 3.57%

hive-emr-on-eks's Introduction

RDS as Hive metastore for EMR on EKS

This is a project developed in Python CDKv2. It includes few Spark examples that create external hive tables on top of sample dataset stored in S3. These jobs will run with EMR on EKS.

The infrastructure deployment includes the following:

  • A new S3 bucket to store sample data and job code
  • An EKS cluster in a new VPC across 2 AZs
  • A RDS Aurora database (MySQL engine) in the same VPC
  • A small EMR on EC2 cluster in the same VPC
    • 1 master & 1 core node (m5.xlarge)
    • use master node to query the remote hive metastore database
  • An EMR virtual cluster in the same VPC
    • registered to emr namespace in EKS
    • EMR on EKS configuration is done
    • Connect to RDS and initialize metastore schema via schematool
  • A standalone Hive metastore service (HMS) in EKS
    • Helm Chart hive-metastore-chart is provided.
    • run in the same emr namespace
    • thrift server is provided for client connections
    • doesn't initialize/upgrade metastore schemas via schematool

Spark Examples

Key Artifacts

Deploy Infrastructure

The provisioning takes about 30 minutes to complete. Two ways to deploy:

  1. AWS CloudFormation template (CFN)
  2. AWS Cloud Development Kit (AWS CDK)


NOTE: HMS helm chart requires k8s >= 1.23, ie. EKS version must be 1.23+.

Install the folowing tools:

  1. AWS CLI. Configure the CLI by aws configure.
  2. kubectl & jq

Can use AWS CloudShell that has included all the neccessary software for a quick start.

CloudFormation Deployment

Region Launch Template
--------------------------- -----------------------
US East (N. Virginia) Deploy to AWS
  • To launch in a different AWS Region, check out the following customization section, or use the CDK deployment option.


You can customize the solution, for example deploy to a different AWS region:

export BUCKET_NAME_PREFIX=<my-bucket-name> # bucket where customized code will reside
export e=<your-region>
export SOLUTION_NAME=hive-emr-on-eks
export VERSION=v2.0.0 # version number for the customized code


# OPTIONAL: create the bucket where customized code will reside

# Upload deployment assets to the S3 bucket
aws s3 cp ./deployment/global-s3-assets/ s3://$BUCKET_NAME_PREFIX-$AWS_REGION/$SOLUTION_NAME/$VERSION/ --recursive --acl bucket-owner-full-control
aws s3 cp ./deployment/regional-s3-assets/ s3://$BUCKET_NAME_PREFIX-$AWS_REGION/$SOLUTION_NAME/$VERSION/ --recursive --acl bucket-owner-full-control

echo -e "\nIn web browser, paste the URL to launch the CFN template:$AWS_REGION#/stacks/quickcreate?stackName=HiveEMRonEKS&templateURL=https://$BUCKET_NAME_PREFIX-$$SOLUTION_NAME/$VERSION/HiveEMRonEKS.template\n"

CDK Deployment

Alternatively, deploy the infrastructure via CDK. It requires to pre-install the following tools as once-off tasks:

  1. Python 3.6+
  2. Nodejs 10.3.0+
  3. CDK toolkit
  4. Run the CDK bootstrap after the 'pip install' requirement step as below.
python3 -m venv .env
source .env/bin/activate
pip install -r requirements.txt
cdk deploy


Make sure AWS CLI, kubectl and jq are installed.

One-off setup:

  1. Set environment variables in .bash_profile and connect to EKS cluster.
curl | bash

source ~/.bash_profile

Can use Cloud9 or Cloudshell, if you don’t want to install anything on your computer or change your bash_profile,

  1. [OPTIONAL] Build HMS docker image and replace the hive metastore docker image name in hive-metastore-chart/values.yaml by the new one if needed:
cd docker
export DOCKERHUB_USERNAME=<your_dockerhub_name_OR_ECR_URL>
docker build -t $DOCKERHUB_USERNAME/hive-metastore:3.0.0 .
docker push $DOCKERHUB_USERNAME/hive-metastore:3.0.0
  1. Copy sample data to your S3 bucket:
aws s3 cp s3://amazon-reviews-pds/parquet/product_category=Toys/ s3://$S3BUCKET/app_code/data/toy --recursive

1.1 Connect Hive metastore via JDBC

import sys
from pyspark.sql import SparkSession
spark = SparkSession \
    .builder \
    .config("spark.sql.warehouse.dir", sys.argv[1]+"/warehouse/" ) \
    .enableHiveSupport() \
spark.sql("SHOW DATABASES").show()
spark.sql("CREATE EXTERNAL TABLE IF NOT EXISTS `demo`.`amazonreview`( `marketplace` string,`customer_id`string,`review_id` string,`product_id` string,`product_parent` string,`product_title` string,`star_rating` integer,`helpful_votes` integer,`total_votes` integer,`vine` string,`verified_purchase` string,`review_headline` string,`review_body` string,`review_date` date,`year` integer) STORED AS PARQUET LOCATION '"+sys.argv[1]+"/app_code/data/toy/'")
spark.sql("SELECT count(*) FROM demo.amazonreview").show()

1.2 Submit job to EMR on EKS

run the script:

curl | bash


aws emr-containers start-job-run \
--virtual-cluster-id $VIRTUAL_CLUSTER_ID \
--name spark-hive-via-jdbc \
--execution-role-arn $EMR_ROLE_ARN \
--release-label emr-6.3.0-latest \
--job-driver '{
  "sparkSubmitJobDriver": {
      "entryPoint": "s3://'$S3BUCKET'/app_code/job/",
      "sparkSubmitParameters": "--conf spark.jars.packages=mysql:mysql-connector-java:8.0.28 --conf spark.driver.cores=1 --conf spark.executor.memory=4G --conf spark.driver.memory=1G --conf spark.executor.cores=2"}}' \
--configuration-overrides '{
    "applicationConfiguration": [
        "classification": "spark-defaults", 
        "properties": {
          "spark.hadoop.javax.jdo.option.ConnectionDriverName": "com.mysql.cj.jdbc.Driver",
          "spark.hadoop.javax.jdo.option.ConnectionUserName": "'$USER_NAME'",
          "spark.hadoop.javax.jdo.option.ConnectionPassword": "'$PASSWORD'",
          "spark.hadoop.javax.jdo.option.ConnectionURL": "jdbc:mysql://'$HOST_NAME':3306/'$DB_NAME'?createDatabaseIfNotExist=true" 
    "monitoringConfiguration": {
      "s3MonitoringConfiguration": {"logUri": "s3://'$S3BUCKET'/elasticmapreduce/emr-containers"}}}'

^ back to top

2.1 Connect Hive metastore via thrift service hosted on EMR on EC2

from os import environ
import sys
from pyspark.sql import SparkSession
spark = SparkSession \
    .builder \
    .config("spark.sql.warehouse.dir", sys.argv[1]+"/warehouse/" ) \
    .config("hive.metastore.uris","thrift://"+sys.argv[2]+":9083") \
    .enableHiveSupport() \
spark.sql("SHOW DATABASES").show()
spark.sql("CREATE EXTERNAL TABLE IF NOT EXISTS `demo`.`amazonreview2`( `marketplace` string,`customer_id`string,`review_id` string,`product_id` string,`product_parent` string,`product_title` string,`star_rating` integer,`helpful_votes` integer,`total_votes` integer,`vine` string,`verified_purchase` string,`review_headline` string,`review_body` string,`review_date` date,`year` integer) STORED AS PARQUET LOCATION '"+sys.argv[1]+"/app_code/data/toy/'")
spark.sql("SELECT count(*) FROM demo.amazonreview2").show()

2.2 Submit job to EMR on EKS

Run the script:

curl | bash


export EMR_MASTER_DNS_NAME=$(aws ec2 describe-instances --filter Name=tag:project,Values=HiveEMRonEKS Name=tag:aws:elasticmapreduce:instance-group-role,Values=MASTER --query Reservations[].Instances[].PrivateDnsName --output text | xargs) 

aws emr-containers start-job-run \
--virtual-cluster-id $VIRTUAL_CLUSTER_ID \
--name spark-hive-via-thrift \
--execution-role-arn $EMR_ROLE_ARN \
--release-label emr-6.3.0-latest \
--job-driver '{
  "sparkSubmitJobDriver": {
      "entryPoint": "s3://'$S3BUCKET'/app_code/job/",
      "sparkSubmitParameters": "--conf spark.driver.cores=1 --conf spark.executor.memory=4G --conf spark.driver.memory=1G --conf spark.executor.cores=2"}}' \
--configuration-overrides '{
    "monitoringConfiguration": {
      "s3MonitoringConfiguration": {"logUri": "s3://'$S3BUCKET'/elasticmapreduce/emr-containers"}}}'

^ back to top

3.1 Connect Hive metastore via thrift service hosted on EKS

from os import environ
import sys
from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .config("spark.sql.warehouse.dir", sys.argv[1]+"/warehouse/" ) \
    .config("hive.metastore.uris","thrift://"+environ['HIVE_METASTORE_SERVICE_HOST']+":9083") \
    .enableHiveSupport() \

spark.sql("SHOW DATABASES").show()
spark.sql("DROP TABLE IF EXISTS demo.amazonreview3")
spark.sql("CREATE EXTERNAL TABLE IF NOT EXISTS `demo`.`amazonreview3`( `marketplace` string,`customer_id`string,`review_id` string,`product_id` string,`product_parent` string,`product_title` string,`star_rating` integer,`helpful_votes` integer,`total_votes` integer,`vine` string,`verified_purchase` string,`review_headline` string,`review_body` string,`review_date` date,`year` integer) STORED AS PARQUET LOCATION '"+sys.argv[1]+"/app_code/data/toy/'")

3.2 Submit job to EMR on EKS

Run the script:

curl | bash


aws emr-containers start-job-run \
--virtual-cluster-id $VIRTUAL_CLUSTER_ID \
--name spark-hive-via-thrift \
--execution-role-arn $EMR_ROLE_ARN \
--release-label emr-6.3.0-latest \
--job-driver '{
  "sparkSubmitJobDriver": {
      "entryPoint": "s3://'$S3BUCKET'/app_code/job/",
      "sparkSubmitParameters": "--conf spark.driver.cores=1 --conf spark.executor.memory=4G --conf spark.driver.memory=1G --conf spark.executor.cores=2"}}' \
--configuration-overrides '{
    "monitoringConfiguration": {
      "s3MonitoringConfiguration": {"logUri": "s3://'$S3BUCKET'/elasticmapreduce/emr-containers"}}}'

^ back to top

4.1 Run the thrift service as a sidecar in Spark Driver's pod

** Prerequisite ** NOTE: This repo's CFN/CDK template installs the followings by default.

# does it exist?
kubectl get pod -n kube-system

If the controller doesn't exist in your EKS cluster, replace the variable placeholder: YOUR_REGION & YOUR_IAM_ROLE_ARN_TO_GET_SECRETS_FROM_SM in the command, then run the installation. Refer to the IAM permissions used by CDK to create your IAM role.

helm repo add external-secrets
helm install external-secret external-secrets/kubernetes-external-secrets -n kube-system  --set AWS_REGION=YOUR_REGION --set securityContext.fsGroup=65534 --set serviceAccount.annotations."eks\.amazonaws\.com/role-arn"='YOUR_IAM_ROLE_ARN_TO_GET_SECRETS_FROM_SM' --debug
    1. Two sidecar config maps should be created in EKS, which are pointing to the metastore-site.xml, core-site.xml templates to configure the standalone HMS. The sidecar termination script is copied from the EMR document, in order to workaround the well-known sidecar lifecyle issue in kubernetes.
kubectl get configmap sidecar-hms-conf-templates sidecar-terminate-script -n emr

If they don't exist, run the command to create the configs:

# get remote metastore RDS secret name
secret_name=$(aws secretsmanager list-secrets --query 'SecretList[?starts_with(Name,`RDSAuroraSecret`) == `true`].Name' --output text)
# download the config and apply to EKS
curl | sed 's/{SECRET_MANAGER_NAME}/'$secret_name'/g' | kubectl apply -f -

import sys
from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .config("spark.sql.warehouse.dir", sys.argv[1]+"/warehouse/" ) \
    .enableHiveSupport() \

spark.sql("SHOW DATABASES").show()
spark.sql("DROP TABLE IF EXISTS demo.amazonreview4")
spark.sql("CREATE EXTERNAL TABLE `demo`.`amazonreview4`( `marketplace` string,`customer_id`string,`review_id` string,`product_id` string,`product_parent` string,`product_title` string,`star_rating` integer,`helpful_votes` integer,`total_votes` integer,`vine` string,`verified_purchase` string,`review_headline` string,`review_body` string,`review_date` date,`year` integer) STORED AS PARQUET LOCATION '"+sys.argv[1]+"/app_code/data/toy/'")

# read from files[1]+"/app_code/job/set-of-hive-queries.sql").collect()
cmd_str=' '.join([x[0] for x in sql_scripts]).split(';')
for query in cmd_str:
    if (query != ""):

4.2 Submit job to EMR on EKS

Assign the sidecar pod template to Spark Driver. Run the script:

curl | bash


# test HMS sidecar on EKS
aws emr-containers start-job-run \
--virtual-cluster-id $VIRTUAL_CLUSTER_ID \
--name sidecar-hms \
--execution-role-arn $EMR_ROLE_ARN \
--release-label emr-6.3.0-latest \
--job-driver '{
  "sparkSubmitJobDriver": {
      "entryPoint": "s3://'$S3BUCKET'/app_code/job/",
      "sparkSubmitParameters": "--conf spark.driver.cores=1 --conf spark.executor.memory=4G --conf spark.driver.memory=1G --conf spark.executor.cores=2"}}' \
--configuration-overrides '{
    "applicationConfiguration": [
        "classification": "spark-defaults", 
        "properties": {
          "spark.kubernetes.driver.podTemplateFile": "s3://'$S3BUCKET'/app_code/job/sidecar_hms_pod_template.yaml",
          "spark.hive.metastore.uris": "thrift://localhost:9083"
    "monitoringConfiguration": {
      "s3MonitoringConfiguration": {"logUri": "s3://'$S3BUCKET'/elasticmapreduce/emr-containers"}}}'

^ back to top

5. Hudi + Remote Hive metastore integration

Note: the latest Hudi-spark3-bundle jar is needed to support the HMS hive sync mode. The jar will be included from EMR 6.5+.

Run the submission script:

curl | bash


aws emr-containers start-job-run \
--virtual-cluster-id $VIRTUAL_CLUSTER_ID \
--name hudi-test1 \
--execution-role-arn $EMR_ROLE_ARN \
--release-label emr-6.3.0-latest \
--job-driver '{
  "sparkSubmitJobDriver": {
      "entryPoint": "s3://'$S3BUCKET'/app_code/job/",
      "sparkSubmitParameters": "--jars --conf spark.executor.cores=1 --conf spark.executor.instances=2"}}' \
--configuration-overrides '{
    "applicationConfiguration": [
        "classification": "spark-defaults", 
        "properties": {
          "spark.serializer": "org.apache.spark.serializer.KryoSerializer",
          "spark.sql.hive.convertMetastoreParquet": "false",
          "spark.hive.metastore.uris": "thrift://localhost:9083",
	        "spark.kubernetes.driver.podTemplateFile": "s3://'$S3BUCKET'/app_code/job/sidecar_hms_pod_template.yaml"
    "monitoringConfiguration": {
      "s3MonitoringConfiguration": {"logUri": "s3://'$S3BUCKET'/elasticmapreduce/emr-containers"}}}'

^ back to top

6. Hudi + Glue Catalog Integration

Note: make esure the database ** default ** exists in your Glue catalog

Run the submission script:

curl | bash


aws emr-containers start-job-run \
--virtual-cluster-id $VIRTUAL_CLUSTER_ID \
--name hudi-test1 \
--execution-role-arn $EMR_ROLE_ARN \
--release-label emr-6.3.0-latest \
--job-driver '{
  "sparkSubmitJobDriver": {
      "entryPoint": "s3://'$S3BUCKET'/app_code/job/",
      "sparkSubmitParameters": "--jars --conf spark.executor.cores=1 --conf spark.executor.instances=2"}}' \
--configuration-overrides '{
    "applicationConfiguration": [
        "classification": "spark-defaults", 
        "properties": {
          "spark.serializer": "org.apache.spark.serializer.KryoSerializer",
          "spark.sql.hive.convertMetastoreParquet": "false",
          "spark.hadoop.hive.metastore.client.factory.class": "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory"
    "monitoringConfiguration": {
      "s3MonitoringConfiguration": {"logUri": "s3://'$S3BUCKET'/elasticmapreduce/emr-containers"}}}'

^ back to top

7. Run Hive SQL with EMR on EKS

We can run Hive SQL script with multiple lines using the Spark execution engine. From EMR 6.7, EMR on EKS now supports the ability to run Spark SQL, using a .sql file as the entrypoint script in the StartJobRun API. Make sure your AWS CLI version is 2.7.31+ or 1.25.70+.

See the full version of the sample Hive sql script. code snippet:

CREATE DATABASE hiveonspark;
USE hiveonspark;

--create hive managed table
CREATE TABLE IF NOT EXISTS testtable (`key` INT, `value` STRING) using hive;
LOAD DATA LOCAL INPATH '/usr/lib/spark/examples/src/main/resources/kv1.txt' OVERWRITE INTO TABLE testtable;
SELECT * FROM testtable WHERE key=238;

Run the submission script:

curl | bash

OR run the following:

aws emr-containers start-job-run \
--virtual-cluster-id $VIRTUAL_CLUSTER_ID \
--name sparksql-test \
--execution-role-arn $EMR_ROLE_ARN \
--release-label emr-6.8.0-latest \
--job-driver '{
  "sparkSqlJobDriver": {
      "entryPoint": "s3://'$S3BUCKET'/app_code/job/set-of-hive-queries.sql",
      "sparkSqlParameters": "-hivevar S3Bucket='$S3BUCKET' -hivevar Key_ID=238"}}' \
--configuration-overrides '{
    "applicationConfiguration": [
        "classification": "spark-defaults", 
        "properties": {
          "spark.sql.warehouse.dir": "s3://'$S3BUCKET'/warehouse/",
          "spark.hadoop.hive.metastore.client.factory.class": "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory"
    "monitoringConfiguration": {
      "s3MonitoringConfiguration": {"logUri": "s3://'$S3BUCKET'/elasticmapreduce/emr-containers"}}}'

In the spark-defaults config, we use Glue catalog as the hive metastore for a serverless design, so the table can be queried in Athena. Alternatively, we can replace the config by a standalone HMS setting "spark.hive.metastore.uris": "thrift://hive-metastore:9083" which is running as a k8s pod in the namespace emr. It is pointing to the Remote RDS hive metastore database created by this project.

NOTE: to directly submit Hive scripts to EMR on EKS, replace the following 2 attributes in the job submission script:

  • change from sparkSubmitJobDriver to sparkSqlJobDriver
  • change from sparkSubmitParameters to sparkSqlParameters

^ back to top

Verify the job is running in EKS

kubectl get po -n emr
kubectl logs -n emr -c spark-kubernetes-driver <YOUR-DRIVER-POD-NAME>

Will see the count result in the driver log: Total records on S3:

| 4981601|

Validate HMS and hive tables on EMR master node

  1. Hive metastore login info:
echo -e "\n host: $HOST_NAME\n DB: $DB_NAME\n passowrd: $PASSWORD\n username: $USER_NAME\n"
  1. Find EMR master node EC2 instance:
aws ec2 describe-instances --filter Name=tag:project,Values=$stack_name Name=tag:aws:elasticmapreduce:instance-group-role,Values=MASTER --query Reservations[].Instances[].InstanceId
  1. Go to EC2 console, connect the instance via Session Manager without a SSH key.
  2. Check the remote hive metastore in mysqlDB:
mysql -u admin -P 3306 -p -h <YOUR_HOST_NAME>
Enter password:<YOUR_PASSWORD>

# Query in the metastore
MySQL[(none)]> Use HiveEMRonEKS;
MySQL[HiveEMRonEKS]> select * from DBS;
MySQL[HiveEMRonEKS]> select * from TBLS;
  1. Query Hive tables:
sudo su
hive> use demo;
hive> select count(*) from amazonreview2;
Launching Job 1 out of 1
Time taken: 23.742 seconds, Fetched: 1 row(s)

Get logs from S3


Useful commands

  • kubectl get pod -n emr list running Spark jobs
  • kubectl delete pod --all -n emr delete all Spark jobs
  • kubectl logs -n emr -c spark-kubernetes-driver YOUR-DRIVER-POD-NAME job logs in realtime
  • kubectl get node, check EKS compute capacity types and AZ distribution.

Clean up

Run the clean-up script with:

curl | bash

Go to the CloudFormation console, manually delete the remaining resources if needed.

hive-emr-on-eks's People


amazon-auto avatar melodyyangaws avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.