aws-samples / emr-bootstrap-actions Goto Github PK

This repository hold the Amazon Elastic MapReduce sample bootstrap actions

License: Other

Shell 45.87% Ruby 45.23% Scala 8.90%

emr-bootstrap-actions's Introduction

EMR bootstrap actions

Warning: This repository is undergoing updating and modernization – please bear with us.

A bootstrap action is a shell script stored in Amazon S3 that Amazon EMR executes on every node of your cluster after boot and prior to application provisioning. Bootstrap actions execute as the hadoop user by default; commands can be executed with root privileges if you use sudo.

From the AWS CLI EMR create-cluster command you can reference a bootstrap action as follows:

--bootstrap-actions Name=action-name,Path=s3://myawsbucket/FileName,Args=arg1,arg2

For more information about EMR Bootstrap actions see DeveloperGuide

The code samples in this repository are meant to illustrate how to setup popular applications on Amazon EMR using bootstrap actions. They are not meant to be run in production and all users should carefully inspect code samples before running them.

Use at your own risk.

emr-bootstrap-actions's People

Contributors

Stargazers

Watchers

Forkers

ssepiro tnachen jschwartz73 schwartech linearregression pwangjing edwardt uprush daikeshi mattehicks fuzioncloud avontd2868 benalta ngocthanhit hanifmahboobi christopherbozeman shaja rinatmenyashev stefhen green-sprouts donghochoi pankit dldinternet hyunsik gruter achiku dnlbrky wmdoble charlesakalugwu lindaverse tomz ianmeyers genomics-admin aniketbhatnagar davande nikolayvoronchikhin reallyjflynn sinjax kcb4365640 jkleckner jonchase jeffreypicard russell-datascience danosipov thscorporation agserrano andrewbulin nborwankar chayel jnbala darkseed guptao alewando treff7es nmadhire miguelperalvo hys9958 hojat-dashboard renzuinc cerdman xdevelsistemas alzarei mitesh91 jamesrtaylor zahiro ankurmitujjain data-processing mombergm regadas evgkib ospreyx digitalsanity ralozano eoriented huytquoc danromuald mani2348 he0x neroinc atokhy pashields arunkumarpt ajmssc glenstarchman ericeijkelenboom ntoggle sb462 pmogren ejono seelio louiswilliams priyankag1905 allod johnb30 rodisoto netvisao jxubc ewanleith tyro89 hvivani

emr-bootstrap-actions's Issues

Add support for Dynamic allocation & cluster resizing in the install-spark

Add support for the dynamic allocation and cluster resizing support of EMR in install-spark switch -x.

Also would like to have the 75 - 80 % resources utilization of the EMR cluster using yarn-cluster as Master in Spark-Job, Currently I've seen only 20% of utilization http://ec2-publicdns:9026/cluster/scheduler

Also would like to have a workaround on TASK instances for the same

Spark Bootstrap action : User-Provided library not added while using Yarn-Cluster mode

When I added custom jars of mine from S3 to my cluster using -u switch, while running in yarn's cluster mode it doesn't appears any of'em

Spark on YARN not working : Examples

Hi,

I am trying to run a Spark on YARN program provided by Spark in the examples directory using Amazon Kinesis on EMR cluster :
I've setup the Credentials
export AWS_ACCESS_KEY_ID=<ACCESS_KEY_ID>
export AWS_SECRET_KEY=<ACCESS_SECRET_KEY>

A) This is the Kinesis Word Count Producer which ran Successfully :

run-example org.apache.spark.examples.streaming.KinesisWordCountProducerASL mySparkStream https://kinesis.us-east-1.amazonaws.com 1 5

Sample Logs :
Spark assembly has been built with Hive, including Datanucleus jars on classpath
Putting records onto stream mySparkStream and endpoint https://kinesis.us-east-1.amazonaws.com at a rate of 1 records per second and 5 words per record
Sent 1 records
Sent 1 records
Sent 1 records
Sent 1 records
Sent 1 records
Totals
(0,6)
(1,2)
(2,3)
(3,2)
(4,2)
(5,4)
(6,1)
(7,1)
(8,1)
(9,3)

B) This one is the Normal Consumer using Spark Streaming which is also working :

run-example org.apache.spark.examples.streaming.JavaKinesisWordCountASL mySparkStream https://kinesis.us-east-1.amazonaws.com

15/03/25 11:52:24 INFO storage.BlockManagerMaster: Updated info of block broadcast_30_piece0
15/03/25 11:52:24 INFO storage.BlockManager: Removing block broadcast_30
15/03/25 11:52:24 INFO storage.MemoryStore: Block broadcast_30 of size 2296 dropped from memory (free 278229292)
15/03/25 11:52:24 INFO storage.BlockManagerInfo: Added broadcast_48_piece0 in memory on localhost:52341 (size: 1447.0 B, free: 265.4 MB)
15/03/25 11:52:24 INFO spark.ContextCleaner: Cleaned broadcast 30
15/03/25 11:52:24 INFO storage.BlockManagerMaster: Updated info of block broadcast_48_piece0
15/03/25 11:52:24 INFO spark.ContextCleaner: Cleaned shuffle 14
15/03/25 11:52:24 INFO spark.ContextCleaner: Cleaned shuffle 13
15/03/25 11:52:24 INFO spark.SparkContext: Created broadcast 48 from broadcast at DAGScheduler.scala:839
15/03/25 11:52:24 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from Stage 92 (ShuffledRDD[92] at reduceByKey at JavaKinesisWordCountASL.java:159)
15/03/25 11:52:24 INFO scheduler.TaskSchedulerImpl: Adding task set 92.0 with 1 tasks
15/03/25 11:52:24 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 92.0 (TID 50, localhost, PROCESS_LOCAL, 1133 bytes)
15/03/25 11:52:24 INFO executor.Executor: Running task 0.0 in stage 92.0 (TID 50)
15/03/25 11:52:24 INFO storage.ShuffleBlockFetcherIterator: Getting 2 non-empty blocks out of 2 blocks
15/03/25 11:52:24 INFO storage.ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
15/03/25 11:52:24 INFO executor.Executor: Finished task 0.0 in stage 92.0 (TID 50). 1068 bytes result sent to driver
15/03/25 11:52:24 INFO scheduler.DAGScheduler: Stage 92 (print at JavaKinesisWordCountASL.java:173) finished in 0.020 s

15/03/25 11:52:24 INFO scheduler.DAGScheduler: Job 46 finished: print at JavaKinesisWordCountASL.java:173, took 0.042294 s

Time: 1427284344000 ms

(4,1)
(0,2)
(2,1)
(7,1)
(5,3)
(9,2)

C) And this is the YARN based program which is not working :

run-example org.apache.spark.examples.streaming.JavaKinesisWordCountASLYARN mySparkStream https://kinesis.us-east-1.amazonaws.com\

Spark assembly has been built with Hive, including Datanucleus jars on classpath
15/03/25 11:52:45 INFO spark.SparkContext: Running Spark version 1.3.0
15/03/25 11:52:45 WARN spark.SparkConf:
SPARK_CLASSPATH was detected (set to '/home/hadoop/spark/conf:/home/hadoop/conf:/home/hadoop/spark/classpath/emr/:/home/hadoop/spark/classpath/emrfs/:/home/hadoop/share/hadoop/common/lib/*:/home/hadoop/share/hadoop/common/lib/hadoop-lzo.jar').
This is deprecated in Spark 1.0+.

Please instead use:

./spark-submit with --driver-class-path to augment the driver classpath
spark.executor.extraClassPath to augment the executor classpath

15/03/25 11:52:45 WARN spark.SparkConf: Setting 'spark.executor.extraClassPath' to '/home/hadoop/spark/conf:/home/hadoop/conf:/home/hadoop/spark/classpath/emr/:/home/hadoop/spark/classpath/emrfs/:/home/hadoop/share/hadoop/common/lib/:/home/hadoop/share/hadoop/common/lib/hadoop-lzo.jar' as a work-around.
15/03/25 11:52:45 WARN spark.SparkConf: Setting 'spark.driver.extraClassPath' to '/home/hadoop/spark/conf:/home/hadoop/conf:/home/hadoop/spark/classpath/emr/:/home/hadoop/spark/classpath/emrfs/:/home/hadoop/share/hadoop/common/lib/:/home/hadoop/share/hadoop/common/lib/hadoop-lzo.jar' as a work-around.
15/03/25 11:52:46 INFO spark.SecurityManager: Changing view acls to: hadoop
15/03/25 11:52:46 INFO spark.SecurityManager: Changing modify acls to: hadoop
15/03/25 11:52:46 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(hadoop); users with modify permissions: Set(hadoop)
15/03/25 11:52:47 INFO slf4j.Slf4jLogger: Slf4jLogger started
15/03/25 11:52:48 INFO Remoting: Starting remoting
15/03/25 11:52:48 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://[email protected]:59504]
15/03/25 11:52:48 INFO util.Utils: Successfully started service 'sparkDriver' on port 59504.
15/03/25 11:52:48 INFO spark.SparkEnv: Registering MapOutputTracker
15/03/25 11:52:48 INFO spark.SparkEnv: Registering BlockManagerMaster
15/03/25 11:52:48 INFO storage.DiskBlockManager: Created local directory at /mnt/spark/spark-120befbc-6dae-4751-b41f-dbf7b3d97616/blockmgr-d339d180-36f5-465f-bda3-cecccb23b1d3
15/03/25 11:52:48 INFO storage.MemoryStore: MemoryStore started with capacity 265.4 MB
15/03/25 11:52:48 INFO spark.HttpFileServer: HTTP File server directory is /mnt/spark/spark-85e88478-3dad-4fcf-a43a-efd15166bef3/httpd-6115870a-0d90-44df-aa7c-a6bd1a47e107
15/03/25 11:52:48 INFO spark.HttpServer: Starting HTTP Server
15/03/25 11:52:49 INFO server.Server: jetty-8.y.z-SNAPSHOT
15/03/25 11:52:49 INFO server.AbstractConnector: Started [email protected]:44879
15/03/25 11:52:49 INFO util.Utils: Successfully started service 'HTTP file server' on port 44879.
15/03/25 11:52:49 INFO spark.SparkEnv: Registering OutputCommitCoordinator
15/03/25 11:52:49 INFO server.Server: jetty-8.y.z-SNAPSHOT
15/03/25 11:52:49 INFO server.AbstractConnector: Started [email protected]:4040
15/03/25 11:52:49 INFO util.Utils: Successfully started service 'SparkUI' on port 4040.
15/03/25 11:52:49 INFO ui.SparkUI: Started SparkUI at http://ip-10-80-175-92.ec2.internal:4040
15/03/25 11:52:50 INFO spark.SparkContext: Added JAR file:/home/hadoop/spark/lib/spark-examples-1.3.0-hadoop2.4.0.jar at http://10.80.175.92:44879/jars/spark-examples-1.3.0-hadoop2.4.0.jar with timestamp 1427284370358
15/03/25 11:52:50 INFO cluster.YarnClusterScheduler: Created YarnClusterScheduler
15/03/25 11:52:51 ERROR cluster.YarnClusterSchedulerBackend: Application ID is not set.
15/03/25 11:52:51 INFO netty.NettyBlockTransferService: Server created on 49982
15/03/25 11:52:51 INFO storage.BlockManagerMaster: Trying to register BlockManager
15/03/25 11:52:51 INFO storage.BlockManagerMasterActor: Registering block manager ip-10-80-175-92.ec2.internal:49982 with 265.4 MB RAM, BlockManagerId(, ip-10-80-175-92.ec2.internal, 49982)
15/03/25 11:52:51 INFO storage.BlockManagerMaster: Registered BlockManager
Exception in thread "main" java.lang.NullPointerException
at org.apache.spark.deploy.yarn.ApplicationMaster$.sparkContextInitialized(ApplicationMaster.scala:581)
at org.apache.spark.scheduler.cluster.YarnClusterScheduler.postStartHook(YarnClusterScheduler.scala:32)
at org.apache.spark.SparkContext.(SparkContext.scala:541)
at org.apache.spark.streaming.StreamingContext$.createNewSparkContext(StreamingContext.scala:642)
at org.apache.spark.streaming.StreamingContext.(StreamingContext.scala:75)
at org.apache.spark.streaming.api.java.JavaStreamingContext.(JavaStreamingContext.scala:132)
at org.apache.spark.examples.streaming.JavaKinesisWordCountASLYARN.main(JavaKinesisWordCountASLYARN.java:127)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:569)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

Spark Bootstrap action with Kinesis and Yarn Cluster

I tried steps mentioned in readme to install spark on EMR. But I think it dont integrate Spark with YARN.

How we can use bootstrap to install spark on YARN?

install-spark should ignore case when searching for values

the various "grep" calls in "maximize-spark-default-config" are searching for upper case values like "CORE" when the values are actually in mixed case if the user uses lower case to name instance groups.

spark-submit fails via script-runner.jar (custom Spark build)

I would like to run a custom Spark build (1.3.1). What I did:

Build Spark 1.3.1 from source, according to your example page (Scala 2.10).
Use custom config.file and launch with -c.

Funny enough, running ./bin/spark-submit directly from the command line works fine.

However, when going through Amazon's script-runner.jar (i.e.: hadoop jar /path/to/script-runner.jar /home/hadoop/spark/bin/spark-submit ... etc), I receive the following error:

Exception in thread "main" java.lang.NoSuchMethodError: scala.collection.immutable.$colon$colon.hd$1()Ljava/lang/Object;
at org.apache.spark.deploy.SparkSubmitArguments.parse$1(SparkSubmitArguments.scala:295)
at org.apache.spark.deploy.SparkSubmitArguments.parseOpts(SparkSubmitArguments.scala:288)
at org.apache.spark.deploy.SparkSubmitArguments.(SparkSubmitArguments.scala:87)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:105)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Command exiting with ret '1'

Is this something you have seen before?

Thanks,
Eric

install-spark should support scala 2.11

Since Spark offers experimental support for 2.11, so should install-spark

install-spark doesn't extract number of instances properly

[hadoop@ip-10-83-134-205 ~]$ grep /mnt/var/lib/info/job-flow.json -e "CORE" -A 4 | grep -e requestedInstanceCount | cut -d':' -f2 | sed  's/\s\+//g'
12,

Notice the trailing comma

Presto BA can't process --binary option since presto-cli binary is also needed in the support.elasticmapreduce bucket

Summary

Presto BA can't properly process --binary option to install user compiled Presto binary since the script tries to download presto-cli-executable.jar in s3://support.elasticmapreduce/bootstrap-actions/presto/user-compiled/ if --binary option is specified.

Environment

AMI 3.5.0
Hive latest
EMR cluster is in public subnet
EMR cluster has default EMR_DefaultRole
EC2 instances have default EMR_EC2_DefaultRole

Command

aws emr create-cluster --ami-version 3.5.0 \
    --name "(AMI 3.5.0 Hive + Presto)" \
    --service-role EMR_DefaultRole \
    --tags Name=production-emr environment=production \
    --ec2-attributes KeyName=[key-name],InstanceProfile=EMR_EC2_DefaultRole,SubnetId=[public-subnet] \
    --applications Name=hive \
    --instance-groups \
        InstanceGroupType=MASTER,InstanceCount=1,InstanceType=r3.xlarge\
        InstanceGroupType=CORE,InstanceCount=2,InstanceType=r3.xlarge\
    --bootstrap-action \
        Name="install Presto",Path="s3://mybucket/libs/install-presto.rb",Args="[-p,8989,-m,10240,-n,10240,-b,s3://mybucket/libs/presto-server-0.100.tar.gz]",\
        Name="Install Hive Site Configuration",Path="s3://elasticmapreduce/libs/hive/hive-script",Args=["--base-path","s3://elasticmapreduce/libs/hive","--install-hive-site","--hive-site=s3://mybucket/libs/hive-site.xml"] \
    --log-uri "s3://mybucket/logs/" \
    --no-auto-terminate \
    --visible-to-all-users

Fix?

I think previous version of BA has an option to specify CLI binary path, and if you are ok with adding that option, I will send p-r for this. Any idea?

https://github.com/awslabs/emr-bootstrap-actions/blob/master/presto/latest/install-presto#L349-L360

Execution Log

2015-04-26T05:13:05.102Z INFO Fetching file 's3://mybucket/libs/install-presto.rb'
2015-04-26T05:13:06.071Z INFO startExec '/mnt/var/lib/bootstrap-actions/1/install-presto.rb -p 8989 -m 10240 -n 10240 -b s3://mybucket/libs/presto-server-0.100.tar.gz '
2015-04-26T05:13:06.076Z INFO Environment:
  TERM=linux
  HADOOP_PREFIX=/home/hadoop
  CONSOLETYPE=serial
  JAVA_HOME=/usr/java/latest
  PIG_CONF_DIR=/home/hadoop/pig/conf
  HBASE_HOME=/home/hadoop/hbase
  HADOOP_YARN_HOME=/home/hadoop
  HIVE_HOME=/home/hadoop/hive
  YARN_HOME=/home/hadoop
  MAIL=/var/spool/mail/hadoop
  IMPALA_CONF_DIR=/home/hadoop/impala/conf
  PWD=/
  HOSTNAME=ip-10-0-0-236.ap-northeast-1.compute.internal
  LESS_TERMCAP_mb=[01;31m
  LESS_TERMCAP_me=[0m
  NLSPATH=/usr/dt/lib/nls/msg/%L/%N.cat
  LESS_TERMCAP_md=[01;38;5;208m
  AWS_AUTO_SCALING_HOME=/opt/aws/apitools/as
  HISTSIZE=1000
  HADOOP_COMMON_HOME=/home/hadoop
  PATH=/home/hadoop/pig/bin:/usr/local/cuda/bin:/usr/java/latest/bin:/home/hadoop/bin:/home/hadoop/mahout/bin:/home/hadoop/hive/bin:/home/hadoop/hbase/bin:/home/hadoop/impala/bin:/sbin:/usr/sbin:/bin:/usr/bin:/usr/local/sbin:/opt/aws/bin:/home/hadoop/cascading/tools/multitool-20140224/bin:/home/hadoop/cascading/tools/load-20140223/bin:/home/hadoop/cascading/tools/lingual-client/bin:/home/hadoop/cascading/driven/bin
  HIVE_CONF_DIR=/home/hadoop/hive/conf
  AWS_DEFAULT_REGION=ap-northeast-1
  HADOOP_CONF_DIR=/home/hadoop/conf
  IMPALA_HOME=/home/hadoop/impala
  SHLVL=5
  LANGSH_SOURCED=1
  XFILESEARCHPATH=/usr/dt/app-defaults/%L/Dt
  AWS_CLOUDWATCH_HOME=/opt/aws/apitools/mon
  UPSTART_JOB=rc
  HADOOP_HOME_WARN_SUPPRESS=true
  EC2_AMITOOL_HOME=/opt/aws/amitools/ec2
  AWS_RDS_HOME=/opt/aws/apitools/rds
  PIG_CLASSPATH=/home/hadoop/pig/lib
  LESS_TERMCAP_se=[0m
  MAHOUT_CONF_DIR=/home/hadoop/mahout/conf
  LOGNAME=hadoop
  UPSTART_INSTANCE=
  HBASE_CONF_DIR=/home/hadoop/hbase/conf
  YARN_CONF_DIR=/home/hadoop/conf
  AWS_PATH=/opt/aws
  _=/usr/java/latest/bin/java
  HADOOP_HOME=/home/hadoop
  runlevel=3
  LD_LIBRARY_PATH=/home/hadoop/lib/native:/usr/lib64:/usr/local/cuda/lib64:/usr/local/cuda/lib:
  UPSTART_EVENTS=runlevel
  MAHOUT_LOG_DIR=/mnt/var/log/apps
  previous=N
  EC2_HOME=/opt/aws/apitools/ec2
  PIG_HOME=/home/hadoop/pig
  LESS_TERMCAP_ue=[0m
  AWS_ELB_HOME=/opt/aws/apitools/elb
  RUNLEVEL=3
  USER=hadoop
  RUBYOPT=rubygems
  PREVLEVEL=N
  HADOOP_HDFS_HOME=/home/hadoop
  HOME=/home/hadoop
  HISTCONTROL=ignoredups
  LESSOPEN=||/usr/bin/lesspipe.sh %s
  MAHOUT_HOME=/home/hadoop/mahout
  LANG=en_US.UTF-8
  LESS_TERMCAP_us=[04;38;5;111m
  HADOOP_MAPRED_HOME=/home/hadoop
2015-04-26T05:13:06.076Z INFO redirectOutput to /mnt/var/log/bootstrap-actions/1/stdout
2015-04-26T05:13:06.077Z INFO redirectError to /mnt/var/log/bootstrap-actions/1/stderr
2015-04-26T05:13:06.077Z INFO Working dir /mnt/var/lib/bootstrap-actions/1
2015-04-26T05:13:06.078Z INFO ProcessRunner started child process : /mnt/var/lib/bootstrap-actions/1/install-presto...
2015-04-26T05:13:06.078Z INFO Synchronously wait child process to complete : /mnt/var/lib/bootstrap-actions/1/install-presto...
2015-04-26T05:14:08.084Z INFO Process still running
2015-04-26T05:14:39.230Z INFO waitProcessCompletion ended with exit code 1 : /mnt/var/lib/bootstrap-actions/1/install-presto...
2015-04-26T05:14:39.230Z ERROR Execution failed with code '1'

Error Log

Warning: RPMDB altered outside of yum.
15/04/26 05:13:27 INFO fs.HadoopConfigurationAWSCredentialsProvider: Couldn't extract aws credentials from either uri s3://mybucket/libs/presto-server-0.100.tar.gz or hadoop configuration.
15/04/26 05:13:27 INFO fs.EmrFileSystem: Consistency disabled, using com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem as filesystem implementation
15/04/26 05:13:29 INFO s3n.S3NativeFileSystem: Opening 's3://mybucket/libs/presto-server-0.100.tar.gz' for reading
15/04/26 05:14:09 INFO fs.HadoopConfigurationAWSCredentialsProvider: Couldn't extract aws credentials from either uri s3://support.elasticmapreduce/bootstrap-actions/presto/hadoop-lzo-0.4.19.jar or hadoop configuration.
15/04/26 05:14:09 INFO fs.EmrFileSystem: Consistency disabled, using com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem as filesystem implementation
15/04/26 05:14:11 INFO s3n.S3NativeFileSystem: Opening 's3://support.elasticmapreduce/bootstrap-actions/presto/hadoop-lzo-0.4.19.jar' for reading
15/04/26 05:14:26 INFO fs.HadoopConfigurationAWSCredentialsProvider: Couldn't extract aws credentials from either uri s3://support.elasticmapreduce/bootstrap-actions/presto/user-compiled/presto-cli-executable.jar or hadoop configuration.
15/04/26 05:14:26 INFO fs.EmrFileSystem: Consistency disabled, using com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem as filesystem implementation
get: `s3://support.elasticmapreduce/bootstrap-actions/presto/user-compiled/presto-cli-executable.jar': No such file or directory
/mnt/var/lib/bootstrap-actions/1/install-presto.rb:88:in `run': Command failed: /home/hadoop/bin/hdfs dfs -get s3://support.elasticmapreduce/bootstrap-actions/presto/user-compiled/presto-cli-executable.jar /tmp/presto-cli-executable.jar (RuntimeError)
    from /mnt/var/lib/bootstrap-actions/1/install-presto.rb:350

install-spark 1.1.1f results in 1.1.1e

install-spark does not install 1.1.1.f when 1.1..1f is requested.

Look at lines 240-242

elif [ "$REQUESTED_VERSION" == "1.1.1.f" ]
then
        wget -O install-spark-script $SPARK_111e

EMR Hbase 0.94 and Drill 0.8

Hi Team,

Drill 0.8 support HBase 0.98 version and not Hbase 0.94 (default provided by AWS). So if we have to install Drill 0.8 than we need to have a bootstrap script to install Hbase 0.98.

Thank you
Ankur

Spark-sql result mixed with running status

I use the bootstrap to setup a Spark cluster in AWS, when I use spark-sql to query data, the result was always mixed with spark running status. I could not find the reason.

for more details, please check gist below:
https://gist.github.com/ba4d7ecf480ca7cad93d.git

install-spark -x should set spark.yarn.executor.memoryOverhead for versions before 1.2

Spark recommends a setting for spark.yarn.executor.memoryOverhead that is "executorMemory * 0.07, with minimum of 384". This is default for 1.2.

Bump version on presto bootstrap

Presto following 0.78 had some optimizations for S3 access. Current version is 0.107 (as of posting).

As an aside: they did depart into JDK 1.8 land.

Spark 1.2 Configuration Incorrect

Hi,

I used the following settings to create a Spark 1.2 cluster, yet when I start spark shell, it is configured as a single node with 200 Mb of memory instead of a 5 node cluster with 28 Gb of memory per node

Steps

aws emr create-cluster --name MyCluster --ami-version 3.3.1 --instance-groups InstanceGroupType=MASTER,InstanceType=r3.xlarge,InstanceCount=1,BidPrice=0.32 InstanceGroupType=CORE,BidPrice=0.32,InstanceType=r3.xlarge,InstanceCount=5 --ec2-attributes KeyName=MyKey --applications Name=Hive Name=Impala Name=Hue --bootstrap-actions Path=s3://support.elasticmapreduce/spark/install-spark --steps Name=SparkHistoryServer,Jar=s3://elasticmapreduce/libs/script-runner/script-runner.jar,Args=s3://support.elasticmapreduce/spark/start-history-server

/home/hadoop/spark/bin/spark-shell

I've also tried
MASTER=yarn-client /home/hadoop/spark/bin/spark-shell
and
SPARK_MEM=28g /home/hadoop/spark/bin/spark-shell

My test script imports about 20Gb of data from S3 and runs some basic queries. If I run it on non-emr cluster (created with the Spark ec2 scripts) I can do a simple count in < 20 seconds and cache all the data in memory. Running the same scripts with these EMR Spark Scripts and the same ec2 instance types takes 10x longer.

It appear as if the scripts are not running across all nodes (or only on the master). Is there more configuration I need to do?

how to run an application with multiple python files

AMI version 3.8.0
Spark version 1.3.1

My spark program has multiple python files.
And I use --py-files option to load the files.
It works well on custom spark cluster (without EMR).
But in the case of EMR it fails to import python files.

for example, I have two python files (test1.py, test2.py)
*test1.py is like below
import test2 ==> here the EMR step fails
I want to know how to run the application with multiple python files.

*this is my emr cli
aws emr create-cluster --name Spark --ami-version 3.8.0
--instance-type=m3.xlarge --instance-count 2
--use-default-roles --ec2-attributes KeyName=podotree-landvibe
--log-uri s3://landvibe-emr/spark/log
--bootstrap-actions Name=Spark,Path=s3://support.elasticmapreduce/spark/install-spark,Args=[-x]
Name=install_es,Path=s3://landvibe-emr/spark/bootstrap-actions/install_python_elasticsearch.sh
--steps Name=SparkPi,Jar=s3://ap-northeast-1.elasticmapreduce/libs/script-runner/script-runner.jar,Args=[/home/hadoop/spark/bin/spark-submit,--deploy-mode,cluster,s3://landvibe-emr/spark/code/test1.py,--py-files,s3://landvibe-emr/spark/code/test2.py]
--auto-terminate

Ganglia stops running with 1.3.0.d on ami 3.6.0 (but not 3.5.0)

I tried the most recent 1.3.0.d spark version with ami 3.6.0 and Ganglia gave the exception below.
Has anyone else seen that? The test cluster is j-29YCKSAETGNO6

Note the division by zero marked in the logs below.

The cluster was created with args:

aws emr create-cluster \
... \
  --ami-version 3.6.0 \
  --no-termination-protected \
  --visible-to-all-users \
  --enable-debugging \
  --applications Name=Ganglia \
  --bootstrap-actions \
    Name=InstallSpark,Args="-v,1.3.0.d,-x,-g",Path=s3://support.elasticmapreduce/spark/install-spark \
    Name="Configure HBase for Ganglia",Path=s3://us-east-1.elasticmapreduce/bootstrap-actions/configure-hbase-for-ganglia \
    Name="Aggregate logs",Path=s3://elasticmapreduce/bootstrap-actions/configure-hadoop,\
Args=[\
"-y","yarn.log-aggregation-enable=true",\
"-y","yarn.log-aggregation.retain-seconds=43200",\
"-y","yarn.log-aggregation.retain-check-interval-seconds=1800",\
"-c","hadoop.http.staticuser.user=hadoop"\
] \
  --steps \
    Name=SparkHistoryServer,Jar=s3://elasticmapreduce/libs/script-runner/script-runner.jar,Args=s3://support.elasticmapreduce/spark/start-history-server

Log file snippet:

[Mon Mar 30 06:00:13 2015] [error] [client ***] PHP Warning:  gettimeofday(): It is not safe to rely on the system's timezone settings. You are *required* to use the date.timezone setting or the date_default_timezone_set() function. In case you used any of those methods and you are still getting this warning, you most likely misspelled the timezone identifier. We selected 'UTC' for 'UTC/0.0/no DST' instead in /var/www/html/ganglia/ganglia.php on line 395, referer: http://xxx.compute-1.amazonaws.com/ganglia/?r=hour&cs=&ce=&m=load_one&s=by+name&c=AMZN-EMR&h=&host_regex=&max_graphs=0&tab=m&vn=&sh=1&z=small&hc=4
[Mon Mar 30 06:00:13 2015] [error] [client ***] PHP Warning:  gettimeofday(): It is not safe to rely on the system's timezone settings. You are *required* to use the date.timezone setting or the date_default_timezone_set() function. In case you used any of those methods and you are still getting this warning, you most likely misspelled the timezone identifier. We selected 'UTC' for 'UTC/0.0/no DST' instead in /var/www/html/ganglia/ganglia.php on line 412, referer: http://xxx.compute-1.amazonaws.com/ganglia/?r=hour&cs=&ce=&m=load_one&s=by+name&c=AMZN-EMR&h=&host_regex=&max_graphs=0&tab=m&vn=&sh=1&z=small&hc=4
[Mon Mar 30 06:00:13 2015] [error] [client ***] PHP Warning:  date(): It is not safe to rely on the system's timezone settings. You are *required* to use the date.timezone setting or the date_default_timezone_set() function. In case you used any of those methods and you are still getting this warning, you most likely misspelled the timezone identifier. We selected 'UTC' for 'UTC/0.0/no DST' instead in /var/www/html/ganglia/header.php on line 69, referer: http://xxx.compute-1.amazonaws.com/ganglia/?r=hour&cs=&ce=&m=load_one&s=by+name&c=AMZN-EMR&h=&host_regex=&max_graphs=0&tab=m&vn=&sh=1&z=small&hc=4
[Mon Mar 30 06:00:13 2015] [error] [client ***] PHP Warning:  date(): It is not safe to rely on the system's timezone settings. You are *required* to use the date.timezone setting or the date_default_timezone_set() function. In case you used any of those methods and you are still getting this warning, you most likely misspelled the timezone identifier. We selected 'UTC' for 'UTC/0.0/no DST' instead in /var/www/html/ganglia/header.php on line 77, referer: http://xxx.compute-1.amazonaws.com/ganglia/?r=hour&cs=&ce=&m=load_one&s=by+name&c=AMZN-EMR&h=&host_regex=&max_graphs=0&tab=m&vn=&sh=1&z=small&hc=4
[Mon Mar 30 06:00:13 2015] [error] [client ***] PHP Warning:  date(): It is not safe to rely on the system's timezone settings. You are *required* to use the date.timezone setting or the date_default_timezone_set() function. In case you used any of those methods and you are still getting this warning, you most likely misspelled the timezone identifier. We selected 'UTC' for 'UTC/0.0/no DST' instead in /var/www/html/ganglia/cluster_view.php on line 43, referer: http://xxx.compute-1.amazonaws.com/ganglia/?r=hour&cs=&ce=&m=load_one&s=by+name&c=AMZN-EMR&h=&host_regex=&max_graphs=0&tab=m&vn=&sh=1&z=small&hc=4
[Mon Mar 30 06:00:13 2015] [error] [client ***] PHP Notice:  Undefined variable: host_load in /var/www/html/ganglia/cluster_view.php on line 413, referer: http://xxx.compute-1.amazonaws.com/ganglia/?r=hour&cs=&ce=&m=load_one&s=by+name&c=AMZN-EMR&h=&host_regex=&max_graphs=0&tab=m&vn=&sh=1&z=small&hc=4
[Mon Mar 30 06:00:13 2015] [error] [client ***] PHP Notice:  Undefined variable: host_load in /var/www/html/ganglia/cluster_view.php on line 420, referer: http://xxx.compute-1.amazonaws.com/ganglia/?r=hour&cs=&ce=&m=load_one&s=by+name&c=AMZN-EMR&h=&host_regex=&max_graphs=0&tab=m&vn=&sh=1&z=small&hc=4
[Mon Mar 30 06:00:13 2015] [error] [client ***] PHP Warning:  Invalid argument supplied for foreach() in /var/www/html/ganglia/cluster_view.php on line 420, referer: http://xxx.compute-1.amazonaws.com/ganglia/?r=hour&cs=&ce=&m=load_one&s=by+name&c=AMZN-EMR&h=&host_regex=&max_graphs=0&tab=m&vn=&sh=1&z=small&hc=4
[Mon Mar 30 06:00:13 2015] [error] [client ***] PHP Notice:  Undefined variable: matrix_array in /var/www/html/ganglia/cluster_view.php on line 432, referer: http://xxx.compute-1.amazonaws.com/ganglia/?r=hour&cs=&ce=&m=load_one&s=by+name&c=AMZN-EMR&h=&host_regex=&max_graphs=0&tab=m&vn=&sh=1&z=small&hc=4
[Mon Mar 30 06:00:13 2015] [error] [client ***] PHP Warning:  join(): Invalid arguments passed in /var/www/html/ganglia/cluster_view.php on line 432, referer: http://xxx.compute-1.amazonaws.com/ganglia/?r=hour&cs=&ce=&m=load_one&s=by+name&c=AMZN-EMR&h=&host_regex=&max_graphs=0&tab=m&vn=&sh=1&z=small&hc=4
===>
[Mon Mar 30 06:00:13 2015] [error] [client ***] PHP Warning:  Division by zero in /var/www/html/ganglia/cluster_view.php on line 439, referer: http://xxx.compute-1.amazonaws.com/ganglia/?r=hour&cs=&ce=&m=load_one&s=by+name&c=AMZN-EMR&h=&host_regex=&max_graphs=0&tab=m&vn=&sh=1&z=small&hc=4
[Mon Mar 30 06:00:16 2015] [error] [client ***] PHP Warning:  gettimeofday(): It is not safe to rely on the system's timezone settings. You are *required* to use the date.timezone setting or the date_default_timezone_set() function. In case you used any of those methods and you are still getting this warning, you most likely misspelled the timezone identifier. We selected 'UTC' for 'UTC/0.0/no DST' instead in /var/www/html/ganglia/ganglia.php on line 395, referer: http://xxx.compute-1.amazonaws.com/ganglia/?r=hour&cs=&ce=&m=load_one&s=by+name&c=AMZN-EMR&h=&host_regex=&max_graphs=0&tab=m&vn=&sh=1&z=small&hc=4
[Mon Mar 30 06:00:16 2015] [error] [client ***] PHP Warning:  gettimeofday(): It is not safe to rely on the system's timezone settings. You are *required* to use the date.timezone setting or the date_default_timezone_set() function. In case you used any of those methods and you are still getting this warning, you most likely misspelled the timezone identifier. We selected 'UTC' for 'UTC/0.0/no DST' instead in /var/www/html/ganglia/ganglia.php on line 412, referer: http://xxx.compute-1.amazonaws.com/ganglia/?r=hour&cs=&ce=&m=load_one&s=by+name&c=AMZN-EMR&h=&host_regex=&max_graphs=0&tab=m&vn=&sh=1&z=small&hc=4
[Mon Mar 30 06:00:16 2015] [error] [client ***] PHP Warning:  date(): It is not safe to rely on the system's timezone settings. You are *required* to use the date.timezone setting or the date_default_timezone_set() function. In case you used any of those methods and you are still getting this warning, you most likely misspelled the timezone identifier. We selected 'UTC' for 'UTC/0.0/no DST' instead in /var/www/html/ganglia/header.php on line 69, referer: http://xxx.compute-1.amazonaws.com/ganglia/?r=hour&cs=&ce=&m=load_one&s=by+name&c=AMZN-EMR&h=&host_regex=&max_graphs=0&tab=m&vn=&sh=1&z=small&hc=4
[Mon Mar 30 06:00:16 2015] [error] [client ***] PHP Warning:  date(): It is not safe to rely on the system's timezone settings. You are *required* to use the date.timezone setting or the date_default_timezone_set() function. In case you used any of those methods and you are still getting this warning, you most likely misspelled the timezone identifier. We selected 'UTC' for 'UTC/0.0/no DST' instead in /var/www/html/ganglia/header.php on line 77, referer: http://xxx.compute-1.amazonaws.com/ganglia/?r=hour&cs=&ce=&m=load_one&s=by+name&c=AMZN-EMR&h=&host_regex=&max_graphs=0&tab=m&vn=&sh=1&z=small&hc=4
[Mon Mar 30 06:00:16 2015] [error] [client ***] PHP Warning:  date(): It is not safe to rely on the system's timezone settings. You are *required* to use the date.timezone setting or the date_default_timezone_set() function. In case you used any of those methods and you are still getting this warning, you most likely misspelled the timezone identifier. We selected 'UTC' for 'UTC/0.0/no DST' instead in /var/www/html/ganglia/cluster_view.php on line 43, referer: http://xxx.compute-1.amazonaws.com/ganglia/?r=hour&cs=&ce=&m=load_one&s=by+name&c=AMZN-EMR&h=&host_regex=&max_graphs=0&tab=m&vn=&sh=1&z=small&hc=4
[Mon Mar 30 06:00:16 2015] [error] [client ***] PHP Notice:  Undefined variable: host_load in /var/www/html/ganglia/cluster_view.php on line 413, referer: http://xxx.compute-1.amazonaws.com/ganglia/?r=hour&cs=&ce=&m=load_one&s=by+name&c=AMZN-EMR&h=&host_regex=&max_graphs=0&tab=m&vn=&sh=1&z=small&hc=4
[Mon Mar 30 06:00:16 2015] [error] [client ***] PHP Notice:  Undefined variable: host_load in /var/www/html/ganglia/cluster_view.php on line 420, referer: http://xxx.compute-1.amazonaws.com/ganglia/?r=hour&cs=&ce=&m=load_one&s=by+name&c=AMZN-EMR&h=&host_regex=&max_graphs=0&tab=m&vn=&sh=1&z=small&hc=4
[Mon Mar 30 06:00:16 2015] [error] [client ***] PHP Warning:  Invalid argument supplied for foreach() in /var/www/html/ganglia/cluster_view.php on line 420, referer: http://xxx.compute-1.amazonaws.com/ganglia/?r=hour&cs=&ce=&m=load_one&s=by+name&c=AMZN-EMR&h=&host_regex=&max_graphs=0&tab=m&vn=&sh=1&z=small&hc=4
[Mon Mar 30 06:00:16 2015] [error] [client ***] PHP Notice:  Undefined variable: matrix_array in /var/www/html/ganglia/cluster_view.php on line 432, referer: http://xxx.compute-1.amazonaws.com/ganglia/?r=hour&cs=&ce=&m=load_one&s=by+name&c=AMZN-EMR&h=&host_regex=&max_graphs=0&tab=m&vn=&sh=1&z=small&hc=4
[Mon Mar 30 06:00:16 2015] [error] [client ***] PHP Warning:  join(): Invalid arguments passed in /var/www/html/ganglia/cluster_view.php on line 432, referer: http://xxx.compute-1.amazonaws.com/ganglia/?r=hour&cs=&ce=&m=load_one&s=by+name&c=AMZN-EMR&h=&host_regex=&max_graphs=0&tab=m&vn=&sh=1&z=small&hc=4
===>
[Mon Mar 30 06:00:16 2015] [error] [client ***] PHP Warning:  Division by zero in /var/www/html/ganglia/cluster_view.php on line 439, referer: http://xxx.compute-1.amazonaws.com/ganglia/?r=hour&cs=&ce=&m=load_one&s=by+name&c=AMZN-EMR&h=&host_regex=&max_graphs=0&tab=m&vn=&sh=1&z=small&hc=4**

Ganglia unresponsive after many resize operations?

I don't know if it was all of the resizing operations or not, but ganglia in a dev cluster became slower and slower until unresponsive at some point. This may be because it retained perhaps > 150 "lost nodes" that were the result of resizing up/down.

Has anyone else seen this? I tried blindly restarting gmond and gmetad but that didn't heal it.

For the record the cluster is 1.2.1 Spark with AMI 3.3.2

EMR Step: Submit Spark JAR from S3

It would be very good to have a spark-submit script which can submit jars from S3 to the cluster and which can be executed as an EMR step (i.e. the script runner).

Inspiration:
s3://support.elasticmapreduce/.../emr-spark-submit
--class CLASS --master yarn-client --num-executors 10 s3://bucket/path/to/my.jar [args]

That would make the deployment and execution of spark applications as easy as running Hadoop jars.

Cannot use native BLAS

When running Spark I get the following warnings:

15/06/15 11:17:36 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS
15/06/15 11:17:36 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeRefBLAS
15/06/15 11:17:36 WARN LAPACK: Failed to load implementation from: com.github.fommil.netlib.NativeSystemLAPACK
15/06/15 11:17:36 WARN LAPACK: Failed to load implementation from: com.github.fommil.netlib.NativeRefLAPACK

I saw that the workers don't have libgfortran, that might be the cause.

My sbt build file includes

libraryDependencies += "com.github.fommil.netlib" % "all" % "1.1.2"

Anyone else getting this warning?

Document if and how EMR TASK nodes can be used by Spark

Task nodes in EMR don't hold persistent data but can add compute power. Can Spark take advantage of those? If you run "hdfs dfsadmin -report" it won't enumerate the TASK nodes.

Similarly does Spark scale up with added CORE nodes?

Upgrade spark installation to AMI 3.4.0, spark 1.3.1

I don't see these version in the config file, but they are the current versions.

Update config file to install version 1.3.1

Can you update the install config file to install version 1.3.1 instead of 1.3.0.

Thanks,
Cory

How to run instances with Spark 1.3.1 built on Hadoop 2.6 ?

Hi,

I tried to run a bootstrap action on EMR in order to install Spark 1.3.1 built on Hadoop 2.6 (and not 2.4).

1- Is it possible ?

2- If so, my conf file specify a repository where you can find spark-1.3.1-bin-hadoop2.6.tgz and I'm using the setting "-v 2.6.0" (see details here https://github.com/awslabs/emr-bootstrap-actions/blob/master/spark/README.md).

My conf file:

default default python s3://support.elasticmapreduce/spark/install-spark-script.py s3://<my_repository>/spark-1.3.1-bin-hadoop2.6.tgz s3://support.elasticmapreduce/spark/maximize-spark-default-config

Nonetheless, it's still not working...

Thanks for you help.

Regards

Thomas

spark saveAsTextFile make two results

AMI version 3.8.0
Spark version 1.3.1

My python code is very simple. It should make one result at s3n://landvibe-emr/spark/result/simple_test_(number).
But there are two output folders like below.
s3n://landvibe-emr/spark/result/simple_test_253103
s3n://landvibe-emr/spark/result/simple_test_897893
And the status of the step is failed.
When I run this code without EMR, it generates one result folder.

*my cli command
aws emr create-cluster --name Spark --ami-version 3.8.0 --instance-type=m3.xlarge --instance-count 2 --use-default-roles --ec2-attributes KeyName=podotree-landvibe --log-uri s3://landvibe-emr/spark/log --bootstrap-actions Name=Spark,Path=s3://support.elasticmapreduce/spark/install-spark,Args=[-x] --steps Name=SparkPi,Jar=s3://ap-northeast-1.elasticmapreduce/libs/script-runner/script-runner.jar,Args=[/home/hadoop/spark/bin/spark-submit,--deploy-mode,cluster,s3://landvibe-emr/spark/code/simple_test.py] --auto-terminate

*simple_test.py
from pyspark import SparkContext
import random
sc = SparkContext("local", "testapp")
lines = sc.textFile("s3n://landvibe-emr/input/2015-06-29-00/t_open_log", sc.defaultParallelism * 3)
randValue = random.randrange(100,1000000)
lines.saveAsTextFile("s3n://landvibe-emr/spark/result/simple_test_"+str(randValue))
sc.stop()

EMR 3.3.1 with spark 1.1.1e error in cluster mode

I have been successful with YARN client mode using a 3.3.1 AMI and spark 1.1.1e script.

But when I run it in cluster mode I get the error message below.

There are old fixed errors that sound suspiciously similar as in:

https://issues.apache.org/jira/browse/MAPREDUCE-3931
- This fix found a long used as an atomic item and added synchronize to fix it.

I tried the removal of .sparkStaging as suggested here but that made no difference:

http://apache-spark-user-list.1001560.n3.nabble.com/jar-changed-on-src-filesystem-td10011.html

It reliably has this error every time.

The command line looks like this:

/home/hadoop/spark/bin/spark-submit \
  --verbose \
  --deploy-mode cluster \
  --master yarn-cluster \
  --driver-memory 4G \
  --executor-memory 5G \
  --num-executors  5 \
  --jars /home/hadoop/test/test-assembly-0.1.0-SNAPSHOT.jar \
  --class test... \
  /home/hadoop/test/test-assembly-0.1.0-SNAPSHOT.jar \
  args

The variation that just uses client mode works fine as in:

/home/hadoop/spark/bin/spark-submit \
  --verbose \
  --deploy-mode client \
  --master yarn-client \
  --driver-memory 4G \
  --executor-memory 5G \
  --num-executors  5 \
  --jars /home/hadoop/test/test-assembly-0.1.0-SNAPSHOT.jar \
  --class test... \
  /home/hadoop/test/test-assembly-0.1.0-SNAPSHOT.jar \
  args

Error message:

2015-01-16 00:53:49,401 INFO  [main] yarn.Client (Logging.scala:logInfo(59)) - Application report from ResourceManager:
  application identifier: application_1421213387393_0069
  appId: 69
  clientToAMToken: null
  appDiagnostics: Application application_1421213387393_0069 failed 2 times due to AM Container for appattempt_1421213387393_0069_000002 exited with  exitCode: -1000 due to: Resource hdfs://xxxxxxxxxxxx:9000/user/hadoop/.sparkStaging/application_1421213387393_0069/xxxxx-assembly-0.1.0-SNAPSHOT.jar changed on src filesystem (expected 1421369607450, was 1421369608135
g this attempt.. Failing the application.
  appMasterHost: N/A
  appQueue: default
  appMasterRpcPort: -1
  appStartTime: 1421369608345
  yarnAppState: FAILED
  distributedFinalState: FAILED
  appTrackingUrl: xxxxxxxxxxxxxxx.ec2.internal:9026/cluster/app/application_1421213387393_0069
  appUser: hadoop

This is using the 3.3.1 AMI and spark 1.1.1e script.

install-spark with -x for 1.2.0.a does not set spark.default.parallelism correctly

When I run s3://support.elasticmapreduce/spark/install-spark with -x, -v, 1.2.0.a, -g,

My /home/hadoop/spark/conf/spark-defaults.conf looks like:

spark.eventLog.enabled  true
spark.eventLog.dir      hdfs:///spark-logs
spark.executor.extraJavaOptions         -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70
spark.metrics.conf      /home/hadoop/spark/conf/ganglia.metrics.properties
spark.executor.instances
spark.executor.cores
spark.executor.memory   10112M
spark.default.parallelism       *

The last line contains an invalid value (spark.default.parallelism *). When I try to run a Spark job, I get:

Exception in thread "main" java.lang.NumberFormatException: For input string: "*"
        at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
        at java.lang.Integer.parseInt(Integer.java:569)
        at java.lang.Integer.parseInt(Integer.java:615)
        at scala.collection.immutable.StringLike$class.toInt(StringLike.scala:229)
        at scala.collection.immutable.StringOps.toInt(StringOps.scala:31)
        at org.apache.spark.SparkConf$$anonfun$getInt$2.apply(SparkConf.scala:184)
        at org.apache.spark.SparkConf$$anonfun$getInt$2.apply(SparkConf.scala:184)
        at scala.Option.map(Option.scala:145)
        at org.apache.spark.SparkConf.getInt(SparkConf.scala:184)
        at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.defaultParallelism(CoarseGrainedSchedulerBackend.scala:271)
        at org.apache.spark.scheduler.TaskSchedulerImpl.defaultParallelism(TaskSchedulerImpl.scala:402)
        at org.apache.spark.SparkContext.defaultParallelism(SparkContext.scala:1455)
        at org.apache.spark.SparkContext.defaultMinPartitions(SparkContext.scala:1462)
        at org.apache.spark.SparkContext.textFile$default$2(SparkContext.scala:539)
        at org.apache.spark.api.java.JavaSparkContext.textFile(JavaSparkContext.scala:184)

It looks like the issue is in maximize-spark-default-config. When I download it to my master and run it manually, there are some issues:

+ VCOREREFERENCE=http://support.elasticmapreduce.s3.amazonaws.com/spark/vcorereference.tsv
+ CONFIGURESPARK=http://support.elasticmapreduce.s3.amazonaws.com/spark/configure-spark.bash
+ echo 'Configuring Spark default configuration to the max memory and vcore setting given configured number of cores nodes at cluster creation'
Configuring Spark default configuration to the max memory and vcore setting given configured number of cores nodes at cluster creation
+ /usr/share/aws/emr/scripts/configure-hadoop -y yarn.scheduler.minimum-allocation-mb=256
Processing default file /home/hadoop/conf/yarn-site.xml with overwrite yarn.scheduler.minimum-allocation-mb=256
'yarn.scheduler.minimum-allocation-mb': new value '256' overwriting '256'
Saved /home/hadoop/conf/yarn-site.xml with overwrites. Original saved to /home/hadoop/conf/yarn-site.xml.old
++ grep /mnt/var/lib/info/job-flow.json -e CORE -A 4
++ grep -e requestedInstanceCount
++ cut -d: -f2
++ sed 's/\s\+//g'
+ NUM_CORE_NODES=
+ '[' -lt 2 ']'
./f: line 27: [: -lt: unary operator expected
++ grep /mnt/var/lib/info/job-flow.json -e CORE -A 4
++ grep -e instanceType
++ cut '-d"' -f4
++ sed 's/\s\+//g'
+ CORE_INSTANCE_TYPE=
+ '[' == '' ']'
./f: line 35: [: ==: unary operator expected
+ wget http://support.elasticmapreduce.s3.amazonaws.com/spark/vcorereference.tsv
--2015-01-12 10:58:39--  http://support.elasticmapreduce.s3.amazonaws.com/spark/vcorereference.tsv
Resolving support.elasticmapreduce.s3.amazonaws.com (support.elasticmapreduce.s3.amazonaws.com)... 54.231.32.217
Connecting to support.elasticmapreduce.s3.amazonaws.com (support.elasticmapreduce.s3.amazonaws.com)|54.231.32.217|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 470 [text/tab-separated-values]
Saving to: ‘vcorereference.tsv.3’

vcorereference.tsv.3                                            100%[========================================================================================================================================================>]     470  --.-KB/s   in 0s

2015-01-12 10:58:40 (57.0 MB/s) - ‘vcorereference.tsv.3’ saved [470/470]

++ grep vcorereference.tsv -e
++ cut -f2
grep: option requires an argument -- 'e'
Usage: grep [OPTION]... PATTERN [FILE]...
Try `grep --help' for more information.
+ NUM_VCORES=
++ grep /home/hadoop/conf/yarn-site.xml -e 'yarn\.scheduler\.maximum-allocation-mb'
++ sed 's/.*<value>\(.*\).*<\/value>.*/\1/g'
+ MAX_YARN_MEMORY=11520
++ expr 11520 - 1024 - 384
+ EXEC_MEMORY=10112
+ EXEC_MEMORY+=M
+ echo 'num vcores '
num vcores
+ echo 'num core nodes '
num core nodes
++ expr '*'
+ PARALLEL='*'
+ echo 'parallel *'
parallel *
+ wget http://support.elasticmapreduce.s3.amazonaws.com/spark/configure-spark.bash
--2015-01-12 10:58:40--  http://support.elasticmapreduce.s3.amazonaws.com/spark/configure-spark.bash
Resolving support.elasticmapreduce.s3.amazonaws.com (support.elasticmapreduce.s3.amazonaws.com)... 54.231.32.217
Connecting to support.elasticmapreduce.s3.amazonaws.com (support.elasticmapreduce.s3.amazonaws.com)|54.231.32.217|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 584 [binary/octet-stream]
Saving to: ‘configure-spark.bash.3’

configure-spark.bash.3                                          100%[========================================================================================================================================================>]     584  --.-KB/s   in 0s

2015-01-12 10:58:40 (69.6 MB/s) - ‘configure-spark.bash.3’ saved [584/584]

+ exit 0

(btw, I commented out the call to configure-spark.bash at the end).

It looks like:

NUM_CORE_NODES is set to empty string (expected?)
syntax error on '[' -lt 2 ']'
line 27: [: -lt: unary operator expected
another on + '[' == '' ']'
line 35: [: ==: unary operator expected
NUM_VCORES is an empty string (expected?)
PARALLEL is set to '*' - this is what is causing my current issue

I can hack this for now, but it'd be great to see a fix this pushed out (or clarification if I'm doing something wrong).

Update config file to install Spark version 1.4.0

Can you please update the install config file to install Spark version 1.4.0? There are several bugfix also related to the deployment on EMR in this revision. For instance, I hope that https://issues.apache.org/jira/browse/SPARK-2669 will fix the problems I'm facing when trying to use the HiveContext within a Spark Streaming app deployed on EMR in YARN-cluster mode (Spark v1.3.1e, -h option, AMI 3.7.0), i.e. it can't connect to the Hive metastore. First of all the default classpath, although I'm using the -h option, doesn't include the datanucleus-* jars (missing looking at the Spark UI, environment section) and this causes CNF errors once the driver tries to use the HiveContext. Once manually added the classpath using --driver-class-path /home/hadoop/spark/classpath/hive/* as spark-submit optional argument, the needed jars seem to be available in the environment, but I get:

15/06/11 17:04:53 WARN conf.HiveConf: hive-default.xml not found on CLASSPATH
15/06/11 17:04:54 INFO metastore.HiveMetaStore: 0: Opening raw store with implemenation class:org.apache.hadoop.hive.metastore.ObjectStore
15/06/11 17:04:54 INFO metastore.ObjectStore: ObjectStore, initialize called
15/06/11 17:04:54 INFO DataNucleus.Persistence: Property datanucleus.cache.level2 unknown - will be ignored
15/06/11 17:04:54 INFO DataNucleus.Persistence: Property hive.metastore.integral.jdo.pushdown unknown - will be ignored
15/06/11 17:04:54 ERROR Datastore.Schema: Failed initialising database.
Unable to open a test connection to the given database. JDBC url = jdbc:mysql://localhost:3306/hive?createDatabaseIfNotExist=true, username = hive. Terminating connection pool (set lazyInit to true if you expect to start your database after your app). Original Exception: ------
java.sql.SQLException: Access denied for user 'hive'@'localhost' (using password: YES)
    at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:1084)
    at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:4232)
    at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:4164)
    at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:926)
        ...

Note: I also added --driver-java-options -Dscala.usejavacp=true and --files /home/hadoop/hive/conf/hive-site.xml

Thanks,
erond

AMI 3.3.2 still uses Hadoop 2.4.0 instead of security-bug-fixed 2.4.1

I'm curious why you stay at Hadoop 2.4.0 when 2.4.1 includes only bug fixes including security.

See: http://hadoop.apache.org/releases.html

30 June, 2014: Release 2.4.1 available
Apache Hadoop 2.4.1 is a bug-fix release for the stable 2.4.x line.

There is also a security bug fix in this minor release.

CVE-2014-0229: Add privilege checks to HDFS admin sub-commands refreshNamenodes, deleteBlockPool and shutdownDatanode.
Users are encouraged to immediately move to 2.4.1.

Please see the Hadoop 2.4.1 Release Notes for details.

http://hadoop.apache.org/docs/r2.4.1/hadoop-project-dist/hadoop-common/releasenotes.html

Presto query fails when querying on S3 external table

Hi,

Basic Info

AMI Version: 3.3.1
Hive Version: 0.13.1
Presto Version: 0.78
All EC2 instances are in VPC public subnet (default gateway is InternetGateway)
Using EMR_EC2_DefaultRole/EMR_DefaultRole
Start EMR by boto==2.34.0

start command line

aws emr create-cluster --ami-version 3.3.1 \
    --name " (AMI 3.3.1 Hive + Presto) $( date '+%Y%m%d%H%M' )" \
    --service-role EMR_DefaultRole \
    --tags Name=production-emr environment=production \
    --ec2-attributes KeyName=[KEY],InstanceProfile=EMR_EC2_DefaultRole,SubnetId=[SUBNET] \
    --applications file://./app-hive-hue.json \
    --instance-groups file://./large-instance-setup.json \
    --bootstrap-actions file://./bootstrap-presto.json \
    --log-uri 's3://[BUCKET]/jobflow_logs/' \
    --no-auto-terminate \
    --visible-to-all-users

app-hive.json

[
  {
    "Name": "HIVE"
  }
]

large-instance-setup.json

[
  {
     "Name": "emr-master-production",
     "InstanceGroupType": "MASTER",
     "InstanceCount": 1,
     "InstanceType": "r3.xlarge"
  },
  {
     "Name": "emr-core-production",
     "InstanceGroupType": "CORE",
     "InstanceCount": 2 ,
     "InstanceType": "r3.xlarge"
  }
]

bootstrap-presto.json

[
  {
    "Name": "Install/Setup Presto",
    "Path": "s3://development-env-log/libs/install-presto.rb",
  }
]

How it fails

After created pv_detail table (it's external table which data are on S3) from hive.

[hadoop@ip-10-0-0-236 ~]$ ./presto --catalog=hive
presto:default> show tables;
       Table
--------------------
 pv_detail
 sample_07
 sample_08
(3 rows)

Query 20150107_064546_00008_ch8y3, FINISHED, 2 nodes
Splits: 2 total, 2 done (100.00%)
0:00 [3 rows, 191B] [62 rows/s, 3.85KB/s]

presto:default> select * from pv_detail;

Query 20150107_064555_00009_ch8y3, FAILED, 1 node
Splits: 1 total, 0 done (0.00%)
0:00 [0 rows, 0B] [0 rows/s, 0B/s]

Query 20150107_064555_00009_ch8y3 failed: AWS Access Key ID and Secret Access Key must be specified as the username or password (respectively) of a s3 URL, or by setting the fs.s3.awsAccessKeyId or fs.s3.awsSecretAccessKey properties (respectively).
presto:default>

When it works

When I use sample commands on README, it works fine.
https://github.com/awslabs/emr-bootstrap-actions/tree/master/presto

Other S3 access from master node

Since aws-cli on master node aws s3 ls s3://path/to/pv_detail/ works just fine, EC2 instances must have proper right to access to S3 object.

Unable to run spark-sql queries with some UDFs

When using spark-sql 1.2.x (that's all I've tested), most queries that I've tried seem to fail. Here's an example:

spark-sql> select concat("foo", "bar") from sample_08;
2015-02-26 01:36:07,754 INFO  [main] parse.ParseDriver (ParseDriver.java:parse(185)) - Parsing command: select concat("foo", "bar") from sample_08
2015-02-26 01:36:07,755 INFO  [main] parse.ParseDriver (ParseDriver.java:parse(206)) - Parse Completed
2015-02-26 01:36:07,758 INFO  [main] metastore.HiveMetaStore (HiveMetaStore.java:logInfo(624)) - 0: get_table : db=default tbl=sample_08
2015-02-26 01:36:07,758 INFO  [main] HiveMetaStore.audit (HiveMetaStore.java:logAuditEvent(306)) - ugi=hadoop   ip=unknown-ip-addr  cmd=get_table : db=default tbl=sample_08
2015-02-26 01:36:07,775 ERROR [main] thriftserver.SparkSQLDriver (Logging.scala:logError(96)) - Failed in [select concat("foo", "bar") from sample_08]
java.lang.NoSuchMethodException: org.apache.hadoop.hive.ql.exec.Utilities.deserializeObjectByKryo(com.esotericsoftware.kryo.Kryo, java.io.InputStream, java.lang.Class)
    at java.lang.Class.getDeclaredMethod(Class.java:2009)
    at org.apache.spark.sql.hive.HiveFunctionWrapper.<init>(Shim13.scala:67)
    at org.apache.spark.sql.hive.HiveFunctionRegistry.lookupFunction(hiveUdfs.scala:59)
    at org.apache.spark.sql.hive.HiveContext$$anon$2.org$apache$spark$sql$catalyst$analysis$OverrideFunctionRegistry$$super$lookupFunction(HiveContext.scala:258)
    at org.apache.spark.sql.catalyst.analysis.OverrideFunctionRegistry$$anonfun$lookupFunction$2.apply(FunctionRegistry.scala:41)
    at org.apache.spark.sql.catalyst.analysis.OverrideFunctionRegistry$$anonfun$lookupFunction$2.apply(FunctionRegistry.scala:41)
    at scala.Option.getOrElse(Option.scala:120)
    at org.apache.spark.sql.catalyst.analysis.OverrideFunctionRegistry$class.lookupFunction(FunctionRegistry.scala:41)
    at org.apache.spark.sql.hive.HiveContext$$anon$2.lookupFunction(HiveContext.scala:258)
    at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$10$$anonfun$applyOrElse$2.applyOrElse(Analyzer.scala:220)
    at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$10$$anonfun$applyOrElse$2.applyOrElse(Analyzer.scala:218)
    at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:144)
    at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:162)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
    at scala.collection.Iterator$class.foreach(Iterator.scala:727)
    at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
    at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
    at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
    at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
    at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
    at scala.collection.AbstractIterator.to(Iterator.scala:1157)
    at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
    at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
    at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
    at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
    at org.apache.spark.sql.catalyst.trees.TreeNode.transformChildrenDown(TreeNode.scala:191)
    at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:147)
    at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$transformExpressionDown$1(QueryPlan.scala:71)
    at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$1$$anonfun$apply$1.apply(QueryPlan.scala:85)
    at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
    at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
    at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
    at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
    at scala.collection.AbstractTraversable.map(Traversable.scala:105)
    at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$1.apply(QueryPlan.scala:84)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
    at scala.collection.Iterator$class.foreach(Iterator.scala:727)
    at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
    at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
    at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
    at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
    at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
    at scala.collection.AbstractIterator.to(Iterator.scala:1157)
    at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
    at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
    at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
    at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
    at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsDown(QueryPlan.scala:89)
    at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressions(QueryPlan.scala:60)
    at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$10.applyOrElse(Analyzer.scala:218)
    at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$10.applyOrElse(Analyzer.scala:216)
    at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:144)
    at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:135)
    at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$.apply(Analyzer.scala:216)
    at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$.apply(Analyzer.scala:215)
    at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:61)
    at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:59)
    at scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:111)
    at scala.collection.immutable.List.foldLeft(List.scala:84)
    at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:59)
    at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:51)
    at scala.collection.immutable.List.foreach(List.scala:318)
    at org.apache.spark.sql.catalyst.rules.RuleExecutor.apply(RuleExecutor.scala:51)
    at org.apache.spark.sql.SQLContext$QueryExecution.analyzed$lzycompute(SQLContext.scala:411)
    at org.apache.spark.sql.SQLContext$QueryExecution.analyzed(SQLContext.scala:411)
    at org.apache.spark.sql.SQLContext$QueryExecution.withCachedData$lzycompute(SQLContext.scala:412)
    at org.apache.spark.sql.SQLContext$QueryExecution.withCachedData(SQLContext.scala:412)
    at org.apache.spark.sql.SQLContext$QueryExecution.optimizedPlan$lzycompute(SQLContext.scala:413)
    at org.apache.spark.sql.SQLContext$QueryExecution.optimizedPlan(SQLContext.scala:413)
    at org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:418)
    at org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:416)
    at org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:422)
    at org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:422)
    at org.apache.spark.sql.hive.HiveContext$QueryExecution.stringResult(HiveContext.scala:371)
    at org.apache.spark.sql.hive.thriftserver.AbstractSparkSQLDriver.run(AbstractSparkSQLDriver.scala:57)
    at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:275)
    at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:430)
    at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:211)
    at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:358)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Time taken: 0.08 seconds

Update config file to install Spark version 1.2.2

Can you please update the install config file to install Spark version 1.2.2? Since such update is a maintenance/bugfix one I think it should be provided as the latest available for Spark 1.2.x.

Thanks,
erond

Apache Drill - 404

Hello,

When I execute the setup_drill, I get the following error on my amazon EMR:

curl: (22) The requested URL returned error: 404 Not Found
/mnt/var/lib/bootstrap-actions/4/setup_drill:23:in run': Command failed: sudo curl -L --silent --show-error --fail --connect-timeout 60 --max-time 720 --retry 5 -O http://beta.elasticmapreduce.s3.amazonaws.com/bootstrap-actions/zookeeper-standalone/zookeeper-3.4.5.tar.gz (RuntimeError) from /mnt/var/lib/bootstrap-actions/4/setup_drill:28:insudo'
from /mnt/var/lib/bootstrap-actions/4/setup_drill:166:in installZookeeper' from /mnt/var/lib/bootstrap-actions/4/setup_drill:218:ininstallDrill'
from /mnt/var/lib/bootstrap-actions/4/setup_drill:244

Regards,

Eric

Spark built with avro-mapred for hadoop1 instead of hadoop2?

Running spark on AMI 3.x (hadoop 2.x) I get the following exception trying to read any avro files. According to https://issues.apache.org/jira/browse/SPARK-3039 this should have been fixed in 1.2.0, so I suspect 1.2.0.a didn't include this fix?

java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected
at org.apache.avro.mapreduce.AvroKeyInputFormat.createRecordReader(AvroKeyInputFormat.java:47)
at org.apache.spark.rdd.NewHadoopRDD$$anon$1.(NewHadoopRDD.scala:133)
at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:107)
at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:69)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:56)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

Does current Spark version work with AWS Roles for hadoop s3 access?

I think this line suggests that the current Spark version will work with AWS: S3 will be accessible according to the policy of the associated role https://github.com/awslabs/emr-bootstrap-actions/blame/master/spark/examples/spark-submit-via-step.md#L18

Our current code works with s3 by providing the access key but before monkeying around with converting to roles, I just wanted to confirm that it should work.

Here is a snippet of code used to configure a test:

    val sparkConf = new SparkConf().setAppName("appname")
    val sc = new SparkContext(sparkConf)
    sc.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", config.getString("AWS_ACCESS_KEY"))
    sc.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey", config.getString("AWS_SECRET_KEY"))
    val hdfs = HDFS(sc.hadoopConfiguration)

Note that the underlying implementations in hadoop claim that this is implemented in Hadoop 2.6 ( https://issues.apache.org/jira/browse/HADOOP-10400 ) where EMR 3.7 is on Hadoop 2.4.0

Thanks.

Spark UI does not load with SSH tunnel

I can access spark UI from command line via lynx, but with opened SSH tunnel the UI page does not load, I tried all possible link combinations localhost:4040, http://ec2-##-###-###-###.eu-west-1.compute.amazonaws.com:4040/, http://ip-##-##-###-###.eu-west-1.compute.internal. It is weird since I can access hadoop UIs on ports 9026 and 9101. I'm using AMI 3.3.1, with spark 1.2.0

Trouble with YARN log aggregation setup

I set up YARN log aggregation as described here: https://spark.apache.org/docs/1.1.1/running-on-yarn.html
by using a boostrap step in aws emr create-cluster like this:

Name="Aggregate logs",Path=s3://elasticmapreduce/bootstrap-actions/configure-hadoop,\
Args=[\
"-y","yarn.log-aggregation-enable=true",\
"-y","yarn.log-aggregation.retain-seconds=36000",\
"-y","yarn.log-aggregation.retain-check-interval-seconds=3600",\
"-y","yarn.nodemanager.remote-app-log-dir=s3://somewriteablebucketofmine",\
] \

Nothing gets stored in the bucket and if I try to invoke yarn to display the logs, I get the following:

$ yarn logs -applicationId application_1421712553870_0003
15/01/20 00:18:40 INFO client.RMProxy: Connecting to ResourceManager at /10.168.80.68:9022
Exception in thread "main" org.apache.hadoop.fs.UnsupportedFileSystemException: No AbstractFileSystem for scheme: s3
        at org.apache.hadoop.fs.AbstractFileSystem.createFileSystem(AbstractFileSystem.java:154)
        at org.apache.hadoop.fs.AbstractFileSystem.get(AbstractFileSystem.java:242)
        at org.apache.hadoop.fs.FileContext$2.run(FileContext.java:333)
        at org.apache.hadoop.fs.FileContext$2.run(FileContext.java:330)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:415)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
        at org.apache.hadoop.fs.FileContext.getAbstractFileSystem(FileContext.java:330)
        at org.apache.hadoop.fs.FileContext.getFSofPath(FileContext.java:322)
        at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:85)
        at org.apache.hadoop.fs.FileContext.listStatus(FileContext.java:1388)
        at org.apache.hadoop.yarn.logaggregation.LogCLIHelpers.dumpAllContainersLogs(LogCLIHelpers.java:112)
        at org.apache.hadoop.yarn.client.cli.LogsCLI.run(LogsCLI.java:137)
        at org.apache.hadoop.yarn.client.cli.LogsCLI.main(LogsCLI.java:199)

This sounds suspiciously like the issue described here: http://stackoverflow.com/a/26491972/541202

I tried switching to hdfs as in the following, but nothing showed up in the hdfs://tmp/logs location:

Name="Aggregate logs",Path=s3://elasticmapreduce/bootstrap-actions/configure-hadoop,\
Args=[\
"-y","yarn.log-aggregation-enable=true",\
"-y","yarn.log-aggregation.retain-seconds=36000",\
"-y","yarn.log-aggregation.retain-check-interval-seconds=3600",\
"-y","yarn.nodemanager.remote-app-log-dir=hdfs://tmp/logs",\
] \

Suggestions?

Spark 1.2.1 is out. Installer should be updated to support it.

No rush, just figured I'd get this on the backlog.

Include sqoop instructions and installation

Is it possible to include Sqoop installation instructions and commands in the Spark distribution?

Cheers

How is Spark for EMR constructed? Is there a repo for it and can we build it?

I have a test case that:

Test case runs correctly locally when spark is built using sbt
Test case fails locally when spark is built using mvn
Test case runs correctly in Amazon versions 1.2.0.a and 1.2.1.a

Obviously the assembly created by maven is different from the one created by sbt. For example, there is the ability to shade jar files with Maven that isn't possible in sbt and that is used by parquet-column (and perhaps other modules).

Are the Amazon versions built using maven or sbt?

Do you make a repo available for how those spark builds are created?

I would like to see how those are built to figure out what is going on.

For double bonus points, I would like to be able to roll my own spark installs to try out in EMR.

ACCESS DENIED s3://support.elasticmapreduce

Traceback (most recent call last):
  File "install-spark-script", line 144, in <module>
    download_and_uncompress_files()
  File "install-spark-script", line 37, in download_and_uncompress_files
    subprocess.check_call(["/home/hadoop/bin/hdfs","dfs","-get",scala_url, tmp_dir])
  File "/usr/lib64/python2.6/subprocess.py", line 505, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['/home/hadoop/bin/hdfs', 'dfs', '-get', 's3://support.elasticmapreduce/spark//scala/scala-2.10.3.tgz', '/tmp']' returned non-zero exit status 255
[hadoop@ip-172-31-0-255 spark]$ /home/hadoop/bin/hdfs dfs -get s3://support.elasticmapreduce/spark//scala/scala-2.10.3.tgz /tmp
14/12/16 06:59:47 INFO guice.EmrFSBaseModule: Consistency disabled, using com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem as FileSystem implementation.
14/12/16 06:59:49 INFO fs.EmrFileSystem: Using com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem as filesystem implementation
-get: Fatal internal error
com.amazonaws.services.s3.model.AmazonS3Exception: Access Denied (Service: Amazon S3; Status Code: 403; Error Code: AccessDenied; Request ID: 996CDE1921142A24), S3 Extended Request ID: OvULPLcVlT5o0qirHfxzfJkL/g5VfcGJX2vmRwF83RFCXYysUXpcfc+vGCfL3kpr

downloading s3://support.elasticmapreduce/spark//scala/scala-2.10.3.tgz and access denied.
it seems that whole s3://support.elasticmapreduce is forbidened.
i'm trying to change the link to download from somewhere else.

Spark 1.4.1 release candidates

Would love to have the option to launch with spark 1.4.1 rc's! I looked through the s3 buckets and noticed there's only 1.4.0 builds at the moment.

Java 8 Bootstrap

Newer versions of components such as Presto may require Java 8, a stock bootstrap action would make it easier to install them.

There is an action here, but it's not the right one:
https://gist.github.com/ericeijkelenboom/9951500

EMR Presto example fails on US West 2

Per the example in the current documents ("support.elasticmapreduce"), launching EMR presto fails on us-west2:

Hive fails to install:
Arguments: s3://us-west-2.elasticmapreduce/libs/hive/hive-script --install-hive --base-path s3://us-west-2.elasticmapreduce/libs/hive --hive-versions latest

Multiple SLF4J logger implementations in EMR implementations of Spark

There are conflicting implementations of SLF4J loggers in the spark assembly and the hadoop assemblies.

In the example output below, I am not setting SPARK_CLASSPATH myself:

sparkVersion="-v,1.2.1.a," ; amiVersion="3.3.2" ;

Spark assembly has been built with Hive, including Datanucleus jars on classpath
Spark Command: /usr/java/latest/bin/java -cp /home/hadoop/spark/conf:/home/hadoop/conf:/home/hadoop/spark/classpath/emr/*:/home/hadoop/spark/classpath/emrfs/*:/home/hadoop/share/hadoop/common/lib/*:/home/hadoop/share/hadoop/common/lib/hadoop-lzo.jar::/home/hadoop/spark/conf:/home/hadoop/spark/lib/spark-assembly-1.2.1-hadoop2.4.0.jar:/home/hadoop/spark/lib/datanucleus-core-3.2.10.jar:/home/hadoop/spark/lib/datanucleus-rdbms-3.2.9.jar:/home/hadoop/spark/lib/datanucleus-api-jdo-3.2.6.jar:/home/hadoop/conf:/home/hadoop/conf -XX:MaxPermSize=128m -DHBASE_ZKQUORUM=master.hbase.... <more env vars>
========================================

SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/hadoop/.versions/2.4.0-amzn-1/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/hadoop/.versions/spark-1.2.1.a/lib/spark-assembly-1.2.1-hadoop2.4.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
15/03/11 20:29:19 WARN spark.SparkConf: 
SPARK_CLASSPATH was detected (set to '/home/hadoop/spark/conf:/home/hadoop/conf:/home/hadoop/spark/classpath/emr/*:/home/hadoop/spark/classpath/emrfs/*:/home/hadoop/share/hadoop/common/lib/*:/home/hadoop/share/hadoop/common/lib/hadoop-lzo.jar').
This is deprecated in Spark 1.0+.

Please instead use:
 - ./spark-submit with --driver-class-path to augment the driver classpath
 - spark.executor.extraClassPath to augment the executor classpath

Spark | Add support for dynamic scheduling

The spark 1.2 script is missing configuration for dynamic scheduling over YARN. Dynamic scheduling requires setting up all nodes as documented in https://github.com/apache/spark/blob/master/docs/job-scheduling.md.

Unable to start spark-submit python file located in s3

I'm attempting to use spark submit to start a python file located in s3. I'm trying something like:

/home/hadoop/spark/bin/spark-submit --master yarn-cluster --deploy-mode cluster s3://bucket/folder/folder/spark.py

I get some indication that it might be working:

15/05/26 23:08:35 INFO yarn.Client: Uploading resource s3://bucket/folder/folder/spark.py -> hdfs://10.196.88.42:9000/user/hadoop/.sparkStaging/application_1432676762331_0011/spark.py

But in the end, the end of the script does not write to s3 as it should. It does work correctly if I submit via a local spark.py file. It also works correctly if I take the file and upload it to hdfs and start the job via:

/home/hadoop/spark/bin/spark-submit --master yarn-cluster --deploy-mode cluster hdfs://10.196.88.42:9000/user/hadoop/spark.py

I can successfully reach the s3 file from my cluster using "aws s3 cp s3://bucket/folder/folder/spark.py .", so the role permissions seem to be correct.

Although you documentation indicates this works with jar files, I suspect it may be broken for pyspark.

Unable to set grantees for output files in S3 using spark-submit

I'm trying to use spark-submit to run a python spark job and pass in the fs.s3.canned.acl as a --conf option like this:

hadoop jar /mnt/var/lib/hadoop/steps/s-3HIRLHJJXV3SJ/script-runner.jar \
/home/hadoop/spark/bin/spark-submit --deploy-mode cluster --master yarn-cluster \
--conf "spark.driver.extraJavaOptions -Dfs.s3.canned.acl=BucketOwnerFullControl" \ 
hdfs:///user/hadoop/spark.py

The job runs fine and the files appear in the correct s3 bucket but when I look at the permissions of the files, there aren't any grantees on it.

I've also tried in the spark.py script adding:
conf = SparkConf().set('spark.driver.extraJavaOptions', '-Dfs.s3.canned.acl=BucketOwnerFullControl')
sc = SparkContext(conf=conf)

But the grantees on the files still do not get set.

EMRFS using scopt for Scala 2.11 causing conflict with Spark running on 2.10

I was having issues running my Spark program compiled for Spark 1.2.1 and Scala 2.10.

It turns out, on EMR Spark adds to its classpath the following jar: /home/hadoop/spark/classpath/emrfs/scopt_2.11-3.2.0.jar

If you're like me and not using 2.11, this is a cause of issues. To fix it, I add to add the scopt jar to the -u parameter to install-spark to add this jar in /home/hadoop/spark/classpath/user-provided and gets placed ahead of this other dependency.

I don't know much about EMRFS, but it seems like an issue given that it doesn't happen with Spark standalone.

Decryption problem running presto queries with AWS client side encryption using KMS

Hi,

I have used your latest script that successfully installs presto server(version 0.99) and java 8 on Amazon EMR instance. My data files are located in a s3 bucket encrypted with client-side customer managed key that were encrypted . When I create a hive table that references those encrypted data files in s3, hive can successfully decrypt the records and display it in console. However, when viewing the same external table from presto command line interface the data is displayed in its encrypted form. I have looked at your link given in:
https://prestodb.io/docs/current/release/release-0.57.html and added those properties in my hive.properties file and it looks like given below.

hive.s3.connect-timeout=2m
hive.s3.max-backoff-time=10m
hive.s3.max-error-retries=50
hive.metastore-refresh-interval=1m
hive.s3.max-connections=500
hive.s3.max-client-retries=50
connector.name=hive-hadoop2
hive.s3.socket-timeout=2m
hive.s3.aws-access-key=AKIAJ*******NQ
hive.s3.aws-secret-key=UMo**_46t3M/ILIO_*****5pHj
hive.metastore.uri=thrift://localhost:9083
hive.metastore-cache-ttl=20m
hive.s3.staging-directory=/mnt/tmp/
hive.s3.use-instance-credentials=true

Any help on how to decrypt the files in using presto cli will be much appreciated.