GithubHelp home page GithubHelp logo

sbt-lighter's Introduction

sbt-lighter

Build Status

SBT plugin for Spark on AWS EMR.

Getting started

  1. Add sbt-lighter in project/plugins.sbt
addSbtPlugin("net.pishen" % "sbt-lighter" % "1.2.0")
  1. Setup sbt version for your project in project/build.properties (requires sbt 1.0):
sbt.version=1.1.6
  1. Prepare your build.sbt
name := "sbt-lighter-demo"

scalaVersion := "2.11.12"

libraryDependencies ++= Seq(
  "org.apache.spark" %% "spark-core" % "2.3.1" % "provided"
)

sparkAwsRegion := "ap-northeast-1"

//Since we use cluster mode, we need a bucket to store our application's jar.
sparkS3JarFolder := "s3://my-emr-bucket/my-emr-folder/"

//(optional) Set the subnet id if you want to run Spark in VPC.
sparkSubnetId := Some("subnet-********")

//(optional) Additional security groups that will be attached to Master and Core instances.
sparkSecurityGroupIds := Seq("sg-********")

//(optional) Total number of instances, including master node. The default value is 1.
sparkInstanceCount := 2
  1. Write your application at src/main/scala/mypackage/Main.scala
package mypackage

import org.apache.spark._

object Main {
  def main(args: Array[String]): Unit = {
    //setup spark
    val sc = new SparkContext(new SparkConf())
    //your algorithm
    val n = 10000000
    val count = sc.parallelize(1 to n).map { i =>
      val x = scala.math.random
      val y = scala.math.random
      if (x * x + y * y < 1) 1 else 0
    }.reduce(_ + _)
    println("Pi is roughly " + 4.0 * count / n)
  }
}
  1. Submit your Spark application
> sparkSubmit arg0 arg1 ...

Note that a cluster with the same name as your project's name will be created by this command. This cluster will terminate itself automatically if there's no further steps (beyond the one you just submitted) waiting in the queue. (You can submit multiple steps into the queue by running sparkSubmit multiple times.)

If you want a keep-alive cluster, run sparkCreateCluster before sparkSubmit, and remember to terminate it with sparkTerminateCluster when you are done:

> sparkCreateCluster
> sparkSubmit arg0 arg1 ...
== Wait for your job to finish ==
> sparkTerminateCluster

The id of the created cluster will be stored in a file named .cluster_id. This file will be used to find the cluster when running other commands.

Other available settings

//Your cluster's name. Default value is copied from your project's `name` setting.
sparkClusterName := "your-new-cluster-name"

sparkClusterIdFile := file(".cluster_id")

sparkEmrRelease := "emr-5.17.0"

sparkEmrServiceRole := "EMR_DefaultRole"

//EMR applications that will be installed, default value is Seq("Spark")
sparkEmrApplications := Seq("Spark", "Zeppelin")

sparkVisibleToAllUsers := true

//EC2 instance type of EMR Master node, default is m4.large
//Note that this is *not* the master node of Spark
sparkMasterType := "m4.large"

//EC2 instance type of EMR Core nodes, default is m4.large
sparkCoreType := "m4.large"

//EBS (disk) size of EMR Master node, default is 32GB
sparkMasterEbsSize := Some(32)

//EBS (disk) size of EMR Core nodes, default is 32GB
sparkCoreEbsSize := Some(32)

//Spot instance bid price of Master node, default is None.
sparkMasterPrice := Some(0.1)

//Spot instance bid price of Core nodes, default is None.
sparkCorePrice := Some(0.1)

sparkInstanceRole := "EMR_EC2_DefaultRole"

//EC2 keypair, default is None.
sparkInstanceKeyName := Some("your-keypair")

//EMR logging folder, default is None.
sparkS3LogUri := Some("s3://my-emr-bucket/my-emr-log-folder/")

//Configs of --conf when running spark-submit, default is an empty Map.
sparkSubmitConfs := Map("spark.executor.memory" -> "10G", "spark.executor.instances" -> "2")

//List of EMR bootstrap scripts and their parameters, if any, default is Seq.empty.
sparkEmrBootstrap := Seq(BootstrapAction("my-bootstrap", "s3://my-production-bucket/bootstrap.sh", "--full"))

Other available commands

> show sparkClusterId

> sparkListClusters

> sparkTerminateCluster

> sparkSubmitMain mypackage.Main arg0 arg1 ...

If you accidentally deleted your .cluster_id file, you can bind it back using:

> sparkBindCluster j-*************

Use EmrConfig to configure the applications

EMR provides a JSON syntax to configure the applications on cluster, including spark. Here we provide a helper class called EmrConfig, which lets you setup the configuration in an easier way.

For example, to maximize the memory allocation for each Spark job, one can use the following JSON config:

[
  {
    "Classification": "spark",
    "Properties": {
      "maximizeResourceAllocation": "true"
    }
  }
]

Instead of using this JSON config, one can add the following setting in build.sbt to achieve the same effect:

import sbtlighter.EmrConfig

sparkEmrConfigs := Seq(
  EmrConfig("spark").withProperties("maximizeResourceAllocation" -> "true")
)

For people who already have a JSON config, there's a parsing function EmrConfig.parseJson(jsonString: String) which can convert the JSON array into List[EmrConfig]. And, if your JSON is located on S3, you can also parse the file on S3 directly (note that this will read the file from S3 right after you execute sbt):

import sbtlighter.EmrConfig

sparkEmrConfigs := EmrConfig
  .parseJsonFromS3("s3://your-bucket/your-config.json")(sparkS3Client.value)
  .right
  .get

Modify the configurations of underlying AWS objects

There are two settings called sparkJobFlowInstancesConfig and sparkRunJobFlowRequest, which corresponds to JobFlowInstancesConfig and RunJobFlowRequest in AWS Java SDK. Some default values are already configured in these settings, but you can modify it for your own purpose, for example:

To set the master and slave security groups separately (This requires you leaving sparkSecurityGroupIds as None in step 2):

sparkJobFlowInstancesConfig := sparkJobFlowInstancesConfig.value
  .withAdditionalMasterSecurityGroups("sg-aaaaaaaa")
  .withAdditionalSlaveSecurityGroups("sg-bbbbbbbb")

To set the EMR auto-scaling role:

sparkRunJobFlowRequest := sparkRunJobFlowRequest.value.withAutoScalingRole("EMR_AutoScaling_DefaultRole")

To set the tags on cluster resources:

import com.amazonaws.services.elasticmapreduce.model.Tag

sparkRunJobFlowRequest := sparkRunJobFlowRequest.value.withTags(new Tag("Name", "my-cluster-name"))

To add some initial steps at cluster creation:

import com.amazonaws.services.elasticmapreduce.model._

sparkRunJobFlowRequest := sparkRunJobFlowRequest.value
  .withSteps(
    new StepConfig()
      .withActionOnFailure(ActionOnFailure.CANCEL_AND_WAIT)
      .withName("Install components")
      .withHadoopJarStep(
        new HadoopJarStepConfig()
          .withJar("s3://path/to/jar")
          .withArgs(Seq("arg1", "arg2").asJava)
      )
  )

To add Server Side Encryption to Jar File and Add meta data support:

import com.amazonaws.services.s3.model.ObjectMetadata
sparkS3PutObjectDecorator := { req =>
  val metadata = new ObjectMetadata()
  metadata.setSSEAlgorithm(ObjectMetadata.AES_256_SERVER_SIDE_ENCRYPTION)
  req.withMetadata(metadata)
}

Use SBT's config to provide multiple setting combinations

If you have multiple environments (e.g. different subnet, different AWS region, ...etc) for your Spark project, you can use SBT's config to provide multiple setting combinations:

import sbtlighter._

//Since we don't use the global scope now, we can disable it.
LighterPlugin.disable

//And setup your configurations
lazy val Testing = config("testing")
lazy val Production = config("production")

inConfig(Testing)(LighterPlugin.baseSettings ++ Seq(
  sparkAwsRegion := "ap-northeast-1",
  sparkSubnetId := Some("subnet-aaaaaaaa"),
  sparkSecurityGroupIds := Seq("sg-aaaaaaaa"),
  sparkInstanceCount := 1,
  sparkS3JarFolder := "s3://my-testing-bucket/my-emr-folder/"
))

inConfig(Production)(LighterPlugin.baseSettings ++ Seq(
  sparkAwsRegion := "us-west-2",
  sparkSubnetId := Some("subnet-bbbbbbbb"),
  sparkSecurityGroupIds := Seq("sg-bbbbbbbb"),
  sparkInstanceCount := 20,
  sparkS3JarFolder := "s3://my-production-bucket/my-emr-folder/",
  sparkS3LogUri := Some("s3://aws-logs-************-us-west-2/elasticmapreduce/")
  sparkCorePrice := Some(0.39),
  sparkEmrConfigs := Seq(EmrConfig("spark", Map("maximizeResourceAllocation" -> "true")))
))

Then, in sbt, activate different config by the <config>:<task/setting> syntax:

> testing:sparkSubmit

> production:sparkSubmit

Keep SBT monitoring the cluster status until it completes

There's a special command called

> sparkMonitor

which will poll on the cluster's status until it terminates or exceeds the time limit.

The time limit can be defined by:

import scala.concurrent.duration._

sparkTimeoutDuration := 90.minutes

(the default value of sparkTimeoutDuration is 90 minutes)

And this command will fall into one of the three following behaviors:

  1. If the cluster ran for a duration longer than sparkTimeoutDuration, terminate the cluster and throw an exception.
  2. If the cluster terminated within sparkTimeoutDuration but had some failed steps, throw an exception.
  3. If the cluster terminated without any failed step, return Unit (exit code == 0).

This command would be useful if you want to trigger some notifications. For example, a bash command like this

$ sbt 'sparkSubmit arg0 arg1' sparkMonitor

will exit with error if the job fail or running too long (Don't enter the sbt console here, just append the task names after sbt like above). You can then put this command into a cron job for scheduled computation, and let cron notify yourself when something go wrong.

sbt-lighter's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

sbt-lighter's Issues

Would be nice to expose some decorators in settings

For example, I need to add withEc2KeyName so that I can get onto the cluster. It would be nice to have a generic way of adding additional config attributes via some decorators similar to:

extraJobFlowInstancesConfig := {
(jobFlow:JobFlowInstancesConfig) => jobFlow.withEc2KeyName("MyKeyName")
}

maven

I looked on the maven repo and can't find this but I saw the Spark Deployer. How do you want people to use this?

bootstrap action

Is it possible to setup a bootstrap action with a script through sbt-lighter?

Cannot see cluster listed on AWS elasticmapreduce

I'm sorry if this obvious. After running sparkCreateCluster I can see a box up on EC2 dashboard, sparkListClusters reports the cluster is there and I can submit jobs...

but if you go to /elasticmapreduce dashboard is not there.... Do I have a wrong expectation here? would be nice to have some more documentation on how it looks/results on AWS

Cluster never appears.

I am trying to spin up a long running cluster.

I get this:

info] Your new cluster's id is j-V9DQSH3QKAI2, you may check its status on AWS console.
[success] Total time: 2 s, completed Oct 4, 2017 11:20:52 AM
[info] Found cluster j-V9DQSH3QKAI2, start monitoring.
....
[info] Cluster terminated without error.
[success] Total time: 17 s, completed Oct 4, 2017 11:21:09 AM 

But the cluster is nowhere to be found in the aws console. It seems like it's not spinning up at all.

Can you point me in the right direction for debugging this?

help debugging

Sorry to keep asking questions ..
I tried to run my job and it crashed. I'm not sure how to debug it with this error report

last *:sparkMonitor
[info] Found cluster j-37KJST2B19MM3, start monitoring.
java.lang.RuntimeException: Cluster terminated with abnormal step.
at scala.sys.package$.error(package.scala:27)
at sbtemrspark.EmrSparkPlugin$$anonfun$baseSettings$26.checkStatus$1(EmrSparkPlugin.scala:247)
at sbtemrspark.EmrSparkPlugin$$anonfun$baseSettings$26.apply(EmrSparkPlugin.scala:257)
at sbtemrspark.EmrSparkPlugin$$anonfun$baseSettings$26.apply(EmrSparkPlugin.scala:222)
at scala.Function1$$anonfun$compose$1.apply(Function1.scala:47)
at sbt.$tilde$greater$$anonfun$$u2219$1.apply(TypeFunctions.scala:40)
at sbt.std.Transform$$anon$4.work(System.scala:63)
at sbt.Execute$$anonfun$submit$1$$anonfun$apply$1.apply(Execute.scala:228)
at sbt.Execute$$anonfun$submit$1$$anonfun$apply$1.apply(Execute.scala:228)
at sbt.ErrorHandling$.wideConvert(ErrorHandling.scala:17)
at sbt.Execute.work(Execute.scala:237)
at sbt.Execute$$anonfun$submit$1.apply(Execute.scala:228)
at sbt.Execute$$anonfun$submit$1.apply(Execute.scala:228)
at sbt.ConcurrentRestrictions$$anon$4$$anonfun$1.apply(ConcurrentRestrictions.scala:159)
at sbt.CompletionService$$anon$2.call(CompletionService.scala:28)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

Unable to create a cluster with on-demand instances

Executing sparkSubmitJob command without sparkInstanceBidPrice to setup on-demand instances, I got the following error:

> sparkSubmitJob
...
[trace] Stack trace suppressed: run last *:sparkSubmitJob for the full output.
[error] (*:sparkSubmitJob) com.amazonaws.services.elasticmapreduce.model.AmazonElasticMapReduceException: An instance group is missing the bid price. (Service: AmazonElasticMapReduce; Status Code: 400; Error Code: ValidationException; Request ID: 74f9963c-1fa8-11e7-adba-1112ab79b27b)
[error] Total time: 20 s, completed 2017/04/13 2:50:04

local and EMR

Is it possible to have both a local spark version and an EMR version in the same SBT file?

Trying Mill & not sure why I am not getting

Hi:

Trying to use your wonderful tool for mill

  import mill.modules.Assembly
  import coursier.maven.MavenRepository

  //object cor_poc extends ScalaModule {
  object ai_io extends ScalaModule {
+   def name = "ai_io_grab_data"
    def scalaVersion = "2.12.10"
+   def sparkInstanceCount = 5
    def sparkAwsRegion = "us-east-1"
    def sparkS3JarFolder = "s3://hn-sandbox/emr_jarlys/"

    def repositories = super.repositories ++ Seq(
      MavenRepository("http://dl.bintray.com/spark-packages/maven"),
      MavenRepository("https://mvnrepository.com/artifact/org.apache.spark/spark-yarn"),
      MavenRepository("https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-aws"),
      MavenRepository("https://mvnrepository.com/artifact/com.amazonaws/aws-java-sdk")
    ),
     MavenRepository("https://mvnrepository.com/artifact")
     // net.pishen/sbt-lighter")
   )


(snip)...

    def ivyDeps = Agg(
      ivy"org.apache.spark::spark-core:3.0.1",
      ivy"org.apache.spark::spark-sql:3.0.1",
      ivy"org.apache.spark::spark-yarn:3.0.1",
      ivy"io.delta::delta-core:0.7.0",
      ivy"org.apache.hadoop:hadoop-aws:2.7.7",
     ivy"com.amazonaws:aws-java-sdk:1.7.4",
     ivy"net.pishen:sbt-lighter:1.2.0")

I suspect
MavenRepository("https://mvnrepository.com/artifact")
or
MavenRepository("https://mvnrepository.com/artifact/net.pishen/sbt-lighter")
is not right places to go. Can you tell me where I need point to?
BTW: Also tried ivy"net.pishen::sbt-lighter:1.2.0"
Also saw that publishMavenStyle := false in your sbt file. Darn~

Kind regards,

1 targets failed
ai_io.resolvedIvyDeps
Resolution failed for 1 modules:
--------------------------------------------
  net.pishen:sbt-lighter:1.2.0
        not found: /home/syoon/.ivy2/local/net.pishen/sbt-lighter/1.2.0/ivys/ivy.xml
        not found: https://repo1.maven.org/maven2/net/pishen/sbt-lighter/1.2.0/sbt-lighter-1.2.0.pom
        not found: http://dl.bintray.com/spark-packages/maven/net/pishen/sbt-lighter/1.2.0/sbt-lighter-1.2.0.pom
        not found: https://mvnrepository.com/artifact/org.apache.spark/spark-yarn/net/pishen/sbt-lighter/1.2.0/sbt-lighter-1.2.0.pom
        not found: https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-aws/net/pishen/sbt-lighter/1.2.0/sbt-lighter-1.2.0.pom
        not found: https://mvnrepository.com/artifact/com.amazonaws/aws-java-sdk/net/pishen/sbt-lighter/1.2.0/sbt-lighter-1.2.0.pom
        not found: https://mvnrepository.com/artifact/net/pishen/sbt-lighter/1.2.0/sbt-lighter-1.2.0.pom

Add ability to specify a custom log4j.properties files using spark-submit

by default, EMR logs everything in INFO mode, generating massive amounts of logs being generated. One way to counter this is to specify log4j-properties in the EMR config. However, this file can only be specified once, during cluster launch. Which makes it a pain when testing / debugging jobs.

Another way is to specify a log4j.properties files like this. spark-submit --files path/to/log4j.properties.

Is it possible to add this to the plugin? Would be greatly appreciated.
More info in this stackoverflow thread: https://stackoverflow.com/a/42523811

Skipping sbt in sbt-emr-spark?

This is an interesting plugin, but I wonder if the functionality it provides would not be even more interesting outside of sbt? Then it would be very likely much easier to reuse, e.g. in a CI/CD context.

Maven and Scala 2.11

Thanks for the efforts. Unfortunately, I cannot install the plugin, I guess you should cross compile it for more versions of SBT 0.13 and also for Scala 2.11.

another stupid question

HI :),
Sorry to bug you again. Let me know if you can help.
So I'm running a Spark job with your plugin. It runs fine on EMR with an 8 machine cluster for a couple hours all of a sudden it terminates with this strange error:

SLF4J: Actual binding is of type [ch.qos.logback.classic.util.ContextSelectorStaticBinder]
{"ts":"2017-10-11T21:57:09.275+00:00","msg":"Error initializing SparkContext.","logger":"org.apache.spark.SparkContext","level":"ERROR","stack_trace":"org.apache.spark.SparkException: A master URL must be set in your configuration\n\tat org.apache.spark.SparkContext.(SparkContext.scala:379) ~[spark-core_2.11-2.1.0.jar:2.1.0]\n\tat RunExtractors$.main(RunExtractors.scala:28) [classes/:na]\n\tat RunExtractors.main(RunExtractors.scala) [classes/:na]\n\tat sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[na:1.8.0_66]\n\tat sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[na:1.8.0_66]\n\tat sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[na:1.8.0_66]\n\tat java.lang.reflect.Method.invoke(Method.java:497) ~[na:1.8.0_66]\n\tat sbt.Run.invokeMain(Run.scala:67) [run-0.13.13.jar:0.13.13]\n\tat sbt.Run.run0(Run.scala:61) [run-0.13.13.jar:0.13.13]\n\tat sbt.Run.sbt$Run$$execute$1(Run.scala:51) [run-0.13.13.jar:0.13.13]\n\tat sbt.Run$$anonfun$run$1.apply$mcV$sp(Run.scala:55) [run-0.13.13.jar:0.13.13]\n\tat sbt.Run$$anonfun$run$1.apply(Run.scala:55) [run-0.13.13.jar:0.13.13]\n\tat sbt.Run$$anonfun$run$1.apply(Run.scala:55) [run-0.13.13.jar:0.13.13]\n\tat sbt.Logger$$anon$4.apply(Logger.scala:84) [logging-0.13.13.jar:0.13.13]\n\tat sbt.TrapExit$App.run(TrapExit.scala:248) [run-0.13.13.jar:0.13.13]\n\tat java.lang.Thread.run(Thread.java:745) [na:1.8.0_66]\n","HOSTNAME":"Administrators-MacBook-Pro-2.local"}
[error] (run-main-0) org.apache.spark.SparkException: A master URL must be set in your configuration
org.apache.spark.SparkException: A master URL must be set in your configuration
at org.apache.spark.SparkContext.(SparkContext.scala:379)
at RunExtractors$.main(RunExtractors.scala:28)
at RunExtractors.main(RunExtractors.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)

any idea?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.