GithubHelp home page GithubHelp logo

high-performance-spark / high-performance-spark-examples Goto Github PK

View Code? Open in Web Editor NEW
497.0 58.0 233.0 1.59 MB

Examples for High Performance Spark

License: Other

Scala 61.47% Shell 5.28% Python 7.45% CMake 0.71% C 0.29% Java 9.36% Fortran 0.03% Perl 2.23% R 1.06% Julia 0.11% C++ 0.84% Dockerfile 0.05% Jupyter Notebook 11.10%
spark

high-performance-spark-examples's Introduction

high-performance-spark-examples

Examples for High Performance Spark

We are in the progress of updata this for Spark 3.5+ and the 2ed edition of our book!

Building

Most of the examples can be built with sbt, the C and Fortran components depend on gcc, g77, and cmake.

Tests

The full test suite depends on having the C and Fortran components built as well as a local R installation available.

The most "accuate" way of seeing how we run the tests is to look at the .github workflows

History Server

The history server can be a great way to figure out what's going on.

By default the history server writes to /tmp/spark-events so you'll need to create that directory if not setup with

mkdir -p /tmp/spark-events

The scripts for running the examples generally run with the event log enabled.

You can set the SPARK_EVENTLOG=true before running the scala tests and you'll get the history server too!

e.g.

SPARK_EVENTLOG=true sbt test

If you want to run just a specific test you can run testOnly

Then to view the history server you'll want to launch it using the ${SPARK_HOME}/sbin/start-history-server.sh then you can go to your local history server

high-performance-spark-examples's People

Contributors

holdenk avatar jiminhsieh avatar maddatascience avatar mahmoudhanafy avatar rachelwarren avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

high-performance-spark-examples's Issues

port to SBT 1.*

it's difficult to work in intellij with sbt 0.13(it's a long time to "dump project structure from sbt"). plugin like sbt-spark-package seems not work in sbt 1.* .

Project failing to build

Could be my own unfamiliarity with Scala, but I'm currently unable to build this project as-is using SBT. I'm getting the stack trace pasted below from IntelliJ (I get something similar when running 'sbt compile' from the CLI).

SBT 'high-performance-spark-examples' project refresh failed
    Error:Error:Error while importing SBT project:<br/>...<br/><pre>[info] Resolving org.scala-sbt#apply-macro;0.13.13 ...
[info] Resolving org.spire-math#json4s-support_2.10;0.6.0 ...
[info] Resolving org.codehaus.plexus#plexus-component-annotations;1.5.5 ...
[info] Resolving javax.annotation#jsr250-api;1.0 ...
[info] Resolving com.thoughtworks.paranamer#paranamer;2.6 ...
[info] Resolving com.typesafe#config;1.2.0 ...
[info] Resolving org.scala-sbt#test-agent;0.13.13 ...
[info] Resolving org.scala-sbt#classfile;0.13.13 ...
[info] Resolving org.scala-sbt#completion;0.13.13 ...
[info] Resolving org.scala-sbt#test-interface;1.0 ...
[info] Resolving com.jcraft#jsch;0.1.50 ...
[info] Resolving org.scala-lang#scala-compiler;2.10.6 ...
[info] Resolving org.scala-sbt#interface;0.13.13 ...
[info] Resolving javax.inject#javax.inject;1 ...
[info] Resolving org.scala-sbt#logging;0.13.13 ...
[trace] Stack trace suppressed: run 'last *:ssExtractDependencies' for the full output.
[trace] Stack trace suppressed: run 'last *:update' for the full output.
[error] (*:ssExtractDependencies) sbt.ResolveException: download failed: org.mortbay.jetty#jetty;6.1.26!jetty.zip
[error] (*:update) sbt.ResolveException: download failed: org.mortbay.jetty#jetty;6.1.26!jetty.zip
[error] Total time: 19 s, completed Jul 10, 2017 10:47:35 AM</pre><br/>See complete log in <a href="file:/Users/adam/Library/Logs/IdeaIC2017.1/sbt.last.log">file:/Users/adam/Library/Logs/IdeaIC2017.1/sbt.last.log</a>

The feedback on a code bug at /goldilocks/GoldilocksSecondarySort.scala

where does the val list head :: rest appear from in the following function ?

def groupSorted[K,S,V]( it : Iterator[((K, S), V)] ) : Iterator[(K, List[(S, V)])] = {
        val res = List[ (K, ArrayBuffer[(S, V)]) ]()
        it.foldLeft(res)(
            (list, next) => list match {
                case Nil => val ((firstKey, secondKey), value) = next
                            List((firstKey, ArrayBuffer((secondKey, value))))
                case head :: rest => 
                       val (curKey, valueBuf) = head
                       val ((firstKey, secondKey), value) = next
                       if (!firstKey.equals(curKey) ) {
                            (firstKey, ArrayBuffer((secondKey, value))) :: list
          	        } else {
                            valueBuf.append((secondKey, value))
                            list
                        }
      	    }
        ).map { case (key, buf) => (key, buf.toList) }.iterator
  }

QuantileOnlyArtisanalTest test("Secondary Sort") error

val r    = SecondarySort.groupByKeyAndSortBySecondaryKey(data, 3)
val rSorted = r.collect().sortWith(lt = (a, b) => a._1.toDouble > b._1.toDouble)
    assert(r.collect().zipWithIndex.forall {
      case (((key, list), index)) => rSorted(index)._1.equals(key)
    })

Actually r is not ordered, so it's not correct to compare r with rSorted

example 6-8 in SecondarySort.scala error

object CoPartitioningLessons corresponding to book sample Example 6-8 Example 6-9. Both functions use two different Partitioner to show coLocate and copartition

Failed to find a default value for inputCol

There seems to be a problem with https://github.com/high-performance-spark/high-performance-spark-examples/blob/master/src/main/scala/com/high-performance-spark-examples/ml/CustomPipeline.scala#L125. For example, if we try testing it thus

    val indexer = new SimpleIndexer()
    indexer.setInputCol("inputColumn")
    indexer.setOutputCol("categoryIndex")
    val model = indexer.fit(ds)
    val predicted = model.transform(ds)

the indexer has the inputCol set, but the model (that is, object that's returned by indexer.fit) does not. So when we try model.transform, it complains that it "Failed to find a default value for inputCol".

Following https://stackoverflow.com/questions/40847625/spark-custom-estimator-access-to-paramt, @conorbmurphy and I were able to "solve" the problem by hard coding the column names within the class with

setDefault(inputCol, "inputColumn")
setDefault(outputCol, "categoryIndex")

but that's hardly the right solution. What should happen is the paramMap should be copied into the SimpleIndexerModel when it's generated in the fit method, but we can't figure out how to do that since paramMap is protected.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.