Examples for High Performance Spark

License: Other

Scala 61.47% Shell 5.28% Python 7.45% CMake 0.71% C 0.29% Java 9.36% Fortran 0.03% Perl 2.23% R 1.06% Julia 0.11% C++ 0.84% Dockerfile 0.05% Jupyter Notebook 11.10%

spark

high-performance-spark-examples's Introduction

high-performance-spark-examples

Examples for High Performance Spark

We are in the progress of updata this for Spark 3.5+ and the 2ed edition of our book!

Building

Most of the examples can be built with sbt, the C and Fortran components depend on gcc, g77, and cmake.

Tests

The full test suite depends on having the C and Fortran components built as well as a local R installation available.

The most "accuate" way of seeing how we run the tests is to look at the .github workflows

History Server

The history server can be a great way to figure out what's going on.

By default the history server writes to /tmp/spark-events so you'll need to create that directory if not setup with

mkdir -p /tmp/spark-events

The scripts for running the examples generally run with the event log enabled.

You can set the SPARK_EVENTLOG=true before running the scala tests and you'll get the history server too!

e.g.

SPARK_EVENTLOG=true sbt test

If you want to run just a specific test you can run testOnly

Then to view the history server you'll want to launch it using the ${SPARK_HOME}/sbin/start-history-server.sh then you can go to your local history server

high-performance-spark-examples's People

Contributors

Stargazers

Watchers

Forkers

holdenk svishnu88 mahmoudhanafy maplenvg mabidm jeperez meraboxer akzaidi gregnwosu phanther xbkaishui mindis kevintrannz cwpdntm joe2hpimn nikolayvoronchikhin saimaung rodrigoparada erdcpatel vaquarkhan akbarahmed andyyehoo 0xqq a414930249 said66 minhqnguyen gdtm86 maogautam manjunathgit lichaojacobs sudhanshu746 pborne kose-y linchin berngp zhousishuo wangzhiwubigdata alexanlee ram1991 witnesslq rcpbayindir daroza rubonz alexgids vedprakash-singh madhan1m jayinai imrepo devilsnare04 maddatascience naveen-8451 princessd8251 yunkillere pierrenowi jdsteja kali786516 harlixxy kingice nuthanreddy kanghojung-nbt phoenixofasia luojianp eminency hjw199089 xiarixiaoyao vishamdi vicchugu arwangasdaf chssunil srtata hanchen331 kaushiksekar eschizoid tianli9996 yinzhiqiang422 jamalahmedmaaz konu90 akxyz zyxue airob narasimhakattunga chaokunyang sirwz riordon kadiy1k xiufranklin lslab eazonyoung hongjie7202 javapro2000 richardqiao2000 choupijiang lidiyam veeraravi techsnob lusimba buildlackey imran273 mallik-g bayalla

high-performance-spark-examples's Issues

Build assembly jar with travis

port to SBT 1.*

it's difficult to work in intellij with sbt 0.13(it's a long time to "dump project structure from sbt"). plugin like sbt-spark-package seems not work in sbt 1.* .

Could be my own unfamiliarity with Scala, but I'm currently unable to build this project as-is using SBT. I'm getting the stack trace pasted below from IntelliJ (I get something similar when running 'sbt compile' from the CLI).

SBT 'high-performance-spark-examples' project refresh failed
    Error:Error:Error while importing SBT project:<br/>...<br/><pre>[info] Resolving org.scala-sbt#apply-macro;0.13.13 ...
[info] Resolving org.spire-math#json4s-support_2.10;0.6.0 ...
[info] Resolving org.codehaus.plexus#plexus-component-annotations;1.5.5 ...
[info] Resolving javax.annotation#jsr250-api;1.0 ...
[info] Resolving com.thoughtworks.paranamer#paranamer;2.6 ...
[info] Resolving com.typesafe#config;1.2.0 ...
[info] Resolving org.scala-sbt#test-agent;0.13.13 ...
[info] Resolving org.scala-sbt#classfile;0.13.13 ...
[info] Resolving org.scala-sbt#completion;0.13.13 ...
[info] Resolving org.scala-sbt#test-interface;1.0 ...
[info] Resolving com.jcraft#jsch;0.1.50 ...
[info] Resolving org.scala-lang#scala-compiler;2.10.6 ...
[info] Resolving org.scala-sbt#interface;0.13.13 ...
[info] Resolving javax.inject#javax.inject;1 ...
[info] Resolving org.scala-sbt#logging;0.13.13 ...
[trace] Stack trace suppressed: run 'last *:ssExtractDependencies' for the full output.
[trace] Stack trace suppressed: run 'last *:update' for the full output.
[error] (*:ssExtractDependencies) sbt.ResolveException: download failed: org.mortbay.jetty#jetty;6.1.26!jetty.zip
[error] (*:update) sbt.ResolveException: download failed: org.mortbay.jetty#jetty;6.1.26!jetty.zip
[error] Total time: 19 s, completed Jul 10, 2017 10:47:35 AM</pre><br/>See complete log in <a href="file:/Users/adam/Library/Logs/IdeaIC2017.1/sbt.last.log">file:/Users/adam/Library/Logs/IdeaIC2017.1/sbt.last.log</a>

Add more documentation and examples

Upgrade to Spark 1.6

Enable R tests

Once travis-ci/apt-source-safelist#287 is resolved (or we can otherwise access R >= 3) in our travis env enable the SparkR tests.

Port examples to Java

We should port at least some of the examples to the Java API.

Add a mapPartitions example to GenerateScalingData

Add mysql setup to travis

The feedback on a code bug at /goldilocks/GoldilocksSecondarySort.scala

where does the val list head :: rest appear from in the following function ?

def groupSorted[K,S,V]( it : Iterator[((K, S), V)] ) : Iterator[(K, List[(S, V)])] = {
        val res = List[ (K, ArrayBuffer[(S, V)]) ]()
        it.foldLeft(res)(
            (list, next) => list match {
                case Nil => val ((firstKey, secondKey), value) = next
                            List((firstKey, ArrayBuffer((secondKey, value))))
                case head :: rest => 
                       val (curKey, valueBuf) = head
                       val ((firstKey, secondKey), value) = next
                       if (!firstKey.equals(curKey) ) {
                            (firstKey, ArrayBuffer((secondKey, value))) :: list
          	        } else {
                            valueBuf.append((secondKey, value))
                            list
                        }
      	    }
        ).map { case (key, buf) => (key, buf.toList) }.iterator
  }

Port examples to Python

We should port some of the examples to Python

Add a benchmark to illustrate some Spark core v. Spark DataFrame v. Spark DataSets perf

Add a simple benchmark to illustrate some of the differences between Spark Core & dataframes/datasets

Add an object to load data from (or write data to) mysql in travis

Improve test coverage

Our test coverage is super low. Try and improve test coverage.

QuantileOnlyArtisanalTest test("Secondary Sort") error

val r    = SecondarySort.groupByKeyAndSortBySecondaryKey(data, 3)
val rSorted = r.collect().sortWith(lt = (a, b) => a._1.toDouble > b._1.toDouble)
    assert(r.collect().zipWithIndex.forall {
      case (((key, list), index)) => rSorted(index)._1.equals(key)
    })

Actually r is not ordered, so it's not correct to compare r with rSorted

Behavior of spark's predicate push down for Hbase backed Hive tables

It would be great if you guys can present working[if there is any] of push down filter's behavior if there is a filter applied on rowKey of Hbase backed hive table.

Run some examples with travis

Upgrade SparkR + enable CI

example 6-8 in SecondarySort.scala error

object CoPartitioningLessons corresponding to book sample Example 6-8 Example 6-9. Both functions use two different Partitioner to show coLocate and copartition

Add a custom expression w/codegen

Add a selectExplode example for chapter 3

Example 3-4

[error] Server access Error: Connection refused (Connection refused)


[error] Server access Error: Connection refused (Connection refused) url=http://repo.typesafe.com/typesafe/releases/org/apache/hadoop/hadoop-yarn/2.6.5/hadoop-yarn-2.6.5.jar

I am using home network, no proxy needed.

Add a custom optimizer rule

Add examples for Spark accumulators

One for sum of fuzzyness other for max panda id

Failed to find a default value for inputCol

There seems to be a problem with https://github.com/high-performance-spark/high-performance-spark-examples/blob/master/src/main/scala/com/high-performance-spark-examples/ml/CustomPipeline.scala#L125. For example, if we try testing it thus

    val indexer = new SimpleIndexer()
    indexer.setInputCol("inputColumn")
    indexer.setOutputCol("categoryIndex")
    val model = indexer.fit(ds)
    val predicted = model.transform(ds)

the indexer has the inputCol set, but the model (that is, object that's returned by indexer.fit) does not. So when we try model.transform, it complains that it "Failed to find a default value for inputCol".

Following https://stackoverflow.com/questions/40847625/spark-custom-estimator-access-to-paramt, @conorbmurphy and I were able to "solve" the problem by hard coding the column names within the class with

setDefault(inputCol, "inputColumn")
setDefault(outputCol, "categoryIndex")

but that's hardly the right solution. What should happen is the paramMap should be copied into the SimpleIndexerModel when it's generated in the fit method, but we can't figure out how to do that since paramMap is protected.

high-performance-spark / high-performance-spark-examples Goto Github PK