GithubHelp home page GithubHelp logo

sap-archive / hanavora-extensions Goto Github PK

View Code? Open in Web Editor NEW
36.0 24.0 18.0 2.23 MB

Spark extensions for business contexts

License: Apache License 2.0

Scala 98.70% Java 0.95% HTML 0.34%
open-source

hanavora-extensions's Introduction

Important Notice

We have decided to stop the maintenance of this public GitHub repository.

HANA Vora Spark extensions

These are some extensions of Apache Spark developed for SAP HANA Vora. They can be used with any supported Spark version, even without Vora. Note that some features might improve their performance significantly if the HANA Vora datasource is used.

First Steps

Prerequisites

Minimal requirements for the spark extensions:

  1. Java SE 7 (or later) installed
  2. Maven installed (mvn -version should work)
  3. Spark 1.6.1 installed (SPARK_HOME must be set to the installation directory) (Note that the only supported spark versions are 1.6.0 and 1.6.1!)

Building

Build the distribution package with Maven:

mvn clean package

You can also skip the tests by adding the appropriate switch to the command line:

mvn clean package -D maven.test.skip

Then extract the package to its target directory:

export SAP_SPARK_HOME=$HOME/sap-spark-extensions # choose your install dir
mkdir -p $SAP_SPARK_HOME
tar xzpf ./dist/target/spark-sap-extensions-*-dist.tar.gz -C $SAP_SPARK_HOME

Starting an Extended Spark Shell

From the command line, execute

$SAP_SPARK_HOME/bin/start-spark-shell.sh

Using the Extensions

While the spark shell starts up with SparkContext and SQLContext predefined, you need to instantiate a SapSQLContext to make use of the extensions.

import org.apache.spark.sql._

val voraContext = new SapSQLContext(sc)

// now we can start with some SQL queries
voraContext.sql("SHOW TABLES").show()

Further Documentation

Package Documentation

TODO

Inline Documentation

You can also start at the rootdoc and work your way through the code from there.

Troubleshooting

Wrong Spark Version
java.lang.NoSuchMethodError: org.apache.spark.sql.catalyst.optimizer.Optimizer: method <init>()V not found
  at org.apache.spark.sql.extension.ExtendableOptimizer.<init>(ExtendableOptimizer.scala:13)
  at org.apache.spark.sql.hive.ExtendableHiveContext.optimizer$lzycompute(ExtendableHiveContext.scala:93)
  at org.apache.spark.sql.hive.ExtendableHiveContext.optimizer(ExtendableHiveContext.scala:92)
  at org.apache.spark.sql.execution.QueryExecution.optimizedPlan$lzycompute(QueryExecution.scala:43)
  at org.apache.spark.sql.execution.QueryExecution.optimizedPlan(QueryExecution.scala:43)
  ...

This happens when spark version 1.6.2 is used, due to a change in the Optimizer interface. Because of this compatibility issue, we only support versions 1.6.0 and 1.6.1.

hanavora-extensions's People

Contributors

ajnavarro avatar alpkom avatar haty avatar martin-weidner avatar martinhartig avatar olafschmidt avatar opuertas avatar pc-jedi avatar smola avatar stephankessler avatar wamdo avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

hanavora-extensions's Issues

Unable to execute start-spark-shell.sh from Mac

After getting a distribution built (see #4), I couldn't get start-spark-shell.sh working from my Mac (due to readlink being not available by default on Mac).

[bin] ./start-spark-shell.sh 
readlink: illegal option -- f
usage: readlink [-n] [file ...]
readlink: illegal option -- f
usage: readlink [-n] [file ...]
readlink: illegal option -- f
usage: readlink [-n] [file ...]
[ERROR] too many jar files in 
[ERROR] please check your installation

Installing greadlink (via brew install coreutils) moved things forward a bit, but met with another error:

[bin] ./start-spark-shell.sh 
[ERROR] too many jar files in /Users/ramnivas/dev/vora/dist/lib
[ERROR] please check your installation

I verified that there is only one file in the lib directory.

For now, I am able to workaround by executing:

$ spark-shell.sh --jars <path to spark-sap-parent-1.3.69-assembly.jar or core-1.3.69.jar>

which seems like what the start-spark-shell.sh does effectively anyway.

SapSQLContext fails with DataFrame created through Cassandra

If I create a DataFrame through SapSQLContext and invoke an SQL that performs a join, it fails with a serialization exception. However, an identical code works fine with the plain SQLContext. Here is a minimized code to reproduce (I am using a self-join to keep things simple):

bin/spark-shell --conf spark.cassandra.conneost=127.0.0.1 \
                --packages datastax:spark-cassandra-connector:1.6.1-s_2.10 \
                --jars ~/.m2/repository/com/sap/spark/core/1.3.79/core-1.3.79.jar
import org.apache.spark.sql._

val sqlc = new SapSQLContext(sc)

val df = sqlc.read.format("org.apache.spark.sql.cassandra").
         options(Map("table" -> "employee", "keyspace" -> "organization")).load()
df.show()

df.registerTempTable("employee")

val bugDf = sqlc.sql("select * from employee report, employee manager")
bugDf.show()

Results in long stack trace, but the gist of it is the following line:

java.io.NotSerializableException for org.apache.spark.sql.cassandra.CassandraSourceRelation 

If you replace the line

val sqlc = new SapSQLContext(sc)

with

val sqlc = new SQLContext(sc)

it all works fine. It also works fine if I don't have any joins.

Here is the Cassandra CQL that you may find helpful in reproducing the problem.

CREATE KEYSPACE organization WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1};
CREATE TABLE organization.employee (id INT PRIMARY KEY, name TEXT, role TEXT, salary FLOAT, reports_to INT);

INSERT INTO organization.employee (id, name, role, salary) values (1, 'A', 'CEO', 100);

issue while running start-spark-shell.sh

Hello,

the spark shell starts correctly on executing the script start-spark-shell.sh.
But once I run the command sqlContext.sql("show tables").show I get the error

java.lang.RuntimeException: Stream '/jars/*.jar' was not found.. Can someone help plz?

Kindly note, I just compiled the project & ran the .sh file in dist.

Still private

Are you planning to upload the sources and set it to public soon?

Issue with append data command

Hello,

I tried executing the following command

%vora APPEND TABLE EMPLOYEE OPTIONS (paths "hdfs://localhost:9000/tmp/SHA_Employee_append.csv", eagerLoad "true")

& I got the following error

Error: com.sap.spark.hana.HANARelation cannot be cast to org.apache.spark.sql.sources.AppendRelation

Kindly note, the basic commands like read/write to hana works fine still for me.
Can somebody help?

Thanks & regards,
Subhobrata

Failed to find data source: com.sap.spark.vora

After working through #4 and #5, I am running into another problem.

scala> import org.apache.spark.sql._
import org.apache.spark.sql._

scala> val sqlc = new SapSQLContext(sc)
sqlc: org.apache.spark.sql.SapSQLContext = org.apache.spark.sql.SapSQLContext@221ec884

scala>sqlc.sql("""CREATE TABLE people (name string, age int) USING com.sap.spark.vora OPTIONS (files "path-to/test.json") """)
java.lang.ClassNotFoundException: Failed to find data source: com.sap.spark.vora. Please find packages at http://spark-packages.org
    at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.lookupDataSource(ResolvedDataSource.scala:77)
    at org.apache.spark.sql.DefaultDatasourceResolver.lookup(DatasourceResolver.scala:74)
...

Note that I am starting the spark-shell as described in the workaround for #5

Here are the versions I am using:
Spark: 1.6.1 (with hadoop 2.6)
Scala: 2.10.5
Java: 1.8.0_51

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.