sap-archive / hanavora-extensions Goto Github PK

View Code? Open in Web Editor NEW

36.0 24.0 18.0 2.23 MB

Spark extensions for business contexts

License: Apache License 2.0

Scala 98.70% Java 0.95% HTML 0.34%

open-source

hanavora-extensions's Introduction

Important Notice

We have decided to stop the maintenance of this public GitHub repository.

HANA Vora Spark extensions

These are some extensions of Apache Spark developed for SAP HANA Vora. They can be used with any supported Spark version, even without Vora. Note that some features might improve their performance significantly if the HANA Vora datasource is used.

First Steps

Prerequisites

Minimal requirements for the spark extensions:

Java SE 7 (or later) installed
Maven installed (mvn -version should work)
Spark 1.6.1 installed (SPARK_HOME must be set to the installation directory) (Note that the only supported spark versions are 1.6.0 and 1.6.1!)

Building

Build the distribution package with Maven:

mvn clean package

You can also skip the tests by adding the appropriate switch to the command line:

mvn clean package -D maven.test.skip

Then extract the package to its target directory:

export SAP_SPARK_HOME=$HOME/sap-spark-extensions # choose your install dir
mkdir -p $SAP_SPARK_HOME
tar xzpf ./dist/target/spark-sap-extensions-*-dist.tar.gz -C $SAP_SPARK_HOME

Starting an Extended Spark Shell

From the command line, execute

$SAP_SPARK_HOME/bin/start-spark-shell.sh

Using the Extensions

While the spark shell starts up with SparkContext and SQLContext predefined, you need to instantiate a SapSQLContext to make use of the extensions.

import org.apache.spark.sql._

val voraContext = new SapSQLContext(sc)

// now we can start with some SQL queries
voraContext.sql("SHOW TABLES").show()

Further Documentation

Package Documentation

TODO

Inline Documentation

You can also start at the rootdoc and work your way through the code from there.

Troubleshooting

Wrong Spark Version

java.lang.NoSuchMethodError: org.apache.spark.sql.catalyst.optimizer.Optimizer: method <init>()V not found
  at org.apache.spark.sql.extension.ExtendableOptimizer.<init>(ExtendableOptimizer.scala:13)
  at org.apache.spark.sql.hive.ExtendableHiveContext.optimizer$lzycompute(ExtendableHiveContext.scala:93)
  at org.apache.spark.sql.hive.ExtendableHiveContext.optimizer(ExtendableHiveContext.scala:92)
  at org.apache.spark.sql.execution.QueryExecution.optimizedPlan$lzycompute(QueryExecution.scala:43)
  at org.apache.spark.sql.execution.QueryExecution.optimizedPlan(QueryExecution.scala:43)
  ...

This happens when spark version 1.6.2 is used, due to a change in the Optimizer interface. Because of this compatibility issue, we only support versions 1.6.0 and 1.6.1.

hanavora-extensions's People

Contributors

Stargazers

Watchers

Forkers

boobboo sbcd90 zhangqiusheng tspannhw burgerdev vajeevan ajnavarro vkhokhla gatorsmile jimmyhorn smola shahaness aaaaaaron olafschmidt binzhango isabella232

hanavora-extensions's Issues

Any documentation? How to integrate with current open source Apache Spark?

Dear SAP,

Any documentation on this extension? How to integrate with current open source Apache Spark?

Unable to execute start-spark-shell.sh from Mac

After getting a distribution built (see #4), I couldn't get start-spark-shell.sh working from my Mac (due to readlink being not available by default on Mac).

[bin] ./start-spark-shell.sh 
readlink: illegal option -- f
usage: readlink [-n] [file ...]
readlink: illegal option -- f
usage: readlink [-n] [file ...]
readlink: illegal option -- f
usage: readlink [-n] [file ...]
[ERROR] too many jar files in 
[ERROR] please check your installation

Installing greadlink (via brew install coreutils) moved things forward a bit, but met with another error:

[bin] ./start-spark-shell.sh 
[ERROR] too many jar files in /Users/ramnivas/dev/vora/dist/lib
[ERROR] please check your installation

I verified that there is only one file in the lib directory.

For now, I am able to workaround by executing:

$ spark-shell.sh --jars <path to spark-sap-parent-1.3.69-assembly.jar or core-1.3.69.jar>

which seems like what the start-spark-shell.sh does effectively anyway.

SapSQLContext fails with DataFrame created through Cassandra

If I create a DataFrame through SapSQLContext and invoke an SQL that performs a join, it fails with a serialization exception. However, an identical code works fine with the plain SQLContext. Here is a minimized code to reproduce (I am using a self-join to keep things simple):

bin/spark-shell --conf spark.cassandra.conneost=127.0.0.1 \
                --packages datastax:spark-cassandra-connector:1.6.1-s_2.10 \
                --jars ~/.m2/repository/com/sap/spark/core/1.3.79/core-1.3.79.jar

import org.apache.spark.sql._

val sqlc = new SapSQLContext(sc)

val df = sqlc.read.format("org.apache.spark.sql.cassandra").
         options(Map("table" -> "employee", "keyspace" -> "organization")).load()
df.show()

df.registerTempTable("employee")

val bugDf = sqlc.sql("select * from employee report, employee manager")
bugDf.show()

Results in long stack trace, but the gist of it is the following line:

java.io.NotSerializableException for org.apache.spark.sql.cassandra.CassandraSourceRelation

If you replace the line

val sqlc = new SapSQLContext(sc)

with

val sqlc = new SQLContext(sc)

it all works fine. It also works fine if I don't have any joins.

Here is the Cassandra CQL that you may find helpful in reproducing the problem.

CREATE KEYSPACE organization WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1};
CREATE TABLE organization.employee (id INT PRIMARY KEY, name TEXT, role TEXT, salary FLOAT, reports_to INT);

INSERT INTO organization.employee (id, name, role, salary) values (1, 'A', 'CEO', 100);

issue while running start-spark-shell.sh

Hello,

the spark shell starts correctly on executing the script start-spark-shell.sh.
But once I run the command sqlContext.sql("show tables").show I get the error

java.lang.RuntimeException: Stream '/jars/*.jar' was not found.. Can someone help plz?

Kindly note, I just compiled the project & ran the .sh file in dist.

Still private

Are you planning to upload the sources and set it to public soon?

Is this project alive? Can I use spark 2.x with it?

I learn this project from https://issues.apache.org/jira/browse/SPARK-12449 .

I do think it's a good idea to support aggregate-push-down. However I doubt this project is alive or not ?

Issue with append data command

Hello,

I tried executing the following command

%vora APPEND TABLE EMPLOYEE OPTIONS (paths "hdfs://localhost:9000/tmp/SHA_Employee_append.csv", eagerLoad "true")

& I got the following error

Error: com.sap.spark.hana.HANARelation cannot be cast to org.apache.spark.sql.sources.AppendRelation

Kindly note, the basic commands like read/write to hana works fine still for me.
Can somebody help?

Thanks & regards,
Subhobrata

object SapSQLcontext is not a member of org.apache.sql.SQLcontext

Hi,

Getting error "object SapSQLcontext is not a member of org.apache.sql.SQLcontext" when running the command "import org.apache.spark.sql.SapSQLContext".

Please suggest.

Thanks,
Jyoti

Failed to find data source: com.sap.spark.vora

After working through #4 and #5, I am running into another problem.

scala> import org.apache.spark.sql._
import org.apache.spark.sql._

scala> val sqlc = new SapSQLContext(sc)
sqlc: org.apache.spark.sql.SapSQLContext = org.apache.spark.sql.SapSQLContext@221ec884

scala>sqlc.sql("""CREATE TABLE people (name string, age int) USING com.sap.spark.vora OPTIONS (files "path-to/test.json") """)
java.lang.ClassNotFoundException: Failed to find data source: com.sap.spark.vora. Please find packages at http://spark-packages.org
    at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.lookupDataSource(ResolvedDataSource.scala:77)
    at org.apache.spark.sql.DefaultDatasourceResolver.lookup(DatasourceResolver.scala:74)
...

Note that I am starting the spark-shell as described in the workaround for #5

Here are the versions I am using:
Spark: 1.6.1 (with hadoop 2.6)
Scala: 2.10.5
Java: 1.8.0_51

Need SapSQLContext.getOrCreate

I am trying to store streaming data into database using SapSQLContext, and running into serialization and checkpointing issue identical to the one in https://issues.apache.org/jira/browse/SPARK-7478. So an identical solution by providing SapSQLContext.getOrCreate will help.