src-d / jgit-spark-connector Goto Github PK

jgit-spark-connector is a library for running scalable data retrieval pipelines that process any number of Git repositories for source code analysis.

License: Apache License 2.0

Scala 75.07% Shell 6.79% Makefile 1.69% Python 12.35% Jupyter Notebook 3.40% Dockerfile 0.70%

spark pyspark scala python git datasource

jgit-spark-connector's Introduction

jgit-spark-connector

jgit-spark-connector is a library for running scalable data retrieval pipelines that process any number of Git repositories for source code analysis.

It is written in Scala and built on top of Apache Spark to enable rapid construction of custom analysis pipelines and processing large number of Git repositories stored in HDFS in Siva file format. It is accessible both via Scala and Python Spark APIs, and capable of running on large-scale distributed clusters.

Current implementation combines:

src-d/enry to detect programming language of every file
bblfsh/client-scala to parse every file to UAST
src-d/siva-java for reading Siva files in JVM
apache/spark to extend DataFrame API
eclipse/jgit for working with Git .pack files

Deprecated

jgit-spark-connector has been deprecated in favor of gitbase-spark-connector and there will be no further development of this tool.

Quick-start

First, you need to download Apache Spark somewhere on your machine:

$ cd /tmp && wget "https://www.apache.org/dyn/mirrors/mirrors.cgi?action=download&filename=spark/spark-2.2.1/spark-2.2.1-bin-hadoop2.7.tgz" -O spark-2.2.1-bin-hadoop2.7.tgz

The Apache Software Foundation suggests you the better mirror where you can download Spark from. If you wish to take a look and find the best option in your case, you can do it here.

Then you must extract Spark from the downloaded tar file:

$ tar -C ~/ -xvzf spark-2.2.1-bin-hadoop2.7.tgz

Binaries and scripts to run Spark are located in spark-2.2.1-bin-hadoop2.7/bin, so should set PATH and SPARK_HOME to point to this directory. It's advised to add this to your shell profile:

$ export SPARK_HOME=$HOME/spark-2.2.1-bin-hadoop2.7
$ export PATH=$PATH:$SPARK_HOME/bin

Look for the latest jgit-spark-connector version, and then replace in the command where [version] is showed:

$ spark-shell --packages "tech.sourced:jgit-spark-connector:[version]"

# or

$ pyspark --packages "tech.sourced:jgit-spark-connector:[version]"

Run bblfsh daemon. You can start it easily in a container following its quick start guide.

If you run jgit-spark-connector in an UNIX like environment, you should set the LANG variable properly:

export LANG="en_US.UTF-8"

The rationale behind this is that UNIX file systems don't keep the encoding for each file name, they are just plain bytes, so the Java API for FS looks for the LANG environment variable to apply certain encoding.

Either in case the LANG variable wouldn't be set to a UTF-8 encoding or it wouldn't be set at all (which results in handle encoding in C locale) you could get an exception during the jgit-spark-connector execution similar to java.nio.file.InvalidPathException: Malformed input or input contains unmappable characters.

Pre-requisites

Scala 2.11.x
Apache Spark Installation 2.2.x or 2.3.x
bblfsh >= 2.5.0: Used for UAST extraction

Python pre-requisites:

Python >= 3.4.x (jgit-spark-connector is tested with Python 3.4, 3.5 and 3.6 and these are the supported versions, even if it might still work with previous ones)
libxml2-dev installed
python3-dev installed
g++ installed

Examples of jgit-spark-connector usage

jgit-spark-connector is available on maven central. To add it to your project as a dependency,

For projects managed by maven add the following to your pom.xml:

<dependency>
    <groupId>tech.sourced</groupId>
    <artifactId>jgit-spark-connector</artifactId>
    <version>[version]</version>
</dependency>

For sbt managed projects add the dependency:

libraryDependencies += "tech.sourced" % "jgit-spark-connector" % "[version]"

In both cases, replace [version] with the latest jgit-spark-connector version

Usage in applications as a dependency

The default jar published is a fatjar containing all the dependencies required by the jgit-spark-connector. It's meant to be used directly as a jar or through --packages for Spark usage.

If you want to use it in an application and built a fatjar with that you need to follow these steps to use what we call the "slim" jar:

With maven:

<dependency>
    <groupId>tech.sourced</groupId>
    <artifactId>jgit-spark-connector</artifactId>
    <version>[version]</version>
    <classifier>slim</classifier>
</dependency>

Or (for sbt):

libraryDependencies += "tech.sourced" % "jgit-spark-connector" % "[version]" % Compile classifier "slim"

If you run into problems with io.netty.versions.properties on sbt, you can add the following snippet to solve it:

In sbt:

assemblyMergeStrategy in assembly := {
  case "META-INF/io.netty.versions.properties" => MergeStrategy.last
  case x =>
    val oldStrategy = (assemblyMergeStrategy in assembly).value
    oldStrategy(x)
}

pyspark

Local mode

Install python-wrappers is necessary to use jgit-spark-connector from pyspark:

$ pip install sourced-jgit-spark-connector

Then you should provide the jgit-spark-connector's maven coordinates to the pyspark's shell:

$ $SPARK_HOME/bin/pyspark --packages "tech.sourced:jgit-spark-connector:[version]"

Replace [version] with the latest jgit-spark-connector version

Cluster mode

Install jgit-spark-connector wrappers as in local mode:

$ pip install -e sourced-jgit-spark-connector

Then you should package and compress with zip the python wrappers to provide pyspark with it. It's required to distribute the code among the nodes of the cluster.

$ zip <path-to-installed-package> ./sourced-jgit-spark-connector.zip
$ $SPARK_HOME/bin/pyspark <same-args-as-local-plus> --py-files ./sourced-jgit-spark-connector.zip

pyspark API usage

Run pyspark as explained before to start using the jgit-spark-connector, replacing [version] with the latest jgit-spark-connector version:

$ $SPARK_HOME/bin/pyspark --packages "tech.sourced:jgit-spark-connector:[version]"
Welcome to

   spark version 2.2.1

Using Python version 3.6.2 (default, Jul 20 2017 03:52:27)
SparkSession available as 'spark'.
>>> from sourced.engine import Engine
>>> engine = Engine(spark, '/path/to/siva/files', 'siva')
>>> engine.repositories.filter('id = "github.com/mingrammer/funmath.git"').references.filter("name = 'refs/heads/HEAD'").show()
+--------------------+---------------+--------------------+
|       repository_id|           name|                hash|
+--------------------+---------------+--------------------+
|github.com/mingra...|refs/heads/HEAD|290440b64a73f5c7e...|
+--------------------+---------------+--------------------+

Scala API usage

You must provide jgit-spark-connector as a dependency in the following way, replacing [version] with the latest jgit-spark-connector version:

$ spark-shell --packages "tech.sourced:jgit-spark-connector:[version]"

To start using jgit-spark-connector from the shell you must import everything inside the tech.sourced.engine package (or, if you prefer, just import Engine and EngineDataFrame classes):

scala> import tech.sourced.engine._
import tech.sourced.engine._

Now, you need to create an instance of Engine and give it the spark session and the path of the directory containing the siva files:

scala> val engine = Engine(spark, "/path/to/siva-files", "siva")

Then, you will be able to perform queries over the repositories:

scala> engine.getRepositories.filter('id === "github.com/mawag/faq-xiyoulinux").
     | getReferences.filter('name === "refs/heads/HEAD").
     | getAllReferenceCommits.filter('message.contains("Initial")).
     | select('repository_id, 'hash, 'message).
     | show

     +--------------------------------+-------------------------------+--------------------+
     |                 repository_id|                                hash|          message|
     +--------------------------------+-------------------------------+--------------------+
     |github.com/mawag/...|fff7062de8474d10a...|Initial commit|
     +--------------------------------+-------------------------------+--------------------+

Supported repository formats

As you might have seen, you need to provide the repository format you will be reading when you create the Engine instance. Although the documentation always uses the siva format, there are more repository formats available.

These are all the supported formats at the moment:

siva: rooted repositories packed in a single .siva file.
standard: regular git repositories with a .git folder. Each in a folder of their own under the given repository path.
bare: git bare repositories. Each in a folder of their own under the given repository path.

Processing local repositories with the jgit-spark-connector

There are some design decisions that may surprise the user when processing local repositories, instead of siva files. This is the list of things you should take into account when doing so:

All local branches will belong to a repository whose id is file://$REPOSITORY_PATH. So, if you clone https://github.com/foo/bar.git at /home/foo/bar, you will see two repositories file:///home/foo/bar and github.com/foo/bar, even if you only have one.
Remote branches are transformed from refs/remote/$REMOTE_NAME/$BRANCH_NAME to refs/heads/$BRANCH_NAME as they will only belong to the repository id of their corresponding remote. So refs/remote/origin/HEAD becomes refs/heads/HEAD.

Playing around with jgit-spark-connector on Jupyter

You can launch our docker container which contains some Notebooks examples just running:

docker run --name jgit-spark-connector-jupyter --rm -it -p 8080:8080 -v $(pwd)/path/to/siva-files:/repositories --link bblfshd:bblfshd srcd/jgit-spark-connector-jupyter

You must have some siva files in local to mount them on the container replacing the path $(pwd)/path/to/siva-files. You can get some siva-files from the project here.

You should have a bblfsh daemon container running to link the jupyter container (see Pre-requisites).

When the jgit-spark-connector-jupyter container starts it will show you an URL that you can open in your browser.

Using jgit-spark-connector directly from Python

If you are using the jgit-spark-connector directly from Python and are unable to modify the PYTHON_SUBMIT_ARGS you can copy the jgit-spark-connector jar to the pyspark jars to make it available there.

cp jgit-spark-connector.jar "$(python -c 'import pyspark; print(pyspark.__path__[0])')/jars"

This way, you can use it in the following way:

import sys

pyspark_path = "/path/to/pyspark/python"
sys.path.append(pyspark_path)

from pyspark.sql import SparkSession
from sourced.engine import Engine

siva_folder = "/path/to/siva-files"
spark = SparkSession.builder.appName("test").master("local[*]").getOrCreate()
engine = Engine(spark, siva_folder, 'siva')

Development

Build fatjar

Build the fatjar is needed to build the docker image that contains the jupyter server, or test changes in spark-shell just passing the jar with --jars flag:

$ make build

It leaves the fatjar in target/scala-2.11/jgit-spark-connector-uber.jar

Build and run docker to get a Jupyter server

To build an image with the last built of the project:

$ make docker-build

Notebooks under examples folder will be included on the image.

To run a container with the Jupyter server:

$ make docker-run

Before run the jupyter container you must run a bblfsh daemon:

$ make docker-bblfsh

If it's the first time you run the bblfsh daemon, you must install the drivers:

$ make docker-bblfsh-install-drivers

To see installed drivers:

$ make docker-bblfsh-list-drivers

To remove the development jupyter image generated:

$ make docker-clean

Run tests

jgit-spark-connector uses bblfsh so you need an instance of a bblfsh server running:

$ make docker-bblfsh

To run tests:

$ make test

To run tests for python wrapper:

$ cd python
$ make test

Windows support

There is no windows support in enry-java or bblfsh's client-scala right now, so all the language detection and UAST features are not available for the windows platform.

Code of Conduct

See CODE_OF_CONDUCT.md

License

Apache License Version 2.0, see LICENSE

jgit-spark-connector's People

Contributors

Stargazers

Watchers

jgit-spark-connector's Issues

Handle releases correctly

If we create a new tag defining a version on the spark-api repository, we should:

Execute all the tests
Generate a uber-jar using sbt assembly
Generate a new container version including the new uber-jar
Upload assembly jar to spark packages
Upload python code to pypi

[DS] Service to provide repositories by folder string

Per JVM should be only one repository instance. To do this we will need a singleton that provides a repository giving the path as a key. It should:

Copy all the files from HDFS to the local fs once.
Be able to check if it is a bare repository or not and create correctly the repository instance.
Generate repository index checking the config file urls.
A close method that removes all local temporal files and free the repository instance.

Add scalastyle to project

http://www.scalastyle.org/

We should choose witch configuration file is for us. I previously used the twitter one: http://twitter.github.io/effectivescala/#Formatting

Update documentation container name

we are using src-d/spark-api-jupyter instead of srcd/spark-api-jupyter

[API] Prepare "repository index" and incorporate it in API

Right now, in order to get URLs or the repositories, or languages we need to traverse all .siva files and extract original repo urls from config.

This issue is about speeding it up by pre-computing a dataset of original repo url, init_ hash, languages

It has 2 parts:

prepare a dataset, using Spark API
incorporate it into the Spark API so it is used for filtering, instead of traversing .siva files

Perf: not cleaning unpacked .siva files makes things slower

There is empirical evidence of what supposed to be optimization, may be causing a performance degradation - if cleaning unpack .siva files is skipped , it makes job run much slower (like 2x)

Report libuast SISEGV issue

Done in bblfsh/scala-client#30

~~Still pending to try with v1.3.0 to see if this fixes everything~~
Issue persists with v1.3.0

[API] add is_symbolic value to references schema

To be able to filter all symbolic references, we need a new column into the references source.

[DS] Array[Filter] to list of properties to filter

Having a list of org.apache.spark.sql.sources.Filter we should transform all of them to two lists: include and exclude params.

Per example if we want to get repository_id filters we should be able to do something like:

val filters = processFilters(filters: Array[Filter]): Map[String, (include:Seq[T], exclude:Seq[T])]
val repoFilters = filters("repository_id").getOrElse((Seq(),Seq()))

We only should take into account this filters:

And
Or
[Not]Equals
In

We should ignore by now:

StringStartsWith
StringEndsWith
StringContains

[API] Query UAST using xpath

Depends on #7 and the status of making libuast available in scala client, but we want to have an API to make xpath queries to the UASTs.

use srcd credentials for maven and test in release

https://issues.sonatype.org/browse/OSSRH-34579?focusedCommentId=442659&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-442659

Cleanup un-packed .siva files

After copying from HDFS and unpacking a .siva files at the start of the job, local FS on workers has to be cleaned-up at when the Job ends.

Current implementation does not take care of it.

There are two ways to do that

CompletionIterator / InterruptibleIterator preferable, see example
TaskContext listener

This issue is about picking a simplest one that would feat Spark API use cases and implementing it.
Unpacked .siva file optimization needs to be taken into account.

Change project name

Apache projects don't allow to use the project name in related projects: http://www.apache.org/foundation/marks/faq/#products

[DOC] Installations instructions

Document an easy method of installation:

not a container release (demo convenience container would be nice to have though)
packaging in 3 lines (download Spark, fetch deps, run spark-shell)

Docs should consist of:

very brief installation\usage in README
linking to markdown-like gitBook that @erizocosmico knows best about
make sure public API has ScalaDoc

Change the actual notebook to make good examples using enry and bblfsh.

Also try to make some aggregation (like top 10 languages in all the repositories of the subset)

Change references to bblfsh server on README

from https://github.com/bblfsh/server to https://github.com/bblfsh/bblfshd

Rename repo and change badges

All to engine

[DS] Generate repositories partitions from HDFS blocks

Repositories will be in a specific folder. Example:

repositories/
├── repo1/
│   └── ...
├── repo2/
│   └── ...
└── repo3/
    └── ...

We should get all the files from a repository, get the blocks information, and aggregate repositories by datanodes with more block of each repository.
With this information we need to create a new class called RepositoryPartition that extends the trait org.apache.spark.Partition, that will include a list of repository folders.

This partitions will be sent to each relation to create RDD partitions correctly, depending of the locality.

[API] Parse files into UAST with Babelfish

[DOC] Improve scaladoc

Add `getFirstReferenceCommit` method to python api

[DS] hadoop configuration on RDD

Hadoop org.apache.hadoop.conf.Configuration file is not serializable. To be able to send to all the RDD partitions this info, we need to broadcast it. Spark uses org.apache.spark.util.SerializableConfiguration but is package private. It can be used on our code creating a wrapper to make it public:

package org.apache.spark

import org.apache.hadoop.conf.Configuration
import org.apache.spark.util.SerializableConfiguration

class SCWrapper(conf: Configuration) extends SerializableConfiguration(conf: Configuration) {}

Un-packed .siva files perf optimization

In some cases we want to skip copy\unpack .siva files, if it has been already done.

This should be possible to enable this by configuration.

It is related to #29

BlobIterator does not return the files of all commits

Right now, BlobIterator only returns the files of the commits pointed to by a reference, instead of all commits in a reference.

Prepare a Docker image \w Jupyter and Spark API

This would allow:

something, that everyone can run, to get the latest state of Spark API
consists of a docker image \w hostPath mounted siva files
has 1 notebook \w example of PySpark using Spark API

[DS] Mock Datasource

We need a first Datasource approach to check the viability of the API.
This Datasource should return

Repositories source relation
References source relation
Commits source relation
Files source relation

Add integration tests

Check common use cases queries on integration tests to avoid duplicate errors over the time.

API to load repositories from URL list

[API] Detect language

[API] Create a new method to get files directly from repositories df

We need a new method that would get files from repositories using several filters to improve performance.

The first step is to add two new columns to files relation:

repository_id
reference_name

Then, we should create a new api method, something like:

def getFiles(repositoriesIds:Seq[String],referenceNames:Seq[String],commitHashes:Seq[String]):DataFrame

API to load repositories from rooted repository storage (borges-style)

Race condition, tmp siva files deleted too soon

Executing the python tests I noticed randomly they fail because the tmp siva files cannot be found so this looks like a race condition somewhere (could very well be a Py4J/pyspark thing, so if you think this is not related to our code I can investigate further to see what's going on the py side)

Stacktrace of the crash:

17/09/19 15:24:30 ERROR DiskBlockObjectWriter: Uncaught exception while reverting partial writes to file /tmp/blockmgr-86a0b5f1-6b06-4e0f-8cb3-a7994b0c6f2e/29/temp_shuffle_2fe14279-9444-431b-9392-4bf715b284d9
java.io.FileNotFoundException: /tmp/blockmgr-86a0b5f1-6b06-4e0f-8cb3-a7994b0c6f2e/29/temp_shuffle_2fe14279-9444-431b-9392-4bf715b284d9 (No such file or directory)
	at java.io.FileOutputStream.open0(Native Method)
	at java.io.FileOutputStream.open(FileOutputStream.java:270)
	at java.io.FileOutputStream.<init>(FileOutputStream.java:213)
	at org.apache.spark.storage.DiskBlockObjectWriter$$anonfun$revertPartialWritesAndClose$2.apply$mcV$sp(DiskBlockObjectWriter.scala:215)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1346)
	at org.apache.spark.storage.DiskBlockObjectWriter.revertPartialWritesAndClose(DiskBlockObjectWriter.scala:212)
	at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.stop(BypassMergeSortShuffleWriter.java:237)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:102)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
	at org.apache.spark.scheduler.Task.run(Task.scala:108)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
17/09/19 15:24:30 ERROR BypassMergeSortShuffleWriter: Error while deleting file /tmp/blockmgr-86a0b5f1-6b06-4e0f-8cb3-a7994b0c6f2e/29/temp_shuffle_2fe14279-9444-431b-9392-4bf715b284d9
17/09/19 15:24:30 ERROR DiskBlockObjectWriter: Uncaught exception while reverting partial writes to file /tmp/blockmgr-86a0b5f1-6b06-4e0f-8cb3-a7994b0c6f2e/28/temp_shuffle_649e8ea0-e7ff-4e72-b5d5-f2eec176dd46
java.io.FileNotFoundException: /tmp/blockmgr-86a0b5f1-6b06-4e0f-8cb3-a7994b0c6f2e/28/temp_shuffle_649e8ea0-e7ff-4e72-b5d5-f2eec176dd46 (No such file or directory)
	at java.io.FileOutputStream.open0(Native Method)
	at java.io.FileOutputStream.open(FileOutputStream.java:270)
	at java.io.FileOutputStream.<init>(FileOutputStream.java:213)
	at org.apache.spark.storage.DiskBlockObjectWriter$$anonfun$revertPartialWritesAndClose$2.apply$mcV$sp(DiskBlockObjectWriter.scala:215)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1346)
	at org.apache.spark.storage.DiskBlockObjectWriter.revertPartialWritesAndClose(DiskBlockObjectWriter.scala:212)
	at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.stop(BypassMergeSortShuffleWriter.java:237)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:102)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
	at org.apache.spark.scheduler.Task.run(Task.scala:108)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
17/09/19 15:24:30 ERROR BypassMergeSortShuffleWriter: Error while deleting file /tmp/blockmgr-86a0b5f1-6b06-4e0f-8cb3-a7994b0c6f2e/28/temp_shuffle_649e8ea0-e7ff-4e72-b5d5-f2eec176dd46
17/09/19 15:24:30 ERROR DiskBlockObjectWriter: Uncaught exception while reverting partial writes to file /tmp/blockmgr-86a0b5f1-6b06-4e0f-8cb3-a7994b0c6f2e/3e/temp_shuffle_247a46ff-753f-4a69-8c6b-fed143266a01
java.io.FileNotFoundException: /tmp/blockmgr-86a0b5f1-6b06-4e0f-8cb3-a7994b0c6f2e/3e/temp_shuffle_247a46ff-753f-4a69-8c6b-fed143266a01 (No such file or directory)
	at java.io.FileOutputStream.open0(Native Method)
	at java.io.FileOutputStream.open(FileOutputStream.java:270)
	at java.io.FileOutputStream.<init>(FileOutputStream.java:213)
	at org.apache.spark.storage.DiskBlockObjectWriter$$anonfun$revertPartialWritesAndClose$2.apply$mcV$sp(DiskBlockObjectWriter.scala:215)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1346)
	at org.apache.spark.storage.DiskBlockObjectWriter.revertPartialWritesAndClose(DiskBlockObjectWriter.scala:212)
	at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.stop(BypassMergeSortShuffleWriter.java:237)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:102)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
	at org.apache.spark.scheduler.Task.run(Task.scala:108)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
17/09/19 15:24:30 ERROR BypassMergeSortShuffleWriter: Error while deleting file /tmp/blockmgr-86a0b5f1-6b06-4e0f-8cb3-a7994b0c6f2e/3e/temp_shuffle_247a46ff-753f-4a69-8c6b-fed143266a01
17/09/19 15:24:30 ERROR DiskBlockObjectWriter: Uncaught exception while reverting partial writes to file /tmp/blockmgr-86a0b5f1-6b06-4e0f-8cb3-a7994b0c6f2e/02/temp_shuffle_15d14d72-f4f0-44e4-8c32-4998fa803dff
java.io.FileNotFoundException: /tmp/blockmgr-86a0b5f1-6b06-4e0f-8cb3-a7994b0c6f2e/02/temp_shuffle_15d14d72-f4f0-44e4-8c32-4998fa803dff (No such file or directory)
	at java.io.FileOutputStream.open0(Native Method)
	at java.io.FileOutputStream.open(FileOutputStream.java:270)
	at java.io.FileOutputStream.<init>(FileOutputStream.java:213)
	at org.apache.spark.storage.DiskBlockObjectWriter$$anonfun$revertPartialWritesAndClose$2.apply$mcV$sp(DiskBlockObjectWriter.scala:215)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1346)
	at org.apache.spark.storage.DiskBlockObjectWriter.revertPartialWritesAndClose(DiskBlockObjectWriter.scala:212)
	at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.stop(BypassMergeSortShuffleWriter.java:237)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:102)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
	at org.apache.spark.scheduler.Task.run(Task.scala:108)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
17/09/19 15:24:30 ERROR BypassMergeSortShuffleWriter: Error while deleting file /tmp/blockmgr-86a0b5f1-6b06-4e0f-8cb3-a7994b0c6f2e/02/temp_shuffle_15d14d72-f4f0-44e4-8c32-4998fa803dff
17/09/19 15:24:30 ERROR DiskBlockObjectWriter: Uncaught exception while reverting partial writes to file /tmp/blockmgr-86a0b5f1-6b06-4e0f-8cb3-a7994b0c6f2e/38/temp_shuffle_dd1b5a5a-7226-4578-aaa9-1086dd67594e
java.io.FileNotFoundException: /tmp/blockmgr-86a0b5f1-6b06-4e0f-8cb3-a7994b0c6f2e/38/temp_shuffle_dd1b5a5a-7226-4578-aaa9-1086dd67594e (No such file or directory)
	at java.io.FileOutputStream.open0(Native Method)
	at java.io.FileOutputStream.open(FileOutputStream.java:270)
	at java.io.FileOutputStream.<init>(FileOutputStream.java:213)
	at org.apache.spark.storage.DiskBlockObjectWriter$$anonfun$revertPartialWritesAndClose$2.apply$mcV$sp(DiskBlockObjectWriter.scala:215)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1346)
	at org.apache.spark.storage.DiskBlockObjectWriter.revertPartialWritesAndClose(DiskBlockObjectWriter.scala:212)
	at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.stop(BypassMergeSortShuffleWriter.java:237)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:102)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
	at org.apache.spark.scheduler.Task.run(Task.scala:108)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
17/09/19 15:24:30 ERROR BypassMergeSortShuffleWriter: Error while deleting file /tmp/blockmgr-86a0b5f1-6b06-4e0f-8cb3-a7994b0c6f2e/38/temp_shuffle_dd1b5a5a-7226-4578-aaa9-1086dd67594e
17/09/19 15:24:30 ERROR DiskBlockObjectWriter: Uncaught exception while reverting partial writes to file /tmp/blockmgr-86a0b5f1-6b06-4e0f-8cb3-a7994b0c6f2e/3d/temp_shuffle_ef432a77-5e65-4f6d-96ed-b5b5965e58f8
java.io.FileNotFoundException: /tmp/blockmgr-86a0b5f1-6b06-4e0f-8cb3-a7994b0c6f2e/3d/temp_shuffle_ef432a77-5e65-4f6d-96ed-b5b5965e58f8 (No such file or directory)
	at java.io.FileOutputStream.open0(Native Method)
	at java.io.FileOutputStream.open(FileOutputStream.java:270)
	at java.io.FileOutputStream.<init>(FileOutputStream.java:213)
	at org.apache.spark.storage.DiskBlockObjectWriter$$anonfun$revertPartialWritesAndClose$2.apply$mcV$sp(DiskBlockObjectWriter.scala:215)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1346)
	at org.apache.spark.storage.DiskBlockObjectWriter.revertPartialWritesAndClose(DiskBlockObjectWriter.scala:212)
	at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.stop(BypassMergeSortShuffleWriter.java:237)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:102)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
	at org.apache.spark.scheduler.Task.run(Task.scala:108)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
17/09/19 15:24:30 ERROR BypassMergeSortShuffleWriter: Error while deleting file /tmp/blockmgr-86a0b5f1-6b06-4e0f-8cb3-a7994b0c6f2e/3d/temp_shuffle_ef432a77-5e65-4f6d-96ed-b5b5965e58f8
17/09/19 15:24:30 ERROR DiskBlockObjectWriter: Uncaught exception while reverting partial writes to file /tmp/blockmgr-86a0b5f1-6b06-4e0f-8cb3-a7994b0c6f2e/29/temp_shuffle_69c5fa7d-a6f6-4d9d-a943-9816db1d5518
java.io.FileNotFoundException: /tmp/blockmgr-86a0b5f1-6b06-4e0f-8cb3-a7994b0c6f2e/29/temp_shuffle_69c5fa7d-a6f6-4d9d-a943-9816db1d5518 (No such file or directory)
	at java.io.FileOutputStream.open0(Native Method)
	at java.io.FileOutputStream.open(FileOutputStream.java:270)
	at java.io.FileOutputStream.<init>(FileOutputStream.java:213)
	at org.apache.spark.storage.DiskBlockObjectWriter$$anonfun$revertPartialWritesAndClose$2.apply$mcV$sp(DiskBlockObjectWriter.scala:215)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1346)
	at org.apache.spark.storage.DiskBlockObjectWriter.revertPartialWritesAndClose(DiskBlockObjectWriter.scala:212)
	at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.stop(BypassMergeSortShuffleWriter.java:237)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:102)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
	at org.apache.spark.scheduler.Task.run(Task.scala:108)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
17/09/19 15:24:30 ERROR BypassMergeSortShuffleWriter: Error while deleting file /tmp/blockmgr-86a0b5f1-6b06-4e0f-8cb3-a7994b0c6f2e/29/temp_shuffle_69c5fa7d-a6f6-4d9d-a943-9816db1d5518
17/09/19 15:24:30 ERROR DiskBlockObjectWriter: Uncaught exception while reverting partial writes to file /tmp/blockmgr-86a0b5f1-6b06-4e0f-8cb3-a7994b0c6f2e/20/temp_shuffle_692f2eae-7e3b-43e5-9e7e-1fc7310787fb
java.io.FileNotFoundException: /tmp/blockmgr-86a0b5f1-6b06-4e0f-8cb3-a7994b0c6f2e/20/temp_shuffle_692f2eae-7e3b-43e5-9e7e-1fc7310787fb (No such file or directory)
	at java.io.FileOutputStream.open0(Native Method)
	at java.io.FileOutputStream.open(FileOutputStream.java:270)
	at java.io.FileOutputStream.<init>(FileOutputStream.java:213)
	at org.apache.spark.storage.DiskBlockObjectWriter$$anonfun$revertPartialWritesAndClose$2.apply$mcV$sp(DiskBlockObjectWriter.scala:215)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1346)
	at org.apache.spark.storage.DiskBlockObjectWriter.revertPartialWritesAndClose(DiskBlockObjectWriter.scala:212)
	at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.stop(BypassMergeSortShuffleWriter.java:237)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:102)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
	at org.apache.spark.scheduler.Task.run(Task.scala:108)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
17/09/19 15:24:30 ERROR BypassMergeSortShuffleWriter: Error while deleting file /tmp/blockmgr-86a0b5f1-6b06-4e0f-8cb3-a7994b0c6f2e/20/temp_shuffle_692f2eae-7e3b-43e5-9e7e-1fc7310787fb
17/09/19 15:24:30 ERROR DiskBlockObjectWriter: Uncaught exception while reverting partial writes to file /tmp/blockmgr-86a0b5f1-6b06-4e0f-8cb3-a7994b0c6f2e/33/temp_shuffle_7e094f01-8975-4983-a60c-2c6b13fa28dc
java.io.FileNotFoundException: /tmp/blockmgr-86a0b5f1-6b06-4e0f-8cb3-a7994b0c6f2e/33/temp_shuffle_7e094f01-8975-4983-a60c-2c6b13fa28dc (No such file or directory)
	at java.io.FileOutputStream.open0(Native Method)
	at java.io.FileOutputStream.open(FileOutputStream.java:270)
	at java.io.FileOutputStream.<init>(FileOutputStream.java:213)
	at org.apache.spark.storage.DiskBlockObjectWriter$$anonfun$revertPartialWritesAndClose$2.apply$mcV$sp(DiskBlockObjectWriter.scala:215)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1346)
	at org.apache.spark.storage.DiskBlockObjectWriter.revertPartialWritesAndClose(DiskBlockObjectWriter.scala:212)
	at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.stop(BypassMergeSortShuffleWriter.java:237)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:102)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
	at org.apache.spark.scheduler.Task.run(Task.scala:108)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
17/09/19 15:24:30 ERROR BypassMergeSortShuffleWriter: Error while deleting file /tmp/blockmgr-86a0b5f1-6b06-4e0f-8cb3-a7994b0c6f2e/33/temp_shuffle_7e094f01-8975-4983-a60c-2c6b13fa28dc
17/09/19 15:24:30 ERROR DiskBlockObjectWriter: Uncaught exception while reverting partial writes to file /tmp/blockmgr-86a0b5f1-6b06-4e0f-8cb3-a7994b0c6f2e/11/temp_shuffle_99570036-ad3f-40c1-8971-f8aa09263446
java.io.FileNotFoundException: /tmp/blockmgr-86a0b5f1-6b06-4e0f-8cb3-a7994b0c6f2e/11/temp_shuffle_99570036-ad3f-40c1-8971-f8aa09263446 (No such file or directory)
	at java.io.FileOutputStream.open0(Native Method)
	at java.io.FileOutputStream.open(FileOutputStream.java:270)
	at java.io.FileOutputStream.<init>(FileOutputStream.java:213)
	at org.apache.spark.storage.DiskBlockObjectWriter$$anonfun$revertPartialWritesAndClose$2.apply$mcV$sp(DiskBlockObjectWriter.scala:215)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1346)
	at org.apache.spark.storage.DiskBlockObjectWriter.revertPartialWritesAndClose(DiskBlockObjectWriter.scala:212)
	at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.stop(BypassMergeSortShuffleWriter.java:237)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:102)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
	at org.apache.spark.scheduler.Task.run(Task.scala:108)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
17/09/19 15:24:30 ERROR BypassMergeSortShuffleWriter: Error while deleting file /tmp/blockmgr-86a0b5f1-6b06-4e0f-8cb3-a7994b0c6f2e/11/temp_shuffle_99570036-ad3f-40c1-8971-f8aa09263446
17/09/19 15:24:30 ERROR DiskBlockObjectWriter: Uncaught exception while reverting partial writes to file /tmp/blockmgr-86a0b5f1-6b06-4e0f-8cb3-a7994b0c6f2e/24/temp_shuffle_28e739f9-a3d0-462d-b71f-4196351aefef
java.io.FileNotFoundException: /tmp/blockmgr-86a0b5f1-6b06-4e0f-8cb3-a7994b0c6f2e/24/temp_shuffle_28e739f9-a3d0-462d-b71f-4196351aefef (No such file or directory)
	at java.io.FileOutputStream.open0(Native Method)
	at java.io.FileOutputStream.open(FileOutputStream.java:270)
	at java.io.FileOutputStream.<init>(FileOutputStream.java:213)
	at org.apache.spark.storage.DiskBlockObjectWriter$$anonfun$revertPartialWritesAndClose$2.apply$mcV$sp(DiskBlockObjectWriter.scala:215)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1346)
	at org.apache.spark.storage.DiskBlockObjectWriter.revertPartialWritesAndClose(DiskBlockObjectWriter.scala:212)
	at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.stop(BypassMergeSortShuffleWriter.java:237)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:102)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
	at org.apache.spark.scheduler.Task.run(Task.scala:108)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
17/09/19 15:24:30 ERROR BypassMergeSortShuffleWriter: Error while deleting file /tmp/blockmgr-86a0b5f1-6b06-4e0f-8cb3-a7994b0c6f2e/24/temp_shuffle_28e739f9-a3d0-462d-b71f-4196351aefef

Upgrade client-scala to fix the memory leaks in libuast

When bblfsh/scala-client#36 is fixed we should upgrade client-scala, right now memory is being leaked by libuast.

Rename head_ref to head, master_ref to master

I do understand that this has smth to do with the Scala API, but Python users should not suffer because of the name collision in the different language. references.head_ref is a duplication.

[API] Add convenience wrappers for Python

Right now, there is a number of steps user need to take to make spark API work in PySpark

add jar
configure DataSource
select/join (see example notebook)

We want to have a convenience API in PySpark that does it for us, that using py4j delegates to appropriate methods in Scala.

So this issue is twofold:

make sure that Scala API matches the proposed one
make PySpark wrappers for it

Benchmark for src-d/Engine

Try using it \w 100, 1000 repos for our usecases

pick the use-case
measure in local Spark

jacoco.settings are not compatible with intellij test execution

If you don't temporally remove the jacoco.settings line from build.sbt file, you will have class not found errors when you execute tests from Intellij. I did a quick search but I wasn't able to find a solution for this.

Error reporting: check existence of the path to siva files

If example from our README https://github.com/src-d/spark-api#pyspark-api-usage is tried literally

from sourced.spark import API as SparkAPI
from pyspark.sql import SparkSession
 
spark = SparkSession.builder.appName("test").master("local[*]").getOrCreate()
api = SparkAPI(spark, '/path/to/siva/files')
api.repositories.filter("id = 'github.com/mawag/faq-xiyoulinux'").references.filter("name = 'refs/heads/HEAD'").show()

with the )non-existent path to .siva files_, it will result in

---------------------------------------------------------------------------
Py4JJavaError                             Traceback (most recent call last)
<ipython-input-1-7f67465f882f> in <module>()
      4 spark = SparkSession.builder.appName("test").master("local[*]").getOrCreate()
      5 api = SparkAPI(spark, '/path/to/siva/files')
----> 6 api.repositories.filter("id = 'github.com/mawag/faq-xiyoulinux'").references.filter("name = 'refs/heads/HEAD'").show()

/usr/local/spark/python/pyspark/sql/dataframe.py in show(self, n, truncate)
    334         """
    335         if isinstance(truncate, bool) and truncate:
--> 336             print(self._jdf.showString(n, 20))
    337         else:
    338             print(self._jdf.showString(n, int(truncate)))

/usr/local/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py in __call__(self, *args)
   1131         answer = self.gateway_client.send_command(command)
   1132         return_value = get_return_value(
-> 1133             answer, self.gateway_client, self.target_id, self.name)
   1134 
   1135         for temp_arg in temp_args:

/usr/local/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
     61     def deco(*a, **kw):
     62         try:
---> 63             return f(*a, **kw)
     64         except py4j.protocol.Py4JJavaError as e:
     65             s = e.java_exception.toString()

/usr/local/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
    317                 raise Py4JJavaError(
    318                     "An error occurred while calling {0}{1}{2}.\n".
--> 319                     format(target_id, ".", name), value)
    320             else:
    321                 raise Py4JError(

Py4JJavaError: An error occurred while calling o39.showString.
: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree:
Exchange hashpartitioning(repository_id#11, 200)
+- *Filter (((isnotnull(name#12) && (name#12 = refs/heads/HEAD)) && isnotnull(repository_id#11)) && (repository_id#11 = github.com/mawag/faq-xiyoulinux))
   +- *Scan GitRelation(org.apache.spark.sql.SQLContext@4faae818,references,/path/to/siva/files,/tmp) [repository_id#11,name#12,hash#13] PushedFilters: [IsNotNull(name), EqualTo(name,refs/heads/HEAD), IsNotNull(repository_id), EqualTo(repository_id,..., ReadSchema: struct<repository_id:string,name:string,hash:string>

which is not very clean error message.

This can be fixed by implementing a proper check and error reporting, if the given path does not exist, in the Scala part.

Push down JOIN conditions if possible to the GitDatasource

Actually, if we have a query like this one:

api.getRepositories().where("id = 'github.com/foo/bar'").getHEAD.getFiles.extractUASTs().select("name", "path", "uast")

The optimized plan will be (simplified):

Project
+-Join[commit_hash = hash]
   :   +- GitRelation[FILES]
   +- Join[repository_id = id]
       :   +- Filter[name="refs/heads/HEAD",repository_id="github.com/foo/bar"]
       :       +- GitRelation[REFERENCES]
       +- Filter[id="github.com/foo/bar"]
           +- GitRelation[REPOSITORIES]

As we can see, no filter is pushed down to the files relation, having to push up all the files over all the revisions over all the repositories in this case.
We should implement a rule to be able to transform the optimized plan from above to:

Project
+- Filter[commit_hash = hash, repository_id = id, repository_id = "github.com/foo/bar"]
    +-GitRelation[FILES]

Then, the non resolved conditions (repository_id = id in this case) should be handled by the iterator.

Include enry-java dependency using maven

Make release process more automatic

Right now, to release a new version we need:

Change the project version on build.sbt
Push changes to master
Create tag pointing to that commit
Push to master moving the version to the next -SNAPSHOT one

Use some plugin to automate this process.

Improvement proposal

Hi,

I think we can simplify default usage of Engine a bit - let's make possible to use pure python instead of pyspark.
There are at least several ways how it can be done:

Some function that will find pyspark : https://stackoverflow.com/questions/23256536/importing-pyspark-in-python-shell
or add pyspark to PYTHONPATH during installation: export PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH
so it will be possible to import pyspark and so on.

I think the ideal situation for Data Scientist:

from sourced.engine import Engine
engine = Engine(siva_folder, **parameters_for_SparkSession)

and pyspark session will be initialized in background.

What do you think?
@bzz , @ajnavarro, @erizocosmico, @mcarmonaa

Overall API proposal

Create an overall API proposal. Initially this is expected to:

Load repositories from rooted repository storage (borges-style)
Load repositories from URL list
Extract files
Detect language
Parse files into UAST with Babelfish

[DS] Read .siva files to load to the datasource.

Check how Spark creates RelationProvider to check if we can do it stateful.
Create RDD from siva folders into the relation provider to unpack them once.
map to jgit repositories

Spark: investigate a suspicious warning

On running over ./examples/siva-files

make clean build
SPARK_HOME="" ./spark/bin/spark-shell --jars ./target/scala-2.11/engine-uber.jar

val engine = Engine(spark, "./examples/siva-files")

engine.getRepositories.getHEAD.getFiles.classifyLanguages.where('lang === "Python").extractUASTs.queryUAST("//*[@roleIdentifier]", "uast", "result").extractTokens("result", "tokens").select('path, 'lang, 'uast, 'tokens).show

results in suspicious warning

17/10/18 18:25:01 WARN Executor: Managed memory leak detected; size = 4456448 bytes, TID = 1034

Guava version at bblfsh/scala-client

On clean Ubuntu 16.04 env using

bin/spark-shell --packages tech.sourced:engine:0.1.2 --repositories "https://jitpack.io"
--repositories is needed for siva-java, not included in uber-jar 0.1.2
and

Example \w UAST extraction I got Guava version mismatch at runtime

17/10/26 13:59:37 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 10, localhost, executor driver): java.lang.NoSuchMethodError: com.google.common.util.concurre$
t.MoreExecutors.directExecutor()Ljava/util/concurrent/Executor;
        at io.grpc.internal.ClientCallImpl.<init>(ClientCallImpl.java:104)
        at io.grpc.internal.ManagedChannelImpl$RealChannel.newCall(ManagedChannelImpl.java:554)
        at io.grpc.internal.ManagedChannelImpl.newCall(ManagedChannelImpl.java:533)
        at gopkg.in.bblfsh.sdk.v1.protocol.generated.ProtocolServiceGrpc$ProtocolServiceBlockingStub.parse(ProtocolServiceGrpc.scala:50)
        at org.bblfsh.client.BblfshClient.parse(BblfshClient.scala:29)
        at tech.sourced.engine.udf.ExtractUASTsUDF$.extractUsingBblfsh(ExtractUASTsUDF.scala:110)
        at tech.sourced.engine.udf.ExtractUASTsUDF$.extractUAST(ExtractUASTsUDF.scala:92)
        at tech.sourced.engine.udf.ExtractUASTsUDF$.extractUASTsWithLang(ExtractUASTsUDF.scala:69)