GithubHelp home page GithubHelp logo

src-d / jgit-spark-connector Goto Github PK

View Code? Open in Web Editor NEW
71.0 11.0 33.0 8.76 MB

jgit-spark-connector is a library for running scalable data retrieval pipelines that process any number of Git repositories for source code analysis.

License: Apache License 2.0

Scala 75.07% Shell 6.79% Makefile 1.69% Python 12.35% Jupyter Notebook 3.40% Dockerfile 0.70%
spark pyspark scala python git datasource

jgit-spark-connector's Introduction

jgit-spark-connector Build Status codecov Maven Central

jgit-spark-connector is a library for running scalable data retrieval pipelines that process any number of Git repositories for source code analysis.

It is written in Scala and built on top of Apache Spark to enable rapid construction of custom analysis pipelines and processing large number of Git repositories stored in HDFS in Siva file format. It is accessible both via Scala and Python Spark APIs, and capable of running on large-scale distributed clusters.

Current implementation combines:

Deprecated

jgit-spark-connector has been deprecated in favor of gitbase-spark-connector and there will be no further development of this tool.

Quick-start

First, you need to download Apache Spark somewhere on your machine:

$ cd /tmp && wget "https://www.apache.org/dyn/mirrors/mirrors.cgi?action=download&filename=spark/spark-2.2.1/spark-2.2.1-bin-hadoop2.7.tgz" -O spark-2.2.1-bin-hadoop2.7.tgz

The Apache Software Foundation suggests you the better mirror where you can download Spark from. If you wish to take a look and find the best option in your case, you can do it here.

Then you must extract Spark from the downloaded tar file:

$ tar -C ~/ -xvzf spark-2.2.1-bin-hadoop2.7.tgz

Binaries and scripts to run Spark are located in spark-2.2.1-bin-hadoop2.7/bin, so should set PATH and SPARK_HOME to point to this directory. It's advised to add this to your shell profile:

$ export SPARK_HOME=$HOME/spark-2.2.1-bin-hadoop2.7
$ export PATH=$PATH:$SPARK_HOME/bin

Look for the latest jgit-spark-connector version, and then replace in the command where [version] is showed:

$ spark-shell --packages "tech.sourced:jgit-spark-connector:[version]"

# or

$ pyspark --packages "tech.sourced:jgit-spark-connector:[version]"

Run bblfsh daemon. You can start it easily in a container following its quick start guide.

If you run jgit-spark-connector in an UNIX like environment, you should set the LANG variable properly:

export LANG="en_US.UTF-8"

The rationale behind this is that UNIX file systems don't keep the encoding for each file name, they are just plain bytes, so the Java API for FS looks for the LANG environment variable to apply certain encoding.

Either in case the LANG variable wouldn't be set to a UTF-8 encoding or it wouldn't be set at all (which results in handle encoding in C locale) you could get an exception during the jgit-spark-connector execution similar to java.nio.file.InvalidPathException: Malformed input or input contains unmappable characters.

Pre-requisites

Python pre-requisites:

  • Python >= 3.4.x (jgit-spark-connector is tested with Python 3.4, 3.5 and 3.6 and these are the supported versions, even if it might still work with previous ones)
  • libxml2-dev installed
  • python3-dev installed
  • g++ installed

Examples of jgit-spark-connector usage

jgit-spark-connector is available on maven central. To add it to your project as a dependency,

For projects managed by maven add the following to your pom.xml:

<dependency>
    <groupId>tech.sourced</groupId>
    <artifactId>jgit-spark-connector</artifactId>
    <version>[version]</version>
</dependency>

For sbt managed projects add the dependency:

libraryDependencies += "tech.sourced" % "jgit-spark-connector" % "[version]"

In both cases, replace [version] with the latest jgit-spark-connector version

Usage in applications as a dependency

The default jar published is a fatjar containing all the dependencies required by the jgit-spark-connector. It's meant to be used directly as a jar or through --packages for Spark usage.

If you want to use it in an application and built a fatjar with that you need to follow these steps to use what we call the "slim" jar:

With maven:

<dependency>
    <groupId>tech.sourced</groupId>
    <artifactId>jgit-spark-connector</artifactId>
    <version>[version]</version>
    <classifier>slim</classifier>
</dependency>

Or (for sbt):

libraryDependencies += "tech.sourced" % "jgit-spark-connector" % "[version]" % Compile classifier "slim"

If you run into problems with io.netty.versions.properties on sbt, you can add the following snippet to solve it:

In sbt:

assemblyMergeStrategy in assembly := {
  case "META-INF/io.netty.versions.properties" => MergeStrategy.last
  case x =>
    val oldStrategy = (assemblyMergeStrategy in assembly).value
    oldStrategy(x)
}

pyspark

Local mode

Install python-wrappers is necessary to use jgit-spark-connector from pyspark:

$ pip install sourced-jgit-spark-connector

Then you should provide the jgit-spark-connector's maven coordinates to the pyspark's shell:

$ $SPARK_HOME/bin/pyspark --packages "tech.sourced:jgit-spark-connector:[version]"

Replace [version] with the latest jgit-spark-connector version

Cluster mode

Install jgit-spark-connector wrappers as in local mode:

$ pip install -e sourced-jgit-spark-connector

Then you should package and compress with zip the python wrappers to provide pyspark with it. It's required to distribute the code among the nodes of the cluster.

$ zip <path-to-installed-package> ./sourced-jgit-spark-connector.zip
$ $SPARK_HOME/bin/pyspark <same-args-as-local-plus> --py-files ./sourced-jgit-spark-connector.zip

pyspark API usage

Run pyspark as explained before to start using the jgit-spark-connector, replacing [version] with the latest jgit-spark-connector version:

$ $SPARK_HOME/bin/pyspark --packages "tech.sourced:jgit-spark-connector:[version]"
Welcome to

   spark version 2.2.1

Using Python version 3.6.2 (default, Jul 20 2017 03:52:27)
SparkSession available as 'spark'.
>>> from sourced.engine import Engine
>>> engine = Engine(spark, '/path/to/siva/files', 'siva')
>>> engine.repositories.filter('id = "github.com/mingrammer/funmath.git"').references.filter("name = 'refs/heads/HEAD'").show()
+--------------------+---------------+--------------------+
|       repository_id|           name|                hash|
+--------------------+---------------+--------------------+
|github.com/mingra...|refs/heads/HEAD|290440b64a73f5c7e...|
+--------------------+---------------+--------------------+

Scala API usage

You must provide jgit-spark-connector as a dependency in the following way, replacing [version] with the latest jgit-spark-connector version:

$ spark-shell --packages "tech.sourced:jgit-spark-connector:[version]"

To start using jgit-spark-connector from the shell you must import everything inside the tech.sourced.engine package (or, if you prefer, just import Engine and EngineDataFrame classes):

scala> import tech.sourced.engine._
import tech.sourced.engine._

Now, you need to create an instance of Engine and give it the spark session and the path of the directory containing the siva files:

scala> val engine = Engine(spark, "/path/to/siva-files", "siva")

Then, you will be able to perform queries over the repositories:

scala> engine.getRepositories.filter('id === "github.com/mawag/faq-xiyoulinux").
     | getReferences.filter('name === "refs/heads/HEAD").
     | getAllReferenceCommits.filter('message.contains("Initial")).
     | select('repository_id, 'hash, 'message).
     | show

     +--------------------------------+-------------------------------+--------------------+
     |                 repository_id|                                hash|          message|
     +--------------------------------+-------------------------------+--------------------+
     |github.com/mawag/...|fff7062de8474d10a...|Initial commit|
     +--------------------------------+-------------------------------+--------------------+

Supported repository formats

As you might have seen, you need to provide the repository format you will be reading when you create the Engine instance. Although the documentation always uses the siva format, there are more repository formats available.

These are all the supported formats at the moment:

  • siva: rooted repositories packed in a single .siva file.
  • standard: regular git repositories with a .git folder. Each in a folder of their own under the given repository path.
  • bare: git bare repositories. Each in a folder of their own under the given repository path.

Processing local repositories with the jgit-spark-connector

There are some design decisions that may surprise the user when processing local repositories, instead of siva files. This is the list of things you should take into account when doing so:

  • All local branches will belong to a repository whose id is file://$REPOSITORY_PATH. So, if you clone https://github.com/foo/bar.git at /home/foo/bar, you will see two repositories file:///home/foo/bar and github.com/foo/bar, even if you only have one.
  • Remote branches are transformed from refs/remote/$REMOTE_NAME/$BRANCH_NAME to refs/heads/$BRANCH_NAME as they will only belong to the repository id of their corresponding remote. So refs/remote/origin/HEAD becomes refs/heads/HEAD.

Playing around with jgit-spark-connector on Jupyter

You can launch our docker container which contains some Notebooks examples just running:

docker run --name jgit-spark-connector-jupyter --rm -it -p 8080:8080 -v $(pwd)/path/to/siva-files:/repositories --link bblfshd:bblfshd srcd/jgit-spark-connector-jupyter

You must have some siva files in local to mount them on the container replacing the path $(pwd)/path/to/siva-files. You can get some siva-files from the project here.

You should have a bblfsh daemon container running to link the jupyter container (see Pre-requisites).

When the jgit-spark-connector-jupyter container starts it will show you an URL that you can open in your browser.

Using jgit-spark-connector directly from Python

If you are using the jgit-spark-connector directly from Python and are unable to modify the PYTHON_SUBMIT_ARGS you can copy the jgit-spark-connector jar to the pyspark jars to make it available there.

cp jgit-spark-connector.jar "$(python -c 'import pyspark; print(pyspark.__path__[0])')/jars"

This way, you can use it in the following way:

import sys

pyspark_path = "/path/to/pyspark/python"
sys.path.append(pyspark_path)

from pyspark.sql import SparkSession
from sourced.engine import Engine

siva_folder = "/path/to/siva-files"
spark = SparkSession.builder.appName("test").master("local[*]").getOrCreate()
engine = Engine(spark, siva_folder, 'siva')

Development

Build fatjar

Build the fatjar is needed to build the docker image that contains the jupyter server, or test changes in spark-shell just passing the jar with --jars flag:

$ make build

It leaves the fatjar in target/scala-2.11/jgit-spark-connector-uber.jar

Build and run docker to get a Jupyter server

To build an image with the last built of the project:

$ make docker-build

Notebooks under examples folder will be included on the image.

To run a container with the Jupyter server:

$ make docker-run

Before run the jupyter container you must run a bblfsh daemon:

$ make docker-bblfsh

If it's the first time you run the bblfsh daemon, you must install the drivers:

$ make docker-bblfsh-install-drivers

To see installed drivers:

$ make docker-bblfsh-list-drivers

To remove the development jupyter image generated:

$ make docker-clean

Run tests

jgit-spark-connector uses bblfsh so you need an instance of a bblfsh server running:

$ make docker-bblfsh

To run tests:

$ make test

To run tests for python wrapper:

$ cd python
$ make test

Windows support

There is no windows support in enry-java or bblfsh's client-scala right now, so all the language detection and UAST features are not available for the windows platform.

Code of Conduct

See CODE_OF_CONDUCT.md

License

Apache License Version 2.0, see LICENSE

jgit-spark-connector's People

Contributors

abeaumont avatar ajnavarro avatar bzz avatar carlosms avatar dpordomingo avatar egorbu avatar erizocosmico avatar fossabot avatar jfontan avatar mcarmonaa avatar mcuadros avatar r0maink avatar smacker avatar smola avatar turtlemonvh avatar vmarkovtsev avatar zurk avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

jgit-spark-connector's Issues

Handle releases correctly

If we create a new tag defining a version on the spark-api repository, we should:

  • Execute all the tests
  • Generate a uber-jar using sbt assembly
  • Generate a new container version including the new uber-jar
  • Upload assembly jar to spark packages
  • Upload python code to pypi

[DS] Service to provide repositories by folder string

Per JVM should be only one repository instance. To do this we will need a singleton that provides a repository giving the path as a key. It should:

  • Copy all the files from HDFS to the local fs once.
  • Be able to check if it is a bare repository or not and create correctly the repository instance.
  • Generate repository index checking the config file urls.
  • A close method that removes all local temporal files and free the repository instance.

[API] Prepare "repository index" and incorporate it in API

Right now, in order to get URLs or the repositories, or languages we need to traverse all .siva files and extract original repo urls from config.

This issue is about speeding it up by pre-computing a dataset of original repo url, init_ hash, languages

It has 2 parts:

  • prepare a dataset, using Spark API
  • incorporate it into the Spark API so it is used for filtering, instead of traversing .siva files

[DS] Array[Filter] to list of properties to filter

Having a list of org.apache.spark.sql.sources.Filter we should transform all of them to two lists: include and exclude params.

Per example if we want to get repository_id filters we should be able to do something like:

val filters = processFilters(filters: Array[Filter]): Map[String, (include:Seq[T], exclude:Seq[T])]
val repoFilters = filters("repository_id").getOrElse((Seq(),Seq()))

We only should take into account this filters:

  • And
  • Or
  • [Not]Equals
  • In

We should ignore by now:

  • StringStartsWith
  • StringEndsWith
  • StringContains

[API] Query UAST using xpath

Depends on #7 and the status of making libuast available in scala client, but we want to have an API to make xpath queries to the UASTs.

Cleanup un-packed .siva files

After copying from HDFS and unpacking a .siva files at the start of the job, local FS on workers has to be cleaned-up at when the Job ends.

Current implementation does not take care of it.

There are two ways to do that

  • CompletionIterator / InterruptibleIterator preferable, see example
  • TaskContext listener

This issue is about picking a simplest one that would feat Spark API use cases and implementing it.
Unpacked .siva file optimization needs to be taken into account.

[DOC] Installations instructions

Document an easy method of installation:

  • not a container release (demo convenience container would be nice to have though)
  • packaging in 3 lines (download Spark, fetch deps, run spark-shell)

Docs should consist of:

  • very brief installation\usage in README
  • linking to markdown-like gitBook that @erizocosmico knows best about
  • make sure public API has ScalaDoc

[DS] Generate repositories partitions from HDFS blocks

Repositories will be in a specific folder. Example:

repositories/
โ”œโ”€โ”€ repo1/
โ”‚   โ””โ”€โ”€ ...
โ”œโ”€โ”€ repo2/
โ”‚   โ””โ”€โ”€ ...
โ””โ”€โ”€ repo3/
    โ””โ”€โ”€ ...

We should get all the files from a repository, get the blocks information, and aggregate repositories by datanodes with more block of each repository.
With this information we need to create a new class called RepositoryPartition that extends the trait org.apache.spark.Partition, that will include a list of repository folders.

This partitions will be sent to each relation to create RDD partitions correctly, depending of the locality.

[DS] hadoop configuration on RDD

Hadoop org.apache.hadoop.conf.Configuration file is not serializable. To be able to send to all the RDD partitions this info, we need to broadcast it. Spark uses org.apache.spark.util.SerializableConfiguration but is package private. It can be used on our code creating a wrapper to make it public:

package org.apache.spark

import org.apache.hadoop.conf.Configuration
import org.apache.spark.util.SerializableConfiguration

class SCWrapper(conf: Configuration) extends SerializableConfiguration(conf: Configuration) {}

Prepare a Docker image \w Jupyter and Spark API

This would allow:

  • something, that everyone can run, to get the latest state of Spark API
  • consists of a docker image \w hostPath mounted siva files
  • has 1 notebook \w example of PySpark using Spark API

[DS] Mock Datasource

We need a first Datasource approach to check the viability of the API.
This Datasource should return

  • Repositories source relation
  • References source relation
  • Commits source relation
  • Files source relation

Add integration tests

Check common use cases queries on integration tests to avoid duplicate errors over the time.

[API] Create a new method to get files directly from repositories df

We need a new method that would get files from repositories using several filters to improve performance.

The first step is to add two new columns to files relation:

  • repository_id
  • reference_name

Then, we should create a new api method, something like:

def getFiles(repositoriesIds:Seq[String],referenceNames:Seq[String],commitHashes:Seq[String]):DataFrame

Race condition, tmp siva files deleted too soon

Executing the python tests I noticed randomly they fail because the tmp siva files cannot be found so this looks like a race condition somewhere (could very well be a Py4J/pyspark thing, so if you think this is not related to our code I can investigate further to see what's going on the py side)

Stacktrace of the crash:

17/09/19 15:24:30 ERROR DiskBlockObjectWriter: Uncaught exception while reverting partial writes to file /tmp/blockmgr-86a0b5f1-6b06-4e0f-8cb3-a7994b0c6f2e/29/temp_shuffle_2fe14279-9444-431b-9392-4bf715b284d9
java.io.FileNotFoundException: /tmp/blockmgr-86a0b5f1-6b06-4e0f-8cb3-a7994b0c6f2e/29/temp_shuffle_2fe14279-9444-431b-9392-4bf715b284d9 (No such file or directory)
	at java.io.FileOutputStream.open0(Native Method)
	at java.io.FileOutputStream.open(FileOutputStream.java:270)
	at java.io.FileOutputStream.<init>(FileOutputStream.java:213)
	at org.apache.spark.storage.DiskBlockObjectWriter$$anonfun$revertPartialWritesAndClose$2.apply$mcV$sp(DiskBlockObjectWriter.scala:215)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1346)
	at org.apache.spark.storage.DiskBlockObjectWriter.revertPartialWritesAndClose(DiskBlockObjectWriter.scala:212)
	at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.stop(BypassMergeSortShuffleWriter.java:237)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:102)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
	at org.apache.spark.scheduler.Task.run(Task.scala:108)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
17/09/19 15:24:30 ERROR BypassMergeSortShuffleWriter: Error while deleting file /tmp/blockmgr-86a0b5f1-6b06-4e0f-8cb3-a7994b0c6f2e/29/temp_shuffle_2fe14279-9444-431b-9392-4bf715b284d9
17/09/19 15:24:30 ERROR DiskBlockObjectWriter: Uncaught exception while reverting partial writes to file /tmp/blockmgr-86a0b5f1-6b06-4e0f-8cb3-a7994b0c6f2e/28/temp_shuffle_649e8ea0-e7ff-4e72-b5d5-f2eec176dd46
java.io.FileNotFoundException: /tmp/blockmgr-86a0b5f1-6b06-4e0f-8cb3-a7994b0c6f2e/28/temp_shuffle_649e8ea0-e7ff-4e72-b5d5-f2eec176dd46 (No such file or directory)
	at java.io.FileOutputStream.open0(Native Method)
	at java.io.FileOutputStream.open(FileOutputStream.java:270)
	at java.io.FileOutputStream.<init>(FileOutputStream.java:213)
	at org.apache.spark.storage.DiskBlockObjectWriter$$anonfun$revertPartialWritesAndClose$2.apply$mcV$sp(DiskBlockObjectWriter.scala:215)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1346)
	at org.apache.spark.storage.DiskBlockObjectWriter.revertPartialWritesAndClose(DiskBlockObjectWriter.scala:212)
	at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.stop(BypassMergeSortShuffleWriter.java:237)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:102)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
	at org.apache.spark.scheduler.Task.run(Task.scala:108)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
17/09/19 15:24:30 ERROR BypassMergeSortShuffleWriter: Error while deleting file /tmp/blockmgr-86a0b5f1-6b06-4e0f-8cb3-a7994b0c6f2e/28/temp_shuffle_649e8ea0-e7ff-4e72-b5d5-f2eec176dd46
17/09/19 15:24:30 ERROR DiskBlockObjectWriter: Uncaught exception while reverting partial writes to file /tmp/blockmgr-86a0b5f1-6b06-4e0f-8cb3-a7994b0c6f2e/3e/temp_shuffle_247a46ff-753f-4a69-8c6b-fed143266a01
java.io.FileNotFoundException: /tmp/blockmgr-86a0b5f1-6b06-4e0f-8cb3-a7994b0c6f2e/3e/temp_shuffle_247a46ff-753f-4a69-8c6b-fed143266a01 (No such file or directory)
	at java.io.FileOutputStream.open0(Native Method)
	at java.io.FileOutputStream.open(FileOutputStream.java:270)
	at java.io.FileOutputStream.<init>(FileOutputStream.java:213)
	at org.apache.spark.storage.DiskBlockObjectWriter$$anonfun$revertPartialWritesAndClose$2.apply$mcV$sp(DiskBlockObjectWriter.scala:215)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1346)
	at org.apache.spark.storage.DiskBlockObjectWriter.revertPartialWritesAndClose(DiskBlockObjectWriter.scala:212)
	at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.stop(BypassMergeSortShuffleWriter.java:237)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:102)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
	at org.apache.spark.scheduler.Task.run(Task.scala:108)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
17/09/19 15:24:30 ERROR BypassMergeSortShuffleWriter: Error while deleting file /tmp/blockmgr-86a0b5f1-6b06-4e0f-8cb3-a7994b0c6f2e/3e/temp_shuffle_247a46ff-753f-4a69-8c6b-fed143266a01
17/09/19 15:24:30 ERROR DiskBlockObjectWriter: Uncaught exception while reverting partial writes to file /tmp/blockmgr-86a0b5f1-6b06-4e0f-8cb3-a7994b0c6f2e/02/temp_shuffle_15d14d72-f4f0-44e4-8c32-4998fa803dff
java.io.FileNotFoundException: /tmp/blockmgr-86a0b5f1-6b06-4e0f-8cb3-a7994b0c6f2e/02/temp_shuffle_15d14d72-f4f0-44e4-8c32-4998fa803dff (No such file or directory)
	at java.io.FileOutputStream.open0(Native Method)
	at java.io.FileOutputStream.open(FileOutputStream.java:270)
	at java.io.FileOutputStream.<init>(FileOutputStream.java:213)
	at org.apache.spark.storage.DiskBlockObjectWriter$$anonfun$revertPartialWritesAndClose$2.apply$mcV$sp(DiskBlockObjectWriter.scala:215)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1346)
	at org.apache.spark.storage.DiskBlockObjectWriter.revertPartialWritesAndClose(DiskBlockObjectWriter.scala:212)
	at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.stop(BypassMergeSortShuffleWriter.java:237)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:102)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
	at org.apache.spark.scheduler.Task.run(Task.scala:108)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
17/09/19 15:24:30 ERROR BypassMergeSortShuffleWriter: Error while deleting file /tmp/blockmgr-86a0b5f1-6b06-4e0f-8cb3-a7994b0c6f2e/02/temp_shuffle_15d14d72-f4f0-44e4-8c32-4998fa803dff
17/09/19 15:24:30 ERROR DiskBlockObjectWriter: Uncaught exception while reverting partial writes to file /tmp/blockmgr-86a0b5f1-6b06-4e0f-8cb3-a7994b0c6f2e/38/temp_shuffle_dd1b5a5a-7226-4578-aaa9-1086dd67594e
java.io.FileNotFoundException: /tmp/blockmgr-86a0b5f1-6b06-4e0f-8cb3-a7994b0c6f2e/38/temp_shuffle_dd1b5a5a-7226-4578-aaa9-1086dd67594e (No such file or directory)
	at java.io.FileOutputStream.open0(Native Method)
	at java.io.FileOutputStream.open(FileOutputStream.java:270)
	at java.io.FileOutputStream.<init>(FileOutputStream.java:213)
	at org.apache.spark.storage.DiskBlockObjectWriter$$anonfun$revertPartialWritesAndClose$2.apply$mcV$sp(DiskBlockObjectWriter.scala:215)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1346)
	at org.apache.spark.storage.DiskBlockObjectWriter.revertPartialWritesAndClose(DiskBlockObjectWriter.scala:212)
	at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.stop(BypassMergeSortShuffleWriter.java:237)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:102)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
	at org.apache.spark.scheduler.Task.run(Task.scala:108)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
17/09/19 15:24:30 ERROR BypassMergeSortShuffleWriter: Error while deleting file /tmp/blockmgr-86a0b5f1-6b06-4e0f-8cb3-a7994b0c6f2e/38/temp_shuffle_dd1b5a5a-7226-4578-aaa9-1086dd67594e
17/09/19 15:24:30 ERROR DiskBlockObjectWriter: Uncaught exception while reverting partial writes to file /tmp/blockmgr-86a0b5f1-6b06-4e0f-8cb3-a7994b0c6f2e/3d/temp_shuffle_ef432a77-5e65-4f6d-96ed-b5b5965e58f8
java.io.FileNotFoundException: /tmp/blockmgr-86a0b5f1-6b06-4e0f-8cb3-a7994b0c6f2e/3d/temp_shuffle_ef432a77-5e65-4f6d-96ed-b5b5965e58f8 (No such file or directory)
	at java.io.FileOutputStream.open0(Native Method)
	at java.io.FileOutputStream.open(FileOutputStream.java:270)
	at java.io.FileOutputStream.<init>(FileOutputStream.java:213)
	at org.apache.spark.storage.DiskBlockObjectWriter$$anonfun$revertPartialWritesAndClose$2.apply$mcV$sp(DiskBlockObjectWriter.scala:215)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1346)
	at org.apache.spark.storage.DiskBlockObjectWriter.revertPartialWritesAndClose(DiskBlockObjectWriter.scala:212)
	at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.stop(BypassMergeSortShuffleWriter.java:237)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:102)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
	at org.apache.spark.scheduler.Task.run(Task.scala:108)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
17/09/19 15:24:30 ERROR BypassMergeSortShuffleWriter: Error while deleting file /tmp/blockmgr-86a0b5f1-6b06-4e0f-8cb3-a7994b0c6f2e/3d/temp_shuffle_ef432a77-5e65-4f6d-96ed-b5b5965e58f8
17/09/19 15:24:30 ERROR DiskBlockObjectWriter: Uncaught exception while reverting partial writes to file /tmp/blockmgr-86a0b5f1-6b06-4e0f-8cb3-a7994b0c6f2e/29/temp_shuffle_69c5fa7d-a6f6-4d9d-a943-9816db1d5518
java.io.FileNotFoundException: /tmp/blockmgr-86a0b5f1-6b06-4e0f-8cb3-a7994b0c6f2e/29/temp_shuffle_69c5fa7d-a6f6-4d9d-a943-9816db1d5518 (No such file or directory)
	at java.io.FileOutputStream.open0(Native Method)
	at java.io.FileOutputStream.open(FileOutputStream.java:270)
	at java.io.FileOutputStream.<init>(FileOutputStream.java:213)
	at org.apache.spark.storage.DiskBlockObjectWriter$$anonfun$revertPartialWritesAndClose$2.apply$mcV$sp(DiskBlockObjectWriter.scala:215)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1346)
	at org.apache.spark.storage.DiskBlockObjectWriter.revertPartialWritesAndClose(DiskBlockObjectWriter.scala:212)
	at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.stop(BypassMergeSortShuffleWriter.java:237)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:102)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
	at org.apache.spark.scheduler.Task.run(Task.scala:108)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
17/09/19 15:24:30 ERROR BypassMergeSortShuffleWriter: Error while deleting file /tmp/blockmgr-86a0b5f1-6b06-4e0f-8cb3-a7994b0c6f2e/29/temp_shuffle_69c5fa7d-a6f6-4d9d-a943-9816db1d5518
17/09/19 15:24:30 ERROR DiskBlockObjectWriter: Uncaught exception while reverting partial writes to file /tmp/blockmgr-86a0b5f1-6b06-4e0f-8cb3-a7994b0c6f2e/20/temp_shuffle_692f2eae-7e3b-43e5-9e7e-1fc7310787fb
java.io.FileNotFoundException: /tmp/blockmgr-86a0b5f1-6b06-4e0f-8cb3-a7994b0c6f2e/20/temp_shuffle_692f2eae-7e3b-43e5-9e7e-1fc7310787fb (No such file or directory)
	at java.io.FileOutputStream.open0(Native Method)
	at java.io.FileOutputStream.open(FileOutputStream.java:270)
	at java.io.FileOutputStream.<init>(FileOutputStream.java:213)
	at org.apache.spark.storage.DiskBlockObjectWriter$$anonfun$revertPartialWritesAndClose$2.apply$mcV$sp(DiskBlockObjectWriter.scala:215)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1346)
	at org.apache.spark.storage.DiskBlockObjectWriter.revertPartialWritesAndClose(DiskBlockObjectWriter.scala:212)
	at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.stop(BypassMergeSortShuffleWriter.java:237)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:102)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
	at org.apache.spark.scheduler.Task.run(Task.scala:108)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
17/09/19 15:24:30 ERROR BypassMergeSortShuffleWriter: Error while deleting file /tmp/blockmgr-86a0b5f1-6b06-4e0f-8cb3-a7994b0c6f2e/20/temp_shuffle_692f2eae-7e3b-43e5-9e7e-1fc7310787fb
17/09/19 15:24:30 ERROR DiskBlockObjectWriter: Uncaught exception while reverting partial writes to file /tmp/blockmgr-86a0b5f1-6b06-4e0f-8cb3-a7994b0c6f2e/33/temp_shuffle_7e094f01-8975-4983-a60c-2c6b13fa28dc
java.io.FileNotFoundException: /tmp/blockmgr-86a0b5f1-6b06-4e0f-8cb3-a7994b0c6f2e/33/temp_shuffle_7e094f01-8975-4983-a60c-2c6b13fa28dc (No such file or directory)
	at java.io.FileOutputStream.open0(Native Method)
	at java.io.FileOutputStream.open(FileOutputStream.java:270)
	at java.io.FileOutputStream.<init>(FileOutputStream.java:213)
	at org.apache.spark.storage.DiskBlockObjectWriter$$anonfun$revertPartialWritesAndClose$2.apply$mcV$sp(DiskBlockObjectWriter.scala:215)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1346)
	at org.apache.spark.storage.DiskBlockObjectWriter.revertPartialWritesAndClose(DiskBlockObjectWriter.scala:212)
	at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.stop(BypassMergeSortShuffleWriter.java:237)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:102)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
	at org.apache.spark.scheduler.Task.run(Task.scala:108)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
17/09/19 15:24:30 ERROR BypassMergeSortShuffleWriter: Error while deleting file /tmp/blockmgr-86a0b5f1-6b06-4e0f-8cb3-a7994b0c6f2e/33/temp_shuffle_7e094f01-8975-4983-a60c-2c6b13fa28dc
17/09/19 15:24:30 ERROR DiskBlockObjectWriter: Uncaught exception while reverting partial writes to file /tmp/blockmgr-86a0b5f1-6b06-4e0f-8cb3-a7994b0c6f2e/11/temp_shuffle_99570036-ad3f-40c1-8971-f8aa09263446
java.io.FileNotFoundException: /tmp/blockmgr-86a0b5f1-6b06-4e0f-8cb3-a7994b0c6f2e/11/temp_shuffle_99570036-ad3f-40c1-8971-f8aa09263446 (No such file or directory)
	at java.io.FileOutputStream.open0(Native Method)
	at java.io.FileOutputStream.open(FileOutputStream.java:270)
	at java.io.FileOutputStream.<init>(FileOutputStream.java:213)
	at org.apache.spark.storage.DiskBlockObjectWriter$$anonfun$revertPartialWritesAndClose$2.apply$mcV$sp(DiskBlockObjectWriter.scala:215)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1346)
	at org.apache.spark.storage.DiskBlockObjectWriter.revertPartialWritesAndClose(DiskBlockObjectWriter.scala:212)
	at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.stop(BypassMergeSortShuffleWriter.java:237)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:102)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
	at org.apache.spark.scheduler.Task.run(Task.scala:108)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
17/09/19 15:24:30 ERROR BypassMergeSortShuffleWriter: Error while deleting file /tmp/blockmgr-86a0b5f1-6b06-4e0f-8cb3-a7994b0c6f2e/11/temp_shuffle_99570036-ad3f-40c1-8971-f8aa09263446
17/09/19 15:24:30 ERROR DiskBlockObjectWriter: Uncaught exception while reverting partial writes to file /tmp/blockmgr-86a0b5f1-6b06-4e0f-8cb3-a7994b0c6f2e/24/temp_shuffle_28e739f9-a3d0-462d-b71f-4196351aefef
java.io.FileNotFoundException: /tmp/blockmgr-86a0b5f1-6b06-4e0f-8cb3-a7994b0c6f2e/24/temp_shuffle_28e739f9-a3d0-462d-b71f-4196351aefef (No such file or directory)
	at java.io.FileOutputStream.open0(Native Method)
	at java.io.FileOutputStream.open(FileOutputStream.java:270)
	at java.io.FileOutputStream.<init>(FileOutputStream.java:213)
	at org.apache.spark.storage.DiskBlockObjectWriter$$anonfun$revertPartialWritesAndClose$2.apply$mcV$sp(DiskBlockObjectWriter.scala:215)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1346)
	at org.apache.spark.storage.DiskBlockObjectWriter.revertPartialWritesAndClose(DiskBlockObjectWriter.scala:212)
	at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.stop(BypassMergeSortShuffleWriter.java:237)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:102)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
	at org.apache.spark.scheduler.Task.run(Task.scala:108)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
17/09/19 15:24:30 ERROR BypassMergeSortShuffleWriter: Error while deleting file /tmp/blockmgr-86a0b5f1-6b06-4e0f-8cb3-a7994b0c6f2e/24/temp_shuffle_28e739f9-a3d0-462d-b71f-4196351aefef

Rename head_ref to head, master_ref to master

I do understand that this has smth to do with the Scala API, but Python users should not suffer because of the name collision in the different language. references.head_ref is a duplication.

[API] Add convenience wrappers for Python

Right now, there is a number of steps user need to take to make spark API work in PySpark

  • add jar
  • configure DataSource
  • select/join (see example notebook)

We want to have a convenience API in PySpark that does it for us, that using py4j delegates to appropriate methods in Scala.

So this issue is twofold:

  • make sure that Scala API matches the proposed one
  • make PySpark wrappers for it

Error reporting: check existence of the path to siva files

If example from our README https://github.com/src-d/spark-api#pyspark-api-usage is tried literally

from sourced.spark import API as SparkAPI
from pyspark.sql import SparkSession
 
spark = SparkSession.builder.appName("test").master("local[*]").getOrCreate()
api = SparkAPI(spark, '/path/to/siva/files')
api.repositories.filter("id = 'github.com/mawag/faq-xiyoulinux'").references.filter("name = 'refs/heads/HEAD'").show()

with the )non-existent path to .siva files_, it will result in

---------------------------------------------------------------------------
Py4JJavaError                             Traceback (most recent call last)
<ipython-input-1-7f67465f882f> in <module>()
      4 spark = SparkSession.builder.appName("test").master("local[*]").getOrCreate()
      5 api = SparkAPI(spark, '/path/to/siva/files')
----> 6 api.repositories.filter("id = 'github.com/mawag/faq-xiyoulinux'").references.filter("name = 'refs/heads/HEAD'").show()

/usr/local/spark/python/pyspark/sql/dataframe.py in show(self, n, truncate)
    334         """
    335         if isinstance(truncate, bool) and truncate:
--> 336             print(self._jdf.showString(n, 20))
    337         else:
    338             print(self._jdf.showString(n, int(truncate)))

/usr/local/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py in __call__(self, *args)
   1131         answer = self.gateway_client.send_command(command)
   1132         return_value = get_return_value(
-> 1133             answer, self.gateway_client, self.target_id, self.name)
   1134 
   1135         for temp_arg in temp_args:

/usr/local/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
     61     def deco(*a, **kw):
     62         try:
---> 63             return f(*a, **kw)
     64         except py4j.protocol.Py4JJavaError as e:
     65             s = e.java_exception.toString()

/usr/local/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
    317                 raise Py4JJavaError(
    318                     "An error occurred while calling {0}{1}{2}.\n".
--> 319                     format(target_id, ".", name), value)
    320             else:
    321                 raise Py4JError(

Py4JJavaError: An error occurred while calling o39.showString.
: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree:
Exchange hashpartitioning(repository_id#11, 200)
+- *Filter (((isnotnull(name#12) && (name#12 = refs/heads/HEAD)) && isnotnull(repository_id#11)) && (repository_id#11 = github.com/mawag/faq-xiyoulinux))
   +- *Scan GitRelation(org.apache.spark.sql.SQLContext@4faae818,references,/path/to/siva/files,/tmp) [repository_id#11,name#12,hash#13] PushedFilters: [IsNotNull(name), EqualTo(name,refs/heads/HEAD), IsNotNull(repository_id), EqualTo(repository_id,..., ReadSchema: struct<repository_id:string,name:string,hash:string>

which is not very clean error message.

This can be fixed by implementing a proper check and error reporting, if the given path does not exist, in the Scala part.

Push down JOIN conditions if possible to the GitDatasource

Actually, if we have a query like this one:

api.getRepositories().where("id = 'github.com/foo/bar'").getHEAD.getFiles.extractUASTs().select("name", "path", "uast")

The optimized plan will be (simplified):

Project
+-Join[commit_hash = hash]
   :   +- GitRelation[FILES]
   +- Join[repository_id = id]
       :   +- Filter[name="refs/heads/HEAD",repository_id="github.com/foo/bar"]
       :       +- GitRelation[REFERENCES]
       +- Filter[id="github.com/foo/bar"]
           +- GitRelation[REPOSITORIES]

As we can see, no filter is pushed down to the files relation, having to push up all the files over all the revisions over all the repositories in this case.
We should implement a rule to be able to transform the optimized plan from above to:

Project
+- Filter[commit_hash = hash, repository_id = id, repository_id = "github.com/foo/bar"]
    +-GitRelation[FILES]

Then, the non resolved conditions (repository_id = id in this case) should be handled by the iterator.

Make release process more automatic

Right now, to release a new version we need:

  • Change the project version on build.sbt
  • Push changes to master
  • Create tag pointing to that commit
  • Push to master moving the version to the next -SNAPSHOT one

Use some plugin to automate this process.

Improvement proposal

Hi,

I think we can simplify default usage of Engine a bit - let's make possible to use pure python instead of pyspark.
There are at least several ways how it can be done:

  1. Some function that will find pyspark : https://stackoverflow.com/questions/23256536/importing-pyspark-in-python-shell
  2. or add pyspark to PYTHONPATH during installation: export PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH
    so it will be possible to import pyspark and so on.

I think the ideal situation for Data Scientist:

from sourced.engine import Engine
engine = Engine(siva_folder, **parameters_for_SparkSession)

and pyspark session will be initialized in background.

What do you think?
@bzz , @ajnavarro, @erizocosmico, @mcarmonaa

Overall API proposal

Create an overall API proposal. Initially this is expected to:

  • Load repositories from rooted repository storage (borges-style)
  • Load repositories from URL list
  • Extract files
  • Detect language
  • Parse files into UAST with Babelfish

Spark: investigate a suspicious warning

On running over ./examples/siva-files

make clean build
SPARK_HOME="" ./spark/bin/spark-shell --jars ./target/scala-2.11/engine-uber.jar
val engine = Engine(spark, "./examples/siva-files")

engine.getRepositories.getHEAD.getFiles.classifyLanguages.where('lang === "Python").extractUASTs.queryUAST("//*[@roleIdentifier]", "uast", "result").extractTokens("result", "tokens").select('path, 'lang, 'uast, 'tokens).show

results in suspicious warning

17/10/18 18:25:01 WARN Executor: Managed memory leak detected; size = 4456448 bytes, TID = 1034

Guava version at bblfsh/scala-client

On clean Ubuntu 16.04 env using

bin/spark-shell --packages tech.sourced:engine:0.1.2 --repositories "https://jitpack.io"
--repositories is needed for siva-java, not included in uber-jar 0.1.2
and

Example \w UAST extraction I got Guava version mismatch at runtime

17/10/26 13:59:37 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 10, localhost, executor driver): java.lang.NoSuchMethodError: com.google.common.util.concurre$
t.MoreExecutors.directExecutor()Ljava/util/concurrent/Executor;
        at io.grpc.internal.ClientCallImpl.<init>(ClientCallImpl.java:104)
        at io.grpc.internal.ManagedChannelImpl$RealChannel.newCall(ManagedChannelImpl.java:554)
        at io.grpc.internal.ManagedChannelImpl.newCall(ManagedChannelImpl.java:533)
        at gopkg.in.bblfsh.sdk.v1.protocol.generated.ProtocolServiceGrpc$ProtocolServiceBlockingStub.parse(ProtocolServiceGrpc.scala:50)
        at org.bblfsh.client.BblfshClient.parse(BblfshClient.scala:29)
        at tech.sourced.engine.udf.ExtractUASTsUDF$.extractUsingBblfsh(ExtractUASTsUDF.scala:110)
        at tech.sourced.engine.udf.ExtractUASTsUDF$.extractUAST(ExtractUASTsUDF.scala:92)
        at tech.sourced.engine.udf.ExtractUASTsUDF$.extractUASTsWithLang(ExtractUASTsUDF.scala:69)

[DS] Model iterators

We need a way to transform model (repository,reference,commit,blob) -> Iterator[Row]
It should be an abstract class with all the generic code for all the models, and specific code to generate rows per each model.

This iterator should be able to process more than one repository in a lazy way.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.