absaoss / spline-spark-agent Goto Github PK

Spline agent for Apache Spark

Home Page: https://absaoss.github.io/spline/

License: Apache License 2.0

Scala 98.83% Shell 0.35% Java 0.82%

spline-spark-agent's Introduction

Spline Agent for Apache Spark™

The Spline Agent for Apache Spark™ is a complementary module to the Spline project that captures runtime lineage information from the Apache Spark jobs.

The agent is a Scala library that is embedded into the Spark driver, listening to Spark events, and capturing logical execution plans. The collected metadata is then handed over to the lineage dispatcher, from where it can either be sent to the Spline server (e.g. via REST API or Kafka), or used in another way, depending on selected dispatcher type (see Lineage Dispatchers).

The agent can be used with or without a Spline server, depending on your use case. See References.

Versioning
- Spark / Scala version compatibility matrix
Usage
- Selecting artifact
- Initialization
  - Codeless
  - Programmatic
Configuration
Spark features coverage
Developer documentation
- Plugin API
- Building for different Scala and Spark versions
References and Examples

Versioning

The Spline Spark Agent follows the Semantic Versioning principles. The Public API is defined as a set of entry-point classes (SparkLineageInitializer, SplineSparkSessionWrapper), extension APIs (Plugin API, filters, dispatchers), configuration properties and a set of supported Spark versions. In other words, the Spline Spark Agent Public API in terms of SemVer covers all entities and abstractions that are designed to be used or extended by client applications.

The version number does not directly reflect the relation of the Agent to the Spline Producer API (the Spline server). Both the Spline Server and the Agent are designed to be as much mutually compatible as possible, assuming long-term operation and a possibly significant gap in the server and the agent release dates. Such requirement is dictated by the nature of the Agent that could be embedded into some Spark jobs and only rarely if ever updated without posing a risk to stop working because of eventual Spline server update. Likewise, it should be possible to update the Agent anytime (e.g. to fix a bug or support a newer Spark version or a feature that earlier agent version didn't support) without requiring a Spline server upgrade.

Although not required by the above statement, for minimizing user astonishment when the compatibility between too distant Agent and Server versions is dropped, we'll increment the Major version component.

Spark / Scala version compatibility matrix

	Scala 2.11	Scala 2.12
Spark 2.2	(no SQL; no codeless init)	—
Spark 2.3	(no Delta support)	—
Spark 2.4	Yes	Yes
Spark 3.0 or newer	—	Yes

Usage

Selecting artifact

There are two main agent artifacts:

agent-core is a Java library that you can use with any compatible Spark version. Use this one if you want to include Spline agent into your custom Spark application, and you want to manage all transitive dependencies yourself.
spark-spline-agent-bundle is a fat jar that is designed to be embedded into the Spark driver, either by manually copying it to the Spark's /jars directory, or by using --jars or --packages argument for the spark-submit, spark-shell or pyspark commands. This artifact is self-sufficient and is aimed to be used by most users.

Because the bundle is pre-built with all necessary dependencies, it is important to select a proper version of it that matches the minor Spark and Scala versions of your target Spark installation.

spark-A.B-spline-agent-bundle_X.Y.jar

here A.B is the first two Spark version numbers and X.Y is the first two Scala version numbers. For example, if you have Spark 2.4.4 pre-built with Scala 2.12.10 then select the following agent bundle:

spark-2.4-spline-agent-bundle_2.12.jar

AWS Glue Note: dependency on org.yaml:snakeyaml:1.33 is missing in Glue flavour of Spark. Please add this dependency on the classpath.

Initialization

Spline agent is basically a Spark query listener that needs to be registered in a Spark session before is can be used. Depending on if you are using it as a library in your custom Spark application, or as a standalone bundle you can choose one of the following initialization approaches.

Codeless Initialization

This way is the most convenient one, can be used in majority use-cases. Simply include the Spline listener into the spark.sql.queryExecutionListeners config property (see Static SQL Configuration)

Example:

pyspark \
  --packages za.co.absa.spline.agent.spark:spark-2.4-spline-agent-bundle_2.12:<VERSION> \
  --conf "spark.sql.queryExecutionListeners=za.co.absa.spline.harvester.listener.SplineQueryExecutionListener" \
  --conf "spark.spline.lineageDispatcher.http.producer.url=http://localhost:9090/producer"

The same approach works for spark-submit and spark-shell commands.

Note: all Spline properties set via Spark conf should be prefixed with spark. prefix in order to be visible to the Spline agent.
See Configuration section for details.

Programmatic Initialization

Note: starting from Spline 0.6 most agent components can be configured or even replaced in a declarative manner either using Configuration or Plugin API. So normally there should be no need to use a programmatic initialization method. We recommend to use Codeless Initialization instead.

But if for some reason, Codeless Initialization doesn't fit your needs, or you want to do more customization on Spark agent, you can use programmatic initialization method.

// given a Spark session ...
val sparkSession: SparkSession = ???

// ... enable data lineage tracking with Spline
import za.co.absa.spline.harvester.SparkLineageInitializer._
sparkSession.enableLineageTracking()

// ... then run some Dataset computations as usual.
// The lineage will be captured and sent to the configured Spline Producer endpoint.

or in Java syntax:

import za.co.absa.spline.harvester.SparkLineageInitializer;
// ...
SparkLineageInitializer.enableLineageTracking(session);

The method enableLineageTracking() accepts optional AgentConfig object that can be used to customize Spline behavior. This is an alternative way to configure Spline. The other one if via the property based configuration.

The instance of AgentConfig can be created by using a builder or one of the factory methods.

// from a sequence of key-value pairs 
val config = AgentConfig.from(???: Iterable[(String, Any)])

// from a Common Configuration
val config = AgentConfig.from(???: org.apache.commons.configuration.Configuration)

// using a builder
val config = AgentConfig.builder()
  // call some builder methods here...
  .build()

sparkSession.enableLineageTracking(config)

Note: AgentConfig object doesn't override the standard configuration stack. Instead, it serves as an additional configuration mean with the precedence set between the spline.yaml and spline.default.yaml files (see below).

Configuration

The agent looks for configuration in the following sources (listed in order of precedence):

Hadoop configuration (core-site.xml)
Spark configuration
JVM system properties
spline.properties file on classpath
spline.yaml file on classpath
AgentConfig object
spline.default.yaml file on classpath

The file spline.default.yaml contains default values for all Spline properties along with additional documentation. It's a good idea to look in the file to see what properties are available.

The order of precedence might look counter-intuitive, as one would expect that explicitly provided config (AgentConfig instance) should override ones defined in the outer scope. However, prioritizing global config to local one makes it easier to manage Spline settings centrally on clusters, while still allowing room for customization by job developers.

For example, a company could require lineage metadata from jobs executed on a particular cluster to be sanitized, enhanced with some metrics and credentials and stored in a certain metadata store (a database, file, Spline server etc). The Spline configuration needs to be set globally and applied to all Spark jobs automatically. However, some jobs might contain hardcoded properties that the developers used locally or on a testing environment, and forgot to remove them before submitting jobs into a production. In such situation we want cluster settings to have precedence over the job settings. Assuming that hardcoded settings would most likely be defined in the AgentConfig object, a property file or a JVM properties, on the cluster we could define them in the Spark config or Hadoop config.

In case of multiple definitions of property the first occurrence wins, but spline.lineageDispatcher and spline.postProcessingFilter properties are composed instead. E.g. if a LineageDispatcher is set to be Kafka in one config source and 'Http' in another, they would be implicitly wrapped by a composite dispatcher, so both would be called in the order corresponding the config source precedence. See CompositeLineageDispatcher and CompositePostProcessingFilter.

Every config property is resolved independently. So, for instance, if a DataSourcePasswordReplacingFilter is used some of its properties might be taken from one config source and the other ones form another, according to the conflict resolution rules described above. This allows administrators to tweak settings of individual Spline components (filters, dispatchers or plugins) without having to redefine and override the whole piece of configuration for a given component.

Properties

`spline.mode`

ENABLED [default]

Spline will try to initialize itself, but if it fails it switches to DISABLED mode allowing the Spark application to proceed normally without Lineage tracking.
DISABLED

Lineage tracking is completely disabled and Spline is unhooked from Spark.

`spline.lineageDispatcher`

The logical name of the root lineage dispatcher. See Lineage Dispatchers chapter.

`spline.postProcessingFilter`

The logical name of the root post-processing filter. See Post Processing Filters chapter.

Lineage Dispatchers

The LineageDispatcher trait is responsible for sending out the captured lineage information. By default, the HttpLineageDispatcher is used, that sends the lineage data to the Spline REST endpoint (see Spline Producer API).

Available dispatchers:

HttpLineageDispatcher - sends lineage to an HTTP endpoint
KafkaLineageDispatcher - sends lineage to a Kafka topic
ConsoleLineageDispatcher - writes lineage to the console
LoggingLineageDispatcher - logs lineage using the Spark logger
FallbackLineageDispatcher - sends lineage to a fallback dispatcher if the primary one fails
CompositeLineageDispatcher - allows to combine multiple dispatchers to send lineage to multiple endpoints

Each dispatcher can have different configuration parameters. To make the configs clearly separated each dispatcher has its own namespace in which all it's parameters are defined. I will explain it on a Kafka example.

Defining dispatcher

spline.lineageDispatcher=kafka

Once you defined the dispatcher all other parameters will have a namespace spline.lineageDispatcher.{{dipatcher-name}}. as a prefix. In this case it is spline.lineageDispatcher.kafka..

To find out which parameters you can use look into spline.default.yaml. For kafka I would have to define at least these two properties:

spline.lineageDispatcher.kafka.topic=foo
spline.lineageDispatcher.kafka.producer.bootstrap.servers=localhost:9092

Using the Http Dispatcher

This dispatcher is used by default. The only mandatory configuration is url of the producer API rest endpoint (spline.lineageDispatcher.http.producer.url). Additionally, timeouts, apiVersion and multiple custom headers can be set.

spline.lineageDispatcher.http.producer.url=
spline.lineageDispatcher.http.timeout.connection=2000
spline.lineageDispatcher.http.timeout.read=120000
spline.lineageDispatcher.http.apiVersion=LATEST
spline.lineageDispatcher.http.header.X-CUSTOM-HEADER=custom-header-value

If the producer requires token based authentication for requests, below mentioned details must be included in configuration.

spline.lineageDispatcher.http.authentication.type=OAUTH
spline.lineageDispatcher.http.authentication.grantType=client_credentials
spline.lineageDispatcher.http.authentication.clientId=<client_id>
spline.lineageDispatcher.http.authentication.clientSecret=<secret>
spline.lineageDispatcher.http.authentication.scope=<scope>
spline.lineageDispatcher.http.authentication.tokenUrl=<token_url>

Example: Azure HTTP trigger template API key header can be set like this:

spline.lineageDispatcher.http.header.X-FUNCTIONS-KEY=USER_API_KEY

Example: AWS Rest API key header can be set like this:

spline.lineageDispatcher.http.header.X-API-Key=USER_API_KEY

Using the Fallback Dispatcher

The FallbackDispatcher is a proxy dispatcher that sends lineage to the primary dispatcher first, and then if there is an error it calls the fallback one.

In the following example the HttpLineageDispatcher will be used as a primary, and the ConsoleLineageDispatcher as fallback.

spline.lineageDispatcher=fallback
spline.lineageDispatcher.fallback.primaryDispatcher=http
spline.lineageDispatcher.fallback.fallbackDispatcher=console

Using the Composite Dispatcher

The CompositeDispatcher is a proxy dispatcher that forwards lineage data to multiple dispatchers.

For example, if you want the lineage data to be sent to an HTTP endpoint and to be logged to the console at the same time you can do the following:

spline.lineageDispatcher=composite
spline.lineageDispatcher.composite.dispatchers=http,console

By default, if some dispatchers in the list fail, the others are still attempted. If you want the error in any dispatcher to be treated as fatal, and be propagated to the main process, you set the failOnErrors property to true:

spline.lineageDispatcher.composite.failOnErrors=true

Creating your own dispatcher

There is also a possibility to create your own dispatcher. It must implement LineageDispatcher trait and have a constructor with a single parameter of type org.apache.commons.configuration.Configuration. To use it you must define name and class and also all other parameters you need. For example:

spline.lineageDispatcher=my-dispatcher
spline.lineageDispatcher.my-dispatcher.className=org.example.spline.MyDispatcherImpl
spline.lineageDispatcher.my-dispatcher.prop1=value1
spline.lineageDispatcher.my-dispatcher.prop2=value2

Combining dispatchers (complex example)

If you need, you can combine multiple dispatchers into a single one using CompositeLineageDispatcher and FallbackLineageDispatcher in any combination as you wish.

In the following example the lineage will be first sent to the HTTP endpoint "http://10.20.111.222/lineage-primary", if that fails it's redirected to the "http://10.20.111.222/lineage-secondary" endpoint, and if that one fails as well, lineage is logged to the ERROR logs and the console at the same time.

spline.lineageDispatcher.http1.className=za.co.absa.spline.harvester.dispatcher.HttpLineageDispatcher
spline.lineageDispatcher.http1.producer.url=http://10.20.111.222/lineage-primary

spline.lineageDispatcher.http2.className=za.co.absa.spline.harvester.dispatcher.HttpLineageDispatcher
spline.lineageDispatcher.http2.producer.url=http://10.20.111.222/lineage-secondary

spline.lineageDispatcher.errorLogs.className=za.co.absa.spline.harvester.dispatcher.LoggingLineageDispatcher
spline.lineageDispatcher.errorLogs.level=ERROR

spline.lineageDispatcher.disp1.className=za.co.absa.spline.harvester.dispatcher.FallbackLineageDispatcher
spline.lineageDispatcher.disp1.primaryDispatcher=http1
spline.lineageDispatcher.disp1.fallbackDispatcher=disp2

spline.lineageDispatcher.disp2.className=za.co.absa.spline.harvester.dispatcher.FallbackLineageDispatcher
spline.lineageDispatcher.disp2.primaryDispatcher=http2
spline.lineageDispatcher.disp2.fallbackDispatcher=disp3

spline.lineageDispatcher.disp3.className=za.co.absa.spline.harvester.dispatcher.CompositeLineageDispatcher
spline.lineageDispatcher.composite.dispatchers=errorLogs,console

spline.lineageDispatcher=disp1

Post Processing Filters

Filters can be used to enrich the lineage with your own custom data or to remove unwanted data like passwords. All filters are applied after the Spark plan is converted to Spline DTOs, but before the dispatcher is called.

The procedure how filters are registered and configured is similar to the LineageDispatcher registration and configuration procedure. A custom filter class must implement za.co.absa.spline.harvester.postprocessing.PostProcessingFilter trait and declare a constructor with a single parameter of type org.apache.commons.configuration.Configuration. Then register and configure it like this:

spline.postProcessingFilter=my-filter
spline.postProcessingFilter.my-filter.className=my.awesome.CustomFilter
spline.postProcessingFilter.my-filter.prop1=value1
spline.postProcessingFilter.my-filter.prop2=value2

Use pre-registered CompositePostProcessingFilter to chain up multiple filters:

spline.postProcessingFilter=composite
spline.postProcessingFilter.composite.filters=myFilter1,myFilter2

(see spline.default.yaml for details and examples)

Using MetadataCollectingFilter

MetadataCollectingFilter provides a way to add additional data to lineage produced by Spline Agent.

Data can be added to the following lineage entities: executionPlan, executionEvent, operation, read and write.

Inside each entity is dedicated map named extra that can store any additional user data.

executionPlan and executionEvent have additional map labels. Labels are intended for identification and filtering on the server.

Example usage:

spline.postProcessingFilter=userExtraMeta
spline.postProcessingFilter.userExtraMeta.rules=file:///path/to/json-with-rules.json

json-with-rules.json could look like this:

{
    "executionPlan": {
        "extra": {
            "my-extra-1": 42,
            "my-extra-2": [ "aaa", "bbb", "ccc" ]
        },
        "labels": {
            "my-label": "my-value"
        }
    },
    "write": {
        "extra": {
            "foo": "extra-value"
        }
    }
}

The spline.postProcessingFilter.userExtraMeta.rules can be either url pointing to json file or a json string. The rules definition can be quite long and when providing string directly a lot of escaping may be necessary so using a file is recommended.

Example of escaping the rules string in Scala String:

.config("spline.postProcessingFilter.userExtraMeta.rules", "{\"executionPlan\":{\"extra\":{\"qux\":42\\,\"tags\":[\"aaa\"\\,\"bbb\"\\,\"ccc\"]}}}")

" needs to be escaped because it would end the string
, needs to be escaped because when passing configuration via Java properties the comma is used as a separator under the hood and must be explicitly escaped.

Example of escaping the rules string as VM option:

-Dspline.postProcessingFilter.userExtraMeta.rules={\"executionPlan\":{\"extra\":{\"qux\":42\,\"tags\":[\"aaa\"\,\"bbb\"\,\"ccc\"]}}}

A convenient way how to provide rules json without need for escaping may be to specify the property in yaml config file. An example of this can be seen in spline examples yaml config.

There is also option to get environment variables using $env, jvm properties using $jvm and execute javascript using $js. See the following example:

{
    "executionPlan": {
        "extra": {
            "my-extra-1": 42,
            "my-extra-2": [ "aaa", "bbb", "ccc" ],
            "bar": { "$env": "BAR_HOME" },
            "baz": { "$jvm": "some.jvm.prop" },
            "daz": { "$js": "session.conf().get('k')" },
            "appName": { "$js":"session.sparkContext().appName()" }
       }
    }
}

For the javascript evaluation following variables are available by default:

variable	Scala Type
`session`	`org.apache.spark.sql.SparkSession`
`logicalPlan`	`org.apache.spark.sql.catalyst.plans.logical.LogicalPlan`
`executedPlanOpt`	`Option[org.apache.spark.sql.execution.SparkPlan]`

Using those objects it should be possible to extract almost any relevant information from Spark.

The rules can be conditional, meaning the specified params will be added only when some condition is met. See the following example:

{
    "executionEvent[@.timestamp > 65]": {
        "extra": { "tux": 1 }
    },
    "executionEvent[@.extra['foo'] == 'a' && @.extra['bar'] == 'x']": {
        "extra": { "bux": 2 }
    },
    "executionEvent[@.extra['foo'] == 'a' && [email protected]['bar']]": {
        "extra": { "dux": 3 }
    },
    "executionEvent[@.extra['baz'][2] >= 3]": {
        "extra": { "mux": 4 }
    },
    "executionEvent[@.extra['baz'][2] < 3]": {
        "extra": { "fux": 5 }
    },
    "executionEvent[session.sparkContext.conf['spark.ui.enabled'] == 'false']": {
      "extra": { "tux": 1 }
    }
}

The condition is enclosed by [] after entity name. Here the @ serves as a reference to currently processed entity, in this case executionEvent. The [] inside the condition statement can also serve as a way to access maps and sequences. Logical and comparison operators are available.

session and other variables available for js are available here as well.

For more examples of usage please se MetadataCollectingFilterSpec test class.

Spark features coverage

Dataset operations are fully supported

RDD transformations aren't supported due to Spark internal architecture specifics, but they might be supported semi-automatically in the future Spline versions (see #33)

SQL dialect is mostly supported.

DDL operations are not supported, excepts for CREATE TABLE ... AS SELECT ... which is supported.

Note: By default, the lineage is only captured on persistent (write) actions. To capture in-memory actions like collect(), show() etc the corresponding plugin needs to be activated by setting up the following configuration property:

spline.plugins.za.co.absa.spline.harvester.plugin.embedded.NonPersistentActionsCapturePlugin.enabled=true

(See spline.default.yaml for more information)

The following data formats and providers are supported out of the box:

Avro
Cassandra
COBOL
Delta
ElasticSearch
Excel
HDFS
Hive
JDBC
Kafka
MongoDB
XML

Although Spark being an extensible piece of software can support much more, it doesn't provide any universal API that Spline can utilize to capture reads and write from/to everything that Spark supports. Support for most of different data sources and formats has to be added to Spline one by one. Fortunately starting with Spline 0.5.4 the auto discoverable Plugin API has been introduced to make this process easier.

Below is the break-down of the read/write command list that we have come through.
Some commands are implemented, others have yet to be implemented, and finally there are such that bear no lineage information and hence are ignored.

All commands inherit from org.apache.spark.sql.catalyst.plans.logical.Command.

You can see how to produce unimplemented commands in za.co.absa.spline.harvester.SparkUnimplementedCommandsSpec.

Implemented

CreateDataSourceTableAsSelectCommand (org.apache.spark.sql.execution.command)
CreateHiveTableAsSelectCommand (org.apache.spark.sql.hive.execution)
CreateTableCommand (org.apache.spark.sql.execution.command)
DropTableCommand (org.apache.spark.sql.execution.command)
InsertIntoDataSourceDirCommand (org.apache.spark.sql.execution.command)
InsertIntoHadoopFsRelationCommand (org.apache.spark.sql.execution.datasources)
InsertIntoHiveDirCommand (org.apache.spark.sql.hive.execution)
InsertIntoHiveTable (org.apache.spark.sql.hive.execution)
SaveIntoDataSourceCommand (org.apache.spark.sql.execution.datasources)

To be implemented

AlterTableAddColumnsCommand (org.apache.spark.sql.execution.command)
AlterTableChangeColumnCommand (org.apache.spark.sql.execution.command)
AlterTableRenameCommand (org.apache.spark.sql.execution.command)
AlterTableSetLocationCommand (org.apache.spark.sql.execution.command)
CreateDataSourceTableCommand (org.apache.spark.sql.execution.command)
CreateDatabaseCommand (org.apache.spark.sql.execution.command)
CreateTableLikeCommand (org.apache.spark.sql.execution.command)
DropDatabaseCommand (org.apache.spark.sql.execution.command)
LoadDataCommand (org.apache.spark.sql.execution.command)
TruncateTableCommand (org.apache.spark.sql.execution.command)

When one of these commands occurs spline will let you know by logging a warning.

Ignored

AddFileCommand (org.apache.spark.sql.execution.command)
AddJarCommand (org.apache.spark.sql.execution.command)
AlterDatabasePropertiesCommand (org.apache.spark.sql.execution.command)
AlterTableAddPartitionCommand (org.apache.spark.sql.execution.command)
AlterTableDropPartitionCommand (org.apache.spark.sql.execution.command)
AlterTableRecoverPartitionsCommand (org.apache.spark.sql.execution.command)
AlterTableRenamePartitionCommand (org.apache.spark.sql.execution.command)
AlterTableSerDePropertiesCommand (org.apache.spark.sql.execution.command)
AlterTableSetPropertiesCommand (org.apache.spark.sql.execution.command)
AlterTableUnsetPropertiesCommand (org.apache.spark.sql.execution.command)
AlterViewAsCommand (org.apache.spark.sql.execution.command)
AnalyzeColumnCommand (org.apache.spark.sql.execution.command)
AnalyzePartitionCommand (org.apache.spark.sql.execution.command)
AnalyzeTableCommand (org.apache.spark.sql.execution.command)
CacheTableCommand (org.apache.spark.sql.execution.command)
ClearCacheCommand (org.apache.spark.sql.execution.command)
CreateFunctionCommand (org.apache.spark.sql.execution.command)
CreateTempViewUsing (org.apache.spark.sql.execution.datasources)
CreateViewCommand (org.apache.spark.sql.execution.command)
DescribeColumnCommand (org.apache.spark.sql.execution.command)
DescribeDatabaseCommand (org.apache.spark.sql.execution.command)
DescribeFunctionCommand (org.apache.spark.sql.execution.command)
DescribeTableCommand (org.apache.spark.sql.execution.command)
DropFunctionCommand (org.apache.spark.sql.execution.command)
ExplainCommand (org.apache.spark.sql.execution.command)
InsertIntoDataSourceCommand (org.apache.spark.sql.execution.datasources) *
ListFilesCommand (org.apache.spark.sql.execution.command)
ListJarsCommand (org.apache.spark.sql.execution.command)
RefreshResource (org.apache.spark.sql.execution.datasources)
RefreshTable (org.apache.spark.sql.execution.datasources)
ResetCommand$ (org.apache.spark.sql.execution.command)
SetCommand (org.apache.spark.sql.execution.command)
SetDatabaseCommand (org.apache.spark.sql.execution.command)
ShowColumnsCommand (org.apache.spark.sql.execution.command)
ShowCreateTableCommand (org.apache.spark.sql.execution.command)
ShowDatabasesCommand (org.apache.spark.sql.execution.command)
ShowFunctionsCommand (org.apache.spark.sql.execution.command)
ShowPartitionsCommand (org.apache.spark.sql.execution.command)
ShowTablePropertiesCommand (org.apache.spark.sql.execution.command)
ShowTablesCommand (org.apache.spark.sql.execution.command)
StreamingExplainCommand (org.apache.spark.sql.execution.command)
UncacheTableCommand (org.apache.spark.sql.execution.command)

Developer documentation

Plugin API

Using a plugin API you can capture lineage from a 3rd party data source provider. Spline discover plugins automatically by scanning a classpath, so no special steps required to register and configure a plugin. All you need is to create a class extending the za.co.absa.spline.harvester.plugin.Plugin marker trait mixed with one or more *Processing traits, depending on your intention.

There are three general processing traits:

DataSourceFormatNameResolving - returns a name of a data provider/format in use.
ReadNodeProcessing - detects a read-command and gather meta information.
WriteNodeProcessing - detects a write-command and gather meta information.

There are also two additional trait that handle common cases of reading and writing:

BaseRelationProcessing - similar to ReadNodeProcessing, but instead of capturing all logical plan nodes it only reacts on LogicalRelation (see LogicalRelationPlugin)
RelationProviderProcessing - similar to WriteNodeProcessing, but it only captures SaveIntoDataSourceCommand (see SaveIntoDataSourceCommandPlugin)

The best way to illustrate how plugins work is to look at the real working example, e.g. za.co.absa.spline.harvester.plugin.embedded.JDBCPlugin

The most common simplified pattern looks like this:

package my.spline.plugin

import javax.annotation.Priority
import za.co.absa.spline.harvester.builder._
import za.co.absa.spline.harvester.plugin.Plugin._
import za.co.absa.spline.harvester.plugin._

@Priority(Precedence.User) // not required, but can be used to control your plugin precedence in the plugin chain. Default value is `User`.  
class FooBarPlugin
  extends Plugin
    with BaseRelationProcessing
    with RelationProviderProcessing {

  override def baseRelationProcessor: PartialFunction[(BaseRelation, LogicalRelation), ReadNodeInfo] = {
    case (FooBarRelation(a, b, c, d), lr) if /*more conditions*/ =>
      val dataFormat: Option[AnyRef] = ??? // data format being read (will be resolved by the `DataSourceFormatResolver` later)
      val dataSourceURI: String = ??? // a unique URI for the data source
      val params: Map[String, Any] = ??? // additional parameters characterizing the read-command. E.g. (connection protocol, access mode, driver options etc)

      (SourceIdentifier(dataFormat, dataSourceURI), params)
  }

  override def relationProviderProcessor: PartialFunction[(AnyRef, SaveIntoDataSourceCommand), WriteNodeInfo] = {
    case (provider, cmd) if provider == "foobar" || provider.isInstanceOf[FooBarProvider] =>
      val dataFormat: Option[AnyRef] = ??? // data format being written (will be resolved by the `DataSourceFormatResolver` later)
      val dataSourceURI: String = ??? // a unique URI for the data source
      val writeMode: SaveMode = ??? // was it Append or Overwrite?
      val query: LogicalPlan = ??? // the logical plan to get the rest of the lineage from
      val params: Map[String, Any] = ??? // additional parameters characterizing the write-command

      (SourceIdentifier(dataFormat, dataSourceURI), writeMode, query, params)
  }
}

Note: to avoid unwanted possible shadowing the other plugins (including the future ones), make sure that the pattern-matching criteria are as much selective as possible for your plugin needs.

A plugin class is expected to only have a single constructor. The constructor can have no arguments, or one or more of the following types (the values will be autowired):

SparkSession
PathQualifier
PluginRegistry

Compile you plugin and drop it into the Spline/Spark classpath. Spline will pick it up automatically.

Building for different Scala and Spark versions

Note: The project requires Java version 1.8 (strictly) and Apache Maven for building.

Check the build environment:

mvn --version

Verify that Maven is configured to run on Java 1.8. For example:

Apache Maven 3.6.3 (Red Hat 3.6.3-8)
Maven home: /usr/share/maven
Java version: 1.8.0_302, vendor: Red Hat, Inc., runtime: /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.302.b08-2.fc34.x86_64/jre

There are several maven profiles that makes it easy to build the project with different versions of Spark and Scala.

Scala profiles: scala-2.11, scala-2.12 (default)
Spark profiles: spark-2.2, spark-2.3, spark-2.4 (default), spark-3.0, spark-3.1, spark-3.2, spark-3.3

For example, to build an agent for Spark 2.4 and Scala 2.11:

# Change Scala version in pom.xml.
mvn scala-cross-build:change-version -Pscala-2.11

# now you can build for Scala 2.11
mvn clean install -Pscala-2.11,spark-2.4

Build docker image

The agent docker image is mainly used to run example jobs and pre-fill the database with the sample lineage data.

(Spline docker images are available on the DockerHub repo - https://hub.docker.com/u/absaoss)

mvn install -Ddocker -Ddockerfile.repositoryUrl=my

See How to build Spline Docker images for details.

How to measure code coverage

./mvn verify -Dcode-coverage

If module contains measurable data the code coverage report will be generated on path:

{local-path}\spline-spark-agent\{module}\target\site\jacoco

References and examples

Although the primary goal of Spline agent is to be used in combination with the Spline server, it is flexible enough to be used in isolation or integration with other data lineage tracking solutions including custom ones.

Below is a couple of examples of such integration:

Copyright 2019 ABSA Group Limited

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

spline-spark-agent's People

Stargazers

Watchers

Forkers

gaborbarna meduri30 radford1 daimonpl schaloner-kbc datapebbles liangjun-jiang ws2823147532 shanebell hugeshi vidma lordk911 lineclappe frost713 shidianshifen zhifeiyu dvagapov aplaceformom feng-tao singhrajk bensenberner gadbees jesinity aaron-wilson xiejiajun joggyjagz7 mohanajuhi166 lazowmich andrea-rockt xiaodiw dechoma nazerim v2hoping zyclove khileshchauhan aditya-sood semanticbeeng matt12eagles karsonnel jozefbakus monte-carlo-data ganeshnikumbh 597365581 rajamaniv acesso-io mantovani nickdudu yikf wyunnpeng scalarcode rkrumins wangmiao1002 aniketbhisikar 396763284 yangchenghuang vishalag001 gustavojhovani rupesh3020 linjh789 neerajk-sigmoid kuhnen xenosk andrewcheny anirudh181001 luoxuehuan vsranga jinmu0410 kaaosidao2 chncaesar ievan-lhr anujamerwade samarth-c1 uday1409 codeforcontribute ronybony1990 sonnb270898 vishalsingh17 emadwndrr martinf-moodys ganfengtan takkarharsh igalkakoon zacayd xxd9898 kiran-g1 rongyousu rohankumardubey rycowhi jialehe wsczm

spline-spark-agent's Issues

Snowflake EDW Write Operation Support

Background

Currently, Spline seems to not support Snowflake which recently has become popular among EDW solutions

Feature

Snowflake Spark connector read/write support https://github.com/snowflakedb/spark-snowflake

Cannot re-attempt Spline initialization after an error

In Spark shell when I forgot to provide spline.producer.url Spline init fails due to missing configuration property as expected. But after that happens I have to close and open Spark shell again to re-attempt Spline init. Otherwise it says that Spline has already been initialized, even though it's actually not.

Using Scala version 2.11.12 (OpenJDK 64-Bit Server VM, Java 1.8.0_232)
Type in expressions to have them evaluated.
Type :help for more information.

scala> import za.co.absa.spline.harvester.SparkLineageInitializer
import za.co.absa.spline.harvester.SparkLineageInitializer

scala> SparkLineageInitializer.enableLineageTracking(spark)
19/12/16 16:41:51 ERROR SparkLineageInitializer$: Spline initialization failed! Spark Lineage tracking is DISABLED.
java.lang.IllegalArgumentException: requirement failed: Missing configuration property spline.producer.url
	at scala.Predef$.require(Predef.scala:224)
	at za.co.absa.spline.common.ConfigurationImplicits$ConfigurationRequiredWrapper.za$co$absa$spline$common$ConfigurationImplicits$ConfigurationRequiredWrapper$$getRequired(ConfigurationImplicits.scala:107)
	at za.co.absa.spline.common.ConfigurationImplicits$ConfigurationRequiredWrapper$$anonfun$getRequiredString$3.apply(ConfigurationImplicits.scala:40)
	at za.co.absa.spline.common.ConfigurationImplicits$ConfigurationRequiredWrapper$$anonfun$getRequiredString$3.apply(ConfigurationImplicits.scala:40)
	at za.co.absa.spline.harvester.dispatcher.HttpLineageDispatcher$.apply(HttpLineageDispatcher.scala:87)
	at za.co.absa.spline.harvester.conf.DefaultSplineConfigurer.lineageDispatcher$lzycompute(DefaultSplineConfigurer.scala:75)
	at za.co.absa.spline.harvester.conf.DefaultSplineConfigurer.lineageDispatcher(DefaultSplineConfigurer.scala:73)
	at za.co.absa.spline.harvester.conf.DefaultSplineConfigurer.queryExecutionEventHandler(DefaultSplineConfigurer.scala:90)
	at za.co.absa.spline.harvester.SparkLineageInitializer$SparkSessionWrapper.createEventHandler(SparkLineageInitializer.scala:110)
	at za.co.absa.spline.harvester.SparkLineageInitializer$SparkSessionWrapper.enableLineageTracking(SparkLineageInitializer.scala:83)
	at za.co.absa.spline.harvester.SparkLineageInitializer$.enableLineageTracking(SparkLineageInitializer.scala:39)
	at $line15.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(<console>:25)
	at $line15.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(<console>:30)
	at $line15.$read$$iw$$iw$$iw$$iw$$iw$$iw.<init>(<console>:32)
	at $line15.$read$$iw$$iw$$iw$$iw$$iw.<init>(<console>:34)
	at $line15.$read$$iw$$iw$$iw$$iw.<init>(<console>:36)
	at $line15.$read$$iw$$iw$$iw.<init>(<console>:38)
	at $line15.$read$$iw$$iw.<init>(<console>:40)
	at $line15.$read$$iw.<init>(<console>:42)
	at $line15.$read.<init>(<console>:44)
	at $line15.$read$.<init>(<console>:48)
	at $line15.$read$.<clinit>(<console>)
	at $line15.$eval$.$print$lzycompute(<console>:7)
	at $line15.$eval$.$print(<console>:6)
	at $line15.$eval.$print(<console>)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at scala.tools.nsc.interpreter.IMain$ReadEvalPrint.call(IMain.scala:793)
	at scala.tools.nsc.interpreter.IMain$Request.loadAndRun(IMain.scala:1054)
	at scala.tools.nsc.interpreter.IMain$WrappedRequest$$anonfun$loadAndRunReq$1.apply(IMain.scala:645)
	at scala.tools.nsc.interpreter.IMain$WrappedRequest$$anonfun$loadAndRunReq$1.apply(IMain.scala:644)
	at scala.reflect.internal.util.ScalaClassLoader$class.asContext(ScalaClassLoader.scala:31)
	at scala.reflect.internal.util.AbstractFileClassLoader.asContext(AbstractFileClassLoader.scala:19)
	at scala.tools.nsc.interpreter.IMain$WrappedRequest.loadAndRunReq(IMain.scala:644)
	at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:576)
	at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:572)
	at scala.tools.nsc.interpreter.ILoop.interpretStartingWith(ILoop.scala:819)
	at scala.tools.nsc.interpreter.ILoop.command(ILoop.scala:691)
	at scala.tools.nsc.interpreter.ILoop.processLine(ILoop.scala:404)
	at scala.tools.nsc.interpreter.ILoop.loop(ILoop.scala:425)
	at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:285)
	at org.apache.spark.repl.SparkILoop.runClosure(SparkILoop.scala:159)
	at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:182)
	at org.apache.spark.repl.Main$.doMain(Main.scala:78)
	at org.apache.spark.repl.Main$.main(Main.scala:58)
	at org.apache.spark.repl.Main.main(Main.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
	at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:845)
	at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
	at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
	at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
	at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
res0: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@7a02d760

scala> SparkLineageInitializer.enableLineageTracking(spark)
19/12/16 16:41:55 WARN SparkLineageInitializer$: Spline lineage tracking is already initialized!
res1: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@7a02d760

scala>

Cobol connector support (Cobrix)

https://github.com/AbsaOSS/cobrix

java.lang.RuntimeException: Relation is not supported: za.co.absa.cobrix.spark.cobol.source.CobolRelation@30ce78e3
        at scala.sys.package$.error(package.scala:27)
        at za.co.absa.spline.harvester.builder.read.ReadCommandExtractor$$anonfun$asReadCommand$1.applyOrElse(ReadCommandExtractor.scala:110)
        at za.co.absa.spline.harvester.builder.read.ReadCommandExtractor$$anonfun$asReadCommand$1.applyOrElse(ReadCommandExtractor.scala:49)
        at scala.PartialFunction$Lifted.apply(PartialFunction.scala:223)
        at scala.PartialFunction$Lifted.apply(PartialFunction.scala:219)
        at scala.PartialFunction$.condOpt(PartialFunction.scala:286)
        at za.co.absa.spline.harvester.builder.read.ReadCommandExtractor.asReadCommand(ReadCommandExtractor.scala:49)
        at za.co.absa.spline.harvester.LineageHarvester.za$co$absa$spline$harvester$LineageHarvester$$createOperationBuilder(LineageHarvester.scala:144)
- SparkCobolApp$
        at za.co.absa.spline.harvester.LineageHarvester$$anonfun$6.apply(LineageHarvester.scala:121)
        at za.co.absa.spline.harvester.LineageHarvester$$anonfun$6.apply(LineageHarvester.scala:121)
        at scala.Option.getOrElse(Option.scala:121)
        at za.co.absa.spline.harvester.LineageHarvester.traverseAndCollect$1(LineageHarvester.scala:121)
        at za.co.absa.spline.harvester.LineageHarvester.za$co$absa$spline$harvester$LineageHarvester$$createOperationBuildersRecursively(LineageHarvester.scala:140)
        at za.co.absa.spline.harvester.LineageHarvester$$anonfun$harvest$1.apply(LineageHarvester.scala:72)
        at za.co.absa.spline.harvester.LineageHarvester$$anonfun$harvest$1.apply(LineageHarvester.scala:70)
        at scala.Option.flatMap(Option.scala:171)
        at za.co.absa.spline.harvester.LineageHarvester.harvest(LineageHarvester.scala:70)
        at za.co.absa.spline.harvester.QueryExecutionEventHandler.onSuccess(QueryExecutionEventHandler.scala:40)
        at za.co.absa.spline.harvester.listener.SplineQueryExecutionListener$$anonfun$onSuccess$1.apply(SplineQueryExecutionListener.scala:37)
        at za.co.absa.spline.harvester.listener.SplineQueryExecutionListener$$anonfun$onSuccess$1.apply(SplineQueryExecutionListener.scala:37)
        at scala.Option.foreach(Option.scala:257)
        at za.co.absa.spline.harvester.listener.SplineQueryExecutionListener.onSuccess(SplineQueryExecutionListener.scala:37)
        at org.apache.spark.sql.util.ExecutionListenerManager$$anonfun$onSuccess$1$$anonfun$apply$mcV$sp$1.apply(QueryExecutionListener.scala:124)
        at org.apache.spark.sql.util.ExecutionListenerManager$$anonfun$onSuccess$1$$anonfun$apply$mcV$sp$1.apply(QueryExecutionListener.scala:123)
        at org.apache.spark.sql.util.ExecutionListenerManager$$anonfun$org$apache$spark$sql$util$ExecutionListenerManager$$withErrorHandling$1.apply(QueryExecutionListener.scala:145)

Lineage not being reported from production Hadoop+Hive+Spark

I verified that lineage is reported from end-to-end tests (local spark context)

However when run on hadoop cluster with Hive and Spark 2.4.5, no lineage, nor any warning or error is reported

configuration:

spark.sql.queryExecutionListeners="za.co.absa.spline.harvester.listener.SplineQueryExecutionListener"
spark.spline.producer.url="http://my-domian-name:9090/producer"
spark.spline.timeout.read=30000
spark.spline.timeout.connection=30000
spark.spline.iwd_strategy.default.on_missing_metrics="CAPTURE_LINEAGE"

build.sbt:

"za.co.absa.spline.agent.spark" %% "agent-core" % "0.5.2" excludeAll(
  ExclusionRule(organization = "org.apache.spark")
)

logs from job:

2020-05-26 16:22:35.793 | Spline successfully initialized. Spark Lineage tracking is ENABLED.
2020-05-26 16:22:35.728 | spline.timeout.connection is set to:'30000 milliseconds' ms
2020-05-26 16:22:35.728 | spline.timeout.read is set to:'30000 milliseconds' ms
2020-05-26 16:22:35.727 | spline.producer.url is set to:'http://my-domian-name:9090/producer'
2020-05-26 16:22:35.709 | Instantiating LineageDispatcher for class name: za.co.absa.spline.harvester.dispatcher.HttpLineageDispatcher
2020-05-26 16:22:35.706 | Instantiating IgnoredWriteDetectionStrategy for class name: za.co.absa.spline.harvester.write_detection.DefaultIgnoredWriteDetectionStrategy
2020-05-26 16:22:35.703 | Spline mode: BEST_EFFORT
2020-05-26 16:22:35.703 | Spline version: null
2020-05-26 16:22:35.685 | Initializing Spline agent...

Same code run from local spark context sends lineage successfuly and it's visible in spline UI

Any idea how to debug problem? I see no logs which could be helpful, i set logging configuration for DEBUG

Codeless init problem

Hi,

I was trying to setup codeless init with spark 2.4 with the following spark conf option:

sparkConf.set("spark.sql.queryExecutionListeners", "za.co.absa.spline.harvester.QueryExecutionEventHandler")

It throws the following exception:

java.lang.IllegalArgumentException: requirement failed: za.co.absa.spline.harvester.QueryExecutio
nEventHandler is not a subclass of org.apache.spark.sql.util.QueryExecutionListener.

I tried to extend the QueryExecutionListener class, then it throws the following:

za.co.absa.spline.harvester.QueryExecutionEventHandler did not have a zero-argument constructor or a single-argument constructor that accepts SparkConf. Note: if the class is defined inside of another Scala class, then its constructors may accept an implicit parameter that references the enclosing class; in this case, you must define the class as a top-level class in order to prevent this extra parameter from breaking Spark's ability to find a valid constructor.

I'm probably trying to do it the wrong way, how to initialize the agent with just spark conf?

UnsupportedOperationException: dataType

20/04/02 20:58:29 INFO FileFormatWriter: Finished processing stats for write job 55b3e202-a211-4af3-a5ad-4b8dbbe0a945.
20/04/02 20:58:29 INFO SparkLineageInitializer$: Spline v0.4.2 is initializing...
20/04/02 20:58:29 INFO SparkLineageInitializer$: Spline successfully initialized. Spark Lineage tracking is ENABLED.
20/04/02 20:58:30 WARN ExecutionListenerManager: Error executing query execution listener
java.lang.UnsupportedOperationException: dataType
	at org.apache.spark.sql.catalyst.expressions.WindowSpecDefinition.dataType(windowExpressions.scala:54)
	at za.co.absa.spline.harvester.converter.ExpressionConverter.getDataType(ExpressionConverter.scala:77)
	at za.co.absa.spline.harvester.converter.ExpressionConverter.convert(ExpressionConverter.scala:71)
	at za.co.absa.spline.harvester.converter.ExpressionConverter$$anonfun$convert$4.apply(ExpressionConverter.scala:72)
	at za.co.absa.spline.harvester.converter.ExpressionConverter$$anonfun$convert$4.apply(ExpressionConverter.scala:72)
	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
	at scala.collection.immutable.List.foreach(List.scala:392)
	at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
	at scala.collection.immutable.List.map(List.scala:296)
	at za.co.absa.spline.harvester.converter.ExpressionConverter.convert(ExpressionConverter.scala:72)
	at za.co.absa.spline.harvester.converter.ExpressionConverter.convert(ExpressionConverter.scala:41)
	at za.co.absa.spline.harvester.converter.OperationParamsConverter$$anonfun$1$$anonfun$apply$1.applyOrElse(OperationParamsConverter.scala:42)
	at za.co.absa.spline.harvester.converter.OperationParamsConverter$$anonfun$1$$anonfun$apply$1.applyOrElse(OperationParamsConverter.scala:35)
	at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167)
	at za.co.absa.spline.harvester.converter.ValueDecomposer$$anonfun$za$co$absa$spline$harvester$converter$ValueDecomposer$$recursion$1.apply(ValueDecomposer.scala:39)
	at za.co.absa.spline.harvester.converter.ValueDecomposer$$anonfun$za$co$absa$spline$harvester$converter$ValueDecomposer$$recursion$1.apply(ValueDecomposer.scala:39)
	at za.co.absa.spline.harvester.converter.ValueDecomposer$$anonfun$2$$anonfun$apply$1$$anonfun$applyOrElse$4.apply(ValueDecomposer.scala:51)
	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
	at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
	at scala.collection.AbstractTraversable.map(Traversable.scala:104)
	at za.co.absa.spline.harvester.converter.ValueDecomposer$$anonfun$2$$anonfun$apply$1.applyOrElse(ValueDecomposer.scala:51)
	at za.co.absa.spline.harvester.converter.ValueDecomposer$$anonfun$2$$anonfun$apply$1.applyOrElse(ValueDecomposer.scala:44)
	at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
	at za.co.absa.spline.harvester.converter.OperationParamsConverter$$anonfun$1$$anonfun$apply$1.applyOrElse(OperationParamsConverter.scala:35)
	at za.co.absa.spline.harvester.converter.OperationParamsConverter$$anonfun$1$$anonfun$apply$1.applyOrElse(OperationParamsConverter.scala:35)
	at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167)
	at za.co.absa.spline.harvester.converter.ValueDecomposer$$anonfun$za$co$absa$spline$harvester$converter$ValueDecomposer$$recursion$1.apply(ValueDecomposer.scala:39)
	at za.co.absa.spline.harvester.converter.ValueDecomposer$$anonfun$za$co$absa$spline$harvester$converter$ValueDecomposer$$recursion$1.apply(ValueDecomposer.scala:39)
	at za.co.absa.spline.harvester.converter.OperationParamsConverter$$anonfun$convert$4.apply(OperationParamsConverter.scala:58)
	at za.co.absa.spline.harvester.converter.OperationParamsConverter$$anonfun$convert$4.apply(OperationParamsConverter.scala:54)
	at scala.collection.TraversableLike$WithFilter$$anonfun$map$2.apply(TraversableLike.scala:683)
	at scala.collection.immutable.Map$Map4.foreach(Map.scala:188)
	at scala.collection.TraversableLike$WithFilter.map(TraversableLike.scala:682)
	at za.co.absa.spline.harvester.converter.OperationParamsConverter.convert(OperationParamsConverter.scala:54)
	at za.co.absa.spline.harvester.builder.GenericNodeBuilder.build(GenericNodeBuilder.scala:34)
	at za.co.absa.spline.harvester.builder.GenericNodeBuilder.build(GenericNodeBuilder.scala:24)
	at za.co.absa.spline.harvester.LineageHarvester$$anonfun$harvest$1$$anonfun$4.apply(LineageHarvester.scala:76)
	at za.co.absa.spline.harvester.LineageHarvester$$anonfun$harvest$1$$anonfun$4.apply(LineageHarvester.scala:76)
	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
	at scala.collection.immutable.List.foreach(List.scala:392)
	at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
	at scala.collection.immutable.List.map(List.scala:296)
	at za.co.absa.spline.harvester.LineageHarvester$$anonfun$harvest$1.apply(LineageHarvester.scala:76)
	at za.co.absa.spline.harvester.LineageHarvester$$anonfun$harvest$1.apply(LineageHarvester.scala:69)
	at scala.Option.flatMap(Option.scala:171)
	at za.co.absa.spline.harvester.LineageHarvester.harvest(LineageHarvester.scala:69)
	at za.co.absa.spline.harvester.QueryExecutionEventHandler.onSuccess(QueryExecutionEventHandler.scala:41)
	at za.co.absa.spline.harvester.listener.SplineQueryExecutionListener$$anonfun$onSuccess$1.apply(SplineQueryExecutionListener.scala:37)
	at za.co.absa.spline.harvester.listener.SplineQueryExecutionListener$$anonfun$onSuccess$1.apply(SplineQueryExecutionListener.scala:37)
	at scala.Option.foreach(Option.scala:257)
	at za.co.absa.spline.harvester.listener.SplineQueryExecutionListener.onSuccess(SplineQueryExecutionListener.scala:37)
	at org.apache.spark.sql.util.ExecutionListenerManager$$anonfun$onSuccess$1$$anonfun$apply$mcV$sp$1.apply(QueryExecutionListener.scala:124)
	at org.apache.spark.sql.util.ExecutionListenerManager$$anonfun$onSuccess$1$$anonfun$apply$mcV$sp$1.apply(QueryExecutionListener.scala:123)
	at org.apache.spark.sql.util.ExecutionListenerManager$$anonfun$org$apache$spark$sql$util$ExecutionListenerManager$$withErrorHandling$1.apply(QueryExecutionListener.scala:145)
	at org.apache.spark.sql.util.ExecutionListenerManager$$anonfun$org$apache$spark$sql$util$ExecutionListenerManager$$withErrorHandling$1.apply(QueryExecutionListener.scala:143)
	at scala.collection.immutable.List.foreach(List.scala:392)
	at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35)
	at scala.collection.mutable.ListBuffer.foreach(ListBuffer.scala:45)
	at org.apache.spark.sql.util.ExecutionListenerManager.org$apache$spark$sql$util$ExecutionListenerManager$$withErrorHandling(QueryExecutionListener.scala:143)
	at org.apache.spark.sql.util.ExecutionListenerManager$$anonfun$onSuccess$1.apply$mcV$sp(QueryExecutionListener.scala:123)
	at org.apache.spark.sql.util.ExecutionListenerManager$$anonfun$onSuccess$1.apply(QueryExecutionListener.scala:123)
	at org.apache.spark.sql.util.ExecutionListenerManager$$anonfun$onSuccess$1.apply(QueryExecutionListener.scala:123)
	at org.apache.spark.sql.util.ExecutionListenerManager.readLock(QueryExecutionListener.scala:156)
	at org.apache.spark.sql.util.ExecutionListenerManager.onSuccess(QueryExecutionListener.scala:122)
	at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:678)
	at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:285)
	at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271)
	at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:229)
	at org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:566)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)
20/04/02 20:58:30 INFO SparkContext: Invoking stop() from shutdown hook

Spline is not initialized properly!

Using Python version 3.7.4 (default, Aug 9 2019 18:34:13)
SparkSession available as 'spark'.

>>> from pyspark import SparkContext
>>> from pyspark.sql import SQLContext
>>> from pyspark.sql import SparkSession
>>> import pandas as pd
>>>
>>> SC = SparkContext
>>> SC.setSystemProperty('spline.mode','REQUIRED')
>>> SC.setSystemProperty('spline.producer.url', 'http://localhost:8080/producer')
>>> SC._jvm.za.co.absa.spline.harvester.SparkLineageInitializer.enableLineageTracking(spark._jsparkSession)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\SPARK\spark-2.4.4-bin-hadoop2.7\python\lib\py4j-0.10.7-src.zip\py4j\java_gateway.py", line 1257, in __call__
  File "C:\SPARK\spark-2.4.4-bin-hadoop2.7\python\pyspark\sql\utils.py", line 63, in deco
    return f(*a, **kw)
  File "C:\SPARK\spark-2.4.4-bin-hadoop2.7\python\lib\py4j-0.10.7-src.zip\py4j\protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling z:za.co.absa.spline.harvester.SparkLineageInitializer.enableLineageTracking.
: za.co.absa.spline.harvester.exception.SplineNotInitializedException: Spline is not initialized properly!
        at za.co.absa.spline.harvester.dispatcher.HttpLineageDispatcher.ensureProducerReady(HttpLineageDispatcher.scala:69)
        at za.co.absa.spline.harvester.SparkLineageInitializer$SparkSessionWrapper.enableLineageTracking(SparkLineageInitializer.scala:80)
        at za.co.absa.spline.harvester.SparkLineageInitializer$.enableLineageTracking(SparkLineageInitializer.scala:39)
        at za.co.absa.spline.harvester.SparkLineageInitializer.enableLineageTracking(SparkLineageInitializer.scala)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
        at java.lang.reflect.Method.invoke(Unknown Source)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
        at py4j.Gateway.invoke(Gateway.java:282)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.GatewayConnection.run(GatewayConnection.java:238)
        at java.lang.Thread.run(Unknown Source)

BuildInfo.version reports 'null'

Most likely a collision between 'build.properties' files from different JARs.

Spline does not display schema/database for Hive tables

Hive supports multiple schema/databases for example my_db.my_table, different_team_db.my_table

Spline correctly recognizes URI for both but UI only displays my_table. Database/Schema my_db is very crucial and it would benefit everyone to also display database/schema in lineage - or maybe just full table name like my_db.my_table ?

Add more debug logging to the Harvester

The issues like #74 is quite common. We need to log more stuff on DEBUG level to make it easier to tackle issues like that.

ElasticSearch connector support

https://github.com/elastic/elasticsearch-hadoop

LocalRelation node should not capture the data

CC @Zejnilovic

RDD lineage support

Currently there is no one easy solution to provide lineage for RDDs, but there are several ways how to provide at lease some of it or define it manually. This ticket will group all of them.

Subtasks:

Task	Ticket
partial RDD support (Write - LogicalPlan, Read - RDD)	#498
RDD Read support
RDD manual lineage metadata enrichment

Original message:

I am using the latest spline version. When I enable spline, I see incomplete lineage, my code reads the file from parquet and creates a temporary view, performs some transformations and creates a new view which is then finally written to a parquet. However I am not able to see lineages for my transformations.

Are there any constraints like all the intermediate RDD's should be in memory? or any specific spark config apart from that mentioned in the documentation which needs to enabled?

Spline Libraries not working while Import class in databricks

Hi, As instructed, I downloaded the spline libraries as given below and uploaded into Dbricks all clusters.
spline-bundle-2_4-0.3.9.JAR

za.co.absa.spline:spline-core:0.3.6
za.co.absa.spline:spline-core-spark-adapter-2.4:0.3.6
za.co.absa.spline:spline-persistence-mongo:0.3.6

admin-0.4.0-sources.jar
rest-gateway-0.4.0-sources.jar
client-web-0.4.0-sources.jar

and then went to Notebook given the below comments

import za.co.absa.spline.core.SparkLineageInitializer._
spark.enableLineageTracking()

#System.setProperty("spline.mode","BEST_EFFORT")
##System.setProperty("spline.persistence.factory", "za.co.absa.spline.persistence.mongo.MongoPersistenceFactory")
#System.setProperty("spline.mongodb.url", "mongodb://:@:<27701>")
#System.setProperty("spline.mongodb.name", "")

throwing an error.. not sure what i missed here.. i have not executed mongo system property statement since import itself not working..

Please help.
Thanks

Spark Streaming Support

Is it planned to add an integration with Spark Streaming ? It could be useful to be able to apply some lineage for batch and streaming data

Regression: Spline agent mode isn't properly respected in the agent init flow

When the server isn't ready the agent cannot become initialized regardless of the mode.

Multiple Producer API versions support

Agent should negotiate highest compatible API version with the server.
Agent should warn about deprecated versions.
Agent should suggest upgrade if a newer server version detected.

Related to AbsaOSS/spline#671

Azure Cosmos DB write operation support

Background

Spline is not able to track lineage while writing to Azure Cosmos DB

Feature

Azure Cosmos DB write operation support

Example

Sample Scala code of write operation to Cosmos DB, that is not able to track lineage

import com.microsoft.azure.cosmosdb.spark.schema._
import com.microsoft.azure.cosmosdb.spark.CosmosDBSpark
import com.microsoft.azure.cosmosdb.spark.config.Config
import org.apache.spark.sql.SaveMode

val customers_dataset_config_map = Map(
  "Endpoint" -> "<cosmos-db-endpoint>",
  "Masterkey" -> dbutils.secrets.get(scope = "<scope-name>", key = "<key-name-for-cosmos-db-masterkey>"),
  "Database" -> "<cosmos-db-database>",
  "Collection" -> "<cosmos-db-container>",
  "WritingBatchSize" -> "100")

val customers_dataset_config = Config(customers_dataset_config_map)

// Write the output to Azure Cosmos DB
customers_dataset.write.mode(SaveMode.Overwrite).cosmosDB(customers_dataset_config)

spark-shell / pyspark --packages not working

[wajda@alex-xps spline]$ /opt/spark-24/bin/spark-shell --packages za.co.absa.spline:spark-agent-bundle-2.4:0.4.0  --conf "spark.spline.producer.url=http://localhost:8080/producer"
Ivy Default Cache set to: /home/wajda/.ivy2/cache
The jars for the packages stored in: /home/wajda/.ivy2/jars
:: loading settings :: url = jar:file:/opt/spark-2.4.4-bin-hadoop2.7/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
za.co.absa.spline#spark-agent-bundle-2.4 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-5fca6813-067f-4213-8b51-ed08f182c463;1.0
	confs: [default]
	found za.co.absa.spline#spark-agent-bundle-2.4;0.4.0 in central
	found za.co.absa.spline#spark-agent;0.4.0 in central
	found org.scala-lang#scala-compiler;2.11.12 in local-m2-cache
	found org.scala-lang#scala-reflect;2.11.12 in local-m2-cache
	found org.scala-lang.modules#scala-xml_2.11;1.0.5 in local-m2-cache
	found org.scala-lang.modules#scala-parser-combinators_2.11;1.1.0 in local-m2-cache
	found za.co.absa.spline#commons;0.4.0 in central
	found commons-configuration#commons-configuration;1.6 in local-m2-cache
	found commons-collections#commons-collections;3.2.2 in local-m2-cache
	found commons-lang#commons-lang;2.6 in local-m2-cache
	found commons-logging#commons-logging;1.1.3 in local-m2-cache
	found commons-digester#commons-digester;1.8 in local-m2-cache
	found commons-beanutils#commons-beanutils;1.9.3 in local-m2-cache
	found commons-beanutils#commons-beanutils-core;1.8.0 in local-m2-cache
	found commons-io#commons-io;2.4 in local-m2-cache
	found org.apache.commons#commons-lang3;3.5 in local-m2-cache
	found org.slf4s#slf4s-api_2.11;1.7.25 in local-m2-cache
	found org.slf4j#slf4j-api;1.7.16 in local-m2-cache
	found za.co.absa.spline#producer-model;0.4.0 in central
	found com.databricks#spark-xml_2.11;0.4.1 in local-m2-cache
	found org.scalaz#scalaz-core_2.11;7.2.27 in local-m2-cache
	found org.scalaj#scalaj-http_2.11;2.4.1 in local-m2-cache
:: resolution report :: resolve 843ms :: artifacts dl 17ms
	:: modules in use:
	com.databricks#spark-xml_2.11;0.4.1 from local-m2-cache in [default]
	commons-beanutils#commons-beanutils;1.9.3 from local-m2-cache in [default]
	commons-beanutils#commons-beanutils-core;1.8.0 from local-m2-cache in [default]
	commons-collections#commons-collections;3.2.2 from local-m2-cache in [default]
	commons-configuration#commons-configuration;1.6 from local-m2-cache in [default]
	commons-digester#commons-digester;1.8 from local-m2-cache in [default]
	commons-io#commons-io;2.4 from local-m2-cache in [default]
	commons-lang#commons-lang;2.6 from local-m2-cache in [default]
	commons-logging#commons-logging;1.1.3 from local-m2-cache in [default]
	org.apache.commons#commons-lang3;3.5 from local-m2-cache in [default]
	org.scala-lang#scala-compiler;2.11.12 from local-m2-cache in [default]
	org.scala-lang#scala-reflect;2.11.12 from local-m2-cache in [default]
	org.scala-lang.modules#scala-parser-combinators_2.11;1.1.0 from local-m2-cache in [default]
	org.scala-lang.modules#scala-xml_2.11;1.0.5 from local-m2-cache in [default]
	org.scalaj#scalaj-http_2.11;2.4.1 from local-m2-cache in [default]
	org.scalaz#scalaz-core_2.11;7.2.27 from local-m2-cache in [default]
	org.slf4j#slf4j-api;1.7.16 from local-m2-cache in [default]
	org.slf4s#slf4s-api_2.11;1.7.25 from local-m2-cache in [default]
	za.co.absa.spline#commons;0.4.0 from central in [default]
	za.co.absa.spline#producer-model;0.4.0 from central in [default]
	za.co.absa.spline#spark-agent;0.4.0 from central in [default]
	za.co.absa.spline#spark-agent-bundle-2.4;0.4.0 from central in [default]
	---------------------------------------------------------------------
	|                  |            modules            ||   artifacts   |
	|       conf       | number| search|dwnlded|evicted|| number|dwnlded|
	---------------------------------------------------------------------
	|      default     |   23  |   0   |   0   |   0   ||   22  |   0   |
	---------------------------------------------------------------------

:: problems summary ::
:::: WARNINGS
		module not found: org.json4s#json4s-ext_2.11;${json4s.version}

	==== local-m2-cache: tried

	  file:/home/wajda/.m2/repository/org/json4s/json4s-ext_2.11/${json4s.version}/json4s-ext_2.11-${json4s.version}.pom

	  -- artifact org.json4s#json4s-ext_2.11;${json4s.version}!json4s-ext_2.11.jar:

	  file:/home/wajda/.m2/repository/org/json4s/json4s-ext_2.11/${json4s.version}/json4s-ext_2.11-${json4s.version}.jar

	==== local-ivy-cache: tried

	  /home/wajda/.ivy2/local/org.json4s/json4s-ext_2.11/${json4s.version}/ivys/ivy.xml

	  -- artifact org.json4s#json4s-ext_2.11;${json4s.version}!json4s-ext_2.11.jar:

	  /home/wajda/.ivy2/local/org.json4s/json4s-ext_2.11/${json4s.version}/jars/json4s-ext_2.11.jar

	==== central: tried

	  https://repo1.maven.org/maven2/org/json4s/json4s-ext_2.11/${json4s.version}/json4s-ext_2.11-${json4s.version}.pom

	  -- artifact org.json4s#json4s-ext_2.11;${json4s.version}!json4s-ext_2.11.jar:

	  https://repo1.maven.org/maven2/org/json4s/json4s-ext_2.11/${json4s.version}/json4s-ext_2.11-${json4s.version}.jar

	==== spark-packages: tried

	  https://dl.bintray.com/spark-packages/maven/org/json4s/json4s-ext_2.11/${json4s.version}/json4s-ext_2.11-${json4s.version}.pom

	  -- artifact org.json4s#json4s-ext_2.11;${json4s.version}!json4s-ext_2.11.jar:

	  https://dl.bintray.com/spark-packages/maven/org/json4s/json4s-ext_2.11/${json4s.version}/json4s-ext_2.11-${json4s.version}.jar

		::::::::::::::::::::::::::::::::::::::::::::::

		::          UNRESOLVED DEPENDENCIES         ::

		::::::::::::::::::::::::::::::::::::::::::::::

		:: org.json4s#json4s-ext_2.11;${json4s.version}: not found

		::::::::::::::::::::::::::::::::::::::::::::::



:: USE VERBOSE OR DEBUG MESSAGE LEVEL FOR MORE DETAILS
Exception in thread "main" java.lang.RuntimeException: [unresolved dependency: org.json4s#json4s-ext_2.11;${json4s.version}: not found]
	at org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(SparkSubmit.scala:1302)
	at org.apache.spark.deploy.DependencyUtils$.resolveMavenDependencies(DependencyUtils.scala:54)
	at org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:304)
	at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:774)
	at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
	at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
	at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
	at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
[wajda@alex-xps spline]$

Capture user identity?

Background [Optional]

One aspect of 'clarity', auditing & lineage is understanding 'who' ran the query or transformation. Many compliance regimes require knowing who access what and when. Further which spline does is to capture the 'how' of the transformation (or query).

Question

Does Spline capture any form of user identity, is it feasible to capture the identity and to do so as transparently as possible (e.g. in flight recorder mode) on a multi-tenant cluster (Hadoop or Databricks) ? - Thank you!

Spline Spark Agent dependacy setup

Installed pyspark 2.4 in windows 10 - executed the spline spark agent dependency which is throwing an error.
Copied spark-agent-bundle-2.4-0.4.1.jar into C:\SPARK\spark-2.4.4-bin-hadoop2.7\jars folder.
executed pyspark ----packages "za.co.absa.spline.agent.spark:agent-core:0.4.1"
throwing error as below..

C:\Users\HA2050>pyspark --packages "za.co.absa.spline.agent.spark:agent-core:0.4.1"
Python 3.7.4 (default, Aug  9 2019, 18:34:13) [MSC v.1915 64 bit (AMD64)] :: Anaconda, Inc. on win32

Warning:
This Python interpreter is in a conda environment, but the environment has
not been activated.  Libraries may fail to load.  To activate this environment
please see https://conda.io/activation

Type "help", "copyright", "credits" or "license" for more information.
Ivy Default Cache set to: C:\Users\HA2050\.ivy2\cache
The jars for the packages stored in: C:\Users\HA2050\.ivy2\jars
:: loading settings :: url = jar:file:/C:/SPARK/spark-2.4.4-bin-hadoop2.7/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
za.co.absa.spline.agent.spark#agent-core added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-58947001-02f7-41eb-b23b-ab38058cc447;1.0
        confs: [default]
:: resolution report :: resolve 1725ms :: artifacts dl 0ms
        :: modules in use:
        ---------------------------------------------------------------------
        |                  |            modules            ||   artifacts   |
        |       conf       | number| search|dwnlded|evicted|| number|dwnlded|
        ---------------------------------------------------------------------
        |      default     |   1   |   0   |   0   |   0   ||   0   |   0   |
        ---------------------------------------------------------------------

:: problems summary ::
:::: WARNINGS
                module not found: za.co.absa.spline.agent.spark#agent-core;0.4.1

        ==== local-m2-cache: tried

          file:/C:/Users/HA2050/.m2/repository/za/co/absa/spline/agent/spark/agent-core/0.4.1/agent-core-0.4.1.pom

          -- artifact za.co.absa.spline.agent.spark#agent-core;0.4.1!agent-core.jar:

          file:/C:/Users/HA2050/.m2/repository/za/co/absa/spline/agent/spark/agent-core/0.4.1/agent-core-0.4.1.jar

        ==== local-ivy-cache: tried

          C:\Users\HA2050\.ivy2\local\za.co.absa.spline.agent.spark\agent-core\0.4.1\ivys\ivy.xml

          -- artifact za.co.absa.spline.agent.spark#agent-core;0.4.1!agent-core.jar:

          C:\Users\HA2050\.ivy2\local\za.co.absa.spline.agent.spark\agent-core\0.4.1\jars\agent-core.jar

        ==== central: tried

          https://repo1.maven.org/maven2/za/co/absa/spline/agent/spark/agent-core/0.4.1/agent-core-0.4.1.pom

          -- artifact za.co.absa.spline.agent.spark#agent-core;0.4.1!agent-core.jar:

          https://repo1.maven.org/maven2/za/co/absa/spline/agent/spark/agent-core/0.4.1/agent-core-0.4.1.jar

        ==== spark-packages: tried

          https://dl.bintray.com/spark-packages/maven/za/co/absa/spline/agent/spark/agent-core/0.4.1/agent-core-0.4.1.pom

          -- artifact za.co.absa.spline.agent.spark#agent-core;0.4.1!agent-core.jar:

          https://dl.bintray.com/spark-packages/maven/za/co/absa/spline/agent/spark/agent-core/0.4.1/agent-core-0.4.1.jar

                ::::::::::::::::::::::::::::::::::::::::::::::

                ::          UNRESOLVED DEPENDENCIES         ::

                ::::::::::::::::::::::::::::::::::::::::::::::

                :: za.co.absa.spline.agent.spark#agent-core;0.4.1: not found

                ::::::::::::::::::::::::::::::::::::::::::::::



:: USE VERBOSE OR DEBUG MESSAGE LEVEL FOR MORE DETAILS
Exception in thread "main" java.lang.RuntimeException: [unresolved dependency: za.co.absa.spline.agent.spark#agent-core;0.4.1: not found]
        at org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(SparkSubmit.scala:1302)
        at org.apache.spark.deploy.DependencyUtils$.resolveMavenDependencies(DependencyUtils.scala:54)
        at org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:304)
        at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:774)
        at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
        at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
        at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
        at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Traceback (most recent call last):
  File "C:\SPARK\spark-2.4.4-bin-hadoop2.7\bin\..\python\pyspark\shell.py", line 38, in <module>
    SparkContext._ensure_initialized()
  File "C:\SPARK\spark-2.4.4-bin-hadoop2.7\python\pyspark\context.py", line 316, in _ensure_initialized
    SparkContext._gateway = gateway or launch_gateway(conf)
  File "C:\SPARK\spark-2.4.4-bin-hadoop2.7\python\pyspark\java_gateway.py", line 46, in launch_gateway
    return _launch_gateway(conf)
  File "C:\SPARK\spark-2.4.4-bin-hadoop2.7\python\pyspark\java_gateway.py", line 108, in _launch_gateway
    raise Exception("Java gateway process exited before sending its port number")
Exception: Java gateway process exited before sending its port number

Should Spline capture Hive Warehouse Connector activity

Background

The old HiveContext etc. classes have been deprecated since Spark 2.0, in favour of the new Hive Warehouse Connector, which seems to be necessary to interact with LLAP.

Question

Should Spline be capturing lineage operations performed through the HWC? At the moment it doesn't seem to.

I put together a minimal example which has a PySpark job that just reads from a csv file into a dataframe and then saves that as a json file, and into a Hive table. The Spline lineage only shows the data being written to the json file.

from pyspark.sql import SparkSession
from pyspark_llap import HiveWarehouseSession
import pyspark.sql.types as T


spark = SparkSession.builder.appName("SAC-Test").enableHiveSupport().getOrCreate()

sc = spark.sparkContext

sc._jvm.za.co.absa.spline.harvester \
    .SparkLineageInitializer.enableLineageTracking(spark._jsparkSession)


hive = HiveWarehouseSession.session(spark).build()

hive.setDatabase("sactest")

schema = T.StructType()
schema.add(T.StructField("col1", T.IntegerType(), False))
schema.add(T.StructField("col2", T.StringType(), False))
schema.add(T.StructField("col3", T.DateType(), False))

df = spark.read.csv("/tmp/test_data.csv", schema, header=True, quote="'")

df.write.format(HiveWarehouseSession.HIVE_WAREHOUSE_CONNECTOR).mode("append").option("table", "sac_test").save()

df.write.json("/tmp/spline_test.json", "overwrite")

which is launched by

hdfs dfs -put -f test_data.csv /tmp

HWC_JAR=local:/usr/hdp/current/hive_warehouse_connector/hive-warehouse-connector-assembly-1.0.0.3.1.4.0-315.jar 
HWC_PY=local:/usr/hdp/current/hive_warehouse_connector/pyspark_hwc-1.0.0.3.1.4.0-315.zip
SPLINE_JAR=/home/mario/spline/spark-agent-bundle-2.3-0.4.1.jar
SPLINE_URL=http://$HOSTNAME:8079/spline-rest-gateway/producer

spark-submit --master yarn \
             --deploy-mode client \
             --jars $HWC_JAR,$SPLINE_JAR \
             --driver-java-options -Dspline.producer.url=$SPLINE_URL \
             --py-files $HWC_PY \
             load_test_data.py

The lineage it captures is

java.lang.NumberFormatException: Not a version: 9

Originally posted by @Qurashetufail in AbsaOSS/spline#588 (comment)

I tried the code with oracle today. The code is now throwing a new error.

   [java] Exception in thread "main" java.lang.ExceptionInInitializerError
     [java]     at za.co.absa.spline.harvester.json.ShortTypeHintForSpline03ModelSupport$class.formats(ShortTypeHintForSpline03ModelSupport.scala:28)
     [java]     at za.co.absa.spline.harvester.json.HarvesterJsonSerDe$.za$co$absa$spline$common$json$format$NoEmptyValuesSupport$$super$formats(HarvesterJsonSerDe.scala:22)
     [java]     at za.co.absa.spline.common.json.format.NoEmptyValuesSupport$class.formats(NoEmptyValuesSupport.scala:24)
     [java]     at za.co.absa.spline.harvester.json.HarvesterJsonSerDe$.za$co$absa$spline$common$json$format$JavaTypesSupport$$super$formats(HarvesterJsonSerDe.scala:22)
     [java]     at za.co.absa.spline.common.json.format.JavaTypesSupport$class.formats(JavaTypesSupport.scala:23)
     [java]     at za.co.absa.spline.harvester.json.HarvesterJsonSerDe$.formats(HarvesterJsonSerDe.scala:22)
     [java]     at za.co.absa.spline.common.json.AbstractJsonSerDe$class.$init$(AbstractJsonSerDe.scala:30)
     [java]     at za.co.absa.spline.harvester.json.HarvesterJsonSerDe$.<init>(HarvesterJsonSerDe.scala:22)
     [java]     at za.co.absa.spline.harvester.json.HarvesterJsonSerDe$.<clinit>(HarvesterJsonSerDe.scala)
     [java]     at za.co.absa.spline.harvester.dispatcher.HttpLineageDispatcher.send(HttpLineageDispatcher.scala:41)
     [java]     at za.co.absa.spline.harvester.QueryExecutionEventHandler$$anonfun$onSuccess$1.apply(QueryExecutionEventHandler.scala:45)
     [java]     at za.co.absa.spline.harvester.QueryExecutionEventHandler$$anonfun$onSuccess$1.apply(QueryExecutionEventHandler.scala:43)
     [java]     at scala.Option.foreach(Option.scala:257)
     [java]     at za.co.absa.spline.harvester.QueryExecutionEventHandler.onSuccess(QueryExecutionEventHandler.scala:43)
     [java]     at za.co.absa.spline.harvester.listener.SplineQueryExecutionListener$$anonfun$onSuccess$1.apply(SplineQueryExecutionListener.scala:37)
     [java]     at za.co.absa.spline.harvester.listener.SplineQueryExecutionListener$$anonfun$onSuccess$1.apply(SplineQueryExecutionListener.scala:37)
     [java]     at scala.Option.foreach(Option.scala:257)
     [java]     at za.co.absa.spline.harvester.listener.SplineQueryExecutionListener.onSuccess(SplineQueryExecutionListener.scala:37)
     [java]     at org.apache.spark.sql.util.ExecutionListenerManager$$anonfun$onSuccess$1$$anonfun$apply$mcV$sp$1.apply(QueryExecutionListener.scala:114)
     [java]     at org.apache.spark.sql.util.ExecutionListenerManager$$anonfun$onSuccess$1$$anonfun$apply$mcV$sp$1.apply(QueryExecutionListener.scala:113)
     [java]     at org.apache.spark.sql.util.ExecutionListenerManager$$anonfun$org$apache$spark$sql$util$ExecutionListenerManager$$withErrorHandling$1.apply(QueryExecutionListener.scala:135)
     [java]     at org.apache.spark.sql.util.ExecutionListenerManager$$anonfun$org$apache$spark$sql$util$ExecutionListenerManager$$withErrorHandling$1.apply(QueryExecutionListener.scala:133)
     [java]     at scala.collection.immutable.List.foreach(List.scala:392)
     [java]     at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35)
     [java]     at scala.collection.mutable.ListBuffer.foreach(ListBuffer.scala:45)
     [java]     at org.apache.spark.sql.util.ExecutionListenerManager.org$apache$spark$sql$util$ExecutionListenerManager$$withErrorHandling(QueryExecutionListener.scala:133)
     [java]     at org.apache.spark.sql.util.ExecutionListenerManager$$anonfun$onSuccess$1.apply$mcV$sp(QueryExecutionListener.scala:113)
     [java]     at org.apache.spark.sql.util.ExecutionListenerManager$$anonfun$onSuccess$1.apply(QueryExecutionListener.scala:113)
     [java]     at org.apache.spark.sql.util.ExecutionListenerManager$$anonfun$onSuccess$1.apply(QueryExecutionListener.scala:113)
     [java]     at org.apache.spark.sql.util.ExecutionListenerManager.readLock(QueryExecutionListener.scala:146)
     [java]     at org.apache.spark.sql.util.ExecutionListenerManager.onSuccess(QueryExecutionListener.scala:112)
     [java]     at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:611)
     [java]     at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:233)
     [java]     at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:217)
     [java]     at org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:508)
     [java]     at za.co.absa.spline.example.batch.CumminsJob1DB$.delayedEndpoint$za$co$absa$spline$example$batch$CumminsJob1DB$1(CumminsJob1DB.scala:250)
     [java]     at za.co.absa.spline.example.batch.CumminsJob1DB$delayedInit$body.apply(CumminsJob1DB.scala:19)
     [java]     at scala.Function0$class.apply$mcV$sp(Function0.scala:34)
     [java]     at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
     [java]     at scala.App$$anonfun$main$1.apply(App.scala:76)
     [java]     at scala.App$$anonfun$main$1.apply(App.scala:76)
     [java]     at scala.collection.immutable.List.foreach(List.scala:392)
     [java]     at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35)
     [java]     at scala.App$class.main(App.scala:76)
     [java]     at za.co.absa.spline.example.SparkApp.main(SparkApp.scala:27)
     [java]     at za.co.absa.spline.example.batch.CumminsJob1DB.main(CumminsJob1DB.scala)
     [java] Caused by: java.lang.NumberFormatException: Not a version: 9
     [java]     at scala.util.PropertiesTrait$class.parts$1(Properties.scala:184)
     [java]     at scala.util.PropertiesTrait$class.isJavaAtLeast(Properties.scala:187)
     [java]     at scala.util.Properties$.isJavaAtLeast(Properties.scala:17)
     [java]     at scala.tools.util.PathResolverBase$Calculated$.javaBootClasspath(PathResolver.scala:276)
     [java]     at scala.tools.util.PathResolverBase$Calculated$.basis(PathResolver.scala:283)
     [java]     at scala.tools.util.PathResolverBase$Calculated$.containers$lzycompute(PathResolver.scala:293)
     [java]     at scala.tools.util.PathResolverBase$Calculated$.containers(PathResolver.scala:293)
     [java]     at scala.tools.util.PathResolverBase.containers(PathResolver.scala:309)
     [java]     at scala.tools.util.PathResolver.computeResult(PathResolver.scala:341)
     [java]     at scala.tools.util.PathResolver.computeResult(PathResolver.scala:332)
     [java]     at scala.tools.util.PathResolverBase.result(PathResolver.scala:314)
     [java]     at scala.tools.nsc.backend.JavaPlatform$class.classPath(JavaPlatform.scala:28)
     [java]     at scala.tools.nsc.Global$GlobalPlatform.classPath(Global.scala:115)
     [java]     at scala.tools.nsc.Global.scala$tools$nsc$Global$$recursiveClassPath(Global.scala:131)
     [java]     at scala.tools.nsc.Global.classPath(Global.scala:128)
     [java]     at scala.tools.nsc.backend.jvm.BTypesFromSymbols.<init>(BTypesFromSymbols.scala:39)
     [java]     at scala.tools.nsc.backend.jvm.BCodeIdiomatic.<init>(BCodeIdiomatic.scala:24)
     [java]     at scala.tools.nsc.backend.jvm.BCodeHelpers.<init>(BCodeHelpers.scala:23)
     [java]     at scala.tools.nsc.backend.jvm.BCodeSkelBuilder.<init>(BCodeSkelBuilder.scala:25)
     [java]     at scala.tools.nsc.backend.jvm.BCodeBodyBuilder.<init>(BCodeBodyBuilder.scala:25)
     [java]     at scala.tools.nsc.backend.jvm.BCodeSyncAndTry.<init>(BCodeSyncAndTry.scala:21)
     [java]     at scala.tools.nsc.backend.jvm.GenBCode.<init>(GenBCode.scala:47)
     [java]     at scala.tools.nsc.Global$genBCode$.<init>(Global.scala:675)
     [java]     at scala.tools.nsc.Global.genBCode$lzycompute(Global.scala:671)
     [java]     at scala.tools.nsc.Global.genBCode(Global.scala:671)
     [java]     at scala.tools.nsc.backend.jvm.GenASM$JPlainBuilder.serialVUID(GenASM.scala:1240)
     [java]     at scala.tools.nsc.backend.jvm.GenASM$JPlainBuilder.genClass(GenASM.scala:1329)
     [java]     at scala.tools.nsc.backend.jvm.GenASM$AsmPhase.emitFor$1(GenASM.scala:198)
     [java]     at scala.tools.nsc.backend.jvm.GenASM$AsmPhase.run(GenASM.scala:204)
     [java]     at scala.tools.nsc.Global$Run.compileUnitsInternal(Global.scala:1528)
     [java]     at scala.tools.nsc.Global$Run.compileUnits(Global.scala:1513)
     [java]     at scala.tools.reflect.ToolBoxFactory$ToolBoxImpl$ToolBoxGlobal.wrapInPackageAndCompile(ToolBoxFactory.scala:197)
     [java]     at scala.tools.reflect.ToolBoxFactory$ToolBoxImpl$ToolBoxGlobal.compile(ToolBoxFactory.scala:252)
     [java]     at scala.tools.reflect.ToolBoxFactory$ToolBoxImpl$$anonfun$compile$2.apply(ToolBoxFactory.scala:429)
     [java]     at scala.tools.reflect.ToolBoxFactory$ToolBoxImpl$$anonfun$compile$2.apply(ToolBoxFactory.scala:422)
     [java]     at scala.tools.reflect.ToolBoxFactory$ToolBoxImpl$withCompilerApi$.liftedTree2$1(ToolBoxFactory.scala:355)
     [java]     at scala.tools.reflect.ToolBoxFactory$ToolBoxImpl$withCompilerApi$.apply(ToolBoxFactory.scala:355)
     [java]     at scala.tools.reflect.ToolBoxFactory$ToolBoxImpl.compile(ToolBoxFactory.scala:422)
     [java]     at za.co.absa.spline.common.ReflectionUtils$.compile(ReflectionUtils.scala:45)
     [java]     at za.co.absa.spline.harvester.json.ShortTypeHintForSpline03ModelSupport$.<init>(ShortTypeHintForSpline03ModelSupport.scala:39)
     [java]     at za.co.absa.spline.harvester.json.ShortTypeHintForSpline03ModelSupport$.<clinit>(ShortTypeHintForSpline03ModelSupport.scala)
     [java]     ... 46 more
     [java] 20/02/17 23:01:58 INFO SparkContext: Invoking stop() from shutdown hook
     [java] 20/02/17 23:01:58 INFO SparkUI: Stopped Spark web UI at http://169.254.103.167:4040
     [java] 20/02/17 23:01:58 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
     [java] 20/02/17 23:01:59 INFO MemoryStore: MemoryStore cleared
     [java] 20/02/17 23:01:59 INFO BlockManager: BlockManager stopped
     [java] 20/02/17 23:01:59 INFO BlockManagerMaster: BlockManagerMaster stopped
     [java] 20/02/17 23:01:59 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
     [java] 20/02/17 23:01:59 INFO SparkContext: Successfully stopped SparkContext
     [java] 20/02/17 23:01:59 INFO ShutdownHookManager: Shutdown hook called
     [java] 20/02/17 23:01:59 INFO ShutdownHookManager: Deleting directory

Is the compatibility issue of scala and spark or it is the Spline error? Please let me know if you need any further details from my side.
The Java version details I am using is mentioned below

java version "1.8.0_131"
Java(TM) SE Runtime Environment (build 1.8.0_131-b11)
Java HotSpot(TM) 64-Bit Server VM (build 25.131-b11, mixed mode)

Thanks,
Tufail

Spark AppId missing on the new lineage schema

Hi team, I was doing some analysis on the lineage file produced by the spline-spark-agent 0.5.0, and I noticed that the Spark appID attribute is not part of it as it was in Spline 0.3. Is there any plans on reintroducing this attribute in the future?

User defined operation grouping

The developer of a spark job should be able to demarcate logical goups of transformations that would be visualized in Spline UI as named sub-graphs or clusters, collaped by default with the possibilities to expand and drill down.

Track duration of executions

At the moment we only track the end time of write event and do not collect a duration.
We need to track the start time (or duration) as well.

See AbsaOSS/spline#473 (comment)

scala 2.12 support

from spark 2.4.1 the release move to scala 2.12 do you have release for scala 2.12?

Window functions support in attribute lineage

The window functions are not 100% supported in Spline 0.5.x. Current data model have some issue that make that support problematic. New data model in 0.6 should address this and Window functions attribute lineage should be supported.

This is related to #28

Improve spark-agent "sanity check" error messages

The message should give more information about what is working and what not.

API for attaching user metadata to the execution plan and event

Feature

Allow pass custom key value pairs from spark job which is sent along with lineage data either in executionPlan or executionEvent. This will be powerful feature to allow users to add some metadata to the lineage. I am not sure if this feature already exists as I can see a property called extraInfo: Map[String, Any] = Map.empty in ExecutionPlan which looks like it is may be used for this purpose.

Background

The current immediate requirement is to have JobId and RunId passed as part of lineage data.

JobId: Is essentially just a unique name for the notebook that runs as job. Using Azure Databricks the applicationName and applicationId is autogenerated. These are cluster specific and not "job" specific.

RunId: is unique for a run of a job. If there are two write operation in the job then two executionPlan is generated. There is no way I can see to tell whether the two executionPlan is from same job running once (meaning there are two writes) or the job running twice (single write).

Lineage Calculation not triggered when using Spark Connector for SQL Server BulkCopy API

When using the Spark Sql Server connector (https://github.com/Azure/azure-sqldb-spark), there are two ways to produce the final output:

One way the Spark SQL Connectior API works is by using the usual df.write style operation which triggers lineage via the "Save" action (see example below)

However, when calling df.bulkCopyToSqlDB (also shown below), no lineage is triggered as no action is produced at all by the bulkCopyToSqlDB method as it doesn't go through the usual .jdbc chain. Is there a way to force Spline to produce the lineage for the bulkCopyToSqlDB "terminal" operation?

val df = ...//Read from wherever

//Using the Spark Sql Server connector
val writeConfig = Config(Map(
  "url"               -> "databaseservername",
  "databaseName"      -> "catalogname",
  "dbTable"           -> "tablename",
  "user"              -> "user",
  "password"          -> "password"
))

//This is the Spark Sql Server connector API using the usual write technique.
df.write.mode(SaveMode.Append).sqlDB(writeConfig) //Lineage gets triggered

//Spark Sql Connector also allows for a custom bulk copy that doesn't go through .jdbc
df.bulkCopyToSqlDB(writeConfig) //No action and as a result no lineage

Thanks,
Harish.

Cassandra connector support

https://github.com/datastax/spark-cassandra-connector

Capture failed executions

If a job fails we should still send an execution plan and the error to the server.
Failed events should be filtered out in the query when building a lineage across jobs.

Upgrade to Producer API v1.1

Add Content-Type header:
Content-Type: application/vnd.absa.spline.producer.v1.1+json to all POST requests

Replace producer.api.json with the fresh one

Update Harvester to produce a lineage data in a new format, including attribute dependencies, schemas and expressions.

Spline fails to report lineage when plan contains Scala 2.12 lambda

Checked on spline agent 0.5.1 with Scala 2.12, Spark 2.4.5

Code uses Scala Dataframe/Dataset transformation including lambda expressions it fails with following error:

WARN ExecutionListenerManager: Error executing query execution listener
java.lang.AssertionError: assertion failed: no symbol could be loaded from class a.b.c.MyClass$$Lambda$4843/566774261 in package c with name MyClass$$Lambda$4843/566774261 and classloader sun.misc.Launcher$AppClassLoader@18b4aac2
	at scala.reflect.internal.SymbolTable.throwAssertionError(SymbolTable.scala:184)
	at scala.reflect.runtime.JavaMirrors$JavaMirror.classToScala1(JavaMirrors.scala:1061)
	at scala.reflect.runtime.JavaMirrors$JavaMirror.$anonfun$classToScala$1(JavaMirrors.scala:1019)
	at scala.reflect.runtime.JavaMirrors$JavaMirror.$anonfun$toScala$1(JavaMirrors.scala:130)
	at scala.reflect.runtime.TwoWayCaches$TwoWayCache.$anonfun$toScala$1(TwoWayCaches.scala:50)
	at scala.reflect.runtime.TwoWayCaches$TwoWayCache.toScala(TwoWayCaches.scala:46)
	at scala.reflect.runtime.JavaMirrors$JavaMirror.toScala(JavaMirrors.scala:128)
	at scala.reflect.runtime.JavaMirrors$JavaMirror.classToScala(JavaMirrors.scala:1019)
	at scala.reflect.runtime.JavaMirrors$JavaMirror.classSymbol(JavaMirrors.scala:231)
	at scala.reflect.runtime.JavaMirrors$JavaMirror.classSymbol(JavaMirrors.scala:68)
	at za.co.absa.commons.reflect.ReflectionUtils$ModuleClassSymbolExtractor$.unapply(ReflectionUtils.scala:33)
	at za.co.absa.spline.harvester.converter.ValueDecomposer$$anonfun$$nestedInanonfun$handler$2$1.applyOrElse(ValueDecomposer.scala:52)
	at za.co.absa.spline.harvester.converter.ValueDecomposer$$anonfun$$nestedInanonfun$handler$2$1.applyOrElse(ValueDecomposer.scala:44)
	at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:38)
	at za.co.absa.spline.harvester.converter.OperationParamsConverter$$anonfun$$nestedInanonfun$renderer$1$1.applyOrElse(OperationParamsConverter.scala:35)
	at za.co.absa.spline.harvester.converter.OperationParamsConverter$$anonfun$$nestedInanonfun$renderer$1$1.applyOrElse(OperationParamsConverter.scala:35)
	at scala.PartialFunction$OrElse.apply(PartialFunction.scala:172)
	at za.co.absa.spline.harvester.converter.ValueDecomposer.$anonfun$recursion$1(ValueDecomposer.scala:39)
	at za.co.absa.spline.harvester.converter.OperationParamsConverter.$anonfun$convert$6(OperationParamsConverter.scala:59)
	at scala.collection.TraversableLike$WithFilter.$anonfun$map$2(TraversableLike.scala:827)

Upgrade to Producer API ver.1.1

Agent should support both Producer API v1 and newer 1.1

Add Content-Type header:
Content-Type: application/vnd.absa.spline.producer.v1.1+json to all POST requests

Replace producer.api.json with the fresh one

Update Harvester to produce a lineage data in a new format, including attribute dependencies, schemas and expressions.

Spark DataSource V2 support

Add fallback option in case Producer is down and write doesn't pass

If in case of reasons unknown spline is not able to log the lineage to the ArangoDB, I would love to have a fallback option of writing the data to local disk or HDFS, which then I would be able to manually insert into the ArangoDB.

Why?

I might be allowed to only run the job once or once every X hours/days/weeks which means my data about the job is lost.

Additions requests:

I would not like to lose the default configuration that works almost flawlessly out of the box
the path to the file should be configurable by the end-user
if it could be a pre-prepared ArangoDB script that I just run

ElasticSearch-hadoop connector support Scala 2.12

Background

ElasticSearch currently only supports scala 2.11. Need to be able to support scala 2.12 when ElasticSearch releases new version

Details

ElasticSearch connector: https://github.com/elastic/elasticsearch-hadoop
ElasticSearch open issue for scala 2.12: elastic/elasticsearch-hadoop#1224

AVRO: a class name is captured instead of a format name

Avro format is displayed as org.apache.spark.sql.avro.AvroFileFormat@16c9ec5b. Should be just Avro

MongoDB connector support

https://docs.mongodb.com/spark-connector/

Attribute level lineage isn't resolved correctly on UNION operation

Even though UNION's inputs have identical schema the output attributes are only traced to the 1st input group.

Expected behavior: UNION output should be traced to both input groups.

Sanity check for codeless initialization

Spark write mode "Ignore" support

We should ignore events of writes with mode "Ignore"
Related to: AbsaOSS/spline#52, AbsaOSS/spline#79

ArangoDB Spark Connector support

https://www.arangodb.com/docs/stable/drivers/spark-connector-getting-started.html

Add template resource files to runable modules

To make it clear what can be set for runable modules of the project, it might be useful to include template resource files. E.g files like application.conf.template or application.properties.template
These files would include things that can be set in the module - keys with some good example values and explanatory comments if considered so.

support for databricks delta format

Background

Databricks delta format enables ACID transactions on data. Currently "delta" format is not supported by Spline.

Feature

If the support for "delta" file format could be added, it would help a lot.

Ability to replace pattern in application name for lineage

In our setup we have jobs with generated names - usually dates, for example my-app-2020-05-01

For lineage purposes it would be better to report it as my-app-YYYY-MM-DD

@wajda have you thought about such feature? Do you think it would be useful for others? I could try to add implementation for it but let me know first

Improve Spline agent init logging

Spline agent should report the mode and the producer URL in the INFO log.

Security issue: JDBC URI contains plaintext password & username

Proposal

Spline Agent should contain URI filtering for, at least, passwords. Captured Lineage data are send insecurely thru HTTP protocol to Producer, which can be located out of the secured network, so anyone can listen and capture the password.

I believe this issue should be solved on either Agent and UI sides. Best solution is not sent it at all or securely mask it (f.e. with asterisks).

Example

Current state:
DataSource URI: jdbc:sqlserver://sample.database.windows.net:1433;database=sample;user=sample;password=password;encrypt=true;trustServerCertificate=false;hostNameInCertificate=*.database.windows.net;loginTimeout=30:

Future state:
DataSource URI: jdbc:sqlserver://sample.database.windows.net:1433;database=sample;user=;password=;encrypt=true;trustServerCertificate=false;hostNameInCertificate=*.database.windows.net;loginTimeout=30:

Version

For version 0.5.1 Agent, UI.

absaoss / spline-spark-agent Goto Github PK

spline-spark-agent's Introduction

Spline Agent for Apache Spark™

Table of Contents

Versioning

Spark / Scala version compatibility matrix

Usage

Selecting artifact

Initialization

Codeless Initialization

Programmatic Initialization

Configuration

Properties

spline.mode

spline.lineageDispatcher

spline.postProcessingFilter

Lineage Dispatchers

Using the Http Dispatcher

Using the Fallback Dispatcher

Using the Composite Dispatcher

Creating your own dispatcher

Combining dispatchers (complex example)

Post Processing Filters

Using MetadataCollectingFilter

Spark features coverage

Implemented

To be implemented

Ignored

Developer documentation

Plugin API

Building for different Scala and Spark versions

Build docker image

How to measure code coverage

References and examples

spline-spark-agent's People

Stargazers

Watchers

Forkers

spline-spark-agent's Issues

Background

Feature

Background

Feature

Example

Background [Optional]

Question

Background

Question

Feature

Background

Why?

Additions requests:

Background

Details

Background

Feature

Proposal

Example

Version

Recommend Projects

Recommend Topics

Recommend Org

Jobs

`spline.mode`

`spline.lineageDispatcher`

`spline.postProcessingFilter`