hortonworks-spark / spark-atlas-connector Goto Github PK

View Code? Open in Web Editor NEW

263.0 20.0 149.0 925 KB

A Spark Atlas connector to track data lineage in Apache Atlas

License: Apache License 2.0

Scala 98.88% Shell 1.12%

apache-spark apache-atlas

spark-atlas-connector's Introduction

Spark Atlas Connector

A connector to track Spark SQL/DataFrame transformations and push metadata changes to Apache Atlas.

This connector supports tracking:

SQL DDLs like "CREATE/DROP/ALTER DATABASE", "CREATE/DROP/ALTER TABLE".
SQL DMLs like "CREATE TABLE tbl AS SELECT", "INSERT INTO...", "LOAD DATA [LOCAL] INPATH", "INSERT OVERWRITE [LOCAL] DIRECTORY" and so on.
DataFrame transformations which has inputs and outputs
Machine learning pipelines.

This connector will correlate with other systems like Hive, HDFS to track the life-cycle of data in Atlas.

How To Build

To use this connector, you will require a latest version of Spark (Spark 2.3+), because most of the features only exist in Spark 2.3.0+.

To build this project, please execute:

mvn package -DskipTests

mvn package will assemble all the required dependencies and package into an uber jar.

Create Atlas models

NOTE: below steps are only necessary prior to Apache Atlas 2.1.0. Apache Atlas 2.1.0 will include the models.

SAC leverages official Spark models in Apache Atlas, but as of Apache Atlas 2.0.0, it doesn't include the model file yet. Until Apache Atlas publishes new release which includes the model, SAC includes the json model file to apply to Atlas server easily.

Please copy 1100-spark_model.json to <ATLAS_HOME>/models/1000-Hadoop directory and restart Atlas server to take effect.

How To Use

To use it, you will need to make this jar accessible in Spark Driver, also configure

spark.extraListeners=com.hortonworks.spark.atlas.SparkAtlasEventTracker
spark.sql.queryExecutionListeners=com.hortonworks.spark.atlas.SparkAtlasEventTracker
spark.sql.streaming.streamingQueryListeners=com.hortonworks.spark.atlas.SparkAtlasStreamingQueryEventTracker

For example, when you're using spark-shell, you can start the Spark like:

bin/spark-shell --jars spark-atlas-connector_2.11-0.1.0-SNAPSHOT.jar \
--conf spark.extraListeners=com.hortonworks.spark.atlas.SparkAtlasEventTracker \
--conf spark.sql.queryExecutionListeners=com.hortonworks.spark.atlas.SparkAtlasEventTracker \
--conf spark.sql.streaming.streamingQueryListeners=com.hortonworks.spark.atlas.SparkAtlasStreamingQueryEventTracker

Also make sure atlas configuration file atlas-application.properties is in the Driver's classpath. For example, putting this file into <SPARK_HOME>/conf.

If you're using cluster mode, please also ship this conf file to the remote Drive using --files atlas-application.properties.

Spark Atlas Connector supports two types of Atlas clients, "kafka" and "rest". You can configure which type of client via setting atlas.client.type to whether kafka or rest. The default value is kafka which provides stable and secured way of publishing changes. Atlas has embedded Kafka instance so you can test it out in test environment, but it's encouraged to use external kafka cluster in production. If you don't have Kafka cluster in production, you may want to set client to rest.

To Use it in Secure Environment

Atlas now only secures Kafka client API, so when you're using this connector in secure environment, please shift to use Kafka client API by configuring atlas.client.type=kafka in atlas-application.properties.

Also please add the below configurations to your atlas-application.properties.

atlas.jaas.KafkaClient.loginModuleControlFlag=required
atlas.jaas.KafkaClient.loginModuleName=com.sun.security.auth.module.Krb5LoginModule
atlas.jaas.KafkaClient.option.keyTab=./a.keytab
[email protected]
atlas.jaas.KafkaClient.option.serviceName=kafka
atlas.jaas.KafkaClient.option.storeKey=true
atlas.jaas.KafkaClient.option.useKeyTab=true

Please make sure keytab (a.keytab) is accessible from Spark Driver.

When running on cluster node, you will also need to distribute this keytab, below is the example command to run in cluster mode.

 ./bin/spark-submit --class <class_name> \
  --jars spark-atlas-connector_2.11-0.1.0-SNAPSHOT.jar \
  --conf spark.extraListeners=com.hortonworks.spark.atlas.SparkAtlasEventTracker \
  --conf spark.sql.queryExecutionListeners=com.hortonworks.spark.atlas.SparkAtlasEventTracker \
  --conf spark.sql.streaming.streamingQueryListeners=com.hortonworks.spark.atlas.SparkAtlasStreamingQueryEventTracker \
  --master yarn-cluster \
  --principal [email protected] \
  --keytab ./spark.headless.keytab \
  --files atlas-application.properties,a.keytab \
  <application-jar>

When Spark application is started, it will transparently track the execution plan of submitted SQL/DF transformations, parse the plan and create related entities in Atlas.

Spark models vs Hive models

SAC classifies table related entities with two different kind of models: Spark / Hive.

We decided to skip sending create events for Hive tables managed by HMS to avoid duplication of those events from Atlas hook for Hive . For Hive entities, Atlas relies on Atlas hook for Hive as the source of truth.

SAC assumes table entities are being created in Hive side and just refers these entities via object id if below conditions are true:

SparkSession.builder.enableHiveSupport is set
The value of "hive.metastore.uris" is set to non-empty

For other cases, SAC will create table related entities as Spark models.

One exceptional case is HWC - for HWC source and/or sink, SAC will not create table related entities and always refer to Hive table entities via object id.

Known Limitations (Design decision)

SAC only supports SQL/DataFrame API (in other words, SAC doesn't support RDD).

SAC relies on query listener to retrieve query and examine the impacts.

All "inputs" and "outputs" in multiple queries are accumulated into single "spark_process" entity when there're multple queries running in single Spark session.

"spark_process" maps to an "applicationId" in Spark. This is helpful as it allows admin to track all changes that occurred as part of an application. But it also causes lineage/relationship graph in "spark_process" to be complicated and less meaningful.

We've filed #261 to investigate changing the unit of "spark_process" entity to query. It doesn't mean we will change it soon. It will be addressed only if we see clear benefits of changing it.

Only part of inputs are tracked in Streaming query.

This is from design choice on "trade-off": Kafka source supports subscribing with "pattern" and SAC cannot enumerate all matching existing topics, or even all possible topics (even if it was possible, it won't make sense).

"executed plan" provides actual topics which each (micro) batch reads and processes, and as a result, only inputs which participate in (micro) batch are included as "inputs" in "spark_process" entity.

If your query runs long enough that it ingests data from all topics, it will have all topics in "spark_process" entity.

SAC doesn't support tracking changes on columns (Spark models).

We are investigating how to add support for column entity. The main issue we face is how to make this change consistent when multiple spark applications make changes to the same table/column.

This doesn't apply to Hive models, which central remote HMS takes care of DDLs and Hive Atlas Hook will take care of updates.

SAC doesn't track dropping tables (Spark models).

"drop table" event from Spark only provides db and table name, which is NOT sufficient to create qualifierName - especially we separate two types of tables - spark and hive.

SAC depends on reading the Spark Catalog to get table information but Spark will have already dropped the table when SAC notices the table is dropped so that will not work.

We are investigating how to change Spark to provide necessary information via listener, maybe snapshot of information before deletion happens.

ML entities/events may not be tracked properly.

We are concentrating on making basic features be stable: we are not including ML features on the target of basic features as of now. We will revisit once we are sure to resolve most of issues on basic features.

By the way, we have two patches for tracking ML events: one is a custom patch which could be applied to Spark 2.3/2.4, and another one is a patch which is adopted to Apache Spark but will be available for Spark 3.0. Currently SAC follows custom patch, which is kind of deprecated due to new patch. Maybe we would need to revisit ML features again with Spark 3.0.

License

Apache License, Version 2.0 http://www.apache.org/licenses/LICENSE-2.0.

spark-atlas-connector's People

Contributors

Stargazers

Watchers

Forkers

weiqingy merlintang lantaojin ugoa hyukjinkwon lipengyu superwood skshambhu mikelaidata zhaxiaodong9860 damienedwards1 ssharma555 bolkedebruin 0xqq ssharma555now nagarajjayakumar gavinljj kr7ysztof heartsavior dciborow 10110346 luzhonghao sjq1992 geshuro mixergit tbsod two8g bongani 87sanchavan kellyzly xiashuijun conorzhao davetuner fanqi1909 jayrcaraan rubensmabueno kioco vitaly-am yekeng young2018 nicky-ning juhoautio lccbiluox2 zhengxle gilbertobotaro leftjs liu-zhengyi yotpoltd mwiewior nihaojava phenixmzy yingzi-cc vvaks0 dingyifand wforget ambition119 mohamed-a-abdelaziz cyofeiyue johnsonwangnz rowen110 hanibalgk nicolaszhang tarak-mpc makamus bhpraka dechoma gzf117 chinaboyll dragonli-mi wholeworld-timothy yz3100400800 frost713 dragonxinli jiangzz gavinh1984 wang-zhun lunma cjp-tutorial kennydataml luyizhizaio epetxepe smallconchs ghfork saiprasadlaxmeshwar nishuihanqiu alfeuduran rkrumins chguo15 mrenau cjekal rajavp2000 shufanshijie lunescode seanzou isabella232 aoj sqllineage michaelli916 putaozhi123 rongyousu

spark-atlas-connector's Issues

Spark streaming query sink to console is not atlas entity

kafka topic fail to get

18/04/24 23:01:22 WARN SparkExecutionPlanTracker: Caught exception during parsing the query: java.util.NoSuchElementException: None.get

Update for Atlas 1.0.0

Failed to initialize Atlas client

While running my basic word-count spark job with spark-atlas-connector, i am getting the following error :

18/07/25 11:57:52 ERROR ClientResponse: A message body reader for Java class org.apache.atlas.model.typedef.AtlasTypesDef, and Java type class org.apache.atlas.model.typedef.AtlasTypesDef, and MIME media type application/json;charset=UTF-8 was not found    
18/07/25 11:57:52 ERROR ClientResponse: The registered message body readers compatible with the MIME media type are:    
*/* ->    
  com.sun.jersey.core.impl.provider.entity.FormProvider    
  com.sun.jersey.core.impl.provider.entity.StringProvider    
  com.sun.jersey.core.impl.provider.entity.ByteArrayProvider    
  com.sun.jersey.core.impl.provider.entity.FileProvider    
  com.sun.jersey.core.impl.provider.entity.InputStreamProvider    
  com.sun.jersey.core.impl.provider.entity.DataSourceProvider    
  com.sun.jersey.core.impl.provider.entity.XMLJAXBElementProvider$General    
  com.sun.jersey.core.impl.provider.entity.ReaderProvider    
  com.sun.jersey.core.impl.provider.entity.DocumentProvider    
  com.sun.jersey.core.impl.provider.entity.SourceProvider$StreamSourceReader    
  com.sun.jersey.core.impl.provider.entity.SourceProvider$SAXSourceReader    
  com.sun.jersey.core.impl.provider.entity.SourceProvider$DOMSourceReader    
  com.sun.jersey.core.impl.provider.entity.XMLRootElementProvider$General    
  com.sun.jersey.core.impl.provider.entity.XMLListElementProvider$General    
  com.sun.jersey.core.impl.provider.entity.XMLRootObjectProvider$General    
  com.sun.jersey.core.impl.provider.entity.EntityHolderReader    

18/07/25 11:57:52 ERROR SparkAtlasEventTracker: Fail to initialize Atlas client, stop this listener    
org.apache.atlas.AtlasServiceException: Metadata service API GET : api/atlas/v2/types/typedefs/ failed    
        at org.apache.atlas.AtlasBaseClient.callAPIWithResource(AtlasBaseClient.java:325)    
        at org.apache.atlas.AtlasBaseClient.callAPIWithResource(AtlasBaseClient.java:287)    
        at org.apache.atlas.AtlasBaseClient.callAPI(AtlasBaseClient.java:469)    
        at org.apache.atlas.AtlasClientV2.getAllTypeDefs(AtlasClientV2.java:131)    
        at com.hortonworks.spark.atlas.RestAtlasClient.getAtlasTypeDefs(RestAtlasClient.scala:58)    
        at com.hortonworks.spark.atlas.types.SparkAtlasModel$$anonfun$checkAndGroupTypes$1.apply(SparkAtlasModel.scala:107)    
        at com.hortonworks.spark.atlas.types.SparkAtlasModel$$anonfun$checkAndGroupTypes$1.apply(SparkAtlasModel.scala:104)    
        at scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:221)    
        at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:428)    
        at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:428)    
        at com.hortonworks.spark.atlas.types.SparkAtlasModel$.checkAndGroupTypes(SparkAtlasModel.scala:104)    
        at com.hortonworks.spark.atlas.types.SparkAtlasModel$.checkAndCreateTypes(SparkAtlasModel.scala:71)    
        at com.hortonworks.spark.atlas.SparkAtlasEventTracker.initializeSparkModel(SparkAtlasEventTracker.scala:108)    
        at com.hortonworks.spark.atlas.SparkAtlasEventTracker.<init>(SparkAtlasEventTracker.scala:48)    
        at com.hortonworks.spark.atlas.SparkAtlasEventTracker.<init>(SparkAtlasEventTracker.scala:39)    
        at com.hortonworks.spark.atlas.SparkAtlasEventTracker.<init>(SparkAtlasEventTracker.scala:43)    
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)    
        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)    
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)    
        at java.lang.reflect.Constructor.newInstance(Constructor.java:423)    
        at org.apache.spark.util.Utils$$anonfun$loadExtensions$1.apply(Utils.scala:2743)    
        at org.apache.spark.util.Utils$$anonfun$loadExtensions$1.apply(Utils.scala:2732)    
        at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)    
        at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)    
        at scala.collection.mutable.ArraySeq.foreach(ArraySeq.scala:74)    
        at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)    
        at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104)    
        at org.apache.spark.util.Utils$.loadExtensions(Utils.scala:2732)    
        at org.apache.spark.sql.util.ExecutionListenerManager$$anonfun$$lessinit$greater$1.apply(QueryExecutionListener.scala:83)    
        at org.apache.spark.sql.util.ExecutionListenerManager$$anonfun$$lessinit$greater$1.apply(QueryExecutionListener.scala:82)    
        at scala.Option.foreach(Option.scala:257)    
        at org.apache.spark.sql.util.ExecutionListenerManager.<init>(QueryExecutionListener.scala:82)    
        at org.apache.spark.sql.internal.BaseSessionStateBuilder$$anonfun$listenerManager$2.apply(BaseSessionStateBuilder.scala:270)    
        at org.apache.spark.sql.internal.BaseSessionStateBuilder$$anonfun$listenerManager$2.apply(BaseSessionStateBuilder.scala:270)    
        at scala.Option.getOrElse(Option.scala:121)    
        at org.apache.spark.sql.internal.BaseSessionStateBuilder.listenerManager(BaseSessionStateBuilder.scala:269)    
        at org.apache.spark.sql.internal.BaseSessionStateBuilder.build(BaseSessionStateBuilder.scala:297)    
        at org.apache.spark.sql.SparkSession$.org$apache$spark$sql$SparkSession$$instantiateSessionState(SparkSession.scala:1070)    
        at org.apache.spark.sql.SparkSession$$anonfun$sessionState$2.apply(SparkSession.scala:141)    
        at org.apache.spark.sql.SparkSession$$anonfun$sessionState$2.apply(SparkSession.scala:140)    
        at scala.Option.getOrElse(Option.scala:121)    
        at org.apache.spark.sql.SparkSession.sessionState$lzycompute(SparkSession.scala:140)    
        at org.apache.spark.sql.SparkSession.sessionState(SparkSession.scala:137)    
        at org.apache.spark.sql.Dataset.<init>(Dataset.scala:178)    
        at org.apache.spark.sql.Dataset$.apply(Dataset.scala:65)    
        at org.apache.spark.sql.SparkSession.createDataset(SparkSession.scala:470)    
        at org.apache.spark.sql.SQLContext.createDataset(SQLContext.scala:377)    
        at org.apache.spark.sql.SQLImplicits.localSeqToDatasetHolder(SQLImplicits.scala:228)    
        at com.oi.spline.main.SparkAtlasConnector$.main(SparkAtlasConnector.scala:20)    
        at com.oi.spline.main.SparkAtlasConnector.main(SparkAtlasConnector.scala)    
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)    
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)    
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)    
        at java.lang.reflect.Method.invoke(Method.java:498)    
        at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)    
        at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:906)    
        at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:197)    
        at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:227)    
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:136)    
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)    
Caused by: com.sun.jersey.api.client.ClientHandlerException: A message body reader for Java class org.apache.atlas.model.typedef.AtlasTypesDef, and Java type class org.apache.atlas.model.typedef.AtlasTypesDef, and MIME media type application/json;charset=UTF-8 was not found    
        at com.sun.jersey.api.client.ClientResponse.getEntity(ClientResponse.java:630)    
        at com.sun.jersey.api.client.ClientResponse.getEntity(ClientResponse.java:604)    
        at org.apache.atlas.AtlasBaseClient.callAPIWithResource(AtlasBaseClient.java:321)    
        ... 59 more    
18/07/25 11:57:53 INFO FileOutputCommitter: File Output Committer Algorithm version is 1

Mine is a kerberized cluster. Atlas version is 0.8.2 and spark version is 2.3.0.
I have followed all the steps specified for Kerberized environment.
Any help will be highly appreciated @jerryshao , @weiqingy

Track the sink information for Spark streaming query process

df.write.saveAsTable is not creating table entity in Atlas

Read a CSV file from hdfs :

df = spark.read.csv("/tmp/googleplaystore.csv")

Write this dataframe to spark table :

df.write.saveAsTable("app_details")

This is creating 'app_details' table in Spark. But its not creating corresponding entity in Atlas.

Why `mockito-all` is not in scope test?

Merge "multiple source tables & output tables supports" into master branch

Support hbase entities both as an input and an output

Support KafKa data source

Merge "Insert into" into repo

There are some conflicts, and no lineage - need to debug

Kafka related tests are failing intermittently in Travis CI build

There're some reports from ourselves that Kafka related tests are failing intermittently in Travis CI build. The chance of build failure looks a bit high, so need to investigate and make it stable.

Update the Spark Process Atlas record name

When users do not specify the spark application name, we need to record meaningful information for this job. therefore, update Spark-shell name to Sparkjob + applicationID

Unify Spark ML and Dataframe operator into a uniform Spark Process

Investigate whether SAC can leverage Kafka delegation token

There's ongoing effort on applying Kafka delegation token to Apache Spark (apache/spark#22598)

The issue is to track the effort on investigating the possibility to leverage Kafka delegation token in SAC as well when Spark app turns on delegation token.

Update Spark ML listener patch for Apache Spark 2.4

Add a scala style and fix style errors

Could you do this, @weiqingy ?
cc @jerryshao .

Spark's Hive catalog column name should be lower cases

To be consistent with Hive, we had better create it in lower cases.

Spark Catalog PreEvent should be handled correctly

Currently, SAC is collecting some information at PreEvents like DropDatabasePreEvent. However, the database described in DropDatabasePreEvent might not exist and the main event DropDatabaseEvent will not arrive because it will fail. This PR fixes some PreEvent logic and handles explicitly as No-OPs for the others.

scala> sql("drop database sparkdb2")
2018-11-12 14:16:29 WARN  SparkCatalogEventProcessor:48 - Caught exception during parsing event
org.apache.spark.sql.catalyst.analysis.NoSuchDatabaseException: Database 'sparkdb2' not found;
...
	at com.hortonworks.spark.atlas.AbstractEventProcessor$$anon$1.run(AbstractEventProcessor.scala:39)

Record the Spark ML model and pipeline related directory in the output of atlas

Can not show SQL queries on process entities

Disable `atlas.spark.column.enabled` by default

Use Spark Catalog

Hive Metastore provides multiple Catalogs. SAC assumes that it has own spark catalog which consists of fully accessible non-transactional tables. spark catalog is new one which is created by the following method.

https://cwiki.apache.org/confluence/display/Hive/Hive+Schema+Tool

$ schematool \
  -dbType mysql \
  -createCatalog spark \
  -catalogDescription 'Default catalog, for Spark' \
  -catalogLocation hdfs://cluster:8020/apps/spark/warehouse

Question: spark initialization fails with java.lang.ClassNotFoundException: org.apache.atlas.ApplicationProperties

Hi,

I have built the spark-atlas-connector_2.11-0.1.0-SNAPSHOT.jar as mentioned in the readme doc.

while i try to start the spark-shell --conf spark.extraListeners=com.hortonworks.spark.atlas.sql.SparkCatalogEventTracker property spark throws following error

Caused by: java.lang.ClassNotFoundException: org.apache.atlas.ApplicationProperties

I've even tried adding atlas 0.8.2 jar explicitly while invoking spark-shell

PFB command i'm executing
spark-shell --jars ~/hadoop/spark/jars/spark-atlas-connector_2.11-0.1.0-SNAPSHOT.jar, ~/hadoop/spark/jars/atlas-distro-0.8.2.jar --conf spark.extraListeners=com.hortonworks.spark.atlas.sql.SparkCatalogEventTracker --files ~/hadoop/spark/conf/atlas-application.properties

can you please help me understanding what i'm missing?

Thanks!

SAC should not fail on repeating creation/deletion of db/tables

scala> sql("create table t(a int)")
scala> sql("drop table t")
scala> sql("create table t(a int)")
scala> sql("drop table t")
2018-11-12 14:55:30 WARN  SparkCatalogEventProcessor:48 - Caught exception during parsing event
org.apache.spark.sql.catalyst.analysis.NoSuchTableException: Table or view 't' not found in database 'default';
	at org.apache.spark.sql.hive.client.HiveClient$$anonfun$getTable$1.apply(HiveClient.scala:81)
...
	at org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.getTable(ExternalCatalogWithListener.scala:138)
...
	at com.hortonworks.spark.atlas.AbstractEventProcessor$$anon$1.run(AbstractEventProcessor.scala:39)

Make HBase entities in column family level

Use Apache Spark 2.3.0

Cant drop atlas metadatos

After create spark tables and make a pyspark process and then drop table from pyspark, atlas metadata still is on atlas. for example:

PySparkShell local-1531135715403 | | | spark_process

/apps/hive/warehouse/lbk_tabla01 | | | hdfs_path

Update the Spark ML model and pipeline related directory entity

Merge SparkExtension class into repo

SparkExtension is used for showing user SQL queries.

Support `atlas.spark.enabled` configuration

Altering table details from spark are not reflecting on Atlas

Altering table details from spark are not reflecting on Atlas.

Test steps :1) Create table test1

create table test1(col1 int)

Check on atlas that table entity test1 is created with column col1

2)Alter the table and add new column col2

spark.sql("alter table test1 add COLUMNS (col2 int)")

After step 2 check atlas. For table entity test1, column details are not updated.

Expectation : This need to be updated in properties, relationships and schema.

SAC should not warn on query failures

java.lang.NoClassDefFoundError: org/apache/atlas/ApplicationProperties

I am getting the following error when running the spark-shell:

$  spark-shell --jars spark-atlas-connector_2.11-0.1.0-SNAPSHOT.jar --conf spark.extraListeners=com.hortonworks.spark.atlas.SparkAtlasEventTracker --conf spark.sql.queryExecutionListeners=com.hortonworks.spark.atlas.SparkAtlasEventTracker
SPARK_MAJOR_VERSION is set to 2, using Spark2
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
java.lang.NoClassDefFoundError: org/apache/atlas/ApplicationProperties
  at com.hortonworks.spark.atlas.AtlasClientConf.configuration$lzycompute(AtlasClientConf.scala:27)
  at com.hortonworks.spark.atlas.AtlasClientConf.configuration(AtlasClientConf.scala:27)
  at com.hortonworks.spark.atlas.AtlasClientConf.get(AtlasClientConf.scala:52)
  at com.hortonworks.spark.atlas.AtlasClient$.atlasClient(AtlasClient.scala:88)
  at com.hortonworks.spark.atlas.SparkAtlasEventTracker.<init>(SparkAtlasEventTracker.scala:39)
  at com.hortonworks.spark.atlas.SparkAtlasEventTracker.<init>(SparkAtlasEventTracker.scala:43)
  at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
  at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
  at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
  at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
  at org.apache.spark.util.Utils$$anonfun$loadExtensions$1.apply(Utils.scala:2743)
  at org.apache.spark.util.Utils$$anonfun$loadExtensions$1.apply(Utils.scala:2732)
  at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
  at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
  at scala.collection.mutable.ArraySeq.foreach(ArraySeq.scala:74)
  at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
  at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104)
  at org.apache.spark.util.Utils$.loadExtensions(Utils.scala:2732)
  at org.apache.spark.SparkContext$$anonfun$setupAndStartListenerBus$1.apply(SparkContext.scala:2360)
  at org.apache.spark.SparkContext$$anonfun$setupAndStartListenerBus$1.apply(SparkContext.scala:2359)
  at scala.Option.foreach(Option.scala:257)
  at org.apache.spark.SparkContext.setupAndStartListenerBus(SparkContext.scala:2359)
  at org.apache.spark.SparkContext.<init>(SparkContext.scala:554)
  at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2493)
  at org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:930)
  at org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:921)
  at scala.Option.getOrElse(Option.scala:121)
  at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:921)
  at org.apache.spark.repl.Main$.createSparkSession(Main.scala:103)
  ... 55 elided
Caused by: java.lang.ClassNotFoundException: org.apache.atlas.ApplicationProperties
  at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
  at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
  at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
  ... 84 more
<console>:14: error: not found: value spark
       import spark.implicits._
              ^
<console>:14: error: not found: value spark
       import spark.sql
              ^
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.3.0.2.6.5.0-292
      /_/

Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_112)
Type in expressions to have them evaluated.
Type :help for more information.

scala>

Reducing overhead to update entity for Spark Streaming query processing

Datafame exception during parsing event error

Hi Trying to make sense of this error:

sc.parallelize(0 to 100).toDF.registerTempTable("foo")

warning: there was one deprecation warning; re-run with -deprecation for details
2018-10-12 15:39:22 WARN SparkExecutionPlanProcessor:48 - Caught exception during parsing event
java.lang.ClassCastException: org.apache.spark.sql.catalyst.plans.logical.AnalysisBarrier cannot be cast to org.apache.spark.sql.catalyst.plans.logical.Project
at com.hortonworks.spark.atlas.sql.CommandsHarvester$CreateViewHarvester$.harvest(CommandsHarvester.scala:271)
at com.hortonworks.spark.atlas.sql.SparkExecutionPlanProcessor$$anonfun$2.apply(SparkExecutionPlanProcessor.scala:63)
at com.hortonworks.spark.atlas.sql.SparkExecutionPlanProcessor$$anonfun$2.apply(SparkExecutionPlanProcessor.scala:48)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104)
at com.hortonworks.spark.atlas.sql.SparkExecutionPlanProcessor.process(SparkExecutionPlanProcessor.scala:48)
at com.hortonworks.spark.atlas.sql.SparkExecutionPlanProcessor.process(SparkExecutionPlanProcessor.scala:35)
at com.hortonworks.spark.atlas.AbstractEventProcessor$$anonfun$eventProcess$1.apply(AbstractEventProcessor.scala:67)
at com.hortonworks.spark.atlas.AbstractEventProcessor$$anonfun$eventProcess$1.apply(AbstractEventProcessor.scala:66)
at scala.Option.foreach(Option.scala:257)
at com.hortonworks.spark.atlas.AbstractEventProcessor.eventProcess(AbstractEventProcessor.scala:66)
at com.hortonworks.spark.atlas.AbstractEventProcessor$$anon$1.run(AbstractEventProcessor.scala:39)

Find a way to differentiate Kafka topics in different clusters

Since SAC tracks Kafka topics for sources and sink, there's a chance for Spark app(s) to access Kafka topics from various clusters. Since SAC always creates qualified name as topic@clusterName which clusterName is just from atlas client properties, SAC basically can't differentiate them.

We need to find a way to differentiate them.

invalid relationshipDef: avro_schema_associatedEntities: end type 1: DataSet, end type 2: spark_table

we receive this message when creating a table in spark. This is with Atlas 1.0 and Spark 2.3.1

18/09/20 14:32:53 WARN RestAtlasClient: Failed to create entities
org.apache.atlas.AtlasServiceException: Metadata service API org.apache.atlas.AtlasClientV2$API_V2@4539545f failed with status 400 (Bad Request) Response Body ({"errorCode":"ATLAS-400-00-036","errorMessage":"invalid relationshipDef: avro_schema_associatedEntities: end type 1: DataSet, end type 2: spark_table"})
	at org.apache.atlas.AtlasBaseClient.callAPIWithResource(AtlasBaseClient.java:395)
	at org.apache.atlas.AtlasBaseClient.callAPIWithResource(AtlasBaseClient.java:323)
	at org.apache.atlas.AtlasBaseClient.callAPI(AtlasBaseClient.java:211)
	at org.apache.atlas.AtlasClientV2.createEntities(AtlasClientV2.java:305)
	at com.hortonworks.spark.atlas.RestAtlasClient.doCreateEntities(RestAtlasClient.scala:68)
	at com.hortonworks.spark.atlas.AtlasClient$class.createEntities(AtlasClient.scala:42)
	at com.hortonworks.spark.atlas.RestAtlasClient.createEntities(RestAtlasClient.scala:31)
	at com.hortonworks.spark.atlas.sql.SparkCatalogEventProcessor.process(SparkCatalogEventProcessor.scala:64)
	at com.hortonworks.spark.atlas.sql.SparkCatalogEventProcessor.process(SparkCatalogEventProcessor.scala:30)
	at com.hortonworks.spark.atlas.AbstractEventProcessor$$anonfun$eventProcess$1.apply(AbstractEventProcessor.scala:67)
	at com.hortonworks.spark.atlas.AbstractEventProcessor$$anonfun$eventProcess$1.apply(AbstractEventProcessor.scala:66)
	at scala.Option.foreach(Option.scala:257)
	at com.hortonworks.spark.atlas.AbstractEventProcessor.eventProcess(AbstractEventProcessor.scala:66)
	at com.hortonworks.spark.atlas.AbstractEventProcessor$$anon$1.run(AbstractEventProcessor.scala:39)

We don't know how to solve this. Any idea?

Leverage KafkaStreamWriterFactory to avoid reflection while getting destination topic

Currently KafkaHarvester requires specific fix to spark-sql-kafka because it extracts topic information from KafkaStreamWriter which origin class doesn't have topic field (it is just an only one of constructor parameter and KafkaStreamWriter is not a case class), as well as it leverages Java reflection to extract topic.

KafkaStreamWriterFactory is a case class (even in Spark master branch) and have topic in primary constructor, so we can apply pattern matching to extract topic. One downside of this approach is additional overhead, but in this case, it is just an object creation so I guess we could ignore that, and even we can revisit and cache them if the cost of overhead matters.

Unable to get lineage for Structured streaming

Unable to get lineage for the Structured streaming. is this implemented ? if yes can you please provide us the branch ?

Add more unit tests (InsertInto) about views and tables (refer tpcds)

We should all put all sql queries in one place, and use CheckAnswer() to compare results.

Nested select queries
Correlated queries
WITH queries

Add Apache Maven wrapper to download and run

Update Atlas to 0.8.2

Support HBase as an external data source by SHC

Add `atlas.spark.column.enabled` configuration

java.lang.NoSuchMethodError: org.apache.atlas.type.AtlasTypeUtil.createClassTypeDef(

Following error occurs while using atlas 1.0 =>

Exception in thread "main" java.lang.NoSuchMethodError: org.apache.atlas.type.AtlasTypeUtil.createClassTypeDef(Ljava/lang/String;Ljava/lang/String;Ljava/lang/String;Ljava/util/Set;[Lorg/apache/atlas/model/typedef/AtlasStructDef$AtlasAttributeDef;)Lorg/apache/atlas/model/typedef/AtlasEntityDef;
at com.hortonworks.spark.atlas.types.metadata$.(metadata.scala:43)
at com.hortonworks.spark.atlas.types.metadata$.(metadata.scala)
at com.hortonworks.spark.atlas.types.SparkAtlasModel$.(SparkAtlasModel.scala:42)
at com.hortonworks.spark.atlas.types.SparkAtlasModel$.(SparkAtlasModel.scala)
at com.hortonworks.spark.atlas.SparkAtlasEventTracker.initializeSparkModel(SparkAtlasEventTracker.scala:108)
at com.hortonworks.spark.atlas.SparkAtlasEventTracker.(SparkAtlasEventTracker.scala:48)
at com.hortonworks.spark.atlas.SparkAtlasEventTracker.(SparkAtlasEventTracker.scala:39)
at com.hortonworks.spark.atlas.SparkAtlasEventTracker.(SparkAtlasEventTracker.scala:43)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.apache.spark.util.Utils$$anonfun$loadExtensions$1.apply(Utils.scala:2743)
at org.apache.spark.util.Utils$$anonfun$loadExtensions$1.apply(Utils.scala:2732)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.mutable.ArraySeq.foreach(ArraySeq.scala:74)
at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104)
at org.apache.spark.util.Utils$.loadExtensions(Utils.scala:2732)
at org.apache.spark.SparkContext$$anonfun$setupAndStartListenerBus$1.apply(SparkContext.scala:2360)
at org.apache.spark.SparkContext$$anonfun$setupAndStartListenerBus$1.apply(SparkContext.scala:2359)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.SparkContext.setupAndStartListenerBus(SparkContext.scala:2359)
at org.apache.spark.SparkContext.(SparkContext.scala:554)
at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2493)
at org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:933)
at org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:924)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:924)
at com.oi.hermes.impl.SplineDemo$.main(SplineDemo.scala:18)
at com.oi.hermes.impl.SplineDemo.main(SplineDemo.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:906)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:197)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:227)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:136)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

>>> spark.sql("create table tmp_case_q (key INT, value STRING)")
18/03/05 20:20:20 WARN SparkCatalogEventTracker:  Caught exception during parsing catalog event
java.lang.IllegalStateException: Cannot find active or default SparkSession in the current context
	at com.hortonworks.spark.atlas.utils.SparkUtils$.sparkSession(SparkUtils.scala:35)
	at com.hortonworks.spark.atlas.utils.SparkUtils$.getExternalCatalog(SparkUtils.scala:87)
	at com.hortonworks.spark.atlas.sql.SparkCatalogEventTracker$$anonfun$eventProcess$2.apply(SparkCatalogEventTracker.scala:120)
	at com.hortonworks.spark.atlas.sql.SparkCatalogEventTracker$$anonfun$eventProcess$2.apply(SparkCatalogEventTracker.scala:96)
	at scala.Option.foreach(Option.scala:257)
	at com.hortonworks.spark.atlas.sql.SparkCatalogEventTracker.eventProcess(SparkCatalogEventTracker.scala:96)
	at com.hortonworks.spark.atlas.sql.AbstractService$$anon$1.run(AbstractService.scala:24)
DataFrame[]

The exception is from here. @jerryshao Could you please help to look at this issue? Thanks.

cc: @dongjoon-hyun