GithubHelp home page GithubHelp logo

hortonworks-spark / spark-atlas-connector Goto Github PK

View Code? Open in Web Editor NEW
263.0 20.0 149.0 925 KB

A Spark Atlas connector to track data lineage in Apache Atlas

License: Apache License 2.0

Scala 98.88% Shell 1.12%
apache-spark apache-atlas

spark-atlas-connector's Introduction

Build Status

Spark Atlas Connector

A connector to track Spark SQL/DataFrame transformations and push metadata changes to Apache Atlas.

This connector supports tracking:

  1. SQL DDLs like "CREATE/DROP/ALTER DATABASE", "CREATE/DROP/ALTER TABLE".
  2. SQL DMLs like "CREATE TABLE tbl AS SELECT", "INSERT INTO...", "LOAD DATA [LOCAL] INPATH", "INSERT OVERWRITE [LOCAL] DIRECTORY" and so on.
  3. DataFrame transformations which has inputs and outputs
  4. Machine learning pipelines.

This connector will correlate with other systems like Hive, HDFS to track the life-cycle of data in Atlas.

How To Build

To use this connector, you will require a latest version of Spark (Spark 2.3+), because most of the features only exist in Spark 2.3.0+.

To build this project, please execute:

mvn package -DskipTests

mvn package will assemble all the required dependencies and package into an uber jar.

Create Atlas models

NOTE: below steps are only necessary prior to Apache Atlas 2.1.0. Apache Atlas 2.1.0 will include the models.

SAC leverages official Spark models in Apache Atlas, but as of Apache Atlas 2.0.0, it doesn't include the model file yet. Until Apache Atlas publishes new release which includes the model, SAC includes the json model file to apply to Atlas server easily.

Please copy 1100-spark_model.json to <ATLAS_HOME>/models/1000-Hadoop directory and restart Atlas server to take effect.

How To Use

To use it, you will need to make this jar accessible in Spark Driver, also configure

spark.extraListeners=com.hortonworks.spark.atlas.SparkAtlasEventTracker
spark.sql.queryExecutionListeners=com.hortonworks.spark.atlas.SparkAtlasEventTracker
spark.sql.streaming.streamingQueryListeners=com.hortonworks.spark.atlas.SparkAtlasStreamingQueryEventTracker

For example, when you're using spark-shell, you can start the Spark like:

bin/spark-shell --jars spark-atlas-connector_2.11-0.1.0-SNAPSHOT.jar \
--conf spark.extraListeners=com.hortonworks.spark.atlas.SparkAtlasEventTracker \
--conf spark.sql.queryExecutionListeners=com.hortonworks.spark.atlas.SparkAtlasEventTracker \
--conf spark.sql.streaming.streamingQueryListeners=com.hortonworks.spark.atlas.SparkAtlasStreamingQueryEventTracker

Also make sure atlas configuration file atlas-application.properties is in the Driver's classpath. For example, putting this file into <SPARK_HOME>/conf.

If you're using cluster mode, please also ship this conf file to the remote Drive using --files atlas-application.properties.

Spark Atlas Connector supports two types of Atlas clients, "kafka" and "rest". You can configure which type of client via setting atlas.client.type to whether kafka or rest. The default value is kafka which provides stable and secured way of publishing changes. Atlas has embedded Kafka instance so you can test it out in test environment, but it's encouraged to use external kafka cluster in production. If you don't have Kafka cluster in production, you may want to set client to rest.

To Use it in Secure Environment

Atlas now only secures Kafka client API, so when you're using this connector in secure environment, please shift to use Kafka client API by configuring atlas.client.type=kafka in atlas-application.properties.

Also please add the below configurations to your atlas-application.properties.

atlas.jaas.KafkaClient.loginModuleControlFlag=required
atlas.jaas.KafkaClient.loginModuleName=com.sun.security.auth.module.Krb5LoginModule
atlas.jaas.KafkaClient.option.keyTab=./a.keytab
[email protected]
atlas.jaas.KafkaClient.option.serviceName=kafka
atlas.jaas.KafkaClient.option.storeKey=true
atlas.jaas.KafkaClient.option.useKeyTab=true

Please make sure keytab (a.keytab) is accessible from Spark Driver.

When running on cluster node, you will also need to distribute this keytab, below is the example command to run in cluster mode.

 ./bin/spark-submit --class <class_name> \
  --jars spark-atlas-connector_2.11-0.1.0-SNAPSHOT.jar \
  --conf spark.extraListeners=com.hortonworks.spark.atlas.SparkAtlasEventTracker \
  --conf spark.sql.queryExecutionListeners=com.hortonworks.spark.atlas.SparkAtlasEventTracker \
  --conf spark.sql.streaming.streamingQueryListeners=com.hortonworks.spark.atlas.SparkAtlasStreamingQueryEventTracker \
  --master yarn-cluster \
  --principal [email protected] \
  --keytab ./spark.headless.keytab \
  --files atlas-application.properties,a.keytab \
  <application-jar>

When Spark application is started, it will transparently track the execution plan of submitted SQL/DF transformations, parse the plan and create related entities in Atlas.

Spark models vs Hive models

SAC classifies table related entities with two different kind of models: Spark / Hive.

We decided to skip sending create events for Hive tables managed by HMS to avoid duplication of those events from Atlas hook for Hive . For Hive entities, Atlas relies on Atlas hook for Hive as the source of truth.

SAC assumes table entities are being created in Hive side and just refers these entities via object id if below conditions are true:

  • SparkSession.builder.enableHiveSupport is set
  • The value of "hive.metastore.uris" is set to non-empty

For other cases, SAC will create table related entities as Spark models.

One exceptional case is HWC - for HWC source and/or sink, SAC will not create table related entities and always refer to Hive table entities via object id.

Known Limitations (Design decision)

SAC only supports SQL/DataFrame API (in other words, SAC doesn't support RDD).

SAC relies on query listener to retrieve query and examine the impacts.

All "inputs" and "outputs" in multiple queries are accumulated into single "spark_process" entity when there're multple queries running in single Spark session.

"spark_process" maps to an "applicationId" in Spark. This is helpful as it allows admin to track all changes that occurred as part of an application. But it also causes lineage/relationship graph in "spark_process" to be complicated and less meaningful.

We've filed #261 to investigate changing the unit of "spark_process" entity to query. It doesn't mean we will change it soon. It will be addressed only if we see clear benefits of changing it.

Only part of inputs are tracked in Streaming query.

This is from design choice on "trade-off": Kafka source supports subscribing with "pattern" and SAC cannot enumerate all matching existing topics, or even all possible topics (even if it was possible, it won't make sense).

"executed plan" provides actual topics which each (micro) batch reads and processes, and as a result, only inputs which participate in (micro) batch are included as "inputs" in "spark_process" entity.

If your query runs long enough that it ingests data from all topics, it will have all topics in "spark_process" entity.

SAC doesn't support tracking changes on columns (Spark models).

We are investigating how to add support for column entity. The main issue we face is how to make this change consistent when multiple spark applications make changes to the same table/column.

This doesn't apply to Hive models, which central remote HMS takes care of DDLs and Hive Atlas Hook will take care of updates.

SAC doesn't track dropping tables (Spark models).

"drop table" event from Spark only provides db and table name, which is NOT sufficient to create qualifierName - especially we separate two types of tables - spark and hive.

SAC depends on reading the Spark Catalog to get table information but Spark will have already dropped the table when SAC notices the table is dropped so that will not work.

We are investigating how to change Spark to provide necessary information via listener, maybe snapshot of information before deletion happens.

ML entities/events may not be tracked properly.

We are concentrating on making basic features be stable: we are not including ML features on the target of basic features as of now. We will revisit once we are sure to resolve most of issues on basic features.

By the way, we have two patches for tracking ML events: one is a custom patch which could be applied to Spark 2.3/2.4, and another one is a patch which is adopted to Apache Spark but will be available for Spark 3.0. Currently SAC follows custom patch, which is kind of deprecated due to new patch. Maybe we would need to revisit ML features again with Spark 3.0.

License

Apache License, Version 2.0 http://www.apache.org/licenses/LICENSE-2.0.

spark-atlas-connector's People

Contributors

arunmahadevan avatar bolkedebruin avatar bongani avatar dongjoon-hyun avatar heartsavior avatar hyukjinkwon avatar jerryshao avatar merlintang avatar walkertr avatar weiqingy avatar yanboliang avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

spark-atlas-connector's Issues

kafka topic fail to get

18/04/24 23:01:22 WARN SparkExecutionPlanTracker: Caught exception during parsing the query: java.util.NoSuchElementException: None.get

Failed to initialize Atlas client

While running my basic word-count spark job with spark-atlas-connector, i am getting the following error :

18/07/25 11:57:52 ERROR ClientResponse: A message body reader for Java class org.apache.atlas.model.typedef.AtlasTypesDef, and Java type class org.apache.atlas.model.typedef.AtlasTypesDef, and MIME media type application/json;charset=UTF-8 was not found    
18/07/25 11:57:52 ERROR ClientResponse: The registered message body readers compatible with the MIME media type are:    
*/* ->    
  com.sun.jersey.core.impl.provider.entity.FormProvider    
  com.sun.jersey.core.impl.provider.entity.StringProvider    
  com.sun.jersey.core.impl.provider.entity.ByteArrayProvider    
  com.sun.jersey.core.impl.provider.entity.FileProvider    
  com.sun.jersey.core.impl.provider.entity.InputStreamProvider    
  com.sun.jersey.core.impl.provider.entity.DataSourceProvider    
  com.sun.jersey.core.impl.provider.entity.XMLJAXBElementProvider$General    
  com.sun.jersey.core.impl.provider.entity.ReaderProvider    
  com.sun.jersey.core.impl.provider.entity.DocumentProvider    
  com.sun.jersey.core.impl.provider.entity.SourceProvider$StreamSourceReader    
  com.sun.jersey.core.impl.provider.entity.SourceProvider$SAXSourceReader    
  com.sun.jersey.core.impl.provider.entity.SourceProvider$DOMSourceReader    
  com.sun.jersey.core.impl.provider.entity.XMLRootElementProvider$General    
  com.sun.jersey.core.impl.provider.entity.XMLListElementProvider$General    
  com.sun.jersey.core.impl.provider.entity.XMLRootObjectProvider$General    
  com.sun.jersey.core.impl.provider.entity.EntityHolderReader    

18/07/25 11:57:52 ERROR SparkAtlasEventTracker: Fail to initialize Atlas client, stop this listener    
org.apache.atlas.AtlasServiceException: Metadata service API GET : api/atlas/v2/types/typedefs/ failed    
        at org.apache.atlas.AtlasBaseClient.callAPIWithResource(AtlasBaseClient.java:325)    
        at org.apache.atlas.AtlasBaseClient.callAPIWithResource(AtlasBaseClient.java:287)    
        at org.apache.atlas.AtlasBaseClient.callAPI(AtlasBaseClient.java:469)    
        at org.apache.atlas.AtlasClientV2.getAllTypeDefs(AtlasClientV2.java:131)    
        at com.hortonworks.spark.atlas.RestAtlasClient.getAtlasTypeDefs(RestAtlasClient.scala:58)    
        at com.hortonworks.spark.atlas.types.SparkAtlasModel$$anonfun$checkAndGroupTypes$1.apply(SparkAtlasModel.scala:107)    
        at com.hortonworks.spark.atlas.types.SparkAtlasModel$$anonfun$checkAndGroupTypes$1.apply(SparkAtlasModel.scala:104)    
        at scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:221)    
        at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:428)    
        at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:428)    
        at com.hortonworks.spark.atlas.types.SparkAtlasModel$.checkAndGroupTypes(SparkAtlasModel.scala:104)    
        at com.hortonworks.spark.atlas.types.SparkAtlasModel$.checkAndCreateTypes(SparkAtlasModel.scala:71)    
        at com.hortonworks.spark.atlas.SparkAtlasEventTracker.initializeSparkModel(SparkAtlasEventTracker.scala:108)    
        at com.hortonworks.spark.atlas.SparkAtlasEventTracker.<init>(SparkAtlasEventTracker.scala:48)    
        at com.hortonworks.spark.atlas.SparkAtlasEventTracker.<init>(SparkAtlasEventTracker.scala:39)    
        at com.hortonworks.spark.atlas.SparkAtlasEventTracker.<init>(SparkAtlasEventTracker.scala:43)    
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)    
        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)    
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)    
        at java.lang.reflect.Constructor.newInstance(Constructor.java:423)    
        at org.apache.spark.util.Utils$$anonfun$loadExtensions$1.apply(Utils.scala:2743)    
        at org.apache.spark.util.Utils$$anonfun$loadExtensions$1.apply(Utils.scala:2732)    
        at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)    
        at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)    
        at scala.collection.mutable.ArraySeq.foreach(ArraySeq.scala:74)    
        at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)    
        at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104)    
        at org.apache.spark.util.Utils$.loadExtensions(Utils.scala:2732)    
        at org.apache.spark.sql.util.ExecutionListenerManager$$anonfun$$lessinit$greater$1.apply(QueryExecutionListener.scala:83)    
        at org.apache.spark.sql.util.ExecutionListenerManager$$anonfun$$lessinit$greater$1.apply(QueryExecutionListener.scala:82)    
        at scala.Option.foreach(Option.scala:257)    
        at org.apache.spark.sql.util.ExecutionListenerManager.<init>(QueryExecutionListener.scala:82)    
        at org.apache.spark.sql.internal.BaseSessionStateBuilder$$anonfun$listenerManager$2.apply(BaseSessionStateBuilder.scala:270)    
        at org.apache.spark.sql.internal.BaseSessionStateBuilder$$anonfun$listenerManager$2.apply(BaseSessionStateBuilder.scala:270)    
        at scala.Option.getOrElse(Option.scala:121)    
        at org.apache.spark.sql.internal.BaseSessionStateBuilder.listenerManager(BaseSessionStateBuilder.scala:269)    
        at org.apache.spark.sql.internal.BaseSessionStateBuilder.build(BaseSessionStateBuilder.scala:297)    
        at org.apache.spark.sql.SparkSession$.org$apache$spark$sql$SparkSession$$instantiateSessionState(SparkSession.scala:1070)    
        at org.apache.spark.sql.SparkSession$$anonfun$sessionState$2.apply(SparkSession.scala:141)    
        at org.apache.spark.sql.SparkSession$$anonfun$sessionState$2.apply(SparkSession.scala:140)    
        at scala.Option.getOrElse(Option.scala:121)    
        at org.apache.spark.sql.SparkSession.sessionState$lzycompute(SparkSession.scala:140)    
        at org.apache.spark.sql.SparkSession.sessionState(SparkSession.scala:137)    
        at org.apache.spark.sql.Dataset.<init>(Dataset.scala:178)    
        at org.apache.spark.sql.Dataset$.apply(Dataset.scala:65)    
        at org.apache.spark.sql.SparkSession.createDataset(SparkSession.scala:470)    
        at org.apache.spark.sql.SQLContext.createDataset(SQLContext.scala:377)    
        at org.apache.spark.sql.SQLImplicits.localSeqToDatasetHolder(SQLImplicits.scala:228)    
        at com.oi.spline.main.SparkAtlasConnector$.main(SparkAtlasConnector.scala:20)    
        at com.oi.spline.main.SparkAtlasConnector.main(SparkAtlasConnector.scala)    
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)    
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)    
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)    
        at java.lang.reflect.Method.invoke(Method.java:498)    
        at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)    
        at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:906)    
        at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:197)    
        at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:227)    
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:136)    
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)    
Caused by: com.sun.jersey.api.client.ClientHandlerException: A message body reader for Java class org.apache.atlas.model.typedef.AtlasTypesDef, and Java type class org.apache.atlas.model.typedef.AtlasTypesDef, and MIME media type application/json;charset=UTF-8 was not found    
        at com.sun.jersey.api.client.ClientResponse.getEntity(ClientResponse.java:630)    
        at com.sun.jersey.api.client.ClientResponse.getEntity(ClientResponse.java:604)    
        at org.apache.atlas.AtlasBaseClient.callAPIWithResource(AtlasBaseClient.java:321)    
        ... 59 more    
18/07/25 11:57:53 INFO FileOutputCommitter: File Output Committer Algorithm version is 1

Mine is a kerberized cluster. Atlas version is 0.8.2 and spark version is 2.3.0.
I have followed all the steps specified for Kerberized environment.
Any help will be highly appreciated @jerryshao , @weiqingy

df.write.saveAsTable is not creating table entity in Atlas

  1. Read a CSV file from hdfs :
df = spark.read.csv("/tmp/googleplaystore.csv")
  1. Write this dataframe to spark table :
df.write.saveAsTable("app_details")

This is creating 'app_details' table in Spark. But its not creating corresponding entity in Atlas.

Update the Spark Process Atlas record name

When users do not specify the spark application name, we need to record meaningful information for this job. therefore, update Spark-shell name to Sparkjob + applicationID

Spark Catalog PreEvent should be handled correctly

Currently, SAC is collecting some information at PreEvents like DropDatabasePreEvent. However, the database described in DropDatabasePreEvent might not exist and the main event DropDatabaseEvent will not arrive because it will fail. This PR fixes some PreEvent logic and handles explicitly as No-OPs for the others.

scala> sql("drop database sparkdb2")
2018-11-12 14:16:29 WARN  SparkCatalogEventProcessor:48 - Caught exception during parsing event
org.apache.spark.sql.catalyst.analysis.NoSuchDatabaseException: Database 'sparkdb2' not found;
...
	at com.hortonworks.spark.atlas.AbstractEventProcessor$$anon$1.run(AbstractEventProcessor.scala:39)

Use Spark Catalog

Hive Metastore provides multiple Catalogs. SAC assumes that it has own spark catalog which consists of fully accessible non-transactional tables. spark catalog is new one which is created by the following method.

https://cwiki.apache.org/confluence/display/Hive/Hive+Schema+Tool

$ schematool \
  -dbType mysql \
  -createCatalog spark \
  -catalogDescription 'Default catalog, for Spark' \
  -catalogLocation hdfs://cluster:8020/apps/spark/warehouse

Question: spark initialization fails with java.lang.ClassNotFoundException: org.apache.atlas.ApplicationProperties

Hi,

I have built the spark-atlas-connector_2.11-0.1.0-SNAPSHOT.jar as mentioned in the readme doc.

while i try to start the spark-shell --conf spark.extraListeners=com.hortonworks.spark.atlas.sql.SparkCatalogEventTracker property spark throws following error

Caused by: java.lang.ClassNotFoundException: org.apache.atlas.ApplicationProperties

I've even tried adding atlas 0.8.2 jar explicitly while invoking spark-shell

PFB command i'm executing
spark-shell --jars ~/hadoop/spark/jars/spark-atlas-connector_2.11-0.1.0-SNAPSHOT.jar, ~/hadoop/spark/jars/atlas-distro-0.8.2.jar --conf spark.extraListeners=com.hortonworks.spark.atlas.sql.SparkCatalogEventTracker --files ~/hadoop/spark/conf/atlas-application.properties

can you please help me understanding what i'm missing?

Thanks!

image

SAC should not fail on repeating creation/deletion of db/tables

scala> sql("create table t(a int)")
scala> sql("drop table t")
scala> sql("create table t(a int)")
scala> sql("drop table t")
2018-11-12 14:55:30 WARN  SparkCatalogEventProcessor:48 - Caught exception during parsing event
org.apache.spark.sql.catalyst.analysis.NoSuchTableException: Table or view 't' not found in database 'default';
	at org.apache.spark.sql.hive.client.HiveClient$$anonfun$getTable$1.apply(HiveClient.scala:81)
...
	at org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.getTable(ExternalCatalogWithListener.scala:138)
...
	at com.hortonworks.spark.atlas.AbstractEventProcessor$$anon$1.run(AbstractEventProcessor.scala:39)

Cant drop atlas metadatos

After create spark tables and make a pyspark process and then drop table from pyspark, atlas metadata still is on atlas. for example:

PySparkShell local-1531135715403 |   |   | spark_process

/apps/hive/warehouse/lbk_tabla01 |   |   | hdfs_path

Altering table details from spark are not reflecting on Atlas

Altering table details from spark are not reflecting on Atlas.

Test steps :1) Create table test1

create table test1(col1 int)

Check on atlas that table entity test1 is created with column col1

2)Alter the table and add new column col2

spark.sql("alter table test1 add COLUMNS (col2 int)")
  1. After step 2 check atlas. For table entity test1, column details are not updated.

Expectation : This need to be updated in properties, relationships and schema.

java.lang.NoClassDefFoundError: org/apache/atlas/ApplicationProperties

I am getting the following error when running the spark-shell:

$  spark-shell --jars spark-atlas-connector_2.11-0.1.0-SNAPSHOT.jar --conf spark.extraListeners=com.hortonworks.spark.atlas.SparkAtlasEventTracker --conf spark.sql.queryExecutionListeners=com.hortonworks.spark.atlas.SparkAtlasEventTracker
SPARK_MAJOR_VERSION is set to 2, using Spark2
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
java.lang.NoClassDefFoundError: org/apache/atlas/ApplicationProperties
  at com.hortonworks.spark.atlas.AtlasClientConf.configuration$lzycompute(AtlasClientConf.scala:27)
  at com.hortonworks.spark.atlas.AtlasClientConf.configuration(AtlasClientConf.scala:27)
  at com.hortonworks.spark.atlas.AtlasClientConf.get(AtlasClientConf.scala:52)
  at com.hortonworks.spark.atlas.AtlasClient$.atlasClient(AtlasClient.scala:88)
  at com.hortonworks.spark.atlas.SparkAtlasEventTracker.<init>(SparkAtlasEventTracker.scala:39)
  at com.hortonworks.spark.atlas.SparkAtlasEventTracker.<init>(SparkAtlasEventTracker.scala:43)
  at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
  at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
  at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
  at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
  at org.apache.spark.util.Utils$$anonfun$loadExtensions$1.apply(Utils.scala:2743)
  at org.apache.spark.util.Utils$$anonfun$loadExtensions$1.apply(Utils.scala:2732)
  at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
  at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
  at scala.collection.mutable.ArraySeq.foreach(ArraySeq.scala:74)
  at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
  at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104)
  at org.apache.spark.util.Utils$.loadExtensions(Utils.scala:2732)
  at org.apache.spark.SparkContext$$anonfun$setupAndStartListenerBus$1.apply(SparkContext.scala:2360)
  at org.apache.spark.SparkContext$$anonfun$setupAndStartListenerBus$1.apply(SparkContext.scala:2359)
  at scala.Option.foreach(Option.scala:257)
  at org.apache.spark.SparkContext.setupAndStartListenerBus(SparkContext.scala:2359)
  at org.apache.spark.SparkContext.<init>(SparkContext.scala:554)
  at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2493)
  at org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:930)
  at org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:921)
  at scala.Option.getOrElse(Option.scala:121)
  at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:921)
  at org.apache.spark.repl.Main$.createSparkSession(Main.scala:103)
  ... 55 elided
Caused by: java.lang.ClassNotFoundException: org.apache.atlas.ApplicationProperties
  at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
  at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
  at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
  ... 84 more
<console>:14: error: not found: value spark
       import spark.implicits._
              ^
<console>:14: error: not found: value spark
       import spark.sql
              ^
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.3.0.2.6.5.0-292
      /_/

Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_112)
Type in expressions to have them evaluated.
Type :help for more information.

scala> 

Datafame exception during parsing event error

Hi Trying to make sense of this error:

sc.parallelize(0 to 100).toDF.registerTempTable("foo")

warning: there was one deprecation warning; re-run with -deprecation for details
2018-10-12 15:39:22 WARN SparkExecutionPlanProcessor:48 - Caught exception during parsing event
java.lang.ClassCastException: org.apache.spark.sql.catalyst.plans.logical.AnalysisBarrier cannot be cast to org.apache.spark.sql.catalyst.plans.logical.Project
at com.hortonworks.spark.atlas.sql.CommandsHarvester$CreateViewHarvester$.harvest(CommandsHarvester.scala:271)
at com.hortonworks.spark.atlas.sql.SparkExecutionPlanProcessor$$anonfun$2.apply(SparkExecutionPlanProcessor.scala:63)
at com.hortonworks.spark.atlas.sql.SparkExecutionPlanProcessor$$anonfun$2.apply(SparkExecutionPlanProcessor.scala:48)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104)
at com.hortonworks.spark.atlas.sql.SparkExecutionPlanProcessor.process(SparkExecutionPlanProcessor.scala:48)
at com.hortonworks.spark.atlas.sql.SparkExecutionPlanProcessor.process(SparkExecutionPlanProcessor.scala:35)
at com.hortonworks.spark.atlas.AbstractEventProcessor$$anonfun$eventProcess$1.apply(AbstractEventProcessor.scala:67)
at com.hortonworks.spark.atlas.AbstractEventProcessor$$anonfun$eventProcess$1.apply(AbstractEventProcessor.scala:66)
at scala.Option.foreach(Option.scala:257)
at com.hortonworks.spark.atlas.AbstractEventProcessor.eventProcess(AbstractEventProcessor.scala:66)
at com.hortonworks.spark.atlas.AbstractEventProcessor$$anon$1.run(AbstractEventProcessor.scala:39)

Find a way to differentiate Kafka topics in different clusters

Since SAC tracks Kafka topics for sources and sink, there's a chance for Spark app(s) to access Kafka topics from various clusters. Since SAC always creates qualified name as topic@clusterName which clusterName is just from atlas client properties, SAC basically can't differentiate them.

We need to find a way to differentiate them.

invalid relationshipDef: avro_schema_associatedEntities: end type 1: DataSet, end type 2: spark_table

we receive this message when creating a table in spark. This is with Atlas 1.0 and Spark 2.3.1

18/09/20 14:32:53 WARN RestAtlasClient: Failed to create entities
org.apache.atlas.AtlasServiceException: Metadata service API org.apache.atlas.AtlasClientV2$API_V2@4539545f failed with status 400 (Bad Request) Response Body ({"errorCode":"ATLAS-400-00-036","errorMessage":"invalid relationshipDef: avro_schema_associatedEntities: end type 1: DataSet, end type 2: spark_table"})
	at org.apache.atlas.AtlasBaseClient.callAPIWithResource(AtlasBaseClient.java:395)
	at org.apache.atlas.AtlasBaseClient.callAPIWithResource(AtlasBaseClient.java:323)
	at org.apache.atlas.AtlasBaseClient.callAPI(AtlasBaseClient.java:211)
	at org.apache.atlas.AtlasClientV2.createEntities(AtlasClientV2.java:305)
	at com.hortonworks.spark.atlas.RestAtlasClient.doCreateEntities(RestAtlasClient.scala:68)
	at com.hortonworks.spark.atlas.AtlasClient$class.createEntities(AtlasClient.scala:42)
	at com.hortonworks.spark.atlas.RestAtlasClient.createEntities(RestAtlasClient.scala:31)
	at com.hortonworks.spark.atlas.sql.SparkCatalogEventProcessor.process(SparkCatalogEventProcessor.scala:64)
	at com.hortonworks.spark.atlas.sql.SparkCatalogEventProcessor.process(SparkCatalogEventProcessor.scala:30)
	at com.hortonworks.spark.atlas.AbstractEventProcessor$$anonfun$eventProcess$1.apply(AbstractEventProcessor.scala:67)
	at com.hortonworks.spark.atlas.AbstractEventProcessor$$anonfun$eventProcess$1.apply(AbstractEventProcessor.scala:66)
	at scala.Option.foreach(Option.scala:257)
	at com.hortonworks.spark.atlas.AbstractEventProcessor.eventProcess(AbstractEventProcessor.scala:66)
	at com.hortonworks.spark.atlas.AbstractEventProcessor$$anon$1.run(AbstractEventProcessor.scala:39)

We don't know how to solve this. Any idea?

Leverage KafkaStreamWriterFactory to avoid reflection while getting destination topic

Currently KafkaHarvester requires specific fix to spark-sql-kafka because it extracts topic information from KafkaStreamWriter which origin class doesn't have topic field (it is just an only one of constructor parameter and KafkaStreamWriter is not a case class), as well as it leverages Java reflection to extract topic.

KafkaStreamWriterFactory is a case class (even in Spark master branch) and have topic in primary constructor, so we can apply pattern matching to extract topic. One downside of this approach is additional overhead, but in this case, it is just an object creation so I guess we could ignore that, and even we can revisit and cache them if the cost of overhead matters.

java.lang.NoSuchMethodError: org.apache.atlas.type.AtlasTypeUtil.createClassTypeDef(

Following error occurs while using atlas 1.0 =>

Exception in thread "main" java.lang.NoSuchMethodError: org.apache.atlas.type.AtlasTypeUtil.createClassTypeDef(Ljava/lang/String;Ljava/lang/String;Ljava/lang/String;Ljava/util/Set;[Lorg/apache/atlas/model/typedef/AtlasStructDef$AtlasAttributeDef;)Lorg/apache/atlas/model/typedef/AtlasEntityDef;
at com.hortonworks.spark.atlas.types.metadata$.(metadata.scala:43)
at com.hortonworks.spark.atlas.types.metadata$.(metadata.scala)
at com.hortonworks.spark.atlas.types.SparkAtlasModel$.(SparkAtlasModel.scala:42)
at com.hortonworks.spark.atlas.types.SparkAtlasModel$.(SparkAtlasModel.scala)
at com.hortonworks.spark.atlas.SparkAtlasEventTracker.initializeSparkModel(SparkAtlasEventTracker.scala:108)
at com.hortonworks.spark.atlas.SparkAtlasEventTracker.(SparkAtlasEventTracker.scala:48)
at com.hortonworks.spark.atlas.SparkAtlasEventTracker.(SparkAtlasEventTracker.scala:39)
at com.hortonworks.spark.atlas.SparkAtlasEventTracker.(SparkAtlasEventTracker.scala:43)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.apache.spark.util.Utils$$anonfun$loadExtensions$1.apply(Utils.scala:2743)
at org.apache.spark.util.Utils$$anonfun$loadExtensions$1.apply(Utils.scala:2732)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.mutable.ArraySeq.foreach(ArraySeq.scala:74)
at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104)
at org.apache.spark.util.Utils$.loadExtensions(Utils.scala:2732)
at org.apache.spark.SparkContext$$anonfun$setupAndStartListenerBus$1.apply(SparkContext.scala:2360)
at org.apache.spark.SparkContext$$anonfun$setupAndStartListenerBus$1.apply(SparkContext.scala:2359)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.SparkContext.setupAndStartListenerBus(SparkContext.scala:2359)
at org.apache.spark.SparkContext.(SparkContext.scala:554)
at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2493)
at org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:933)
at org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:924)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:924)
at com.oi.hermes.impl.SplineDemo$.main(SplineDemo.scala:18)
at com.oi.hermes.impl.SplineDemo.main(SplineDemo.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:906)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:197)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:227)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:136)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

Run SAC in pyspark, and got "Cannot find active or default SparkSession in the current context"

I ran a sql query in pyspark, and got the following exception:

>>> spark.sql("create table tmp_case_q (key INT, value STRING)")
18/03/05 20:20:20 WARN SparkCatalogEventTracker:  Caught exception during parsing catalog event
java.lang.IllegalStateException: Cannot find active or default SparkSession in the current context
	at com.hortonworks.spark.atlas.utils.SparkUtils$.sparkSession(SparkUtils.scala:35)
	at com.hortonworks.spark.atlas.utils.SparkUtils$.getExternalCatalog(SparkUtils.scala:87)
	at com.hortonworks.spark.atlas.sql.SparkCatalogEventTracker$$anonfun$eventProcess$2.apply(SparkCatalogEventTracker.scala:120)
	at com.hortonworks.spark.atlas.sql.SparkCatalogEventTracker$$anonfun$eventProcess$2.apply(SparkCatalogEventTracker.scala:96)
	at scala.Option.foreach(Option.scala:257)
	at com.hortonworks.spark.atlas.sql.SparkCatalogEventTracker.eventProcess(SparkCatalogEventTracker.scala:96)
	at com.hortonworks.spark.atlas.sql.AbstractService$$anon$1.run(AbstractService.scala:24)
DataFrame[]

The exception is from here. @jerryshao Could you please help to look at this issue? Thanks.

cc: @dongjoon-hyun

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.