GithubHelp home page GithubHelp logo

microsoft / purview-adb-lineage-solution-accelerator Goto Github PK

View Code? Open in Web Editor NEW
89.0 12.0 55.0 13.08 MB

A connector to ingest Azure Databricks lineage into Microsoft Purview

License: MIT License

Scala 12.16% Shell 6.22% C# 70.61% Python 8.01% Java 0.77% Bicep 2.24%
azure-databricks governance lineage microsoft-purview

purview-adb-lineage-solution-accelerator's Introduction

page_type languages products
sample
csharp
microsoft-purview
azure-databricks

EAE_Header.png lineage.png

Microsoft Solutions / Early Access Engineering

Azure Databricks to Purview Lineage Connector

This solution accelerator, together with the OpenLineage project, provides a connector that will transfer lineage metadata from Spark operations in Azure Databricks to Microsoft Purview, allowing you to see a table-level lineage graph as demonstrated above.

Note In addition to this solution accelerator, Microsoft Purview is creating native models for Azure Databricks (e.g.: Notebooks, jobs, job tasks...) to integrate with Catalog experiences. With native models in Microsoft Purview for Azure Databricks, customers will get enriched experiences in lineage such as detailed transformations. If you choose to use this solution accelerator in a Microsoft Purview account before the native models are released, these enriched experiences are not backward compatible. Please reach out to your Microsoft account representative for timeline related questions on the upcoming model enrichment for Azure Databricks in Microsoft Purview.**

Contents

Overview

Gathering lineage data is performed in the following steps:

high-level-architecture.png

  1. Azure Databricks clusters are configured to initialize the OpenLineage Spark Listener with an endpoint to receive data.
  2. Spark operations will output data in a standard OpenLineage format to the endpoint configured in the cluster.
  3. Endpoint provided by an Azure Function app that will filter incoming data and pass it to an Azure EventHub.
  4. Events are captured by a second Function app to transform the data into a format compatible with Atlas and Purview.
  5. Lineage data is synchronized with existing Purview metadata and uploaded to Purview using standard Apache Atlas APIs.

Features

  • Supports table level lineage from Spark Notebooks and jobs for the following data sources:
    • Azure SQL
    • Azure Synapse Analytics (as input)
    • Azure Data Lake Gen 2
    • Azure Blob Storage
    • Delta Lake (Merge command not supported)
    • Azure Data Explorer
    • Azure Data Factory orchestration
    • Hive Tables (in default metastore)
    • MySQL
    • PostgreSQL
  • Supports Spark 3.0, 3.1, 3.2, and 3.3 (Interactive and Job clusters) / Spark 2.x (Job clusters)
    • Databricks Runtimes between 9.1 and 11.3 LTS are currently supported
  • Can be configured per cluster or for all clusters as a global configuration
  • Support column level lineage for ABFSS, WASBS, and default metastore hive tables (see Limitations for more detail)
  • Once configured, does not require any code changes to notebooks or jobs
  • Can add new source support through configuration

Videos

Prerequisites

Installing this connector requires the following:

  1. Azure subscription-level role assignments for both Contributor and User Access Administrator.
  2. Azure Service Principal with client ID and secret - How to create Service Principal.

Getting Started

There are two deployment options for this solution accelerator:

  • No additional prerequisites are necessary as the demo environment will be setup for you, including Azure Databricks, Purview, ADLS, and example data sources and notebooks.

  • If installed as a working connector, Azure Databricks, data sources, and Microsoft Purview are assumed to be setup and running.

Using the Connector

Ensure both the Azure Function app and Azure Databricks cluster are running.

  1. Open your Databricks workspace to run a Spark job or notebook which results in data being transferred from one location to another. For the demo deployment, browse to the Workspace > Shared > abfss-in-abfss-out-olsample notebook, and click "Run all".

  2. Once complete, open your Purview workspace and click the "Browse assets" button near the center of the page

  3. Click on the "By source type" tab
    You should see at least one item listed under the heading of "Azure Databricks". In addition there will possibly be a Purview Custom Connector section under the Custom source types heading

    browse_assets.png

  4. Click on the "Databricks" section, then click on the link to the Azure Databricks workspace which the sample notebook was ran. Then select the notebook which you ran (for those running Databricks Jobs, you can also select the job and drill into the related tasks)

    • After running a Databricks Notebook on an Interactive Cluster, you will see lineage directly in the Notebook asset under the Lineage tab.
    • After running a Databricks Job on a Job Cluster, you will see lineage in the Notebook Task asset. To navigate from a Notebook to a Notebook Task select the Properties tab and choose the Notebook Tasks from the Related Assets section. Please note that Databricks Jobs lineage require additional setup outside of the demo deployment.

    databricks_task_related.png

  5. Click to the lineage view to see the lineage graph

    lineage_view.png

    Note: If you are viewing the Databricks Process shortly after it was created, sometimes the lineage tab takes some time to display. If you do not see the lineage tab, wait a few minutes and then refresh the browser.

    Lineage Note: The screenshot above shows lineage to an Azure Data Lake Gen 2 folder, you must have scanned your Data Lake prior to running a notebook for it to be able to match to a Microsoft Purview built-in type like folders or resource sets.

Troubleshooting

When filing a new issue, please include associated log message(s) from Azure Functions. This will allow the core team to debug within our test environment to validate the issue and develop a solution.

If you have any issues, please start with the Troubleshooting Doc and note the limitations which affect what sort of lineage can be collected. If the problem persists, please raise an Issue on GitHub.

Limitations

The solution accelerator has some limitations which affect what sort of lineage can be collected.

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies

Data Collection

The software may collect information about you and your use of the software and send it to Microsoft. Microsoft may use this information to provide services and improve our products and services. You may turn off the telemetry as described in the repository. There are also some features in the software that may enable you and Microsoft to collect data from users of your applications. If you use these features, you must comply with applicable law, including providing appropriate notices to users of your applications together with a copy of Microsoft’s privacy statement. Our privacy statement is located at https://go.microsoft.com/fwlink/?LinkID=824704. You can learn more about data collection and use in the help documentation and our privacy statement. Your use of the software operates as your consent to these practices.

purview-adb-lineage-solution-accelerator's People

Contributors

gianlucadardia avatar hmoazam avatar isantillan1 avatar marktayl1 avatar marktayl2 avatar mattsavarino avatar microsoftopensource avatar mreistadvipps avatar rodrigomonteiro-gbb avatar travishilbert avatar wjohnson avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

purview-adb-lineage-solution-accelerator's Issues

Databricks Version 10.3 and above are not working

Describe the bug
When using a Databricks version 10.3 or above with the applied Advanced Options, the notebook is not starting

To Reproduce
Enable cluster with Spark 3.2.1
add advanced option like below

spark.openlineage.version 1
spark.openlineage.namespace #<DB_CLUSTER_ID>
spark.openlineage.host https://<FUNCTION_APP_NAME>.azurewebsites.net
spark.openlineage.url.param.code <FUNCTION_APP_DEFAULT_HOST_KEY>

Expected behavior
Notebook should work, but since above addition it's not working.

Screenshots
image

Desktop (please complete the following information):

  • OpenLineage Version: [OpenLineage-Spark 0.8.2 jar]
  • Databricks Runtime Version: [e.g. 6.4, 9.1, 10.1]
  • Cluster Type: [Job]
  • Cluster Mode: [Standard,]
  • Using Credential Passthrough: [e.g. Yes, No]

Additional context
Every cluster below Spark 3.2.1 is working fine

Purview quota reached

Dear,

We were testing the connector. In our setup we have purview account already in place in the resource group. However, the script produced an error that we reached the tenant quota - which we already had done but the script is not supposed to add a new purview resource if I read it correctly?

Many thanks,
Petar

Purview ADLS Gen1 mount to URL resolve mismatch

Describe the bug

We are trying out latest version of ADB lineage solution. And we are using mounts in our notebook. But ADLS URL resolved in purview while using this solution is not matching with actual URL.

URL in purview from solution : adl://accountname.azuredatalakestore.net/lob/ontology/path1/path2/path3

expected actual ADLS Url : adl://accountname.azuredatalakestore.net/datalake-dev/lob/ontology/path1/path2/path3

Mount in Databricks :

MountPoint - /mnt/datalake/
Source - adl://accountname.azuredatalakestore.net/datalake-dev/

Path used in notebook ( as mount) - /mnt/datalake/lob/ontology/path1/path2/path3

How the incremental changes to a notebook work with the solution ?

I am not sure if this is a limitation of the solution or a bug.

In the sample given notebook, we have exampleInputA and exampleInputB producing the output exampleoutput1. If I replace the exampleInputB with a new input exampleInputC to produce the same output - I get 3 inputs to the notebook in the lineage map where it should have been 2 only ( A&C). So my question: does the solution only append and don't delete any lineage mapping?

val exampleC = (
spark.read.format("csv")
.schema(exampleCSchema)
.option("header", true)
.load(adlsRootPath+"/examples/data/csv/exampleInputC/exampleInputC.csv")
)
val outputDf = exampleA.join(exampleC, exampleA("id") === exampleC("id"), "inner").drop(exampleC("id"))
outputDf.repartition(1).write.mode("overwrite").format("csv").save(adlsRootPath+"/examples/data/csv/exampleOutput/")

Mapping in the purview ( I was expecting exampleInputB to be removed automatically)

image

Fix test runner to report success correctly

In tests\integration\runner.py the overall success check is not correct. It is currenlty:

total_successes = searchable_success + process_success
print(f"Summary: {total_successes:0>2}/{len(expected):0>2}")
print(success == len(expected), end="")

But should be

total_successes = searchable_success + process_success
print(f"Summary: {total_successes:0>2}/{len(expected):0>2}")
print(total_successes== len(expected), end="")

What is DB_CLUSTER_ID?

I am not sure if there is a forum for asking questions related to this Lineage Accelerator. In step 4 of this deployment guide, I do not understand where to find DB_CLUSTER_ID in spark.openlineage.namespace <ADB-WORKSPACE-ID>#<DB_CLUSTER_ID>.

On this online search, I see AWS has something called db cluster identifier. But I am not seeing anything DB_CLUSTER_ID in Azure databricks. Question: Is this just a ClussterID?

source type Azure Blob Storage and Azure Data Lake Storage Gen2 not found after initial run

Describe the bug
source type Azure Blob Storage and Azure Data Lake Storage Gen2 not found after initial run

To Reproduce
Steps to reproduce the behavior:

  1. run through Purview-ADB-Lineage-Solution-Accelerator
  2. check Data Catalog, view 'By source Type'

Expected behavior
Expecting to see the image below as per the instructions:
image

Missing Azure blob Storage and Azure Data Lake Storage Gen 2
image

Logs

  1. Please include any Spark code being ran that generates this error
  2. Please include a gist to the OpenLineageIn and PurviewOut logs
  3. See how to stream Azure Function Logs

Screenshots
If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

  • OS: [e.g. Windows, Mac]
  • OpenLineage Version: [e.g. name of jar]
  • Databricks Runtime Version: [e.g. 6.4, 9.1, 10.1]
  • Cluster Type: [e.g. Job, Interactive]
  • Cluster Mode: [e.g. Standard, High Concurrency, Single]
  • Using Credential Passthrough: [e.g. Yes, No]

Additional context
Add any other context about the problem here.

ADF executing notebooks is getting error

Describe the bug
I have deployed the demo environment, after that, I created an ADF to execute the notebook. The pipeline present a success status, but the information is not showing in Purview

To Reproduce
Steps to reproduce the behavior:

  1. Deploy the demo environment
  2. Deploy Azure Data Factory
  3. Create a Linked Service using Access Token generated in the workspace for authentication
  4. Choose a new cluster
  5. Add cluster setting as follow:
    image
  6. I needed to change the workspace from spark.openlineage.namespace, because the value created from the deployment wasn't right - adbpurviewol1

Expected behavior
I am expecting the lineage be pushed to purview

Logs
PurviewOut
2022-08-10T14:27:55.269 [Error] AdbClient-GetSingleAdbJobAsync: error, message: Response status code does not indicate success: 403 (Forbidden).Microsoft.Azure.WebJobs.Script.Workers.Rpc.RpcException : Result: AdbClient-GetSingleAdbJobAsync: error, message: Response status code does not indicate success: 403 (Forbidden).Exception: System.Net.Http.HttpRequestException: Response status code does not indicate success: 403 (Forbidden).at System.Net.Http.HttpResponseMessage.EnsureSuccessStatusCode()at Function.Domain.Providers.AdbClientProvider.GetSingleAdbJobAsync(Int64 runId, String adbWorkspaceUrl) in D:\a\Purview-ADB-Lineage-Solution-Accelerator\Purview-ADB-Lineage-Solution-Accelerator\function-app\adb-to-purview\src\Function.Domain\Providers\AdbClientProvider.cs:line 127Stack: at System.Net.Http.HttpResponseMessage.EnsureSuccessStatusCode()at Function.Domain.Providers.AdbClientProvider.GetSingleAdbJobAsync(Int64 runId, String adbWorkspaceUrl) in D:\a\Purview-ADB-Lineage-Solution-Accelerator\Purview-ADB-Lineage-Solution-Accelerator\function-app\adb-to-purview\src\Function.Domain\Providers\AdbClientProvider.cs:line 127

Desktop (please complete the following information):

  • OpenLineage Version: The same available in the github from this week
  • Databricks Runtime Version: 9.1
  • Cluster Type: Cluster created during the ADF execution
  • Cluster Mode: Standard
  • Using Credential Passthrough: No

Checking if inputs and outputs are equal

The method InOutEqual in Function.Doman > Helpers > OlProcessing > ValidateOlEvent.cs compares only the names (rather than names and namespaces).

Should be updated to return Enumerable.SequenceEqual(nms, nms2) && Enumerable.SequenceEqual(nmspc, nmspc2);

No lineage is shown in purview although no errors in Azure functions

Hi,

I have installed the connector following the Connector Only Deployment. I execute a Notebook, and I see that Azure Functions are being invoked without error. However no lineage is shown.

OpenLineageIn Log:

Connected!
2022-07-05T18:01:56 Welcome, you are now connected to log-streaming service. The default timeout is 2 hours. Change the timeout with the App Setting SCM_LOGSTREAM_TIMEOUT (in seconds).
2022-07-05T18:02:11.102 [Information] Executing 'Functions.OpenLineageIn' (Reason='This function was programmatically called via the host APIs.', Id=84fef0ee-48db-4060-b0d6-278a4bc69bc6)
2022-07-05T18:02:11.108 [Information] Executed 'Functions.OpenLineageIn' (Succeeded, Id=84fef0ee-48db-4060-b0d6-278a4bc69bc6, Duration=7ms)
2022-07-05T18:02:11.310 [Information] Executing 'Functions.OpenLineageIn' (Reason='This function was programmatically called via the host APIs.', Id=98f106ce-692b-43f4-a528-12a2eed22cb3)
2022-07-05T18:02:11.313 [Information] Executed 'Functions.OpenLineageIn' (Succeeded, Id=98f106ce-692b-43f4-a528-12a2eed22cb3, Duration=4ms)
2022-07-05T18:02:11.432 [Information] Executing 'Functions.OpenLineageIn' (Reason='This function was programmatically called via the host APIs.', Id=3c12eda6-a58d-44e4-8a82-28787555c0f7)
2022-07-05T18:02:11.436 [Information] Executed 'Functions.OpenLineageIn' (Succeeded, Id=3c12eda6-a58d-44e4-8a82-28787555c0f7, Duration=4ms)
2022-07-05T18:02:11.584 [Information] Executing 'Functions.OpenLineageIn' (Reason='This function was programmatically called via the host APIs.', Id=b7dd3dea-2f88-4f9d-bc8c-910cdc5a00ef)
2022-07-05T18:02:11.588 [Information] Executed 'Functions.OpenLineageIn' (Succeeded, Id=b7dd3dea-2f88-4f9d-bc8c-910cdc5a00ef, Duration=4ms)
2022-07-05T18:02:11.599 [Information] Executing 'Functions.OpenLineageIn' (Reason='This function was programmatically called via the host APIs.', Id=11af742b-2c36-418f-90eb-53adedd9f91b)
2022-07-05T18:02:11.602 [Information] Executed 'Functions.OpenLineageIn' (Succeeded, Id=11af742b-2c36-418f-90eb-53adedd9f91b, Duration=3ms)
2022-07-05T18:02:11.614 [Information] Executing 'Functions.OpenLineageIn' (Reason='This function was programmatically called via the host APIs.', Id=fb67f20b-31b7-4907-b9db-1016534913e3)
2022-07-05T18:02:11.617 [Information] Executed 'Functions.OpenLineageIn' (Succeeded, Id=fb67f20b-31b7-4907-b9db-1016534913e3, Duration=3ms)
2022-07-05T18:02:11.949 [Information] Executing 'Functions.OpenLineageIn' (Reason='This function was programmatically called via the host APIs.', Id=588cb0c5-8c70-4576-b4c6-3525f7e63cf6)
2022-07-05T18:02:11.952 [Information] Executed 'Functions.OpenLineageIn' (Succeeded, Id=588cb0c5-8c70-4576-b4c6-3525f7e63cf6, Duration=4ms)
2022-07-05T18:02:12.018 [Information] Executing 'Functions.OpenLineageIn' (Reason='This function was programmatically called via the host APIs.', Id=072df589-62fb-47d9-ab7a-61706fa257b0)
2022-07-05T18:02:12.022 [Information] Executed 'Functions.OpenLineageIn' (Succeeded, Id=072df589-62fb-47d9-ab7a-61706fa257b0, Duration=5ms)
2022-07-05T18:02:12.270 [Information] Executing 'Functions.OpenLineageIn' (Reason='This function was programmatically called via the host APIs.', Id=b92c4170-2d52-4837-8f62-4b7593f42f63)
2022-07-05T18:02:12.274 [Information] Executed 'Functions.OpenLineageIn' (Succeeded, Id=b92c4170-2d52-4837-8f62-4b7593f42f63, Duration=5ms)
2022-07-05T18:02:12.342 [Information] Executing 'Functions.OpenLineageIn' (Reason='This function was programmatically called via the host APIs.', Id=ad30a40b-180c-4f51-9bf5-bda9820a85ff)
2022-07-05T18:02:12.346 [Information] Executed 'Functions.OpenLineageIn' (Succeeded, Id=ad30a40b-180c-4f51-9bf5-bda9820a85ff, Duration=4ms)
2022-07-05T18:02:12.462 [Information] Executing 'Functions.OpenLineageIn' (Reason='This function was programmatically called via the host APIs.', Id=08286356-d481-4443-a11f-4ae253e9b44c)
2022-07-05T18:02:12.468 [Information] Executed 'Functions.OpenLineageIn' (Succeeded, Id=08286356-d481-4443-a11f-4ae253e9b44c, Duration=7ms)
2022-07-05T18:02:12.506 [Information] Executing 'Functions.OpenLineageIn' (Reason='This function was programmatically called via the host APIs.', Id=ae054b84-a720-4251-ae9c-b21f230ac2ef)
2022-07-05T18:02:12.511 [Information] Executed 'Functions.OpenLineageIn' (Succeeded, Id=ae054b84-a720-4251-ae9c-b21f230ac2ef, Duration=6ms)
2022-07-05T18:02:14.144 [Information] Executing 'Functions.OpenLineageIn' (Reason='This function was programmatically called via the host APIs.', Id=4c7af4ff-1754-4ad8-9e5c-01cdc9e98e5d)
2022-07-05T18:02:14.202 [Information] Executed 'Functions.OpenLineageIn' (Succeeded, Id=4c7af4ff-1754-4ad8-9e5c-01cdc9e98e5d, Duration=86ms)
2022-07-05T18:02:14.478 [Information] Executing 'Functions.OpenLineageIn' (Reason='This function was programmatically called via the host APIs.', Id=76638f08-8a6c-435f-afe4-9c8273b2fe70)
2022-07-05T18:02:14.482 [Information] Executed 'Functions.OpenLineageIn' (Succeeded, Id=76638f08-8a6c-435f-afe4-9c8273b2fe70, Duration=5ms)
2022-07-05T18:02:14.545 [Information] Executing 'Functions.OpenLineageIn' (Reason='This function was programmatically called via the host APIs.', Id=47134761-ecb3-4bb5-acac-c391e442f0fb)
2022-07-05T18:02:14.549 [Information] Executed 'Functions.OpenLineageIn' (Succeeded, Id=47134761-ecb3-4bb5-acac-c391e442f0fb, Duration=4ms)
2022-07-05T18:02:14.739 [Information] Executing 'Functions.OpenLineageIn' (Reason='This function was programmatically called via the host APIs.', Id=03faca00-be54-4f04-bd56-bcaa39d50311)
2022-07-05T18:02:14.744 [Information] Executed 'Functions.OpenLineageIn' (Succeeded, Id=03faca00-be54-4f04-bd56-bcaa39d50311, Duration=6ms)
2022-07-05T18:02:14.840 [Information] Executing 'Functions.OpenLineageIn' (Reason='This function was programmatically called via the host APIs.', Id=899abbfd-06d8-4bf3-9173-d3564599fe42)
2022-07-05T18:02:14.843 [Information] Executed 'Functions.OpenLineageIn' (Succeeded, Id=899abbfd-06d8-4bf3-9173-d3564599fe42, Duration=4ms)
2022-07-05T18:02:15.524 [Information] Executing 'Functions.OpenLineageIn' (Reason='This function was programmatically called via the host APIs.', Id=cf333bea-d7cc-4d40-ad1f-e424c8c0dd86)
2022-07-05T18:02:15.527 [Information] Executed 'Functions.OpenLineageIn' (Succeeded, Id=cf333bea-d7cc-4d40-ad1f-e424c8c0dd86, Duration=4ms)
2022-07-05T18:02:15.922 [Information] Executing 'Functions.OpenLineageIn' (Reason='This function was programmatically called via the host APIs.', Id=9b2f5235-00d3-4f40-806c-5895a3a31a80)
2022-07-05T18:02:15.945 [Information] Executed 'Functions.OpenLineageIn' (Succeeded, Id=9b2f5235-00d3-4f40-806c-5895a3a31a80, Duration=59ms)
2022-07-05T18:02:16.288 [Information] Executing 'Functions.OpenLineageIn' (Reason='This function was programmatically called via the host APIs.', Id=f1de5058-cf12-44a4-a0b4-467f28dfb946)
2022-07-05T18:02:16.430 [Information] OpenLineageIn:{"eventType":"COMPLETE","eventTime":"2022-07-05T18:02:16.193Z","run":{"runId":"3d31bfdb-2444-431d-9daf-b1f8e5cfbdb6","facets":{"spark_version":{"_producer":"https://github.com/OpenLineage/OpenLineage/tree/0.8.2/integration/spark","_schemaURL":"https://openlineage.io/spec/1-0-2/OpenLineage.json#/$defs/RunFacet","spark-version":"3.1.2","openlineage-spark-version":"0.8.2"},"spark.logicalPlan":{"_producer":"https://github.com/OpenLineage/OpenLineage/tree/0.8.2/integration/spark","_schemaURL":"https://openlineage.io/spec/1-0-2/OpenLineage.json#/$defs/RunFacet","plan":[{"class":"org.apache.spark.sql.catalyst.plans.logical.ReplaceTableAsSelect","num-children":1,"catalog":null,"tableName":null,"partitioning":[],"query":0,"properties":null,"writeOptions":null,"orCreate":true},{"class":"org.apache.spark.sql.catalyst.plans.logical.Aggregate","num-children":1,"groupingExpressions":[[{"class":"org.apache.spark.sql.catalyst.expressions.AttributeReference","num-children":0,"name":"code_postal","dataType":"string","nullable":true,"metadata":{},"exprId":{"product-class":"org.apache.spark.sql.catalyst.expressions.ExprId","id":941,"jvmId":"5935d995-1ace-4dbd-9b10-0a62adfc0d38"},"qualifier":[]}],[{"class":"org.apache.spark.sql.catalyst.expressions.AttributeReference","num-children":0,"name":"lat","dataType":"double","nullable":true,"metadata":{},"exprId":{"product-class":"org.apache.spark.sql.catalyst.expressions.ExprId","id":942,"jvmId":"5935d995-1ace-4dbd-9b10-0a62adfc0d38"},"qualifier":[]}],[{"class":"org.apache.spark.sql.catalyst.expressions.AttributeReference","num-children":0,"name":"longt","dataType":"double","nullable":true,"metadata":{},"exprId":{"product-class":"org.apache.spark.sql.catalyst.expressions.ExprId","id":943,"jvmId":"5935d995-1ace-4dbd-9b10-0a62adfc0d38"},"qualifier":[]}]],"aggregateExpressions":[[{"class":"org.apache.spark.sql.catalyst.expressions.AttributeReference","num-children":0,"name":"code_postal","dataType":"string","nullable":true,"metadata":{},"exprId":{"product-class":"org.apache.spark.sql.catalyst.expressions.ExprId","id":941,"jvmId":"5935d995-1ace-4dbd-9b10-0a62adfc0d38"},"qualifier":[]}],[{"class":"org.apache.spark.sql.catalyst.expressions.AttributeReference","num-children":0,"name":"lat","dataType":"double","nullable":true,"metadata":{},"exprId":{"product-class":"org.apache.spark.sql.catalyst.expressions.ExprId","id":942,"jvmId":"5935d995-1ace-4dbd-9b10-0a62adfc0d38"},"qualifier":[]}],[{"class":"org.apache.spark.sql.catalyst.expressions.AttributeReference","num-children":0,"name":"longt","dataType":"double","nullable":true,"metadata":{},"exprId":{"product-class":"org.apache.spark.sql.catalyst.expressions.ExprId","id":943,"jvmId":"5935d995-1ace-4dbd-9b10-0a62adfc0d38"},"qualifier":[]}]],"child":0},{"class":"org.apache.spark.sql.execution.datasources.LogicalRelation","num-children":0,"relation":null,"output":[[{"class":"org.apache.spark.sql.catalyst.expressions.AttributeReference","num-children":0,"name":"code_postal","dataType":"string","nullable":true,"metadata":{},"exprId":{"product-class":"org.apache.spark.sql.catalyst.expressions.ExprId","id":941,"jvmId":"5935d995-1ace-4dbd-9b10-0a62adfc0d38"},"qualifier":[]}],[{"class":"org.apache.spark.sql.catalyst.expressions.AttributeReference","num-children":0,"name":"lat","dataType":"double","nullable":true,"metadata":{},"exprId":{"product-class":"org.apache.spark.sql.catalyst.expressions.ExprId","id":942,"jvmId":"5935d995-1ace-4dbd-9b10-0a62adfc0d38"},"qualifier":[]}],[{"class":"org.apache.spark.sql.catalyst.expressions.AttributeReference","num-children":0,"name":"longt","dataType":"double","nullable":true,"metadata":{},"exprId":{"product-class":"org.apache.spark.sql.catalyst.expressions.ExprId","id":943,"jvmId":"5935d995-1ace-4dbd-9b10-0a62adfc0d38"},"qualifier":[]}]],"isStreaming":false}]}}},"job":{"namespace":"adb-8545493853080656.16#0525-210434-9gzshgxv","name":"databricks_shell.atomic_replace_table_as_select","facets":{}},"inputs":[{"namespace":"abfss://reference@[REDACTED].dfs.core.windows.net","name":"/postal_code_geocoder","facets":{"dataSource":{"_producer":"https://github.com/OpenLineage/OpenLineage/tree/0.8.2/integration/spark","_schemaURL":"https://openlineage.io/spec/facets/1-0-0/DatasourceDatasetFacet.json#/$defs/DatasourceDatasetFacet","name":"abfss://reference@[REDACTED].dfs.core.windows.net","uri":"abfss://reference@[REDACTED].dfs.core.windows.net"},"schema":{"_producer":"https://github.com/OpenLineage/OpenLineage/tree/0.8.2/integration/spark","_schemaURL":"https://openlineage.io/spec/facets/1-0-0/SchemaDatasetFacet.json#/$defs/SchemaDatasetFacet","fields":[{"name":"code_postal","type":"string"},{"name":"lat","type":"double"},{"name":"longt","type":"double"}]}},"inputFacets":{}}],"outputs":[{"namespace":"abfss://reference@[REDACTED].dfs.core.windows.net","name":"postal_code_qc_geocoder","facets":{"dataSource":{"_producer":"https://github.com/OpenLineage/OpenLineage/tree/0.8.2/integration/spark","_schemaURL":"https://openlineage.io/spec/facets/1-0-0/DatasourceDatasetFacet.json#/$defs/DatasourceDatasetFacet","name":"abfss://reference@[REDACTED].dfs.core.windows.net","uri":"abfss://reference@[REDACTED].dfs.core.windows.net"},"schema":{"_producer":"https://github.com/OpenLineage/OpenLineage/tree/0.8.2/integration/spark","_schemaURL":"https://openlineage.io/spec/facets/1-0-0/SchemaDatasetFacet.json#/$defs/SchemaDatasetFacet","fields":[{"name":"code_postal","type":"string"},{"name":"lat","type":"double"},{"name":"longt","type":"double"}]},"lifecycleStateChange":{"_producer":"https://github.com/OpenLineage/OpenLineage/tree/0.8.2/integration/spark","_schemaURL":"https://openlineage.io/spec/facets/1-0-0/LifecycleStateChangeDatasetFacet.json#/$defs/LifecycleStateChangeDatasetFacet","lifecycleStateChange":"OVERWRITE"},"tableProvider":{"_producer":"https://github.com/OpenLineage/OpenLineage/tree/0.8.2/integration/spark","_schemaURL":"https://openlineage.io/spec/1-0-2/OpenLineage.json#/$defs/DatasetFacet","provider":"delta","format":"parquet"}},"outputFacets":{"outputStatistics":{"_producer":"https://github.com/OpenLineage/OpenLineage/tree/0.8.2/integration/spark","_schemaURL":"https://openlineage.io/spec/facets/1-0-0/OutputStatisticsOutputDatasetFacet.json#/$defs/OutputStatisticsOutputDatasetFacet","rowCount":0,"size":0}}}],"producer":"https://github.com/OpenLineage/OpenLineage/tree/0.8.2/integration/spark","schemaURL":"https://openlineage.io/spec/1-0-2/OpenLineage.json#/$defs/RunEvent"}
2022-07-05T18:02:16.430 [Information] Executed 'Functions.OpenLineageIn' (Succeeded, Id=f1de5058-cf12-44a4-a0b4-467f28dfb946, Duration=143ms)
2022-07-05T18:02:16.848 [Information] Executing 'Functions.OpenLineageIn' (Reason='This function was programmatically called via the host APIs.', Id=3d1e1046-ef87-403d-9967-6a8b3e9ab9ce)
2022-07-05T18:02:16.854 [Information] Executed 'Functions.OpenLineageIn' (Succeeded, Id=3d1e1046-ef87-403d-9967-6a8b3e9ab9ce, Duration=8ms)
2022-07-05T18:02:16.937 [Information] Executing 'Functions.OpenLineageIn' (Reason='This function was programmatically called via the host APIs.', Id=3433fe22-0a07-4bc4-b383-52b249662a7a)
2022-07-05T18:02:16.941 [Information] Executed 'Functions.OpenLineageIn' (Succeeded, Id=3433fe22-0a07-4bc4-b383-52b249662a7a, Duration=6ms)
2022-07-05T18:02:17.049 [Information] Executing 'Functions.OpenLineageIn' (Reason='This function was programmatically called via the host APIs.', Id=7b51f71d-4b46-41ca-978b-191faac20ede)
2022-07-05T18:02:17.062 [Information] Executed 'Functions.OpenLineageIn' (Succeeded, Id=7b51f71d-4b46-41ca-978b-191faac20ede, Duration=15ms)
2022-07-05T18:02:17.112 [Information] Executing 'Functions.OpenLineageIn' (Reason='This function was programmatically called via the host APIs.', Id=e4c9c5a8-e716-4e34-9fb6-474ff1f2fff5)
2022-07-05T18:02:17.116 [Information] Executed 'Functions.OpenLineageIn' (Succeeded, Id=e4c9c5a8-e716-4e34-9fb6-474ff1f2fff5, Duration=7ms)
2022-07-05T18:02:17.762 [Information] Executing 'Functions.OpenLineageIn' (Reason='This function was programmatically called via the host APIs.', Id=085a1edf-bff5-4c07-9219-c0c6ad1f5e1b)
2022-07-05T18:02:17.765 [Information] Executed 'Functions.OpenLineageIn' (Succeeded, Id=085a1edf-bff5-4c07-9219-c0c6ad1f5e1b, Duration=4ms)
2022-07-05T18:02:18.074 [Information] Executing 'Functions.OpenLineageIn' (Reason='This function was programmatically called via the host APIs.', Id=4cf7649b-f8d6-4d44-8f5b-1ad818576565)
2022-07-05T18:02:18.083 [Information] Executed 'Functions.OpenLineageIn' (Succeeded, Id=4cf7649b-f8d6-4d44-8f5b-1ad818576565, Duration=24ms)
2022-07-05T18:02:18.218 [Information] Executing 'Functions.OpenLineageIn' (Reason='This function was programmatically called via the host APIs.', Id=7c532305-484c-4e66-b7e5-7145c5f34466)
2022-07-05T18:02:18.228 [Information] Executed 'Functions.OpenLineageIn' (Succeeded, Id=7c532305-484c-4e66-b7e5-7145c5f34466, Duration=17ms)
2022-07-05T18:02:18.284 [Information] Executing 'Functions.OpenLineageIn' (Reason='This function was programmatically called via the host APIs.', Id=e26dc708-bb73-4b21-9fb8-93e888d22910)
2022-07-05T18:02:18.296 [Information] Executed 'Functions.OpenLineageIn' (Succeeded, Id=e26dc708-bb73-4b21-9fb8-93e888d22910, Duration=19ms)
2022-07-05T18:02:18.455 [Information] Executing 'Functions.OpenLineageIn' (Reason='This function was programmatically called via the host APIs.', Id=c8770f1c-fcfb-41fa-a89e-8e0b518e8e4b)
2022-07-05T18:02:18.464 [Information] Executed 'Functions.OpenLineageIn' (Succeeded, Id=c8770f1c-fcfb-41fa-a89e-8e0b518e8e4b, Duration=15ms)
2022-07-05T18:02:18.511 [Information] Executing 'Functions.OpenLineageIn' (Reason='This function was programmatically called via the host APIs.', Id=586466b5-0935-43e1-aab0-7e22ecf97e50)
2022-07-05T18:02:18.523 [Information] Executed 'Functions.OpenLineageIn' (Succeeded, Id=586466b5-0935-43e1-aab0-7e22ecf97e50, Duration=19ms)
2022-07-05T18:02:19.283 [Information] Executing 'Functions.OpenLineageIn' (Reason='This function was programmatically called via the host APIs.', Id=1aea2f93-54da-4e3a-9d67-09031e1b7731)
2022-07-05T18:02:19.385 [Information] OpenLineageIn:{"eventType":"COMPLETE","eventTime":"2022-07-05T18:02:19.179Z","run":{"runId":"3d31bfdb-2444-431d-9daf-b1f8e5cfbdb6","facets":{"spark_version":{"_producer":"https://github.com/OpenLineage/OpenLineage/tree/0.8.2/integration/spark","_schemaURL":"https://openlineage.io/spec/1-0-2/OpenLineage.json#/$defs/RunFacet","spark-version":"3.1.2","openlineage-spark-version":"0.8.2"},"spark.logicalPlan":{"_producer":"https://github.com/OpenLineage/OpenLineage/tree/0.8.2/integration/spark","_schemaURL":"https://openlineage.io/spec/1-0-2/OpenLineage.json#/$defs/RunFacet","plan":[{"class":"org.apache.spark.sql.catalyst.plans.logical.ReplaceTableAsSelect","num-children":1,"catalog":null,"tableName":null,"partitioning":[],"query":0,"properties":null,"writeOptions":null,"orCreate":true},{"class":"org.apache.spark.sql.catalyst.plans.logical.Aggregate","num-children":1,"groupingExpressions":[[{"class":"org.apache.spark.sql.catalyst.expressions.AttributeReference","num-children":0,"name":"code_postal","dataType":"string","nullable":true,"metadata":{},"exprId":{"product-class":"org.apache.spark.sql.catalyst.expressions.ExprId","id":941,"jvmId":"5935d995-1ace-4dbd-9b10-0a62adfc0d38"},"qualifier":[]}],[{"class":"org.apache.spark.sql.catalyst.expressions.AttributeReference","num-children":0,"name":"lat","dataType":"double","nullable":true,"metadata":{},"exprId":{"product-class":"org.apache.spark.sql.catalyst.expressions.ExprId","id":942,"jvmId":"5935d995-1ace-4dbd-9b10-0a62adfc0d38"},"qualifier":[]}],[{"class":"org.apache.spark.sql.catalyst.expressions.AttributeReference","num-children":0,"name":"longt","dataType":"double","nullable":true,"metadata":{},"exprId":{"product-class":"org.apache.spark.sql.catalyst.expressions.ExprId","id":943,"jvmId":"5935d995-1ace-4dbd-9b10-0a62adfc0d38"},"qualifier":[]}]],"aggregateExpressions":[[{"class":"org.apache.spark.sql.catalyst.expressions.AttributeReference","num-children":0,"name":"code_postal","dataType":"string","nullable":true,"metadata":{},"exprId":{"product-class":"org.apache.spark.sql.catalyst.expressions.ExprId","id":941,"jvmId":"5935d995-1ace-4dbd-9b10-0a62adfc0d38"},"qualifier":[]}],[{"class":"org.apache.spark.sql.catalyst.expressions.AttributeReference","num-children":0,"name":"lat","dataType":"double","nullable":true,"metadata":{},"exprId":{"product-class":"org.apache.spark.sql.catalyst.expressions.ExprId","id":942,"jvmId":"5935d995-1ace-4dbd-9b10-0a62adfc0d38"},"qualifier":[]}],[{"class":"org.apache.spark.sql.catalyst.expressions.AttributeReference","num-children":0,"name":"longt","dataType":"double","nullable":true,"metadata":{},"exprId":{"product-class":"org.apache.spark.sql.catalyst.expressions.ExprId","id":943,"jvmId":"5935d995-1ace-4dbd-9b10-0a62adfc0d38"},"qualifier":[]}]],"child":0},{"class":"org.apache.spark.sql.execution.datasources.LogicalRelation","num-children":0,"relation":null,"output":[[{"class":"org.apache.spark.sql.catalyst.expressions.AttributeReference","num-children":0,"name":"code_postal","dataType":"string","nullable":true,"metadata":{},"exprId":{"product-class":"org.apache.spark.sql.catalyst.expressions.ExprId","id":941,"jvmId":"5935d995-1ace-4dbd-9b10-0a62adfc0d38"},"qualifier":[]}],[{"class":"org.apache.spark.sql.catalyst.expressions.AttributeReference","num-children":0,"name":"lat","dataType":"double","nullable":true,"metadata":{},"exprId":{"product-class":"org.apache.spark.sql.catalyst.expressions.ExprId","id":942,"jvmId":"5935d995-1ace-4dbd-9b10-0a62adfc0d38"},"qualifier":[]}],[{"class":"org.apache.spark.sql.catalyst.expressions.AttributeReference","num-children":0,"name":"longt","dataType":"double","nullable":true,"metadata":{},"exprId":{"product-class":"org.apache.spark.sql.catalyst.expressions.ExprId","id":943,"jvmId":"5935d995-1ace-4dbd-9b10-0a62adfc0d38"},"qualifier":[]}]],"isStreaming":false}]}}},"job":{"namespace":"adb-8545493853080656.16#0525-210434-9gzshgxv","name":"databricks_shell.atomic_replace_table_as_select","facets":{}},"inputs":[{"namespace":"abfss://reference@[REDACTED].dfs.core.windows.net","name":"/postal_code_geocoder","facets":{"dataSource":{"_producer":"https://github.com/OpenLineage/OpenLineage/tree/0.8.2/integration/spark","_schemaURL":"https://openlineage.io/spec/facets/1-0-0/DatasourceDatasetFacet.json#/$defs/DatasourceDatasetFacet","name":"abfss://reference@[REDACTED].dfs.core.windows.net","uri":"abfss://reference@[REDACTED].dfs.core.windows.net"},"schema":{"_producer":"https://github.com/OpenLineage/OpenLineage/tree/0.8.2/integration/spark","_schemaURL":"https://openlineage.io/spec/facets/1-0-0/SchemaDatasetFacet.json#/$defs/SchemaDatasetFacet","fields":[{"name":"code_postal","type":"string"},{"name":"lat","type":"double"},{"name":"longt","type":"double"}]}},"inputFacets":{}}],"outputs":[{"namespace":"abfss://reference@[REDACTED].dfs.core.windows.net","name":"postal_code_qc_geocoder","facets":{"dataSource":{"_producer":"https://github.com/OpenLineage/OpenLineage/tree/0.8.2/integration/spark","_schemaURL":"https://openlineage.io/spec/facets/1-0-0/DatasourceDatasetFacet.json#/$defs/DatasourceDatasetFacet","name":"abfss://reference@[REDACTED].dfs.core.windows.net","uri":"abfss://reference@[REDACTED].dfs.core.windows.net"},"schema":{"_producer":"https://github.com/OpenLineage/OpenLineage/tree/0.8.2/integration/spark","_schemaURL":"https://openlineage.io/spec/facets/1-0-0/SchemaDatasetFacet.json#/$defs/SchemaDatasetFacet","fields":[{"name":"code_postal","type":"string"},{"name":"lat","type":"double"},{"name":"longt","type":"double"}]},"lifecycleStateChange":{"_producer":"https://github.com/OpenLineage/OpenLineage/tree/0.8.2/integration/spark","_schemaURL":"https://openlineage.io/spec/facets/1-0-0/LifecycleStateChangeDatasetFacet.json#/$defs/LifecycleStateChangeDatasetFacet","lifecycleStateChange":"OVERWRITE"},"tableProvider":{"_producer":"https://github.com/OpenLineage/OpenLineage/tree/0.8.2/integration/spark","_schemaURL":"https://openlineage.io/spec/1-0-2/OpenLineage.json#/$defs/DatasetFacet","provider":"delta","format":"parquet"}},"outputFacets":{}}],"producer":"https://github.com/OpenLineage/OpenLineage/tree/0.8.2/integration/spark","schemaURL":"https://openlineage.io/spec/1-0-2/OpenLineage.json#/$defs/RunEvent"}
2022-07-05T18:02:19.386 [Information] Executed 'Functions.OpenLineageIn' (Succeeded, Id=1aea2f93-54da-4e3a-9d67-09031e1b7731, Duration=104ms)
2022-07-05T18:03:56 No new trace in the past 1 min(s).

PurviewOut Logs:

Connected!
2022-07-05T18:02:16.430 [Information] Executing 'Functions.PurviewOut' (Reason='(null)', Id=2217584a-2174-4ce5-a044-0097b04bd4aa)
2022-07-05T18:02:16.431 [Information] Trigger Details: PartionId: 0, Offset: 104144, EnqueueTimeUtc: 2022-07-05T18:02:16.3890000Z, SequenceNumber: 16, Count: 1
2022-07-05T18:02:16.437 [Information] Token to access Purview was generated
2022-07-05T18:02:16.437 [Information] Got Purview Client!
2022-07-05T18:02:16.438 [Information] Enter PurviewOut
2022-07-05T18:02:16.516 [Warning] Start event was missing, retrying to consolodate message. Retry count: 1
2022-07-05T18:02:17.548 [Warning] Start event was missing, retrying to consolodate message. Retry count: 2
2022-07-05T18:02:18.569 [Warning] Start event was missing, retrying to consolodate message. Retry count: 3
2022-07-05T18:02:19.580 [Warning] Start event was missing, retrying to consolodate message. Retry count: 4
2022-07-05T18:02:20.587 [Warning] Start event was missing, retrying to consolodate message. Retry count: 5
2022-07-05T18:02:20.587 [Information] Start event or no context found - eventData: {"eventType":"COMPLETE","eventTime":"2022-07-05T18:02:16.193Z","run":{"runId":"3d31bfdb-2444-431d-9daf-b1f8e5cfbdb6","facets":{"spark_version":{"_producer":"https://github.com/OpenLineage/OpenLineage/tree/0.8.2/integration/spark","_schemaURL":"https://openlineage.io/spec/1-0-2/OpenLineage.json#/$defs/RunFacet","spark-version":"3.1.2","openlineage-spark-version":"0.8.2"},"spark.logicalPlan":{"_producer":"https://github.com/OpenLineage/OpenLineage/tree/0.8.2/integration/spark","_schemaURL":"https://openlineage.io/spec/1-0-2/OpenLineage.json#/$defs/RunFacet","plan":[{"class":"org.apache.spark.sql.catalyst.plans.logical.ReplaceTableAsSelect","num-children":1,"catalog":null,"tableName":null,"partitioning":[],"query":0,"properties":null,"writeOptions":null,"orCreate":true},{"class":"org.apache.spark.sql.catalyst.plans.logical.Aggregate","num-children":1,"groupingExpressions":[[{"class":"org.apache.spark.sql.catalyst.expressions.AttributeReference","num-children":0,"name":"code_postal","dataType":"string","nullable":true,"metadata":{},"exprId":{"product-class":"org.apache.spark.sql.catalyst.expressions.ExprId","id":941,"jvmId":"5935d995-1ace-4dbd-9b10-0a62adfc0d38"},"qualifier":[]}],[{"class":"org.apache.spark.sql.catalyst.expressions.AttributeReference","num-children":0,"name":"lat","dataType":"double","nullable":true,"metadata":{},"exprId":{"product-class":"org.apache.spark.sql.catalyst.expressions.ExprId","id":942,"jvmId":"5935d995-1ace-4dbd-9b10-0a62adfc0d38"},"qualifier":[]}],[{"class":"org.apache.spark.sql.catalyst.expressions.AttributeReference","num-children":0,"name":"longt","dataType":"double","nullable":true,"metadata":{},"exprId":{"product-class":"org.apache.spark.sql.catalyst.expressions.ExprId","id":943,"jvmId":"5935d995-1ace-4dbd-9b10-0a62adfc0d38"},"qualifier":[]}]],"aggregateExpressions":[[{"class":"org.apache.spark.sql.catalyst.expressions.AttributeReference","num-children":0,"name":"code_postal","dataType":"string","nullable":true,"metadata":{},"exprId":{"product-class":"org.apache.spark.sql.catalyst.expressions.ExprId","id":941,"jvmId":"5935d995-1ace-4dbd-9b10-0a62adfc0d38"},"qualifier":[]}],[{"class":"org.apache.spark.sql.catalyst.expressions.AttributeReference","num-children":0,"name":"lat","dataType":"double","nullable":true,"metadata":{},"exprId":{"product-class":"org.apache.spark.sql.catalyst.expressions.ExprId","id":942,"jvmId":"5935d995-1ace-4dbd-9b10-0a62adfc0d38"},"qualifier":[]}],[{"class":"org.apache.spark.sql.catalyst.expressions.AttributeReference","num-children":0,"name":"longt","dataType":"double","nullable":true,"metadata":{},"exprId":{"product-class":"org.apache.spark.sql.catalyst.expressions.ExprId","id":943,"jvmId":"5935d995-1ace-4dbd-9b10-0a62adfc0d38"},"qualifier":[]}]],"child":0},{"class":"org.apache.spark.sql.execution.datasources.LogicalRelation","num-children":0,"relation":null,"output":[[{"class":"org.apache.spark.sql.catalyst.expressions.AttributeReference","num-children":0,"name":"code_postal","dataType":"string","nullable":true,"metadata":{},"exprId":{"product-class":"org.apache.spark.sql.catalyst.expressions.ExprId","id":941,"jvmId":"5935d995-1ace-4dbd-9b10-0a62adfc0d38"},"qualifier":[]}],[{"class":"org.apache.spark.sql.catalyst.expressions.AttributeReference","num-children":0,"name":"lat","dataType":"double","nullable":true,"metadata":{},"exprId":{"product-class":"org.apache.spark.sql.catalyst.expressions.ExprId","id":942,"jvmId":"5935d995-1ace-4dbd-9b10-0a62adfc0d38"},"qualifier":[]}],[{"class":"org.apache.spark.sql.catalyst.expressions.AttributeReference","num-children":0,"name":"longt","dataType":"double","nullable":true,"metadata":{},"exprId":{"product-class":"org.apache.spark.sql.catalyst.expressions.ExprId","id":943,"jvmId":"5935d995-1ace-4dbd-9b10-0a62adfc0d38"},"qualifier":[]}]],"isStreaming":false}]}}},"job":{"namespace":"adb-8545493853080656.16#0525-210434-9gzshgxv","name":"databricks_shell.atomic_replace_table_as_select","facets":{}},"inputs":[{"namespace":"abfss://reference@[REDACTED].dfs.core.windows.net","name":"/postal_code_geocoder","facets":{"dataSource":{"_producer":"https://github.com/OpenLineage/OpenLineage/tree/0.8.2/integration/spark","_schemaURL":"https://openlineage.io/spec/facets/1-0-0/DatasourceDatasetFacet.json#/$defs/DatasourceDatasetFacet","name":"abfss://reference@[REDACTED].dfs.core.windows.net","uri":"abfss://reference@[REDACTED].dfs.core.windows.net"},"schema":{"_producer":"https://github.com/OpenLineage/OpenLineage/tree/0.8.2/integration/spark","_schemaURL":"https://openlineage.io/spec/facets/1-0-0/SchemaDatasetFacet.json#/$defs/SchemaDatasetFacet","fields":[{"name":"code_postal","type":"string"},{"name":"lat","type":"double"},{"name":"longt","type":"double"}]}},"inputFacets":{}}],"outputs":[{"namespace":"abfss://reference@[REDACTED].dfs.core.windows.net","name":"postal_code_qc_geocoder","facets":{"dataSource":{"_producer":"https://github.com/OpenLineage/OpenLineage/tree/0.8.2/integration/spark","_schemaURL":"https://openlineage.io/spec/facets/1-0-0/DatasourceDatasetFacet.json#/$defs/DatasourceDatasetFacet","name":"abfss://reference@[REDACTED].dfs.core.windows.net","uri":"abfss://reference@[REDACTED].dfs.core.windows.net"},"schema":{"_producer":"https://github.com/OpenLineage/OpenLineage/tree/0.8.2/integration/spark","_schemaURL":"https://openlineage.io/spec/facets/1-0-0/SchemaDatasetFacet.json#/$defs/SchemaDatasetFacet","fields":[{"name":"code_postal","type":"string"},{"name":"lat","type":"double"},{"name":"longt","type":"double"}]},"lifecycleStateChange":{"_producer":"https://github.com/OpenLineage/OpenLineage/tree/0.8.2/integration/spark","_schemaURL":"https://openlineage.io/spec/facets/1-0-0/LifecycleStateChangeDatasetFacet.json#/$defs/LifecycleStateChangeDatasetFacet","lifecycleStateChange":"OVERWRITE"},"tableProvider":{"_producer":"https://github.com/OpenLineage/OpenLineage/tree/0.8.2/integration/spark","_schemaURL":"https://openlineage.io/spec/1-0-2/OpenLineage.json#/$defs/DatasetFacet","provider":"delta","format":"parquet"}},"outputFacets":{"outputStatistics":{"_producer":"https://github.com/OpenLineage/OpenLineage/tree/0.8.2/integration/spark","_schemaURL":"https://openlineage.io/spec/facets/1-0-0/OutputStatisticsOutputDatasetFacet.json#/$defs/OutputStatisticsOutputDatasetFacet","rowCount":0,"size":0}}}],"producer":"https://github.com/OpenLineage/OpenLineage/tree/0.8.2/integration/spark","schemaURL":"https://openlineage.io/spec/1-0-2/OpenLineage.json#/$defs/RunEvent"}
2022-07-05T18:02:20.588 [Information] Executed 'Functions.PurviewOut' (Succeeded, Id=2217584a-2174-4ce5-a044-0097b04bd4aa, Duration=4157ms)
2022-07-05T18:02:20.645 [Information] Executing 'Functions.PurviewOut' (Reason='(null)', Id=bf2a018e-b0f9-4334-9bc6-a15d1e9214b7)
2022-07-05T18:02:20.645 [Information] Trigger Details: PartionId: 0, Offset: 110792, EnqueueTimeUtc: 2022-07-05T18:02:19.3730000Z, SequenceNumber: 17, Count: 1
2022-07-05T18:02:20.650 [Information] Token to access Purview was generated
2022-07-05T18:02:20.650 [Information] Got Purview Client!
2022-07-05T18:02:20.650 [Information] Enter PurviewOut
2022-07-05T18:02:20.665 [Warning] Start event was missing, retrying to consolodate message. Retry count: 1
2022-07-05T18:02:21.690 [Warning] Start event was missing, retrying to consolodate message. Retry count: 2
2022-07-05T18:02:22.704 [Warning] Start event was missing, retrying to consolodate message. Retry count: 3
2022-07-05T18:02:23.714 [Warning] Start event was missing, retrying to consolodate message. Retry count: 4
2022-07-05T18:02:24.734 [Warning] Start event was missing, retrying to consolodate message. Retry count: 5
2022-07-05T18:02:24.734 [Information] Start event or no context found - eventData: {"eventType":"COMPLETE","eventTime":"2022-07-05T18:02:19.179Z","run":{"runId":"3d31bfdb-2444-431d-9daf-b1f8e5cfbdb6","facets":{"spark_version":{"_producer":"https://github.com/OpenLineage/OpenLineage/tree/0.8.2/integration/spark","_schemaURL":"https://openlineage.io/spec/1-0-2/OpenLineage.json#/$defs/RunFacet","spark-version":"3.1.2","openlineage-spark-version":"0.8.2"},"spark.logicalPlan":{"_producer":"https://github.com/OpenLineage/OpenLineage/tree/0.8.2/integration/spark","_schemaURL":"https://openlineage.io/spec/1-0-2/OpenLineage.json#/$defs/RunFacet","plan":[{"class":"org.apache.spark.sql.catalyst.plans.logical.ReplaceTableAsSelect","num-children":1,"catalog":null,"tableName":null,"partitioning":[],"query":0,"properties":null,"writeOptions":null,"orCreate":true},{"class":"org.apache.spark.sql.catalyst.plans.logical.Aggregate","num-children":1,"groupingExpressions":[[{"class":"org.apache.spark.sql.catalyst.expressions.AttributeReference","num-children":0,"name":"code_postal","dataType":"string","nullable":true,"metadata":{},"exprId":{"product-class":"org.apache.spark.sql.catalyst.expressions.ExprId","id":941,"jvmId":"5935d995-1ace-4dbd-9b10-0a62adfc0d38"},"qualifier":[]}],[{"class":"org.apache.spark.sql.catalyst.expressions.AttributeReference","num-children":0,"name":"lat","dataType":"double","nullable":true,"metadata":{},"exprId":{"product-class":"org.apache.spark.sql.catalyst.expressions.ExprId","id":942,"jvmId":"5935d995-1ace-4dbd-9b10-0a62adfc0d38"},"qualifier":[]}],[{"class":"org.apache.spark.sql.catalyst.expressions.AttributeReference","num-children":0,"name":"longt","dataType":"double","nullable":true,"metadata":{},"exprId":{"product-class":"org.apache.spark.sql.catalyst.expressions.ExprId","id":943,"jvmId":"5935d995-1ace-4dbd-9b10-0a62adfc0d38"},"qualifier":[]}]],"aggregateExpressions":[[{"class":"org.apache.spark.sql.catalyst.expressions.AttributeReference","num-children":0,"name":"code_postal","dataType":"string","nullable":true,"metadata":{},"exprId":{"product-class":"org.apache.spark.sql.catalyst.expressions.ExprId","id":941,"jvmId":"5935d995-1ace-4dbd-9b10-0a62adfc0d38"},"qualifier":[]}],[{"class":"org.apache.spark.sql.catalyst.expressions.AttributeReference","num-children":0,"name":"lat","dataType":"double","nullable":true,"metadata":{},"exprId":{"product-class":"org.apache.spark.sql.catalyst.expressions.ExprId","id":942,"jvmId":"5935d995-1ace-4dbd-9b10-0a62adfc0d38"},"qualifier":[]}],[{"class":"org.apache.spark.sql.catalyst.expressions.AttributeReference","num-children":0,"name":"longt","dataType":"double","nullable":true,"metadata":{},"exprId":{"product-class":"org.apache.spark.sql.catalyst.expressions.ExprId","id":943,"jvmId":"5935d995-1ace-4dbd-9b10-0a62adfc0d38"},"qualifier":[]}]],"child":0},{"class":"org.apache.spark.sql.execution.datasources.LogicalRelation","num-children":0,"relation":null,"output":[[{"class":"org.apache.spark.sql.catalyst.expressions.AttributeReference","num-children":0,"name":"code_postal","dataType":"string","nullable":true,"metadata":{},"exprId":{"product-class":"org.apache.spark.sql.catalyst.expressions.ExprId","id":941,"jvmId":"5935d995-1ace-4dbd-9b10-0a62adfc0d38"},"qualifier":[]}],[{"class":"org.apache.spark.sql.catalyst.expressions.AttributeReference","num-children":0,"name":"lat","dataType":"double","nullable":true,"metadata":{},"exprId":{"product-class":"org.apache.spark.sql.catalyst.expressions.ExprId","id":942,"jvmId":"5935d995-1ace-4dbd-9b10-0a62adfc0d38"},"qualifier":[]}],[{"class":"org.apache.spark.sql.catalyst.expressions.AttributeReference","num-children":0,"name":"longt","dataType":"double","nullable":true,"metadata":{},"exprId":{"product-class":"org.apache.spark.sql.catalyst.expressions.ExprId","id":943,"jvmId":"5935d995-1ace-4dbd-9b10-0a62adfc0d38"},"qualifier":[]}]],"isStreaming":false}]}}},"job":{"namespace":"adb-8545493853080656.16#0525-210434-9gzshgxv","name":"databricks_shell.atomic_replace_table_as_select","facets":{}},"inputs":[{"namespace":"abfss://reference@[REDACTED].dfs.core.windows.net","name":"/postal_code_geocoder","facets":{"dataSource":{"_producer":"https://github.com/OpenLineage/OpenLineage/tree/0.8.2/integration/spark","_schemaURL":"https://openlineage.io/spec/facets/1-0-0/DatasourceDatasetFacet.json#/$defs/DatasourceDatasetFacet","name":"abfss://reference@[REDACTED].dfs.core.windows.net","uri":"abfss://reference@[REDACTED].dfs.core.windows.net"},"schema":{"_producer":"https://github.com/OpenLineage/OpenLineage/tree/0.8.2/integration/spark","_schemaURL":"https://openlineage.io/spec/facets/1-0-0/SchemaDatasetFacet.json#/$defs/SchemaDatasetFacet","fields":[{"name":"code_postal","type":"string"},{"name":"lat","type":"double"},{"name":"longt","type":"double"}]}},"inputFacets":{}}],"outputs":[{"namespace":"abfss://reference@[REDACTED].dfs.core.windows.net","name":"postal_code_qc_geocoder","facets":{"dataSource":{"_producer":"https://github.com/OpenLineage/OpenLineage/tree/0.8.2/integration/spark","_schemaURL":"https://openlineage.io/spec/facets/1-0-0/DatasourceDatasetFacet.json#/$defs/DatasourceDatasetFacet","name":"abfss://reference@[REDACTED].dfs.core.windows.net","uri":"abfss://reference@[REDACTED].dfs.core.windows.net"},"schema":{"_producer":"https://github.com/OpenLineage/OpenLineage/tree/0.8.2/integration/spark","_schemaURL":"https://openlineage.io/spec/facets/1-0-0/SchemaDatasetFacet.json#/$defs/SchemaDatasetFacet","fields":[{"name":"code_postal","type":"string"},{"name":"lat","type":"double"},{"name":"longt","type":"double"}]},"lifecycleStateChange":{"_producer":"https://github.com/OpenLineage/OpenLineage/tree/0.8.2/integration/spark","_schemaURL":"https://openlineage.io/spec/facets/1-0-0/LifecycleStateChangeDatasetFacet.json#/$defs/LifecycleStateChangeDatasetFacet","lifecycleStateChange":"OVERWRITE"},"tableProvider":{"_producer":"https://github.com/OpenLineage/OpenLineage/tree/0.8.2/integration/spark","_schemaURL":"https://openlineage.io/spec/1-0-2/OpenLineage.json#/$defs/DatasetFacet","provider":"delta","format":"parquet"}},"outputFacets":{}}],"producer":"https://github.com/OpenLineage/OpenLineage/tree/0.8.2/integration/spark","schemaURL":"https://openlineage.io/spec/1-0-2/OpenLineage.json#/$defs/RunEvent"}
2022-07-05T18:02:24.734 [Information] Executed 'Functions.PurviewOut' (Succeeded, Id=bf2a018e-b0f9-4334-9bc6-a15d1e9214b7, Duration=4089ms)

I also have turned Logs Capture for Purview and I don't see any incoming calls form the Azure Function.

Databricks "openlineage-demo" cluster cannot start

Hello,
After deploying the demo using "deploy-demo.md", the databricks cluster could not start due to the following error:

Summary

Message
Cluster terminated. Reason: Secret resolution error

Help
Internal error resolving secrets.

Error message: Failed to fetch secrets referred to in Spark Conf

JSON

{ "reason": { "code": "SECRET_RESOLUTION_ERROR", "type": "SERVICE_FAULT", "parameters": { "databricks_error_message": "Failed to fetch secrets referred to in Spark Conf" } } }

I checked the secrets and the keyvault's secret was the same as the service principal (with data curator role in the Purview account).

Picture link in readme is broken

Describe the bug
Just saw this project from internal talk and open it. Found the picture link in readme is broken to see

To Reproduce
Steps to reproduce the behavior:

  1. e.g. data sources and destination being used
  2. e.g. code snippet to generate the error

Expected behavior
A clear and concise description of what you expected to happen.

Logs

  1. Please include any Spark code being ran that generates this error
  2. Please include a gist to the OpenLineageIn and PurviewOut logs
  3. See how to stream Azure Function Logs

Screenshots
If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

  • OS: [e.g. Windows, Mac]
  • OpenLineage Version: [e.g. name of jar]
  • Databricks Runtime Version: [e.g. 6.4, 9.1, 10.1]
  • Cluster Type: [e.g. Job, Interactive]
  • Cluster Mode: [e.g. Standard, High Concurrency, Single]
  • Using Credential Passthrough: [e.g. Yes, No]

Additional context
Add any other context about the problem here.

Swapping Placeholder for Scanned Asset May Fail for Resource Sets and folders with special characters

Describe the bug
Given a placeholder asset of https://foo.dfs.core.windows.net/bar/baz.parquet it may not match to the resource set that returned in the scan https://foo.dfs.core.windows.net/bar/baz.parquet/{SparkPartitions}.

To Reproduce
Steps to reproduce the behavior:

  1. Execute job with spark.write.parquet("https://foo.dfs.core.windows.net/bar/baz.parquet")

Expected behavior
Should match to https://foo.dfs.core.windows.net/bar/baz.parquet/{SparkPartitions}.

Logs
N/A

Screenshots
N/A

Desktop (please complete the following information):

  • OS: N/A
  • OpenLineage Version: 13.1
  • Databricks Runtime Version: 10.4
  • Cluster Type: Job and Interactive
  • Cluster Mode: Standard
  • Using Credential Passthrough: ?

Additional context
Add any other context about the problem here.

Databricks lineage with mounts not getting captured for 11.3 LTS version

Describe the bug
Databricks lineage with mounts not getting captured for 11.3 LTS version.

Same code with same configurations working fine for 9.1 LTS version

To Reproduce

  1. Followed steps mentioned in Connector only deployment.
  2. Set up interactive cluster with 11.3 LTS version.
  3. Lineage not captured when using mount paths on ADLS Gen1 ( /mnt/mount_name)
  4. Lineage captured fine when used direct ADLS Gen1 URI instead of mount ( adl://path)
  5. Lineage captured properly in 9.1 LTS version.

Attaching logs from function app
dbfs_out_file.txt

Databricks notebook task not found

Describe the bug
Databricks notebook task not found

To Reproduce
Steps to reproduce the behavior:

  1. run-through Purview-ADB-Lineage-Solution-Accelerator
  2. Databricks objects, look for Databricks Notebook Tasks

Expected behavior
Expecting 67 objects loaded:
image
Only finding 3:
image

Expecting Databricks notebook tasks:
image

Unable to find:
image
image

Logs

  1. Please include any Spark code being ran that generates this error
  2. Please include a gist to the OpenLineageIn and PurviewOut logs
  3. See how to stream Azure Function Logs

Screenshots
If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

  • OS: [e.g. Windows, Mac]
  • OpenLineage Version: [e.g. name of jar]
  • Databricks Runtime Version: [e.g. 6.4, 9.1, 10.1]
  • Cluster Type: [e.g. Job, Interactive]
  • Cluster Mode: [e.g. Standard, High Concurrency, Single]
  • Using Credential Passthrough: [e.g. Yes, No]

Additional context
Add any other context about the problem here.

Authentication token expiration is not checked if retrieved from cache

There are no checks around authentication token expiration when cached. In case of token expired, it will lead to returning expired token to calling function.
All calling functions as for now do not check expiration of the token.

The source Databricks is not registered, Integration Runtime is not created, Scan is not created

Carefully following your instructions, we successfully deployed the demo version of Accelerator. There were no errors. and Azure Databricks, Purview, ADLS, Azure Function app (along with its two in and out functions) and example data sources and one notebook are all installed. But the following items were not installed:

  1. Databricks as a source was not registered in the Purview account.
  2. Integration Runtime under Purview Portal-->Data Map-->Source Management is empty.
  3. Purview Portal-->Management-->Security and Access-->Credential-->Manage Key Vault Connection is empty, and hence there is no credential created.

After I manually registered the Databricks as a source in Purview, I tried to manually create a scan but, as shown below, Integration Runtime and Credential are required.

Question: a) Are the steps 1-3 above not required for this connector or the installation process should have completed these steps? b) Should we be manually complete the above 3 steps?

image

Remarks: I saw your only one notebook abfss-in-abfss-out-olsample with 4 cells in the Databricks. That notebook ran successfully. But there was no lineage created (I assume probably because the above 3 steps were missing and hence there was no scan running). Can you please clarify what we may be missing. I went through your troubleshooting doc but found nothing in the document that may have been missing in the installation.

OpenLineageIn shows no errors, no success. Same is true for OpenLineageOut:
image

Incorrect entity search results interpretation while searching already existing items in Purview

Describe the bug
Incorrect entity search results interpretation while searching already existing items in Purview

To Reproduce
Steps to reproduce the behavior:

  1. Load any data from ADLSv2 (in my case it was https://datahub.io/core/airport-codes/) using augmented Databricks cluster, setup according to instructions and notebook on python/scala
  2. Convert data, and compute several aggregates, write out aggregates into several delta lake tables, named same way, for example "airport-per-country", "airport-per-type", "airport-per-region"
  3. Look at the purview - instead of 3 outputs of the notebook, it shows only one, named as first table wrote down to ADLSv2

Expected behavior
3 outputs, instead of 1

Logs
Code of function-app/adb-to-purview/src/Function.Domain/Helpers/PurviewCustomType.cs was modified, to add query logging to Purview

             var uid = Guid.NewGuid().ToString("N");
            _logger.LogInformation($"QUERY {uid} SEARCH {filter["filter"]}");
            List<QueryValeuModel> results = await this._client.Query_entities(filter["filter"]!);
            _logger.LogInformation($"QUERY {uid} RESULT {JsonConvert.SerializeObject(results, Formatting.None)}");

Output of this logging is

2022-07-19T09:51:29.204 [Information] QUERY d7fd7d9195394014aac86cb64593b086 SEARCH {"and": [{"attributeName": "qualifiedName","operator": "contains","attributeValue": "https:%22%7D,%7B%22or%22: [{"attributeName": "qualifiedName","operator": "contains","attributeValue": "letarget.blob.core.windows.net"},{"attributeName": "qualifiedName","operator": "contains","attributeValue": "letarget.dfs.core.windows.net"}]},{"attributeName": "qualifiedName","operator": "contains","attributeValue": "containerout"},{"attributeName": "qualifiedName","operator": "contains","attributeValue": "delta-out"},{"attributeName": "qualifiedName","operator": "contains","attributeValue": "airport-per-type"}]}

2022-07-19T09:51:29.331 [Information] QUERY d7fd7d9195394014aac86cb64593b086 RESULT [{"owner":null,"qualifiedName":https://letarget.blob.core.windows.net/containerout/delta-out/airport-per-country,"entityType":"purview_custom_connector_generic_entity_with_columns","name":"airport-per-country","description":"Data Assets airport-per-country","term":null,"id":"cc4d36c7-865e-4f24-8817-fcf665e97138","label":null,"classification":null,"collectionId":null,"assetType":["Purview Custom Connector"],"@search.highlights":{"qualifiedName":["<em>https</em>://<em>letarget.blob.core.windows.net</em>/<em>containerout</em>/<em>delta</em>-<em>out</em>/<em>airport</em>-<em>per</em>-country"],"name":["<em>airport</em>-<em>per</em>-country"],"description":["Data Assets <em>airport</em>-<em>per</em>-country"]},"@search.score":3.2986124}]

As you can see, once Purview returns not "exact match", but suggestion, with different name, but code treats it like exact match

Desktop (please complete the following information):

  • OS: Azure
  • OpenLineage Version: 0.10
  • Databricks Runtime Version: 9.1LTS
  • Cluster Type: Job
  • Cluster Mode: Single, Standard
  • Using Credential Passthrough: No

Purview Lineage metadata reporting Issue

I configured this connector using the accompanying script and did the following activities.

  1. Created a python script in ADB notebook
  2. Mounted Azure Blob Storage (Gen2) using dbfs over a mountpoint
  3. Invoked Microsoft Graph API and saved the output as csv
  4. Copied the csv to Blob Storage mountpoint
  5. Read the csv back from the mountpoint to a spark dataframe
  6. Did some ETL operations like removed rows with null values (using spark drop function), renamed columns (using withColumnRenamed) and stored the results in a new dataframe
  7. Using a jdbc connector, copied the dataframe (from step 6) to a Synapse SQL datapool
  8. Triggered a stored procedure in Synapse from ABD notebook

Now, when I goto purview portal and browse the ADB notebook asset, I see only the following reported in Lineage visualization

image
None of the ETL operations in step 6 are shown. I understand that stored procedures are not reported but I was expecting at least the ETL operations to show up. Am I doing anything wrong?

adb cluster json is malformed.

while deploying the "demo deployment". The json for the ADB cluster creation is getting malformed for some reason., and the cluster is not getting created by the script.

image

Here is the error message in the CLI
image

Where is this key located: spark.openlineage.samplestorageaccount

The working connector deployment completed successfully. But when I run the sample notebook abfss-in-abfss-out-olsample I get the following error on the code shown below:

NoSuchElementException: spark.openlineage.samplestorageaccount

Code:

import org.apache.spark.sql.types.{StructType, StructField, IntegerType, StringType}

val storageServiceName = spark.conf.get("spark.openlineage.samplestorageaccount")
val storageContainerName = spark.conf.get("spark.openlineage.samplestoragecontainer")
val adlsRootPath = "wasbs://"+storageContainerName+"@"+storageServiceName+".blob.core.windows.net"
            
val storageKey = dbutils.secrets.get("purview-to-adb-kv", "storageAccessKey")
            
spark.conf.set("fs.azure.account.key."+storageServiceName+".blob.core.windows.net", storageKey)

Question: Is abfss-in-abfss-out-olsample notebook valid for the Connector Only Deployment, as well. If so, where in the deployment can we find the key spark.openlineage.samplestorageaccount in the Databricks?

Remarks:

  1. Spark config section of the advanced options of the cluster look like the following. The content here is based on the item 4 of section: Install OpenLineage on Your Databricks Cluster. And as seen below, the key spark.openlineage.samplestorageaccount is not present here.
  2. I have verified that in the Key Vault (KV), the function app (function's managed identity) is in the KV's access policy.
  3. During the execution of the sample notebook, both the Azure Function app and Azure Databricks cluster are running.
  4. OpenLineageIn and PurviewOut functions have no errors.
  5. In the Azure Function Configuration tab, all the Key Vault Referenced app settings have a green checkmark.

Spark config

spark.openlineage.version v1 
spark.openlineage.namespace https://adb-8514321158310101.1#0203-044343-1ehjmfc2
spark.openlineage.host https://https://functionappdbrvxka.azurewebsites.net
spark.openlineage.url.param.code "8JWhjPJE-v63S1az4IftDoekwn75CHS4j8ZblrshstxiAzFusbGibA=="

Lineage not detected using Unity Catalog enabled Cluster

Hi there,

In our setup, we've been testing the connector with great success for a pretty basic All-Purpose cluster with no isolation shared.
Since we also aim to use Unity Catalog, we have tried to test the connector with a Unity Catalog enabled cluster. Given that there's no possibility to edit Init Scripts for UC clusters with Shared Access mode, we have defined the Init Script on a global level.
Now we've tested that this global init script still ensures that the lineage connector works for the non-UC cluster, but we are not able to make it work for the UC cluster.

Seems like there are no messages coming in the Azure Functions, even though the clusters have identical setup, apart from the Shared/No isolation shared property which enables the Unity Catalog in our Workspace.

Any idea how to make sure that we can track Lineage using the Unity Catalog enabled cluster?
Thanks!

Desktop (please complete the following information):

  • OS: Windows
  • OpenLineage Version: 0.17.0
  • Databricks Runtime Version: 11.3 LTS
  • Cluster Type: Interactive - Standard DS3_v2
  • Cluster Mode: Unrestricted / Shared Access Mode (UC Enabled)
  • Using Credential Passthrough: No

Purview function out returns - Forbidden

Describe the bug
Purview function out returns - Forbidden
To Reproduce
Steps to reproduce the behavior:

  1. Setup connector-only deployment
  2. Run notebook that produces lineage

Expected behavior
Lineage does not show up on the Purview portal
Logs
2022-06-28 15:50:11.071
Executing 'Functions.PurviewOut' (Reason='(null)', Id=a35694d7-3354-4584-99c1-b941a6db96bc)
Information
2022-06-28 15:50:11.071
Trigger Details: PartionId: 0, Offset: 1635232, EnqueueTimeUtc: 2022-06-28T15:50:11.0590000Z, SequenceNumber: 233, Count: 1
Information
2022-06-28 15:50:11.076
Token to access Purview was generated
Information
2022-06-28 15:50:11.077
Got Purview Client!
Information
2022-06-28 15:50:11.077
Enter PurviewOut
Information
2022-06-28 15:50:11.153
OlToPurviewParsingService-GetPurviewFromOlEvent
Error
2022-06-28 15:50:11.154
PurviewOut: {"entities": [{"typeName":"spark_application","guid":"-1","attributes":{"name":"order_data_merge","appType":"notebook","qualifiedName":"notebook://users/[email protected]/order_data_merge"}},{"typeName":"spark_process","guid":"-2","attributes":{"name":"/Users/[email protected]/Order_data_merge/mnt/cur/md_supradata_insights/purview/lineage/open_lineage_test/merged_order_data","qualifiedName":"sparkprocess://https://dtdeveusadls.dfs.core.windows.net/int/md_supradata_insights/purview/lineage/open_lineage_test/order_one_data_delta:https://dtdeveusadls.dfs.core.windows.net/cur/md_supradata_insights/purview/lineage/open_lineage_test/merged_order_data","columnMapping":"","executionId":"0","currUser":"","sparkPlanDescription":"{"_producer":"[https://github.com/OpenLineage/OpenLineage/tree/0.8.2/integration/spark](https://github.com/OpenLineage/OpenLineage/tree/0.8.2/integration/spark/)","_schemaURL":"https://openlineage.io/spec/1-0-2/OpenLineage.json#/$defs/RunFacet","plan":[{"class":"org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand","num-children":0,"query":[{"class":"org.apache.spark.sql.catalyst.plans.logical.Union","num-children":2,"children":[0,1]},{"class":"org.apache.spark.sql.execution.datasources.LogicalRelation","num-children":0,"relation":null,"output":[[{"class":"org.apache.spark.sql.catalyst.expressions.AttributeReference","num-children":0,"name":"order_id","dataType":"integer","nullable":true,"metadata":{},"exprId":{"product-class":"org.apache.spark.sql.catalyst.expressions.ExprId","id":3142,"jvmId":"46c5a018-e302-415c-b122-71819d519700"},"qualifier":[]}],[{"class":"org.apache.spark.sql.catalyst.expressions.AttributeReference","num-children":0,"name":"order_date","dataType":"string","nullable":true,"metadata":{},"exprId":{"product-class":"org.apache.spark.sql.catalyst.expressions.ExprId","id":3143,"jvmId":"46c5a018-e302-415c-b122-71819d519700"},"qualifier":[]}],[{"class":"org.apache.spark.sql.catalyst.expressions.AttributeReference","num-children":0,"name":"product_details","dataType":"string","nullable":true,"metadata":{},"exprId":{"product-class":"org.apache.spark.sql.catalyst.expressions.ExprId","id":3144,"jvmId":"46c5a018-e302-415c-b122-71819d519700"},"qualifier":[]}],[{"class":"org.apache.spark.sql.catalyst.expressions.AttributeReference","num-children":0,"name":"quantity","dataType":"integer","nullable":true,"metadata":{},"exprId":{"product-class":"org.apache.spark.sql.catalyst.expressions.ExprId","id":3145,"jvmId":"46c5a018-e302-415c-b122-71819d519700"},"qualifier":[]}],[{"class":"org.apache.spark.sql.catalyst.expressions.AttributeReference","num-children":0,"name":"order_total","dataType":"string","nullable":true,"metadata":{},"exprId":{"product-class":"org.apache.spark.sql.catalyst.expressions.ExprId","id":3146,"jvmId":"46c5a018-e302-415c-b122-71819d519700"},"qualifier":[]}]],"isStreaming":false},{"class":"org.apache.spark.sql.execution.datasources.LogicalRelation","num-children":0,"relation":null,"output":[[{"class":"org.apache.spark.sql.catalyst.expressions.AttributeReference","num-children":0,"name":"order_id","dataType":"integer","nullable":true,"metadata":{},"exprId":{"product-class":"org.apache.spark.sql.catalyst.expressions.ExprId","id":3152,"jvmId":"46c5a018-e302-415c-b122-71819d519700"},"qualifier":[]}],[{"class":"org.apache.spark.sql.catalyst.expressions.AttributeReference","num-children":0,"name":"order_date","dataType":"string","nullable":true,"metadata":{},"exprId":{"product-class":"org.apache.spark.sql.catalyst.expressions.ExprId","id":3153,"jvmId":"46c5a018-e302-415c-b122-71819d519700"},"qualifier":[]}],[{"class":"org.apache.spark.sql.catalyst.expressions.AttributeReference","num-children":0,"name":"product_description","dataType":"string","nullable":true,"metadata":{},"exprId":{"product-class":"org.apache.spark.sql.catalyst.expressions.ExprId","id":3154,"jvmId":"46c5a018-e302-415c-b122-71819d519700"},"qualifier":[]}],[{"class":"org.apache.spark.sql.catalyst.expressions.AttributeReference","num-children":0,"name":"quantity","dataType":"integer","nullable":true,"metadata":{},"exprId":{"product-class":"org.apache.spark.sql.catalyst.expressions.ExprId","id":3155,"jvmId":"46c5a018-e302-415c-b122-71819d519700"},"qualifier":[]}],[{"class":"org.apache.spark.sql.catalyst.expressions.AttributeReference","num-children":0,"name":"order_total","dataType":"string","nullable":true,"metadata":{},"exprId":{"product-class":"org.apache.spark.sql.catalyst.expressions.ExprId","id":3156,"jvmId":"46c5a018-e302-415c-b122-71819d519700"},"qualifier":[]}]],"isStreaming":false}],"dataSource":null,"options":null,"mode":null}]}","inputs":[{"typeName":"azure_datalake_gen2_path","uniqueAttributes":{"qualifiedName":"https://dtdeveusadls.dfs.core.windows.net/int/md_supradata_insights/purview/lineage/open_lineage_test/order_one_data_delta"}},{"typeName":"azure_datalake_gen2_path","uniqueAttributes":{"qualifiedName":"https://dtdeveusadls.dfs.core.windows.net/int/md_supradata_insights/purview/lineage/open_lineage_test/order_two_data_delta"}}],"outputs":[{"typeName":"azure_datalake_gen2_path","uniqueAttributes":{"qualifiedName":"https://dtdeveusadls.dfs.core.windows.net/cur/md_supradata_insights/purview/lineage/open_lineage_test/merged_order_data"}}]},"relationshipAttributes":{"application":{"qualifiedName":"notebook://users/[email protected]/order_data_merge","guid":"-1"}}}]}
Information
2022-06-28 15:50:11.154
Calling SendToPurview
Information
2022-06-28 15:50:11.154
Not found Attribute columnMapping on { "typeName": "spark_application", "guid": "-1", "attributes": { "name": "order_data_merge", "appType": "notebook", "qualifiedName": "notebook://users/[email protected]/order_data_merge" } } i is not a Process Entity!
Information
2022-06-28 15:50:11.154
New Entity Initialized in the process with a passed Purview Client: Nome:order_data_merge - qualified_name:notebook://users/[email protected]/order_data_merge - Guid:-1000
Information
2022-06-28 15:50:11.375
New Entity Initialized in the process with a passed Purview Client: Nome:merged_order_data - qualified_name:https://dtdeveusadls.dfs.core.windows.net/cur/md_supradata_insights/purview/lineage/open_lineage_test/merged_order_data - Guid:-1002
Information
2022-06-28 15:50:11.417
Error Loading to Purview: Return Code: Forbidden - Reason:Forbidden
Error
2022-06-28 15:50:11.418
outputs Entity: https://dtdeveusadls.dfs.core.windows.net/cur/md_supradata_insights/purview/lineage/open_lineage_test/merged_order_data Type: azure_datalake_gen2_path, Not found, Creating Dummy Entity
Information
2022-06-28 15:50:11.418
Entities to load: { "entities": [ { "typeName": "purview_custom_connector_generic_entity_with_columns", "guid": -1000, "attributes": { "name": "order_one_data_delta", "qualifiedName": "https://dtdeveusadls.dfs.core.windows.net/int/md_supradata_insights/purview/lineage/open_lineage_test/order_one_data_delta", "data_type": "azure_datalake_gen2_path", "description": "Data Assets order_one_data_delta" }, "relationshipAttributes": {} }, { "typeName": "purview_custom_connector_generic_entity_with_columns", "guid": -1001, "attributes": { "name": "order_two_data_delta", "qualifiedName": "https://dtdeveusadls.dfs.core.windows.net/int/md_supradata_insights/purview/lineage/open_lineage_test/order_two_data_delta", "data_type": "azure_datalake_gen2_path", "description": "Data Assets order_two_data_delta" }, "relationshipAttributes": {} }, { "typeName": "purview_custom_connector_generic_entity_with_columns", "guid": -1002, "attributes": { "name": "merged_order_data", "qualifiedName": "https://dtdeveusadls.dfs.core.windows.net/cur/md_supradata_insights/purview/lineage/open_lineage_test/merged_order_data", "data_type": "azure_datalake_gen2_path", "description": "Data Assets merged_order_data" }, "relationshipAttributes": {} } ] }
Information
2022-06-28 15:50:11.418
Sending this payload to Purview: {"entities": [ { "typeName": "purview_custom_connector_generic_entity_with_columns", "guid": -1000, "attributes": { "name": "order_one_data_delta", "qualifiedName": "https://dtdeveusadls.dfs.core.windows.net/int/md_supradata_insights/purview/lineage/open_lineage_test/order_one_data_delta", "data_type": "azure_datalake_gen2_path", "description": "Data Assets order_one_data_delta" }, "relationshipAttributes": {} }, { "typeName": "purview_custom_connector_generic_entity_with_columns", "guid": -1001, "attributes": { "name": "order_two_data_delta", "qualifiedName": "https://dtdeveusadls.dfs.core.windows.net/int/md_supradata_insights/purview/lineage/open_lineage_test/order_two_data_delta", "data_type": "azure_datalake_gen2_path", "description": "Data Assets order_two_data_delta" }, "relationshipAttributes": {} }, { "typeName": "purview_custom_connector_generic_entity_with_columns", "guid": -1002, "attributes": { "name": "merged_order_data", "qualifiedName": "https://dtdeveusadls.dfs.core.windows.net/cur/md_supradata_insights/purview/lineage/open_lineage_test/merged_order_data", "data_type": "azure_datalake_gen2_path", "description": "Data Assets merged_order_data" }, "relationshipAttributes": {} } ]}
Information
2022-06-28 15:50:11.458
5c6c45d9-f3ba-437c-b1af-37370288541f
Error
2022-06-28 15:50:11.494
Error Loading to Purview: Return Code: Forbidden - Reason:Forbidden
Error
2022-06-28 15:50:11.494
Processes to load: { "entities": [ { "typeName": "spark_application", "guid": -1000, "attributes": { "name": "order_data_merge", "qualifiedName": "notebook://users/[email protected]/order_data_merge", "data_type": "spark_application", "description": "Data Assets order_data_merge" }, "relationshipAttributes": {} }, { "typeName": "spark_process", "guid": -1001, "attributes": { "name": "/Users/[email protected]/Order_data_merge/mnt/cur/md_supradata_insights/purview/lineage/open_lineage_test/merged_order_data", "qualifiedName": "sparkprocess://https://dtdeveusadls.dfs.core.windows.net/int/md_supradata_insights/purview/lineage/open_lineage_test/order_one_data_delta:https://dtdeveusadls.dfs.core.windows.net/cur/md_supradata_insights/purview/lineage/open_lineage_test/merged_order_data", "columnMapping": "", "executionId": "0", "currUser": "", "sparkPlanDescription": "{"_producer":"[https://github.com/OpenLineage/OpenLineage/tree/0.8.2/integration/spark](https://github.com/OpenLineage/OpenLineage/tree/0.8.2/integration/spark/)","_schemaURL":"https://openlineage.io/spec/1-0-2/OpenLineage.json#/$defs/RunFacet","plan":[{"class":"org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand","num-children":0,"query":[{"class":"org.apache.spark.sql.catalyst.plans.logical.Union","num-children":2,"children":[0,1]},{"class":"org.apache.spark.sql.execution.datasources.LogicalRelation","num-children":0,"relation":null,"output":[[{"class":"org.apache.spark.sql.catalyst.expressions.AttributeReference","num-children":0,"name":"order_id","dataType":"integer","nullable":true,"metadata":{},"exprId":{"product-class":"org.apache.spark.sql.catalyst.expressions.ExprId","id":3142,"jvmId":"46c5a018-e302-415c-b122-71819d519700"},"qualifier":[]}],[{"class":"org.apache.spark.sql.catalyst.expressions.AttributeReference","num-children":0,"name":"order_date","dataType":"string","nullable":true,"metadata":{},"exprId":{"product-class":"org.apache.spark.sql.catalyst.expressions.ExprId","id":3143,"jvmId":"46c5a018-e302-415c-b122-71819d519700"},"qualifier":[]}],[{"class":"org.apache.spark.sql.catalyst.expressions.AttributeReference","num-children":0,"name":"product_details","dataType":"string","nullable":true,"metadata":{},"exprId":{"product-class":"org.apache.spark.sql.catalyst.expressions.ExprId","id":3144,"jvmId":"46c5a018-e302-415c-b122-71819d519700"},"qualifier":[]}],[{"class":"org.apache.spark.sql.catalyst.expressions.AttributeReference","num-children":0,"name":"quantity","dataType":"integer","nullable":true,"metadata":{},"exprId":{"product-class":"org.apache.spark.sql.catalyst.expressions.ExprId","id":3145,"jvmId":"46c5a018-e302-415c-b122-71819d519700"},"qualifier":[]}],[{"class":"org.apache.spark.sql.catalyst.expressions.AttributeReference","num-children":0,"name":"order_total","dataType":"string","nullable":true,"metadata":{},"exprId":{"product-class":"org.apache.spark.sql.catalyst.expressions.ExprId","id":3146,"jvmId":"46c5a018-e302-415c-b122-71819d519700"},"qualifier":[]}]],"isStreaming":false},{"class":"org.apache.spark.sql.execution.datasources.LogicalRelation","num-children":0,"relation":null,"output":[[{"class":"org.apache.spark.sql.catalyst.expressions.AttributeReference","num-children":0,"name":"order_id","dataType":"integer","nullable":true,"metadata":{},"exprId":{"product-class":"org.apache.spark.sql.catalyst.expressions.ExprId","id":3152,"jvmId":"46c5a018-e302-415c-b122-71819d519700"},"qualifier":[]}],[{"class":"org.apache.spark.sql.catalyst.expressions.AttributeReference","num-children":0,"name":"order_date","dataType":"string","nullable":true,"metadata":{},"exprId":{"product-class":"org.apache.spark.sql.catalyst.expressions.ExprId","id":3153,"jvmId":"46c5a018-e302-415c-b122-71819d519700"},"qualifier":[]}],[{"class":"org.apache.spark.sql.catalyst.expressions.AttributeReference","num-children":0,"name":"product_description","dataType":"string","nullable":true,"metadata":{},"exprId":{"product-class":"org.apache.spark.sql.catalyst.expressions.ExprId","id":3154,"jvmId":"46c5a018-e302-415c-b122-71819d519700"},"qualifier":[]}],[{"class":"org.apache.spark.sql.catalyst.expressions.AttributeReference","num-children":0,"name":"quantity","dataType":"integer","nullable":true,"metadata":{},"exprId":{"product-class":"org.apache.spark.sql.catalyst.expressions.ExprId","id":3155,"jvmId":"46c5a018-e302-415c-b122-71819d519700"},"qualifier":[]}],[{"class":"org.apache.spark.sql.catalyst.expressions.AttributeReference","num-children":0,"name":"order_total","dataType":"string","nullable":true,"metadata":{},"exprId":{"product-class":"org.apache.spark.sql.catalyst.expressions.ExprId","id":3156,"jvmId":"46c5a018-e302-415c-b122-71819d519700"},"qualifier":[]}]],"isStreaming":false}],"dataSource":null,"options":null,"mode":null}]}", "inputs": [ { "typeName": "purview_custom_connector_generic_entity_with_columns", "uniqueAttributes": { "qualifiedName": "https://dtdeveusadls.dfs.core.windows.net/int/md_supradata_insights/purview/lineage/open_lineage_test/order_one_data_delta" } }, { "typeName": "purview_custom_connector_generic_entity_with_columns", "uniqueAttributes": { "qualifiedName": "https://dtdeveusadls.dfs.core.windows.net/int/md_supradata_insights/purview/lineage/open_lineage_test/order_two_data_delta" } } ], "outputs": [ { "typeName": "purview_custom_connector_generic_entity_with_columns", "uniqueAttributes": { "qualifiedName": "https://dtdeveusadls.dfs.core.windows.net/cur/md_supradata_insights/purview/lineage/open_lineage_test/merged_order_data" } } ] }, "relationshipAttributes": { "application": { "qualifiedName": "notebook://users/[email protected]/order_data_merge", "guid": -1000 } } } ] }
Information
2022-06-28 15:50:11.495
Sending this payload to Purview: {"entities": [ { "typeName": "spark_application", "guid": -1000, "attributes": { "name": "order_data_merge", "qualifiedName": "notebook://users/[email protected]/order_data_merge", "data_type": "spark_application", "description": "Data Assets order_data_merge" }, "relationshipAttributes": {} }, { "typeName": "spark_process", "guid": -1001, "attributes": { "name": "/Users/[email protected]/Order_data_merge/mnt/cur/md_supradata_insights/purview/lineage/open_lineage_test/merged_order_data", "qualifiedName": "sparkprocess://https://dtdeveusadls.dfs.core.windows.net/int/md_supradata_insights/purview/lineage/open_lineage_test/order_one_data_delta:https://dtdeveusadls.dfs.core.windows.net/cur/md_supradata_insights/purview/lineage/open_lineage_test/merged_order_data", "columnMapping": "", "executionId": "0", "currUser": "", "sparkPlanDescription": "{"_producer":"[https://github.com/OpenLineage/OpenLineage/tree/0.8.2/integration/spark](https://github.com/OpenLineage/OpenLineage/tree/0.8.2/integration/spark/)","_schemaURL":"https://openlineage.io/spec/1-0-2/OpenLineage.json#/$defs/RunFacet","plan":[{"class":"org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand","num-children":0,"query":[{"class":"org.apache.spark.sql.catalyst.plans.logical.Union","num-children":2,"children":[0,1]},{"class":"org.apache.spark.sql.execution.datasources.LogicalRelation","num-children":0,"relation":null,"output":[[{"class":"org.apache.spark.sql.catalyst.expressions.AttributeReference","num-children":0,"name":"order_id","dataType":"integer","nullable":true,"metadata":{},"exprId":{"product-class":"org.apache.spark.sql.catalyst.expressions.ExprId","id":3142,"jvmId":"46c5a018-e302-415c-b122-71819d519700"},"qualifier":[]}],[{"class":"org.apache.spark.sql.catalyst.expressions.AttributeReference","num-children":0,"name":"order_date","dataType":"string","nullable":true,"metadata":{},"exprId":{"product-class":"org.apache.spark.sql.catalyst.expressions.ExprId","id":3143,"jvmId":"46c5a018-e302-415c-b122-71819d519700"},"qualifier":[]}],[{"class":"org.apache.spark.sql.catalyst.expressions.AttributeReference","num-children":0,"name":"product_details","dataType":"string","nullable":true,"metadata":{},"exprId":{"product-class":"org.apache.spark.sql.catalyst.expressions.ExprId","id":3144,"jvmId":"46c5a018-e302-415c-b122-71819d519700"},"qualifier":[]}],[{"class":"org.apache.spark.sql.catalyst.expressions.AttributeReference","num-children":0,"name":"quantity","dataType":"integer","nullable":true,"metadata":{},"exprId":{"product-class":"org.apache.spark.sql.catalyst.expressions.ExprId","id":3145,"jvmId":"46c5a018-e302-415c-b122-71819d519700"},"qualifier":[]}],[{"class":"org.apache.spark.sql.catalyst.expressions.AttributeReference","num-children":0,"name":"order_total","dataType":"string","nullable":true,"metadata":{},"exprId":{"product-class":"org.apache.spark.sql.catalyst.expressions.ExprId","id":3146,"jvmId":"46c5a018-e302-415c-b122-71819d519700"},"qualifier":[]}]],"isStreaming":false},{"class":"org.apache.spark.sql.execution.datasources.LogicalRelation","num-children":0,"relation":null,"output":[[{"class":"org.apache.spark.sql.catalyst.expressions.AttributeReference","num-children":0,"name":"order_id","dataType":"integer","nullable":true,"metadata":{},"exprId":{"product-class":"org.apache.spark.sql.catalyst.expressions.ExprId","id":3152,"jvmId":"46c5a018-e302-415c-b122-71819d519700"},"qualifier":[]}],[{"class":"org.apache.spark.sql.catalyst.expressions.AttributeReference","num-children":0,"name":"order_date","dataType":"string","nullable":true,"metadata":{},"exprId":{"product-class":"org.apache.spark.sql.catalyst.expressions.ExprId","id":3153,"jvmId":"46c5a018-e302-415c-b122-71819d519700"},"qualifier":[]}],[{"class":"org.apache.spark.sql.catalyst.expressions.AttributeReference","num-children":0,"name":"product_description","dataType":"string","nullable":true,"metadata":{},"exprId":{"product-class":"org.apache.spark.sql.catalyst.expressions.ExprId","id":3154,"jvmId":"46c5a018-e302-415c-b122-71819d519700"},"qualifier":[]}],[{"class":"org.apache.spark.sql.catalyst.expressions.AttributeReference","num-children":0,"name":"quantity","dataType":"integer","nullable":true,"metadata":{},"exprId":{"product-class":"org.apache.spark.sql.catalyst.expressions.ExprId","id":3155,"jvmId":"46c5a018-e302-415c-b122-71819d519700"},"qualifier":[]}],[{"class":"org.apache.spark.sql.catalyst.expressions.AttributeReference","num-children":0,"name":"order_total","dataType":"string","nullable":true,"metadata":{},"exprId":{"product-class":"org.apache.spark.sql.catalyst.expressions.ExprId","id":3156,"jvmId":"46c5a018-e302-415c-b122-71819d519700"},"qualifier":[]}]],"isStreaming":false}],"dataSource":null,"options":null,"mode":null}]}", "inputs": [ { "typeName": "purview_custom_connector_generic_entity_with_columns", "uniqueAttributes": { "qualifiedName": "https://dtdeveusadls.dfs.core.windows.net/int/md_supradata_insights/purview/lineage/open_lineage_test/order_one_data_delta" } }, { "typeName": "purview_custom_connector_generic_entity_with_columns", "uniqueAttributes": { "qualifiedName": "https://dtdeveusadls.dfs.core.windows.net/int/md_supradata_insights/purview/lineage/open_lineage_test/order_two_data_delta" } } ], "outputs": [ { "typeName": "purview_custom_connector_generic_entity_with_columns", "uniqueAttributes": { "qualifiedName": "https://dtdeveusadls.dfs.core.windows.net/cur/md_supradata_insights/purview/lineage/open_lineage_test/merged_order_data" } } ] }, "relationshipAttributes": { "application": { "qualifiedName": "notebook://users/[email protected]/order_data_merge", "guid": -1000 } } } ]}
Information
2022-06-28 15:50:11.564
92710849-1f88-4934-9b9e-f8e2ea3f0baf
Error
2022-06-28 15:50:11.600
Error Loading to Purview: Return Code: Forbidden - Reason:Forbidden
Error
2022-06-28 15:50:11.659
Error Loading to Purview: Return Code: Forbidden - Reason:Forbidden
Error
2022-06-28 15:50:11.697
Error Loading to Purview: Return Code: Forbidden - Reason:Forbidden
Error
2022-06-28 15:50:11.736
Error Loading to Purview: Return Code: Forbidden - Reason:Forbidden
Error
2022-06-28 15:50:11.772
Error Loading to Purview: Return Code: Forbidden - Reason:Forbidden
Error
2022-06-28 15:50:11.772
Executed 'Functions.PurviewOut' (Succeeded, Id=a35694d7-3354-4584-99c1-b941a6db96bc, Duration=701ms)
Information

Screenshots
If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

  • OS: [e.g. Windows, Mac]
  • OpenLineage Version: 0.8.2
  • Databricks Runtime Version: 7.3
  • Cluster Type: [e.g. Job, Interactive]
  • Cluster Mode: [e.g. Standard, High Concurrency, Single]
  • Using Credential Passthrough: [e.g. Yes, No]

Additional context
Purview has Private endpoint

Deployment Error

I followed the instructions in the video https://youtu.be/pLF0iykhruY, however when executing script openlineage-deployment.sh, I'm getting the following error. Has anyone else faced this issue? I changed resource groups, location etc, but nothing seems to help.

{
  "code": "DeploymentFailed",
  "message": "At least one resource deployment operation failed. Please list deployment operations for details. Please see https://aka.ms/DeployOperations for usage details.",
  "details": [
    {
      "message": "Encountered an error (ServiceUnavailable) from host runtime."
    }
  ]
}

I'm also getting other errors while deploying. What's cofusing is that this script does not deploy any databricks service at all. More error logs below.

2022-08-14 11:42:31 [INFO] start deployment in purview-eval
2022-08-14 11:42:31 [INFO] start deploying all openlineage required resources
including: FunctionApp, EventHub, StorageAccount, etc.
**ERROR: {"status":"Failed","error":{"code":"DeploymentFailed","message":"At least one resource deployment operation failed. Please list deployment operations for details. Please see https://aka.ms/DeployOperations for usage details.","details":[{"code":"BadRequest","message":"{\r\n  \"Code\": \"BadRequest\",\r\n  \"Message\": \"Encountered an error (ServiceUnavailable) from host runtime.\",\r\n  \"Target\": null,\r\n  \"Details\": [\r\n    {\r\n      \"Message\": \"Encountered an error (ServiceUnavailable) from host runtime.\"\r\n    },\r\n    {\r\n      \"Code\": \"BadRequest\"\r\n    },\r\n    {\r\n      \"ErrorEntity\": {\r\n        \"Code\": \"BadRequest\",\r\n        \"Message\": \"Encountered an error (ServiceUnavailable) from host runtime.\"\r\n      }\r\n    }\r\n  ],\r\n  \"Innererror\": null\r\n}"}]}}**
2022-08-14 11:44:21 [INFO] deploying all openlineage required resources FINISHED
**ERROR: argument --account-name/-n: expected one argument**

Examples from AI knowledge base:
az storage account keys list --resource-group MyResourceGroup --account-name MyStorageAccount
List the access keys for a storage account.

az storage account keys list --resource-group MyResourceGroup --account-name MyStorageAccount --expand-key-type kerb
List the access keys and Kerberos keys (if active directory enabled) for a storage account.

az storage account create --name mystorageaccount --resource-group MyResourceGroup --location westus --sku Standard_LRS
Create a storage account 'mystorageaccount' in resource group 'MyResourceGroup' in the West US region with locally redundant storage.

https://docs.microsoft.com/en-US/cli/azure/storage/account/keys#az_storage_account_keys_list
Read more about the command in reference docs
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 6341k  100 6341k    0     0  2614k      0  0:00:02  0:00:02 --:--:-- 2614k
argument --account-name: expected one argument

Examples from AI knowledge base:
az storage container create --name MyStorageContainer
Create a storage container in a storage account.

az storage container create --name MyStorageContainer --public-access blob
Create a storage container in a storage account and allow public read access for blobs.

az storage account keys list --resource-group MyResourceGroup --account-name MyStorageAccount
List the access keys for a storage account.

https://docs.microsoft.com/en-US/cli/azure/storage/container#az_storage_container_create
Read more about the command in reference docs
**ERROR: argument --account-name: expected one argument**

Examples from AI knowledge base:
az storage blob upload --account-name mystorageaccount --account-key 0000-0000 --container-name mycontainer --file /path/to/file --name myblob
Upload a file to a storage blob. (autogenerated)

az storage blob upload --file /path/to/file --container-name MyContainer --name MyBlob
Upload to a blob.

https://docs.microsoft.com/en-US/cli/azure/storage/blob#az_storage_blob_upload
Read more about the command in reference docs
**ERROR: argument --account-name: expected one argument**

Examples from AI knowledge base:
az storage blob upload --account-name mystorageaccount --account-key 0000-0000 --container-name mycontainer --file /path/to/file --name myblob
Upload a file to a storage blob. (autogenerated)

az storage blob upload --file /path/to/file --container-name MyContainer --name MyBlob
Upload to a blob.

https://docs.microsoft.com/en-US/cli/azure/storage/blob#az_storage_blob_upload
Read more about the command in reference docs
2022-08-14 11:44:28 [INFO] start deploying databricks workspace
2022-08-14 11:44:29 [INFO] start deployinig new databricks workspace
ERROR: argument --location/-l: expected one argument

Examples from AI knowledge base:
az databricks workspace create --resource-group MyResourceGroup --name MyWorkspace --location westus --sku standard
Create a workspace

az databricks workspace create --resource-group MyResourceGroup --name MyWorkspace --location eastus2euap --sku premium --prepare-encryption
Create a workspace with managed identity for storage account

https://aka.ms/cli_ref
Read more about the command in reference docs

4 Times the same event by open linage

Describe the bug
Open linage is sending the same event 4 times
To Reproduce
Steps to reproduce the behavior:

  1. Just execute any notebook

Expected behavior
We can save some execution and API calls, we should filter repeated events.

Screenshots
image

ListenMessages Authorization Rule Resulting in MessagingGatewayConflict

Describe the bug
When doing the connector only deployment, it is giving an event hub MessagingGatewayConflict error for the ListenMessages Authorization Rules

To Reproduce
Steps to reproduce the behavior:

  1. Run the connector only deployment from release 2.1 branch

Expected behavior
It should deploy the connector completely.

Logs

{
    "status": "Failed",
    "error": {
        "code": "DeploymentFailed",
        "message": "At least one resource deployment operation failed. Please list deployment operations for details. Please see https://aka.ms/DeployOperations for usage details.",
        "details": [
            {
                "code": "Conflict",
                "message": {
                    "status": "failed",
                    "error": {
                        "code": "ResourceDeploymentFailure",
                        "message": "The resource operation completed with terminal provisioning state 'failed'.",
                        "details": [
                            {
                                "code": "Failed",
                                "message": "ARM-MSDeploy Deploy Failed: 'System.Threading.ThreadAbortException: Thread was being aborted.
   at System.Collections.Generic.SortedList`2..ctor()
   at Microsoft.Web.Deployment.DeploymentBaseContext..ctor(DeploymentBaseOptions baseOptions, DeploymentObject sourceObject)
   at Microsoft.Web.Deployment.DeploymentManager.CreateObjectPrivate(DeploymentProviderContext providerContext, DeploymentBaseOptions baseOptions, DeploymentObject sourceObject, String serverVersion)
   at Microsoft.Web.Deployment.DeploymentManager.CreateObject(DeploymentProviderOptions providerOptions, DeploymentBaseOptions baseOptions)
   at Microsoft.Web.Deployment.DeploymentManager.CreateObject(String provider, String path, DeploymentBaseOptions baseOptions)
   at Microsoft.Web.Deployment.DeploymentManager.CreateObject(DeploymentWellKnownProvider provider, String path, DeploymentBaseOptions baseOptions)
   at Microsoft.Web.Deployment.WebApi.AppGalleryPackage.Deploy(String deploymentSite, String siteSlotId, Boolean doNotDelete)
   at Microsoft.Web.Deployment.WebApi.DeploymentController.&lt;DownloadAndDeployPackage&gt;d__25.MoveNext()'"
                            }
                        ]
                    }
                }
            },
            {
                "code": "Conflict",
                "message": {
                    "error": {
                        "code": "MessagingGatewayConflict",
                        "message": "<Error><Code>409</Code><Detail> TrackingId:abac1930-22e0-41b2-9af2-a06fbeaabae9_G23, SystemTracker:eventhubnsgh1142dhbd.servicebus.windows.net:$tenants/eventhubnsgh1142dhbd, Timestamp:2022-11-26T17:48:05</Detail></Error>"
                    }
                }
            }
        ]
    }
}

Desktop (please complete the following information):

  • OS: Cloudshell with Bash
  • OpenLineage Version: N/A
  • Databricks Runtime Version: N/A
  • Cluster Type: N/A
  • Cluster Mode: N/A
  • Using Credential Passthrough: N/A

Additional context
N/A

Revisit Synapse as Output

Describe the feature
Synapse should be able to be used as an output.

Detailed Example
When writing to a Synapse table through the Databricks connector, we should see just the qualified name for the table not the temp storage account.

Issues that this feature solves
N/A

Suggested Implementation
Re-run with OL 0.18 and DBR 11.3, the behavior appears to be different.

Additional context
N/A

Checklist for the deployment of Purview ADB Lineage Solution Accelerator

Using this tutorial, I installed this Lineage Solution Accelerator as a working connector. There were no errors. Shown below is the list of items installed in the resource group used for this installation.

  1. In each of these items, what needs to be verified to verify the success of the installation?
  2. Do we have a sample notebook that can be used to verify if the lineage was created? My notebook did not create a lineage.
  3. In purview when you register a source for scanning, it asks for an Integration Runtime. For registering a storage account, it gives you a choice for selecting default integration runtime as
    image
    But when registering Databricks as a source, the above dropdown is empty. Do we have to first install an Integration Runtime?

image

Mount Points Should Replace on Longest String

Describe the bug
A clear and concise description of what the bug is.

To Reproduce
Steps to reproduce the behavior:

  1. Define a mount point of /mnt/x
  2. Define a second mount point of /mnt/x/y
  3. Create a spark job that reads or writes to /mnt/x/y
  4. Observe that the qualified name of the entity contains y due to the accelerator matching to /mnt/x.

Expected behavior
We should not see /y and instead should see the string that mnt/x/y represents.

Logs
N/A

Screenshots
N/A.

Desktop (please complete the following information):

  • OS: N/A
  • OpenLineage Version: 0.14.1
  • Databricks Runtime Version: 10.4
  • Cluster Type: Job
  • Cluster Mode: Standard
  • Using Credential Passthrough: No

Additional context
N/A

Inconsistent lineage between Databricks Notebooks and ADLS assets

Describe the bug
For a PoC, we are testing the Lineage Connector and are aiming to create lineage between Databricks Notebooks and external tables stored in ADLS, basically similar to the lineage graph that's shown in the readme (See image below).
image

Our goal is to make this connection between ADLS -> Notebook -> ADLS, but we are not able to consistently get this. In most cases, the lineage graph will be as shown in the screenshot below, where the output is the dummy entity. In one attempt, we created the table, then ran the ADLS scan so the Resource Set can be found, and then we ran the script again, which transformed the dummy entity to the ADLS resource set. Unfortunately, after trying out several times after that, we have not been able to reproduce that result. Below is a simple example for what we're trying to achieve.

image

We do see that there's an open PR related to this #69 , but this has apparently not been updated for a couple of months, so we're wondering what's the progress on this.

Alternatively, if you know how to achieve our goal in a different way, we are happy to hear it.

To Reproduce
Steps to reproduce the behavior:

  1. Create a .csv file with just some random integers as keys and store this in ADLS.
  2. Run below code in Notebook
    %sql CREATE TABLE IF NOT EXISTS customer_keys (key bigint) LOCATION 'abfss://<...>@<..>.dfs.core.windows.net/customer_keys

new_keys = spark.read.format('csv').option('header', 'true').load('abfss://<...>@<..>.dfs.core.windows.net/new_keys.csv') new_keys.createOrReplaceTempView('new_keys')

%sql INSERT INTO customer_keys SELECT * FROM new_keys
3. Run ADLS scan in Purview so the customer_keys asset is found as ADLS asset
4. Run the code above again

Expected behavior

Expected is that the dummy entity is replaced by the ADLS asset, as we have achieved once, but unfortunately have not been able to reproduce.

Desktop (please complete the following information):

  • OS: Windows
  • OpenLineage Version: 0.17.0
  • Databricks Runtime Version: 11.3 LTS (Spark 3.3) -> Lineage works!
  • Cluster Type: Interactive
  • Cluster Mode: Shared DS3_v2
  • Using Credential Passthrough: No

Fix test expectation for hive_mnt-in-hive_mnt-out-insert

Currently databricks://<WORKSPACE_ID>.azuredatabricks.net/jobs/<JOB_ID>/tasks/databricks://<WORKSPACE_ID>.azuredatabricks.net/jobs/<JOB_ID>/tasks/hive_mnt-in-hive_mnt-out-insert/processes/B5EA0788D2DFDD6724C9638A23C72530->C45E275909E82D362F516CB3DF62F01E

But should be databricks://<WORKSPACE_ID>.azuredatabricks.net/jobs/<JOB_ID>/tasks/hive_mnt-in-hive_mnt-out-insert/processes/B5EA0788D2DFDD6724C9638A23C72530->C45E275909E82D362F516CB3DF62F01E

Support Column Mapping for ABFSS and WASBS

Describe the feature
With an upgrade to OpenLineage 0.18, we can take advantage of the column mapping feature made available for HadoopFS assets.

Suggested Implementation

Currently, PurviewOut calls two separate methods (in this order) that are relevant:

  1. OlToPurviewParsingService.GetPurviewFromOlEvent
  2. PurviewIngestion.SendToPurview

OlToPurviewParsingService.GetPurviewFromOlEvent

The stack looks like:

  • OlToPurviewParsingService.GetPurviewFromOlEvent
  • DatabricksToPurviewParser.GetDatabricksProcess
  • DatabricksToPurviewParser.GetProcAttributes
  • ColParser.GetColIdentifiers - This is where column mapping is being decided

PurviewIngestion.SendToPurview

The stack looks like:

  • PurviewIngestion.SendToPurview
  • One of:
    • PurviewIngestion.Validate_Process_Entities
    • PurviewIngestion.Validate_Entities
    • PurviewIngestion.SetOutputInput - called by Validate_Process_Entities
  • All of which call PurvewCustomType.QueryInPurview with no args.
  • PurvewCustomType.QueryInPurview with an empty string args
  • PurviewCustomType.SelectReturnEntity helps to choose which entity is actually used in the process.

The Fix

SelectReturnEntity needs to inform GetColIdentifiers which process is being used instead of the original one provided by OpenLineage

microsoft-purview is not a valid taxonomy tag.

Please remove the 'microsoft-purview' value from the products. It is not in the taxonomy and is blocking all samples from being published. I have remove it from your live publish but the next time you update this sample, it will add it back and cause all builds to fail. Email me - smmark if you have questions. Thank you!

Use Service Principal Certificates instead of Secret Key

Describe the bug
Hi Team,
Our customer security policy doesn't allow the use of secret keys and they would like us to use Service Principal Certificates instead of the Service Principal Secret Key.

To Reproduce
Steps to reproduce the behavior:

  1. e.g. use Service Principal Certifcate location to install the solution accelerator.

Expected behavior
The Solution Accelerator uses the Service Principal Certificate during the installation and internal communication between the Azure Function and Purview.

Logs

  1. Please include any Spark code being ran that generates this error
  2. Please include a gist to the OpenLineageIn and PurviewOut logs
  3. See how to stream Azure Function Logs

Screenshots
To use Azuzre Service Principal Certificates for installation and internal app communication.
image
Desktop (please complete the following information):

  • OS: [e.g. Windows, Mac]
  • OpenLineage Version: [e.g. name of jar]
  • Databricks Runtime Version: [e.g. 6.4, 9.1, 10.1]
  • Cluster Type: [e.g. Job, Interactive]
  • Cluster Mode: [e.g. Standard, High Concurrency, Single]
  • Using Credential Passthrough: [e.g. Yes, No]

Additional context
Add any other context about the problem here.

Data read outside of mount points is not registered

First of all, thank you for the amazing tool provided, it is really helping our team.

Describe the bug
After using the only connector deployment our team noticied that notebooks dealing with data outside mount points were not being saved on the Purview account.

To Reproduce
Follows a sample notebook that causes this behavior:

df = spark.createDataFrame(data=[[1], [2]])
df.write.mode("overwrite").parquet("/mnt/lineage-test/df1/")
df = spark.read.parquet("/mnt/lineage-test/df1")
df.write.mode("overwrite").saveAsTable("test")
df2 = table("test")
df2.write.mode("overwrite").parquet("mnt/lineage-test/df2/")

Expected behavior

A databricks notebook lineage with the parquet writes "df2" and "df1", the "df1" parque read and the "test" table read from DBFS.
In a real scenario, we would like to see data writes when APIs are called inside a Databricks Notebook and a internal source when data is coming from outside a mount point.

Logs
The full Azure Functions logs can be found in this gist. What got my attention was the errors invalid relationshipDef: process_dataset_outputs: end type 1: DataSet, end type 2: databricks_process and Referenced entity AtlasObjectId ... is not found found in lines 116 and 200, respectively.

Desktop (please complete the following information):

  • OS: Windows 10
  • OpenLineage Version: openlineage-spark-0.13.0.jar
  • Databricks Runtime Version: 10.4
  • Cluster Type: Interactive
  • Cluster Mode: Single
  • Using Credential Passthrough: Yes

Databricks cluster failure with error as "Init script failure"

Describe the bug
Databricks cluster is getting failed with error as "Cluster terminated. Reason: Init script failure"

To Reproduce
Steps to reproduce the behavior:
Deploy Cluster with 9.1 LTS Runtime version

Spark Config
spark.openlineage.url.param.code <FUNCTION_APP_DEFAULT_HOST_KEY>
spark.openlineage.namespace <DATABRICKS_URI_NO>#default
spark.openlineage.version 1
spark.openlineage.host https://<FUNCTION_APP_DEFAULT_HOST_KEY>.azurewebsites.net

Init Script

#!/bin/bash
STAGE_DIR="/dbfs/databricks/openlineage"
cp -f $STAGE_DIR/openlineage-spark-*.jar /mnt/driver-daemon/jars || { echo "Error copying Spark Listener library file"; exit 1;}
cat << 'EOF' > /databricks/driver/conf/openlineage-spark-driver-defaults.conf
[driver] {
"spark.extraListeners" = "io.openlineage.spark.agent.OpenLineageSparkListener"
}
EOF

Expected behavior
Cluster should be start successfully with mentioned Init script.

Logs

  1. Please include any Spark code being ran that generates this error
  2. Please include a gist to the OpenLineage In and PurviewOut logs
  3. See how to stream Azure Function Logs

Screenshots
image

Desktop (please complete the following information):

  • OpenLineage Version: [openlineage-spark-0.9.0.jar]
  • Databricks Runtime Version: [9.1]
  • Cluster Mode: Standard

Additional context
Without Init script Databricks cluster gets started successfully,

Create a python version of the function code

This is not a bug but a request. Is it possible to create python version of the current functions? In our org, we are only using python, so it will really help in extend this accelerator.

azure function (openlineagein) is not getting invoked.

After deployment, executed below steps. however unable to see the invocation of azure function.

  1. Deployed as per connector only deployment.
    Azure functions, event hub got created.
  2. Ran a sample notebook in azure data bricks.
  3. issue: Unable to see the invocation of function app or not seeing any calls in event hub/ function app.

Assets not displayed in Purview

Describe the bug
I installed the demo version of ADB Solution accelerator, after running notebook, I can see successful invocation happening for the PurviewOut Function with valid openai lineage json. But for some reason i am not getting the assets in Purview. I got the assets once for the very first run after installation. I deleted them to do some testing and tried rerunning notebook multiple times, but the assets are not showing up again

.Steps Taken

  1. Confirmed that configuration of Az Functions has green ticks next to Key Vault rreference
  2. Tried installing custom types on Purview using PS and got message. that the types are already installed

Function App Log

union traces
| union exceptions
| where timestamp > ago(30d)
| where operation_Id == 'cedb60b36ff487b012d019b07fdfc6d0'
| where customDimensions['InvocationId'] == '56dc42ff-be09-4353-a0c3-f4113bc496d8'
| order by timestamp asc
| project
timestamp,
message = iff(message != '', message, iff(innermostMessage != '', innermostMessage, customDimensions.['prop__{OriginalFormat}'])),
logLevel = customDimensions.['LogLevel']

screenshot

Support for SecureString

Please modify the arm template to support securestring so the client secret is not passed to the logs

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.