GithubHelp home page GithubHelp logo

albandrod / purview-adb-lineage-solution-accelerator Goto Github PK

View Code? Open in Web Editor NEW

This project forked from microsoft/purview-adb-lineage-solution-accelerator

0.0 1.0 0.0 12.57 MB

A connector to ingest Azure Databricks lineage into Microsoft Purview

License: MIT License

Shell 6.54% Python 4.87% Java 0.93% Scala 11.34% C# 76.32%

purview-adb-lineage-solution-accelerator's Introduction

page_type languages products
sample
csharp
microsoft-purview
azure-databricks

EAE_Header.png lineage.png

Microsoft Solutions / Early Access Engineering

Azure Databricks to Purview Lineage Connector

This solution accelerator, together with the OpenLineage project, provides a connector that will transfer lineage metadata from Spark operations in Azure Databricks to Microsoft Purview, allowing you to see a table-level lineage graph as demonstrated above.

Note In addition to this solution accelerator, Microsoft Purview is creating native models for Azure Databricks (e.g.: Notebooks, jobs, job tasks...) to integrate with Catalog experiences. With native models in Microsoft Purview for Azure Databricks, customers will get enriched experiences in lineage such as detailed transformations. If you choose to use this solution accelerator in a Microsoft Purview account before the native models are released, these enriched experiences are not backward compatible. Please reach out to your Microsoft account representative for timeline related questions on the upcoming model enrichment for Azure Databricks in Microsoft Purview.**

Contents

Overview

Gathering lineage data is performed in the following steps:

high-level-architecture.png

  1. Azure Databricks clusters are configured to initialize the OpenLineage Spark Listener with an endpoint to receive data.
  2. Spark operations will output data in a standard OpenLineage format to the endpoint configured in the cluster.
  3. Endpoint provided by an Azure Function app that will filter incoming data and pass it to an Azure EventHub.
  4. Events are captured by a second Function app to transform the data into a format compatible with Atlas and Purview.
  5. Lineage data is synchronized with existing Purview metadata and uploaded to Purview using standard Apache Atlas APIs.

Features

  • Supports table level lineage from Spark Notebooks and jobs for the following data sources:
    • Azure SQL
    • Azure Synapse Analytics
    • Azure Data Lake Gen 2
    • Azure Blob Storage
    • Delta Lake
  • Supports Spark 3.0, 3.1, and 3.2 (Interactive and Job clusters) / Spark 2.x (Job clusters)
    • Databricks Runtimes between 6.4 and 10.4 are currently supported
  • Can be configured per cluster or for all clusters as a global configuration
  • Once configured, does not require any code changes to notebooks or jobs
  • Can add new source support through configuration

Videos

Prerequisites

Installing this connector requires the following:

  1. Azure subscription-level role assignments for both Contributor and User Access Administrator.
  2. Azure Service Principal with client ID and secret - How to create Service Principal.

Getting Started

There are two deployment options for this solution accelerator:

  • Demo Deployment

    No additional prerequisites are necessary as the demo environment will be setup for you, including Azure Databricks, Purview, ADLS, and example data sources and notebooks.

  • Connector Only Deployment

    If installed as a working connector, Azure Databricks, data sources, and Microsoft Purview are assumed to be setup and running.

Using the Connector

Ensure both the Azure Function app and Azure Databricks cluster are running.

  1. Open your Databricks workspace to run a Spark job or notebook which results in data being transferred from one location to another. For the demo deployment, browse to the Workspace > Shared > abfss-in-abfss-out-olsample notebook, and click "Run all".

  2. Once complete, open your Purview workspace and click the "Browse assets" button near the center of the page

  3. Click on the "By source type" tab
    You should see several items listed under the heading of "Custom source types". There will be a Databricks section and possibly a Purview Custom Connector section under this heading

    browse_assets.png

  4. Click on the "Databricks" section, then click on the "Databricks Notebook" tile which corresponds to the notebook you ran. In the Properties or Related tabs select one of the "Notebook Tasks" which represent a task in a Databricks job. From the "Databricks Notebook Task", you may see the lineage of one or many of the different spark actions in the notebook. This application may have a number of "Databricks Processes" linked under it which represent the data lineage. To see these, see the Properties or Related tabs

    databricks_task_related.png

  5. From the Related view, click on the processes icon, then click on one of the links representing the associated process objects

  6. Click on the properties tab to view the properties associated with the process. Note that the full Spark Plan is included

    spark_plan.png

  7. Click to the lineage view to see the lineage graph

    lineage_view.png

    Note: If you are viewing the Databricks Process shortly after it was created, sometimes the lineage tab takes some time to display. If you do not see the lineage tab, wait a few minutes and then refresh the browser.

Troubleshooting

When filing a new issue, please include associated log message(s) from Azure Functions. This will allow the core team to debug within our test environment to validate the issue and develop a solution.

If you have any issues, please start with the Troubleshooting Doc and note the limitations which affect what sort of lineage can be collected. If the problem persists, please raise an Issue on Github.

Limitations

The solution accelerator has some limitations which affect what sort of lineage can be collected.

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies

Data Collection

The software may collect information about you and your use of the software and send it to Microsoft. Microsoft may use this information to provide services and improve our products and services. You may turn off the telemetry as described in the repository. There are also some features in the software that may enable you and Microsoft to collect data from users of your applications. If you use these features, you must comply with applicable law, including providing appropriate notices to users of your applications together with a copy of Microsoft’s privacy statement. Our privacy statement is located at https://go.microsoft.com/fwlink/?LinkID=824704. You can learn more about data collection and use in the help documentation and our privacy statement. Your use of the software operates as your consent to these practices.

purview-adb-lineage-solution-accelerator's People

Contributors

isantillan1 avatar marktayl1 avatar marktayl2 avatar mattsavarino avatar microsoftopensource avatar rodrigomonteiro-gbb avatar travishilbert avatar wjohnson avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.