wjohnson / pyapacheatlas Goto Github PK

A python package to help work with the apache atlas REST APIs

Home Page: https://wjohnson.github.io/pyapacheatlas-docs/latest/

License: MIT License

Python 100.00%

pyapacheatlas's Introduction

PyApacheAtlas: A Python SDK for Azure Purview and Apache Atlas

PyApacheAtlas lets you work with the Azure Purview and Apache Atlas APIs in a Pythonic way. Supporting bulk loading, custom lineage, custom type definition and more from an SDK and Excel templates / integration.

The package supports programmatic interaction and an Excel template for low-code uploads.

Using Excel to Accelerate Metadata Uploads

Bulk upload entities.
- Upload entities / assets for built-in or custom types.
- Supports adding glossary terms to entities.
- Supports adding classifications to entities.
- Supports creating relationships between entities (e.g. columns of a table).
Creating custom lineage between existing entities.
Defining Purview Column Mappings / Column Lineage.
Bulk upload custom type definitions.
Bulk upload of classification definitions (Purview Classification Rules not supported).

Using the Pythonic SDK for Purview and Atlas

The PyApacheAtlas package itself supports those operations and more for the advanced user:

Programmatically create Entities, Types (Entity, Relationship, etc.).
Perform partial updates of an entity (for non-complex attributes like strings or integers).
Extracting entities by guid or qualified name.
Creating custom lineage with Process and Entity types.
Working with the glossary.
- Uploading terms.
- Downloading individual or all terms.
Working with classifications.
- Classify one entity with multiple classifications.
- Classify multiple entities with a single classification.
- Remove classification ("declassify") from an entity.
Working with relationships.
- Able to create arbitrary relationships between entities.
- e.g. associating a given column with a table.
Deleting types (by name) or entities (by guid).
Performing "What-If" analysis to check if...
- Your entities are valid types.
- Your entities are missing required attributes.
- Your entities are using undefined attributes.
Azure Purview's Search: query, autocomplete, suggest, browse.
Authentication to Azure Purview using azure-identity and Service Principal
Authentication to Apache Atlas using basic authentication of username and password.

Quickstart

Install from PyPi

python -m pip install pyapacheatlas

Using Azure-Identity and the Azure CLI to Connect to Purview

For connecting to Azure Purview, it's even more convenient to install the azure-identity package and its support for Managed Identity, Environment Credential, and Azure CLI credential.

If you want to use your Azure CLI credential rather than a service principal, install azure-identity by running pip install azure-identity and then run the code below.

from azure.identity import AzureCliCredential

from pyapacheatlas.core import PurviewClient

cred = AzureCliCredential()

# Create a client to connect to your service.
client = PurviewClient(
    account_name = "Your-Purview-Account-Name",
    authentication = cred
)

Create a Purview Client Connection Using Service Principal

If you don't want to install any additional packages, you should use the built-in ServicePrincipalAuthentication class.

from pyapacheatlas.auth import ServicePrincipalAuthentication
from pyapacheatlas.core import PurviewClient

auth = ServicePrincipalAuthentication(
    tenant_id = "", 
    client_id = "", 
    client_secret = ""
)

# Create a client to connect to your service.
client = PurviewClient(
    account_name = "Your-Purview-Account-Name",
    authentication = auth
)

Create Entities "By Hand"

You can also create your own entities by hand with the helper AtlasEntity class.

from pyapacheatlas.core import AtlasEntity

# Get All Type Defs
all_type_defs = client.get_all_typedefs()

# Get Specific Entities
list_of_entities = client.get_entity(guid=["abc-123-def","ghi-456-jkl"])

# Create a new entity
ae = AtlasEntity(
    name = "my table", 
    typeName = "demo_table", 
    qualified_name = "somedb.schema.mytable",
    guid = -1000
)

# Upload that entity with the client
upload_results = client.upload_entities( [ae] )

Create Entities from Excel

Read from a standardized excel template that supports...

Bulk uploading entities into your data catalog.
Creating custom table and column level lineage.
Creating custom type definitions for datasets.
Creating custom lineage between existing assets / entities in your data catalog.
Creating custom classification (Purview Classification rules are not supported yet).

See end to end samples for each scenario in the excel samples.

Learn more about the Excel features and configuration in the wiki.

Additional Resources

Learn more about this package in the PyApacheAtlas docs.
The Apache Atlas REST API
The Purview CLI Package provides CLI support.
Purview REST API Official Docs

pyapacheatlas's People

Contributors

Stargazers

Watchers

Forkers

mdrakiburrahman balakreshnan amjadmkhan masuryan fbedecarrats tonio-lora praveenksingh7 slyons abdale kawofong sri-azure-git bsherwin reiselgp kfengmsft iagofranco cristiangomez811 chinmoysarangi luismleite florentbedecarratsnm svchandramohan hannumuurinendigia sistemasbravo hophanms amiket23 fredgis kumarmisra tvboy jomit ksista1848 arvindindeed athenads zeinab-mk isantillan1 cloudbreadpapa fpvmorais jvanbuel rcabr mccheng98 syaheedz xiaoyongzhumsft lazowmich bibuwei costadelsolbjj microcassidy hmoazam edobrynin-dodo schalkje chetnachaudhari briwalkr shreyaj626 snathani-ib nastevtose pedro-luis-ey olddusty ludwinic1 geoffreychet joaosalvadomicrosoft bosunnj marcoeziogustavobartolini jordirua sonnyhcl shivatomar2183 joe-tj amberz minettes ronaldyu ideasoft-tech bramhaaelem ievsantillan plaksnor freeone15 rkapp22 leonianne1 harveyhubj mprs-labs deathtobanana w0lveri9 mchebihi reddy-sruthi ramesh-venkatachalam sgidwani61 mohantyrr2003 leizhang258 manthena2020 jarteagaf zabardast1999 rokarolla ptinsl gse89 lponnam75 laknath123 joelvaneenwyk flyingbearhk yifan-zhou922 maartenevenepoel nextdynamic

pyapacheatlas's Issues

Release 0.3.0 Features

New Classes

Create AtlasRelationshipEndDef from http://atlas.apache.org/api/v2/json_AtlasRelationshipEndDef.html
Subclass AtlasRelationshipEndDef for ParentEndDef, pass in name
Subclass AtlasRelationshipEndDef for ChildEndDef, pass in name

EntityTypeDef and ClassificationTypeDef: AttributeDefs

Support AttributeDefs objects alone or in a list, Dicts alone or in a list
Use @properties to get and set attribute defs in whole
Use addAttributeDef(*args) to append attribute defs (NOT Fluent Interface style)
Settable (in whole via .attributeDefs = [] or in init with list

EntityTypeDef: RelationshipAttributeDef

Support RelationshipAttributeDefs objects alone or in a list, Dicts alone or in a list
Use @properties to get and set relationship attribute defs in whole
Use addAttributeDef(*args) to append attribute defs (NOT Fluent Interface style)
Settable (in whole via .relationshipAttributeDefs = [] or in init with list

AtlasProcess: Inputs and Outputs

Support AtlasEntity objects alone or in a list, dicts alone or in a list
Use @properties to get and set input/output in whole
Use .addInput(*args) to append inputs (NOT Fluent Interface style).
Use .addOutput(*args) to append outputs (NOT Fluent Interface style).
Drop get_ / set_ methods.
CONSIDERING: Intelligently determining if it's using a valid guid vs plans to be referenced by qual name, type, negative guid.

AtlasEntity: name and qualified_name

Move to using @properties
Drop get_ / set_ methods.

AtlasEntity: RelationshipAttributes

Use @properties to get and set relationship attributes in whole
Use addRelationshipAttribute(**kwargs) to add / update attributes
When accepting another AtlasEntity or dict, convert to json with minimum=True
Not creating a separate class for AtlasRelationshipAttribute since it's just a dict.
CONSIDERING: Intelligently determining if it's using a valid guid vs plans to be referenced by qual name, type, negative guid.

Cleanups

Defined TypeDefs have corrected ways of getting TypeCategory.

Support searching across catalog for glossary terms in asset name

Enabling a rough "discovery" of possible candidates for assets that need to be tagged with a glossary term.

Should be possible through the advanced search and parsing through the glossary terms.

Get all glossary terms
Possibly "massage" those terms into different shapes (abbreviations, substr)
Wildcard query against all assets
List out any possible matches for people to review
Phase two would be to assign the AtlasGlossaryTerm relationship to the asset from some intermediate file that was written out to beforehand.

Make force_update smarter

In order to make working with uploads easier, the client.upload_typedefs force_update parameter should be smarter.

Currently, it simply does a POST request if FALSE and a PUT request if TRUE. However, doing a PUT request on a type def that does not exist will break the entire upload and doing a POST request for a type def that exists breaks the upload as well.

A better solution is to look up each type def by name and category (entity, relationship, classification) and see if it exists. If it exists, then use the PUT request.

However, there may be dependencies between types and there may be issues in updating a type that will conflict against the existing entities.

Need to test:

What's the impact of PUT with new type defs
What the impact of POST with existing type defs
What's the impact of a PUT with breaking changes to existing type defs
Can we determine the dependency between type defs?

ColumnMappings Takes in Qualified Name and not just Name for dataset mapping

This is the appropriate syntax and needs to be fixed in the column lineage reader.

column_mapping = [
        {"ColumnMapping": [
            {"Source": "AddressType", "Sink": "address"},
            {"Source": "CustomerId", "Sink": "cust_id"}],
            "DatasetMapping": {
            "Source": custAddr.qualifiedName, "Sink": customer.qualifiedName}
         },
        {"ColumnMapping": [
            {"Source": " total_emp", "Sink": "cust_id"},
            {"Source": " description", "Sink": "username"}],
            "DatasetMapping": {"Source": sample.qualifiedName, "Sink": customer.qualifiedName}
         }
    ]

Search raises StopIteration at end of paging

PEP 479 indicates that StopIteration inside a Generator is not good behavior and replaces it with a RuntimeError instead in Python 3.7+

Need to replace this StopIteration and simply return when the inner function completes on AtlasClient.search_entities

BulkEntities should be able to map a column entity to a table entity.

Related to #31 in the sense that a table has a columns relationship attribute.

Support a bulk delete operation on entities

Sent relationships are not visible with excel_bulk_entities_upload.py

Congratulations and many thanks for this great tool!
The sample provided are very useful but I cannot figure out how the custom attributes and relationships are passed to the Atlas API.
For instance, the script samples/excel/excel_bulk_entities_upload.py produces an excel BulkEntities sheet with two additional columns: "[Relationship] table", and "type".
The corresponding information is visible in the dict outputed by excel_reader.parse_bulk_entities(), but I cannot find them in the result of client.upload_entities() that also get printed on the console (see below). How are the attributes "[Relationship] table", and "type" passed to Apache Atlas in this case?
I would really need to understand that to grasp exactly what kind of related objects I can pass to the catalog API with pyapacheatlas.

runfile('C:/Users/FBEDECARRA/Desktop/Tests Apache Atlas/sample_bulk_upload.py', wdir='C:/Users/FBEDECARRA/Desktop/Tests Apache Atlas')
{
  "mutatedEntities": {
    "CREATE": [
      {
        "typeName": "DataSet",
        "attributes": {
          "qualifiedName": "pyapacheatlas://dataset",
          "name": "exampledataset"
        },
        "guid": "f24c4f22-c5e3-4776-a630-41e533b47099",
        "status": "ACTIVE",
        "displayText": "exampledataset",
        "classificationNames": [],
        "classifications": [],
        "meaningNames": [],
        "meanings": [],
        "isIncomplete": false,
        "labels": []
      },
      {
        "typeName": "hive_table",
        "attributes": {
          "createTime": 0,
          "qualifiedName": "pyapacheatlas://hivetable01",
          "name": "hivetable01"
        },
        "guid": "46efb945-281d-497b-8334-92c668fb8d5b",
        "status": "ACTIVE",
        "displayText": "hivetable01",
        "classificationNames": [],
        "classifications": [],
        "meaningNames": [],
        "meanings": [],
        "isIncomplete": false,
        "labels": []
      },
      {
        "typeName": "hive_column",
        "attributes": {
          "qualifiedName": "pyapacheatlas://hivetable01#colA",
          "name": "columnA"
        },
        "guid": "195d4775-69f0-48fe-b63c-88c0e30066fa",
        "status": "ACTIVE",
        "displayText": "columnA",
        "classificationNames": [],
        "classifications": [],
        "meaningNames": [],
        "meanings": [],
        "isIncomplete": false,
        "labels": []
      },
      {
        "typeName": "hive_column",
        "attributes": {
          "qualifiedName": "pyapacheatlas://hivetable01#colB",
          "name": "columnB"
        },
        "guid": "f43b8f63-63da-4c82-b5f5-2b09c0418e67",
        "status": "ACTIVE",
        "displayText": "columnB",
        "classificationNames": [],
        "classifications": [],
        "meaningNames": [],
        "meanings": [],
        "isIncomplete": false,
        "labels": []
      },
      {
        "typeName": "hive_column",
        "attributes": {
          "qualifiedName": "pyapacheatlas://hivetable01#colC",
          "name": "columnC"
        },
        "guid": "f1650ead-6b7e-4dce-aa2b-03ddb18ebca3",
        "status": "ACTIVE",
        "displayText": "columnC",
        "classificationNames": [],
        "classifications": [],
        "meaningNames": [],
        "meanings": [],
        "isIncomplete": false,
        "labels": []
      }
    ]
  },
  "guidAssignments": {
    "-1005": "f1650ead-6b7e-4dce-aa2b-03ddb18ebca3",
    "-1004": "f43b8f63-63da-4c82-b5f5-2b09c0418e67",
    "-1001": "f24c4f22-c5e3-4776-a630-41e533b47099",
    "-1003": "195d4775-69f0-48fe-b63c-88c0e30066fa",
    "-1002": "46efb945-281d-497b-8334-92c668fb8d5b"
  }
}
Completed bulk upload successfully!
Search for hivetable01 to see your results.

Scaffolding should be able to take an input data source and output data source

Documentation seems incorrect about "Search (only for Azure Purview advanced search)"

As far as I understand PurView offers a very limited API for searches when compared to the original Apache Atlas. One example: there is no v2/search/basic in PurView, but there is in Atlas .

In the light of this information, did you mean this instead in the README.md?

Search (the only search available for Azure Purview advanced search)

And as a side question, do you know if the original Atlas API is still accessible somehow?

Excel Worksheet Should Support Glossary Term Uploads

Right now, the Excel files are only smart enough to include classifications (which might need to be made into an optional field).

By including glossary terms, this would support bulk updates to entities that can't be currently done in Purview.

Implementation should look at adding a meanings special header that supports multiple semi-colon delimited terms that get mapped as relationship attributes.

Guid generation question

Hi,

I'm trying to build a function that takes two entities and creates a process between those two entities using the AtlasProcess function. My problem is that I need to create a new guid and be sure that the guid is not already assigned to one of my asset in Purview. Is there a function that creates a new guid, knowing the already existing ones on Purview?

Thank you,

Edoardo

get_entity error when trying to retrieve database object

Hello,

I am trying to retrieve data about the synapse (azure_sql_dw) instance I have setup in our Purview catalog. However when I try to get that information via get_entity, it returns an error.

Here is the code --

azure_sql_dw = client.get_entity(guid='c39966cb-fbe1-4394-9b44-1d3bbafeb38e')

Here is the error --

HTTPError: 500 Server Error: Internal Server Error for url: https://XXXXXXXXXXXX.catalog.purview.azure.com/api/atlas/v2/entity/bulk?guid=c39966cb-fbe1-4394-9b44-1d3bbafeb38e

Any idea on what could be causing this? Calls I make to table or column objects work fine.

Thanks,
Zack

Test against existing sql types

Any of the additional "required" attributes should be okay if they're not part of the type since they are ignored.

Make it easier to get started with Purview

There should be a Purview Client that accepts a account_name attribute and fills in the endpoint_url for you.

The PurviewClient should be used as a test to warn when...

Using this package's search feature (only implements the Purview)
Using classifications and the propagation feature (not supported in Purview)

excel_custom_table_column_lineage question

Hi,

I am trying to figure out how to add column specific lineage. I have ran the excel_custom_table_column_lineage sample but am not seeing any lineage in the interface. The demo tables and columns are uploaded but I do not see a lineage tab. Are there any changes I need to make to the sample code beside entering the authentication information?

The excel_update_lineage_upload sample works fine for me but this only shows table lineage.

Thank you,
Zack

Contacts and Owner: Phase1 Support Object ID in Excel Sheet

Allow experts and owners to be imported by putting the object ID's into the Excel sheet. This is enough to get the ball rolling. It's the easiest solution and it gives a path for users who are desperate for a solution. It also separates the basic API and import parsing problem from the more complicated "Graph authentication" problem.

Create Entities without lineage bulk upload

Create an excel reader function that supports upload of entities without needing column or table level lineage.

Currently, the Columns and Tables tabs expect you to be creating source and targets.

A new tab should be added to the template to support BulkEntities and the Columns and Tables tabs should be renamed to ColumnsLineage and TablesLineage as defaults.

BulkEntities should be able to automatically take column headers as the attributes. If a cell is empty, it will not add that attribute to the entity.

Provide Sample of Importing ADC Gen 1 Terms to Purview

ADC Gen 1 glossary terms should be importable into Purview!

Add a migration sample for ADC Gen1 that:

Exports ADC Gen1 terms
Converts it to the import format
Executes the bulk import endpoint for Purview

Bulk column UPDATE

Hello,

Would your samples/excel/excel_bulk_entities_upload.py work for updating existing columns? I am trying to find a way to bulk update columns that have already been scanned in via the GUI. We want to add additional information to the columns, mainly descriptions and glossary links.

I am trying to test it by updating a single column (adding a description to it). Below is what I have in the spreadsheet.

typeName	name	qualifiedName	classifications	[Relationship] table	type	description
mssql_column	my_column	mssql://XXXXXXXXXXX:XXXXXXX/MSSQLSERVER/XXXXX/XXXXX/my_table#my_column		pyapacheatlas://my_table	smallint	testing

Running this gives me the following error --

KeyError: 'The entity pyapacheatlas://my_table should be listed before mssql://XXXXXXXXXXX:XXXXXXX/MSSQLSERVER/XXXXX/XXXXX/my_table#my_column.'

I am not sure how to interpret this. Any help is greatly appreciated. Thank you.

Contacts and Owner: Phase2 Support Interactive Auth for Graph Lookups

Change the package so that it looks at the experts and owners input. If the values look like guids, then proceed as before. If they look like email addresses, force the user to login interactively and then the package will use the Graph API to translate the email addresses to guids on the user's behalf.

This should only occur in the PurviewClient and only applies to Entities upload and Glossary Term uploads. This is already handled in the terms/import csv route developed in #77 .

Support classifications

The `is_purview` attribute is not set correctly.

When the client is created with PurviewClient(), the is_purview client attribute is incorrectly set to False.
This causes search_entities() to throw RuntimeWarning: You're using a Purview only feature on a non-purview endpoint:

from pyapacheatlas.auth import ServicePrincipalAuthentication
from pyapacheatlas.core import PurviewClient

auth = ServicePrincipalAuthentication(
    tenant_id = "...", 
    client_id = "...", 
    client_secret = "..."
)

client = PurviewClient(
    account_name = "my-purview-account-name",
    authentication = auth
)

print('client.is_purview:', client.is_purview)
# >> False

for i in client.search_entities('totemove'):
    print(i)
# >> ...python3.9/site-packages/pyapacheatlas/core/util.py:18: 
# >> RuntimeWarning: You're using a Purview only feature on a non-purview endpoint.
# >> warnings.warn(

The output is ok, despite this warning.

Workaround: Set client.is_purview = True after client creation.

Add CLI Support

A CLI would help with using PyApacheAtlas as part of a tool chain and handle simple, reoccurring tasks such as:

Upload (type def | entity | relationship | term) json to your data catalog.
Validate your upload prior to submission with the What If / Validator
Create scaffolding json
Create template file for excel

Lineage UI Question

Hi,

I've ran the excel_custom_table_column_lineage.py sample and it works fine. Below is a picture of the lineage tab from the perspective of DestTable01.

However when I try the same exact type of lineage setup using some of my MSSQL tables, I get the following --

Three of the four some_adf_job entities are of type MS SQL Column Lineage. I don't want those showing. I only want the process entity showing like how it is in the demo. Also in the demo you can search for the columns on the left, but I can't do that here.

Any idea on what I could be doing wrong? I uploaded the missing MSSQL typedefs using the column_lineage_scaffold template beforehand.

OSError: [Errno 22] Invalid argument

HI,
I am trying to run the code and create a sample entity but i am getting the following error. I have checked the credentials and everything seems fine.

Traceback (most recent call last):
File "c:\Users\shkh\Purview.py", line 105, in
batch=[output01, input01, process]
File "C:\Users\shkh\AppData\Roaming\Python\Python36\site-packages\pyapacheatlas\core\client.py", line 927, in upload_entities
headers=self.authentication.get_authentication_headers()
File "C:\Users\shkh\AppData\Roaming\Python\Python36\site-packages\pyapacheatlas\auth\serviceprincipal.py", line 58, in get_authentication_headers
self._set_access_token()
File "C:\Users\shkh\AppData\Roaming\Python\Python36\site-packages\pyapacheatlas\auth\serviceprincipal.py", line 48, in _set_access_token
self.expiration = datetime.fromtimestamp(int(authJson["expires_in"]))
OSError: [Errno 22] Invalid argument

to_json should be smarter when guid is not provided

There are two sort of headers that work!

The currently supported version looks like this:

{
  "guid":-1,
  "typeName": "",
  "qualifiedName": ""
}

However, if you don't provide a guid, the to_json(minimum=True) should specify:

{
    "typeName": "type",
    "uniqueAttributes": {
        "qualifiedName": "qualified name"
    }
}

This could help avoid having to upload the entity as part of the batch.

Create a Reader Abstract Class to standardize multiple readers

Others may want to implement readers for different formats.

For example, you may want to create a JSON reader or a DelimitedFile reader that implements the same standard methods to parse the results.

This will result in merging:

ExcelConfiguration
readers.excel functions
scaffolding.templates.excel
scaffolding.core.*
scaffolding.util maybe?

This would be a breaking change for the samples.

UpdateLineage should support multiple inputs and outputs

After #47 it should include the ability to handle multiple inputs or outputs in the spreadsheet.

If there's an N/A in one row and it's being defined in another row, assume the N/A and WARN on the output.

AtlasProcess should accept AtlasEntity as Inputs and Outputs

As major methods like AtlasClient.upload_entities take on the role of converting objects into json, so should the AtlasProcess.

Three areas require changes:

__init__ should handle the inputs and outputs attributes.
set_outputs ...
set_inputs ...

In each case, it should allow an AtlasEntity and execute the to_json(minimum=True) method for you.

dependencyType defaults to simple but expression should default to null

When you create a column lineage entity, you have a dependencyType attribute that is either SIMPLE or EXPRESSION. If you have an EXPRESSION value then you would also see an expression attribute. That expression attribute would contain the code used to create that field.

If you go to re-run the parse_lineages method with existing entities (based on type and qualified name) and remove the transformation value for a given column lineage, you end up with a SIMPLE dependencyType but still have a value in the expression attribute.

Instead, the default for expression should be set to null. However, this may break other scenarios where we want to omit null values. There may have to be a compromise of an empty string value instead or an NA value?

Enable download of all entities for backup / restore

This is accomplished through the search API and requires paging through the results.

The goal would be to extract every entity and enable users to essentially "back up" their data catalog but also potentially re-locating their data catalog by uploading the results of this extraction.

Search REST API: http://atlas.apache.org/api/v2/resource_DiscoveryREST.html

Need to consider the upload process as well. Assuming you have to replace the guids when pushing to the new catalog since entity upload requires a negative number as guid.

Create a generic Entity Type Def based on an excel sheet template from user

Given an excel spreadsheet with column headers, generate an entity based on the column headers as attributes.

The goal would be to quickly generate the type and have it be hand edited to modify the results.

Use the Excel Configuration to specify the sheet?
All fields are optional
All fields are strings
Support for multiple tabs to be different entity types?

Stretch goal should be to allow for entities (the rows of the spreadsheet) to be created for that entity type.

Disable classification propagation when uploading entities

Currently, pyapacheatlas uploads entity classifications with the propagation attribute activated.
This is not convenient for all use cases. For instance, one would like to add a classification such as "manual_import" to differentiate, when browsing the catalog, the entities imported with pyapacheatlas from those populated automatically. Currently, when uploading related entities with this classification, one ends up with a series of "propagated classifications" stating "manual_import manual_import manual_import manual_import..." as many times as there are relationthips (which can be >10 in my case).

Excel should use columnMappings

Contacts and Owner: Phase3 Support Service Principal Graph Lookup

Add a switch to the import process (or the PurviewClient's authentication?) so that the user can signal "My SP has admin-granted permissions to call the Graph". In that case, the package will know it doesn't have to ask for interactive login. It can use the SP to call the Graph straight away. This would enable a scenario where the package can be used in a fully automated environment.

Look toward RelationshipDefs rather than EntityDefs for source and target types

Support Classification REST Endpoints

Support the following REST Endpoints with AtlasClient methods to round out the supported features

/v2/entity/bulk/classification (POST)
/v2/entity/guid/{guid}/classifications (GET | POST | PUT)
/v2/entity/guid/{guid}/classification/{classificationName} (DELETE | GET)
/v2/types/classificationdef/guid/{guid} (GET) (Already supported)
/v2/types/classificationdef/name/{name} (GET) (Already supported)

Applying classifications + glossary terms to columns via excel

Hi again. Is it possible to add a classification and/or glossary term to columns using the excel_bulk_entities_upload method? I see the sample has a classifications column. I have tried populating this with an existing classification and it runs without error but nothing shows up in the interface for the column. Other fields like description/data_type update fine. Thanks.

Distinguish between Relationship Attributes and Entity Attributes

After completing #29 and merging #32 , there is a potential need to connect relationship attributes to an uploaded entity. For example, you might upload several tables and columns. However, those columns would be unattached entities and have no relationships.

There needs to be something like (Relationship) attributeX in the BulkEntities tab or Target (Relationship) attributeY in the Lineages tabs.

Add support for multiple inputs in excel process

We need support for Owner and Experts

Hi, I got a customer that wants to also be able to manage the owner and expert with the API and also to assign during the creation of the custom ones.

Enable classification definition from Excel

pyapachealtas enables to upload entities specifying classifications. It would be useful to be able to define new classifications directly from the Excel Template.

AtlasClient.upload_typedefs should accept wider variety of def parameters

upload_typedefs currently accepts a typedef parameter that can take in different values.

I think it would be better if it had arguments for the required keys: "classificationDefs", "entityDefs","enumDefs", "relationshipDefs", "structDefs". That way you don't have to construct the dict yourself.

The arguments should accept a list of either AtlasTypeDefs (and converts them into dicts) or dicts.

Support Data Catalog Glossary Term Template Upload

The Azure Data Catalog provides a CSV glossary term upload with the following fields. The goal of this issue would be to develop a similar offering via the excel template and replicate the features.

Columns of CSV / Excel File:

Name
Status: ENUM (Approved, Draft)
Definition: String
Acronym: String
Resources: DisplayName:URL
Related Terms: Needs to look up existing terms and create or associate.
Synonyms: Needs to look up existing terms and create or associate.
Stewards: Needs Graph API support
Experts: Needs Graph API support
Dynamic attribute of pattern: [Attribute][termTemplateName]extraAttributeName

The dynamic attribute should be attached to an attributes property

{
"attributes:{
  "termTemplateName": {
      "extraAttributeName": ""
}
}

Add documentation

Support LineageREST for Purview Features

Knock out the LineageREST section!

Purview ONLY Support
GET /atlas/v2/lineage/{guid}/next/
GET /atlas/v2/lineage/{guid}

I will not support the Atlas way of calling this API at this time.

Purview Limitation
GET /v2/lineage/uniqueAttribute/type/{typeName}

I will not support this endpoint as it is not present in Purview currently.

TypeError in databricks_catalog_dataframe.py

Hello,

I'm seeing the following error when running databricks_catalog_dataframe.py in Databricks:

TypeError: 'EntityTypeDef' object is not subscriptable
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<command-1278818675318490> in <module>
     78    "relationshipDefs":[spark_column_to_df_relationship]
     79   }, 
---> 80   force_update=True)
     81 print(typedef_results)
     82 

/databricks/python/lib/python3.7/site-packages/pyapacheatlas/core/client.py in upload_typedefs(self, typedefs, force_update, **kwargs)
    840                 new_types[cat] = []
    841                 for t in typelist:
--> 842                     if t["name"] in types_from_client[cat]:
    843                         existing_types[cat].append(t)
    844                     else:

TypeError: 'EntityTypeDef' object is not subscriptable

Create What If Analysis of output

Count of changes, adds, invalid
Displaying diff in changes
What is invalid in the entity

AtlasClient.upload_entities should handle AtlasEntity

Currently upload_entities only supports a dictionary or list of dictionaries. It should handle a single AtlasEntity or a list of AtlasEntities. If the batch is a dictionary of "entities: [] then assume they are passing in a list of dicts already since they know the format.

wjohnson / pyapacheatlas Goto Github PK

pyapacheatlas's Introduction

PyApacheAtlas: A Python SDK for Azure Purview and Apache Atlas

Using Excel to Accelerate Metadata Uploads

Using the Pythonic SDK for Purview and Atlas

Quickstart

Install from PyPi

Using Azure-Identity and the Azure CLI to Connect to Purview

Create a Purview Client Connection Using Service Principal

Create Entities "By Hand"

Create Entities from Excel

Additional Resources

pyapacheatlas's People

Contributors

Stargazers

Watchers

Forkers

pyapacheatlas's Issues

New Classes

EntityTypeDef and ClassificationTypeDef: AttributeDefs

EntityTypeDef: RelationshipAttributeDef

AtlasProcess: Inputs and Outputs

AtlasEntity: name and qualified_name

AtlasEntity: RelationshipAttributes

Cleanups

Recommend Projects

Recommend Topics

Recommend Org

Jobs