GithubHelp home page GithubHelp logo

googlecloudplatform / datacatalog-connectors-bi Goto Github PK

View Code? Open in Web Editor NEW
32.0 8.0 16.0 587 KB

Sample code with integration between Data Catalog and BI data sources.

License: Apache License 2.0

Dockerfile 0.32% Python 98.73% Shell 0.95%
python gcp metadata-management metadata datacatalog looker looker-sdk tableau qlik qlik-sense

datacatalog-connectors-bi's Introduction

datacatalog-connectors-bi

This repository contains sample code with integration between Data Catalog and BI data sources.

Disclaimer: This is not an officially supported Google product.

License Python package Issues

Breaking Changes in v0.5.0

The package names were renamed, if you are still using the older version use the branch: release-v0.0.0

Project structure

Each subfolder contains a Python package. Please check components' README files for details.

The following components are available in this repo:

Component Description Folder Language
google-datacatalog-looker-connector Sample code for Looker data source. google-datacatalog-looker-connector Python
google-datacatalog-qlik-connector Sample code for Qlik Sense data source. google-datacatalog-qlik-connector Python
google-datacatalog-sisense-connector Sample code for Sisense data source. google-datacatalog-sisense-connector Python
google-datacatalog-tableau-connector Sample code for Tableau data source. google-datacatalog-tableau-connector Python

datacatalog-connectors-bi's People

Contributors

mesmacosta avatar ricardolsmendes avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

datacatalog-connectors-bi's Issues

[FEATURE] Publish connectors to pypi.

What would you like to be added:
Currently we store the package dependencies in the lib folder, ideally awe should have all dependencies in pypi.

Why is this needed:
Makes the environment set up easier, only need to do a single command to install each connector, this improves the dev experience and is less error prone.

[BUG] Looker - error.SDKError when scrapping query_generated_sql data.

What happened:
The Looker connector threw an error.SDKError when scrapping query_generated_sql data.

What you expected to happen:
This should not stop the connector execution, since some Looker servers may have inconsistent data at some degree.

For example, I have one Looker instance that had the connector working, but once someone created an invalid query reference, the connector stopped working.

How to reproduce it (as minimally and precisely as possible):
Run the connector on a Looker instance with inconsistent query information.

Anything else we need to know?:
No

[FEATURE] Tableau - Verify that resolved GraphQL query is valid on unit tests and improve error message.

What would you like to be added:
Currently breaking changes to: tableau/scrape/metadata_api_constants.py is not caught by the unit tests.

For example if we generate an invalid GraphQL query:

query getSites {
    tableauSites {
        
    luid
    uri
    name
    publishedDatasources {
        
    luid
    
    name
    upstreamTables {
        
    fullName
    database {{
        luid
    }}

    }
    upstreamDatabases {
        
    luid
    name
    connectionType

    }

    site {
        luid
        name
    }
    projectName
    owner {
        
    username
    name

    }
    isCertified
    certifierDisplayName
    certificationNote
    description
    vizportalUrlId

    }
    workbooks {
        
    luid
    name
    site {
        luid
        name
    }
    projectName
    owner {
        
    username
    name

    }
    sheets {
        
    id
    luid
    name
    path
    createdAt
    updatedAt

    }
    description
    vizportalUrlId
    createdAt
    updatedAt
    upstreamTables {
        
    fullName
    database {
        luid
    }

    }
    upstreamDatabases {
        
    luid
    name
    connectionType

    }

    }

    }
}

This query has a field with double {{}}, which is invalid. All tests passes, and once we run the tableau connector, it shows the metadata is empty(even if the Tableau server has assets) and succeeds without throwing an error.

So we should add this verification to the unit tests, and add an error message if the query is invalid to the connector scrape process.

Why is this needed:
Makes it hard to maintain, because breaking changes to the main query is not caught and not easily identified.

[FEATURE] Add support for Qlik

What would you like to be added:
Support for ingesting metadata from Qlik into Google Data Catalog.

Why is this needed:
Qlik is among the main providers in the BI System market. Qlik customers might benefit from Data Catalog's powerful metadata management capabilities in order to discover and better manage their metadata.

[BUG]

What happened:
Got an error when dealing with quote in filenames such as "chiffre d'affaire"

What you expected to happen:
Retrieve Data correctly

How to reproduce it (as minimally and precisely as possible):
Try to import metadata related to a filename containing a quote such as "Chiffre d'affaire" (in french)

Anything else we need to know?:

This happens during this phase :
INFO:root:===> Synchronizing Tableau :: Data Catalog metadata...
Files without quote ' do not encounter any issues

_Here is the error logs :

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/Users/tgadiole/projects/tableau/tableau/bin/google-datacatalog-tableau-connector", line 8, in
sys.exit(main())
File "/Users/tgadiole/projects/tableau/tableau/lib/python3.8/site-packages/google/datacatalog_connectors/tableau/tableau2datacatalog_cli.py", line 79, in main
Tableau2DataCatalogCli.run(argv[1:] if len(argv) > 0 else argv)
File "/Users/tgadiole/projects/tableau/tableau/lib/python3.8/site-packages/google/datacatalog_connectors/tableau/tableau2datacatalog_cli.py", line 32, in run
args.func(args)
File "/Users/tgadiole/projects/tableau/tableau/lib/python3.8/site-packages/google/datacatalog_connectors/tableau/tableau2datacatalog_cli.py", line 67, in __run_synchronizer
sync.DataCatalogSynchronizer(
File "/Users/tgadiole/projects/tableau/tableau/lib/python3.8/site-packages/google/datacatalog_connectors/tableau/sync/datacatalog_synchronizer.py", line 64, in run
self.__run_full_sync()
File "/Users/tgadiole/projects/tableau/tableau/lib/python3.8/site-packages/google/datacatalog_connectors/tableau/sync/datacatalog_synchronizer.py", line 71, in __run_full_sync
self.__sites_synchronizer.run()
File "/Users/tgadiole/projects/tableau/tableau/lib/python3.8/site-packages/google/datacatalog_connectors/tableau/sync/metadata_synchronizer.py", line 136, in run
ingestor.ingest_metadata(assembled_entries, tag_templates_dict)
File "/Users/tgadiole/projects/tableau/tableau/lib/python3.8/site-packages/google/datacatalog_connectors/commons/ingest/datacatalog_metadata_ingestor.py", line 65, in ingest_metadata
self.__ingest_entries(entry_group_name, assembled_entries_data, config)
File "/Users/tgadiole/projects/tableau/tableau/lib/python3.8/site-packages/google/datacatalog_connectors/commons/ingest/datacatalog_metadata_ingestor.py", line 94, in __ingest_entries
entry = self.__datacatalog_facade.upsert_entry(
File "/Users/tgadiole/projects/tableau/tableau/lib/python3.8/site-packages/google/datacatalog_connectors/commons/datacatalog_facade.py", line 100, in upsert_entry
persisted_entry = self.create_entry(
File "/Users/tgadiole/projects/tableau/tableau/lib/python3.8/site-packages/google/datacatalog_connectors/commons/datacatalog_facade.py", line 48, in create_entry
entry = self.__datacatalog.create_entry(parent=entry_group_name,
File "/Users/tgadiole/projects/tableau/tableau/lib/python3.8/site-packages/google/cloud/datacatalog_v1beta1/gapic/data_catalog_client.py", line 1481, in create_entry
return self.inner_api_calls["create_entry"](
File "/Users/tgadiole/projects/tableau/tableau/lib/python3.8/site-packages/google/api_core/gapic_v1/method.py", line 145, in call
return wrapped_func(*args, **kwargs)
File "/Users/tgadiole/projects/tableau/tableau/lib/python3.8/site-packages/google/api_core/retry.py", line 281, in retry_wrapped_func
return retry_target(
File "/Users/tgadiole/projects/tableau/tableau/lib/python3.8/site-packages/google/api_core/retry.py", line 184, in retry_target
return target()
File "/Users/tgadiole/projects/tableau/tableau/lib/python3.8/site-packages/google/api_core/timeout.py", line 214, in func_with_timeout
return func(*args, **kwargs)
File "/Users/tgadiole/projects/tableau/tableau/lib/python3.8/site-packages/google/api_core/grpc_helpers.py", line 59, in error_remapped_callable
six.raise_from(exceptions.from_grpc_error(exc), exc)
File "", line 3, in raise_from
google.api_core.exceptions.InvalidArgument: 400 "YTD Chiffre d'affaires" is an invalid value for CreateEntryRequest.entry.display_name. It must: contain only unicode letters, numbers, underscores, dashes and spaces; not start or end with spaces; and be at most 200 bytes long when encoded in UTF-8.

[BUG] Looker connector throws "Requires authentication" error when scraping metadata takes too long

What happened:
I'm using the google-datacatalog-looker-connector to ingest metadata from a large Looker instance into Google Data Catalog. This Looker instance has hundreds of queries, and scraping their metadata takes too long (I believe more than 1h). After scraping metadata for such a long time, the connector fails with the message:

looker_sdk.error.SDKError: {"message":"Requires authentication.","documentation_url":"http://docs.looker.com/"}

What you expected to happen:
I expect the connector to ingest metadata no matter how large the source Looker instance is.

How to reproduce it (as minimally and precisely as possible):
Run any set of scrape operations that take very long to complete, actually more than 1h.

Anything else we need to know?:
I didn't investigate how Looker's SDK manages the REST API credentials, but this error might be related to a token expiration. Refreshing the token from time to time during the scrape stage might fix it.

[FEATURE] Looker - Add option to filter the ingested types.

What would you like to be added:
The Looker connector currently supports the below asset types:

  • Folder
  • Look
  • Dashboard
  • Dashboard Element (aka Tile)
  • Query

ADD an option to choose which types are ingested in Data Catalog.

Why is this needed:
Depending on the Looker instance, we may have a lot of query elements, and ingesting them all takes a long time.

ADD support for Python3.6 on looker2datacatalog

What would you like to be added:
Make the looker2datacatalog connector work with Python3.6

Why is this needed:
Currently tableau2datacatalog works with both Python3.6 and Python3.7, since looker does not work with Python3.6 the CI is executing only for Python3.7.
With this change we can enable our CI to run on both versions and increase our version coverage.

[FEATURE] Turn the WebsocketResponsesManager thread safe

The WebsocketResponsesManager class (Qlik Sense connector, google.datacatalog_connectors.qlik.scrape.websocket_responses_manager) provides common features for responses handling over WebSocket communication sessions. Its methods are concurrently used by a given message consumer and producer pair, so the reading/writing operations should be improved in order to be thread-safe and prevent errors.

This doesn't seem to be a critical issue since we didn't face any concurrency-related errors in the scope of this class when testing the connector (both in the unit and manual integrated tests), but it's worth keeping an eye on it for the future.

[FEATURE] Fields metadata to be included in Qlik Connector

Currently the Qlik connector scrapes only for Dimensions, Measures, Sheets and Visualizations using the Engine JSON API.

However, many of the visualisations also use "Fields", which is not captured in the connector.

https://help.qlik.com/en-US/sense/November2021/Subsystems/Hub/Content/Sense_Hub/Visualizations/data-in-your-visualization.htm

This field list can be retrieved by using "qType": "FieldList" similar to how "qType": "DimensionList" is used for Dimensions.

Thank you!

[BUG] List of Systems do not get updated after Tableau sync

What happened:

I used the connector to sync data from a Tableau server to Google Data Catalog. The data came in fine, but the list of systems in Data Catalog does not get updated to include Tableau, even after days. Honestly, most likely this is an issue with Catalog itself and not the connector, and it's maybe even expected behavior.

What you expected to happen:

List of systems should include Tableau

How to reproduce it (as minimally and precisely as possible):

Sync a Tableau server to Data Catalog, look in Google Data Catalog, main dashboard, list for systems.

Anything else we need to know?:

Including image with the result:

image

[BUG] Tableau connector is always returning empty metadata.

What happened:
On the v0.5.0 release version, tableau connectors is always returning empty metadata on the scrape process.

What you expected to happen:
Return the existent workbooks/dashboards/sheets.

How to reproduce it (as minimally and precisely as possible):
Run the Tableau connector using the version from PyPI on a Tableau server containing metadata, and no metadata will be ingsted.

Anything else we need to know?:
On the previous version the metadata is returned.

[BUG] Initial full sync entries in data catalog are not searchable

After the successful initial/first run of tableau-connector entries like dashboards ingested into google datacatalog, searching by entries or by tags will not result. But the entries exist under the entry_group=tableau.

Expected: Search in datacatalog by entry name (like dashboard name) or by tag-template "tableau_dashboard_metadata" should return list of entries/views/dashboards/tables...etc

To Reproduce:
Create a Datacatalog service account with catalog admin role : roles/datacatalog.admin
Create a developer tableau demo with example workbook.
Execute exact steps as in readme document to setup env, packages and connector execution

Couple of things to note:

  1. After the first run and entries are created in catalog
    a. either add a new dashboard to existing workbook in tableau and rerun the connector. Now the all existing dashboards and new dashboard under the same workbook are searchable in datacatalog
    b. After initial/first run, when entries are not searchable, edit tag-template value for one entry in datacatalog and save. Now that entry is searchable

  2. I have noticed this pattern of initial/first entries not searchable even with other connectors like datacatalog-connectors-rdbms/google-datacatalog-mysql-connector where mysql database or tables or columns or tag-templates are not searchable unless there is a change in DB or meta of that entry in catalog. Which make me assume its either an issue in google datacatalog or in ingestion code: datacatalog-connectors/google-datacatalog-connectors-commons/

[BUG] Error when site name contains space for Tableau connector

What happened:

I am running the connector to ingest the metadata from a Tableau server on the cloud to Google Catalog. The name of the site in question contain an space in the name (like: My site). When providing this name to the connector is in the format myname.

Google Catalog fails because it does not support spaces:

google.api_core.exceptions.InvalidArgument: 400 
"https://us-east-1.online.tableau.com/#/site/My Site/workbooks/594378" is an invalid value for CreateEntryRequest.entry.linked_resource name. 
It must: contain only letters, numbers, periods, colons, slashs, underscores, dashes and hashes and be at most 200  bytes long when encoded in UTF-8.

What you expected to happen:

Metadata should be replicated to Google Data Catalog

How to reproduce it (as minimally and precisely as possible):

Use a site where the display name contains an space. Creates a workbook inside and run the connector.

Anything else we need to know?:

Looking at the code I see an issue in the method __format_site_content_url where '/site/{site_name}' is used to build the site url part. This is not going to yield a valid url in many cases.

[FEATURE] ADD Tags with links to Looker on ingested Entries.

What would you like to be added:
ADD the following url's to Looker ingested Entries:

Folder: <hostname>/folders/<folder id>
Look: <hostname>/looks/<look id>
Dashboard: <hostname>/dashboards/<dashboard id>
Dashboard element: There is no individual URL for a dashboard element.
Query: <hostname>/explore/<model name>/<explore name>?qid=<query id, like "VZMahwm8XYSygcfpsep5aV">

Why is this needed:
Having those URL's improves the Data Catalog and Looker experience, making it easy for users to navigate between the two platforms.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.