tokern / data-lineage Goto Github PK

View Code? Open in Web Editor NEW

296.0 8.0 42.0 2.52 MB

Generate and Visualize Data Lineage from query history

Home Page: https://tokern.io/data-lineage/

License: MIT License

Dockerfile 1.47% Python 87.27% Jupyter Notebook 9.13% Makefile 0.69% Shell 1.44%

data-lineage data-governance python postgresql jupyter

data-lineage's People

Stargazers

Watchers

data-lineage's Issues

could not translate host name "---" to address

changed the CATALOG_PASSWORD,CATALOG_USER, CATALOG_DB, CATALOG_HOST accordingly and ran this command docker-compose -f tokern-lineage-engine.yml up.
Throwing me an error
return self.dbapi.connect(*cargs, **cparams)
tokern-data-lineage | File "/opt/pysetup/.venv/lib/python3.8/site-packages/psycopg2/init.py", line 122, in connect
tokern-data-lineage | conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
tokern-data-lineage | sqlalchemy.exc.OperationalError: (psycopg2.OperationalError) could not translate host name "-xxxxxxx.amazonaws.com" to address: Temporary failure in name resolution

job execution status always set to SUCCESS

As per the lines below, job_execution.status will always be JobExecutionStatus.SUCCESS. Is this intentional?

https://github.com/tokern/data-lineage/blob/master/data_lineage/server.py#L311-L313

[Feature Request] Support Snowflake

It would be neat

Support `select * ` in DML queries

Also select * queries are not supported. I file a new issue for that.

Originally posted by @vrajat in #42 (comment)

Error while parsing queries from json file

I was able to successfully load catalog using dbcat but I'm geting the following error when I tried to parse queries using a file in json format(I also tried the given test file)

File "~/Python/3.8/lib/python/site-packages/data_lineage/parser/init.py", line 124, in parse
name = str(hash(sql))
TypeError: unhashable type: 'dict'

Here's line 124:

data-lineage/data_lineage/parser/__init__.py

Line 124 in f347484

name = str(hash(sql))

Code executed:

from dbcat import catalog_connection
from data_lineage.parser import parse_queries, visit_dml_query
import json

with open("queries2.json", "r") as file:
    queries = json.load(file)

catalog_conf = """
catalog:
  user:test
  password: t@st
  host: 127.0.0.1
  port: 5432
  database: postgres
"""
catalog = catalog_connection(catalog_conf)

parsed = parse_queries(queries)

visited = visit_dml_query(catalog, parsed)

problem with docker-compose

Hello!
i use this docs
https://tokern.io/docs/data-lineage/installation

1 curl https://raw.githubusercontent.com/tokern/data-lineage/master/install-manifests/docker-compose-demodb/docker-compose.yml -o docker-compose.yml
2 docker-compose up -d
3 get error
ERROR: In file './docker-compose.yml', the services name 404 must be a quoted string, i.e. '404'.

Please update Query Parsing section of Example

Trying to follow: https://tokern.io/docs/data-lineage/example

... to get data lineage from snowflake

Using this example...

from data_lineage import Parser

parser = Parser(docker_address)

for query in queries:
    print(query)
    parser.parse(**query, source=source)

I get (using 0.8.3 of data-lineage, 3.8 of python in an isolated venv): cannot import name 'Parser' from 'data_lineage'

nginx is timing out because scan takes a long time.

Looks like it's scanning now but getting lots of

tokern-catalog    | 2021-10-14 13:52:05.252 UTC [36] DETAIL:  Key (source_id, name)=(142, foo) already exists.
tokern-catalog    | 2021-10-14 13:52:05.252 UTC [36] STATEMENT:  INSERT INTO schemata (name, source_id) VALUES ('foo', 142) RETURNING schemata.id
tokern-catalog    | 2021-10-14 13:52:08.597 UTC [36] ERROR:  duplicate key value violates unique constraint "unique_schema_name"
tokern-catalog    | 2021-10-14 13:52:08.597 UTC [36] DETAIL:  Key (source_id, name)=(142, bar) already exists.
tokern-catalog    | 2021-10-14 13:52:08.597 UTC [36] STATEMENT:  INSERT INTO schemata (name, source_id) VALUES ('bar', 142) RETURNING schemata.id
tokern-catalog    | 2021-10-14 13:52:08.675 UTC [36] ERROR:  duplicate key value violates unique constraint "unique_schema_name"

... which could be because we have the same schema names across different databases. Might be able to ignore this because we're only concerned with one database.

Another issue though (can open a sep ticket if desired) is that while the above keeps running, I get a 504 Server Error: Gateway Time-out for url: http://127.0.0.1:8000/api/v1/catalog/scanner - different error than the gunicorn one from previous... is this nginx timing out now instead of gunicorn?

Originally posted by @peteclark3 in #75 (comment)

400 BAD Request when Calling /api/v1/catalog/sources

Hi,

i am trying to connect data_lineage to an external postgresql database but am receiving 400 BAD Request error when calling /api/v1/catalog/sources end point. we deployed data_lineage using docker. Following the example, we pass:

edw_db = { "username": "<external postgres username>", "password": "somepassw0rd|", "uri": "<external postgresql hostname>", "port": "<external postgresql port>", "database": "<external postgresql database>" }

source = catalog.add_source(name="edw", source_type="postgresql", **edw_db)

but it seems we get the error here. Please help.

Tokern visualizer new feature

I am working the docker compose file. When I select the output node , I meed the only input linked to it in column_datalineage table should be highlighted , not all the inputs linked to the load node

503 Service Unavailable When calling catalog.add_source()

Hi,

I have deployed tokern data lineage using docker as stated in your documentations. I connect data lineage to an external postgresql database so I overwritten the CATALOG_PASSWORD, CATALOG_USER, CATALOG_DB and CATALOG_HOST variable.

I have written a python script which grabs all queries logged in postgresql and parse it to tokern data lineage. However, I keep getting the following error:

503 Server Error: Service Unavailable for url: https://<host ip>:8000/api/v1/catalog/sources

this is calling the following module:
source = catalog.add_source(name="edw", source_type="postgresql", **edw_db)

The logs do not throw any error even when log level is set to debug. Please help. Thank You.

CTE visiting

Currently it doesn't appear that the dml_visitor will walk through the common table expressions to build the lineage. Am I interpreting this wrong? Within vistor.py line 45 and 61 both visit the "with clause". There doesn't seem to be any functionality for handling the commontableexpr or ctes within the parsed statments. This causes any statements with ctes to throw an error when calling parse_queries, as no table is found when attempting to bind a CTE in a FROM clause.

Does the tokern support for PL/pgSQL

Support for Hive/Impala/SparkSQL based Tables

It would be nice to have support of more databases. May be a generic JDBC way or including pluggable 3rd party drivers.

Support for Google BigQuery

Any possibility to look out about Google BigQuery?
BigQuery already have query history feature, and we can retrieve it from bigquery logs exported table.

Could someone give me quick pointers for loading directly from json file without parsing

Hi,

I am working on a POC and have created a metadata for teradata i will convert that metadata to the demo mock json which was in kedro viz modular example. Json will be all formatted and parsed already. I just need quickbits how to load this json and display in kedro-viz.

Sorry to bother everyone but am in hurry i have all the metadata and regarding kedro and kedro-viz i learned yesterday only so a quick shortcut cheat code kind of thing I need.

Thanks,

ModuleNotFoundError: No module named 'sqlalchemy.sql.roles'

from data_lineage import Analyze, Catalog

...

Traceback (most recent call last):
  File "/opt/workspace/anaconda2/envs/tf21/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/opt/workspace/anaconda2/envs/tf21/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/cobolbaby/.vscode/extensions/ms-python.python-2021.11.1422169775/pythonFiles/lib/python/debugpy/__main__.py", line 45, in <module>
    cli.main()
  File "/home/cobolbaby/.vscode/extensions/ms-python.python-2021.11.1422169775/pythonFiles/lib/python/debugpy/../debugpy/server/cli.py", line 444, in main
    run()
  File "/home/cobolbaby/.vscode/extensions/ms-python.python-2021.11.1422169775/pythonFiles/lib/python/debugpy/../debugpy/server/cli.py", line 285, in run_file
    runpy.run_path(target_as_str, run_name=compat.force_str("__main__"))
  File "/opt/workspace/anaconda2/envs/tf21/lib/python3.6/runpy.py", line 263, in run_path
    pkg_name=pkg_name, script_name=fname)
  File "/opt/workspace/anaconda2/envs/tf21/lib/python3.6/runpy.py", line 96, in _run_module_code
    mod_name, mod_spec, pkg_name, script_name)
  File "/opt/workspace/anaconda2/envs/tf21/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/media/cobolbaby/data/ubuntu/opt/workspace/git/lineage/analyse.py", line 3, in <module>
    from data_lineage import Analyze, Catalog
  File "/opt/workspace/anaconda2/envs/tf21/lib/python3.6/site-packages/data_lineage/__init__.py", line 10, in <module>
    from dbcat.catalog.models import JobExecutionStatus
  File "/opt/workspace/anaconda2/envs/tf21/lib/python3.6/site-packages/dbcat/__init__.py", line 7, in <module>
    from dbcat.catalog import Catalog
  File "/opt/workspace/anaconda2/envs/tf21/lib/python3.6/site-packages/dbcat/catalog/__init__.py", line 3, in <module>
    from .catalog import Catalog
  File "/opt/workspace/anaconda2/envs/tf21/lib/python3.6/site-packages/dbcat/catalog/catalog.py", line 9, in <module>
    from dbcat.catalog.models import (
  File "/opt/workspace/anaconda2/envs/tf21/lib/python3.6/site-packages/dbcat/catalog/models.py", line 5, in <module>
    from snowflake.sqlalchemy import URL
  File "/opt/workspace/anaconda2/envs/tf21/lib/python3.6/site-packages/snowflake/sqlalchemy/__init__.py", line 25, in <module>
    from . import base, snowdialect
  File "/opt/workspace/anaconda2/envs/tf21/lib/python3.6/site-packages/snowflake/sqlalchemy/base.py", line 17, in <module>
    from .custom_commands import AWSBucket, AzureContainer, ExternalStage
  File "/opt/workspace/anaconda2/envs/tf21/lib/python3.6/site-packages/snowflake/sqlalchemy/custom_commands.py", line 14, in <module>
    from sqlalchemy.sql.roles import FromClauseRole
ModuleNotFoundError: No module named 'sqlalchemy.sql.roles'

MySQL client binaries seem to be required

This is probably due to SQLAlchemy's requirement of mysqlclient, but when doing

pip install data-lineage

The following is seen

Collecting mysqlclient<3,>=1.3.6
  Using cached mysqlclient-2.1.1.tar.gz (88 kB)
  Preparing metadata (setup.py) ... error
  error: subprocess-exited-with-error
  
  × python setup.py egg_info did not run successfully.
  │ exit code: 1
  ╰─> [16 lines of output]
      /bin/sh: mysql_config: command not found
      /bin/sh: mariadb_config: command not found
      /bin/sh: mysql_config: command not found
      Traceback (most recent call last):
        File "<string>", line 2, in <module>
        File "<pip-setuptools-caller>", line 34, in <module>
        File "/private/var/folders/th/yz4tb0ss5t3_4df1xnfrkg3r0000gn/T/pip-install-auypdvbk/mysqlclient_42a825d5ee084d6686c16912ef8320cc/setup.py", line 15, in <module>
          metadata, options = get_config()
        File "/private/var/folders/th/yz4tb0ss5t3_4df1xnfrkg3r0000gn/T/pip-install-auypdvbk/mysqlclient_42a825d5ee084d6686c16912ef8320cc/setup_posix.py", line 70, in get_config
          libs = mysql_config("libs")
        File "/private/var/folders/th/yz4tb0ss5t3_4df1xnfrkg3r0000gn/T/pip-install-auypdvbk/mysqlclient_42a825d5ee084d6686c16912ef8320cc/setup_posix.py", line 31, in mysql_config
          raise OSError("{} not found".format(_mysql_config_path))
      OSError: mysql_config not found
      mysql_config --version
      mariadb_config --version
      mysql_config --libs
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

Installing mysql client fixes it.

Since you are using SQLAlchemy, this is out of your hands but this issue is to suggest maybe add a note to that effect in the docs?

Support for large queries

calling

analyze.analyze(**{"query":query}, source=dl_source, start_time=datetime.now(), end_time=datetime.now())

with a large query, I get a "request too long" - seems that even though it is POSTing, it's still appending the query to the URL, thus the request fails.. e.g.

tokern-data-lineage-visualizer | 10.10.0.1 - - [14/Oct/2021:14:39:00 +0000] "POST /api/v1/analyze?query=ANY_REALLY_LONG_QUERY_HERE

syntax error for Snowflake query history

After fixing the issue #33 it is still failing. Note that I am using Snowflake.

import datetime
end_time = datetime.datetime.now()
start_time = end_time - datetime.timedelta(days=7)

query = f"""
SELECT query_text
FROM table(information_schema.query_history(
    end_time_range_start=>to_timestamp_ltz('{start_time.isoformat()}'),
    end_time_range_end=>to_timestamp_ltz('{end_time.isoformat()}')));
"""

cursors = conn.execute_string(
    sql_text=query
)

queries = []
for cursor in cursors:
  for row in cursor:
    print(f"{row[0]}")
    queries.append(row[0])

This shows query history as follows.

SELECT query_text
FROM table(information_schema.query_history(
    end_time_range_start=>to_timestamp_ltz('2021-04-25T13:46:32.544154'),
    end_time_range_end=>to_timestamp_ltz('2021-05-02T13:46:32.544154')));
SELECT query_text
FROM table(information_schema.query_history(
    end_time_range_start=>to_timestamp_ltz('2021-04-25T13:46:20.237862'),
    end_time_range_end=>to_timestamp_ltz('2021-05-02T13:46:20.237862')));
SELECT query_text
FROM table(information_schema.query_history(
    end_time_range_start=>to_timestamp_ltz('2021-04-25T13:45:18.371513'),
    end_time_range_end=>to_timestamp_ltz('2021-05-02T13:45:18.371513')));
SELECT query_text
FROM table(information_schema.query_history(
    end_time_range_start=>to_timestamp_ltz('2021-04-25T13:44:27.187499'),
    end_time_range_end=>to_timestamp_ltz('2021-05-02T13:44:27.187499')));
SELECT query_text
FROM table(information_schema.query_history(
    end_time_range_start=>to_timestamp_ltz('2021-04-24T07:25:55.213431'),
    end_time_range_end=>to_timestamp_ltz('2021-05-01T07:25:55.213431')));
SELECT query_text
FROM table(information_schema.query_history(
    end_time_range_start=>to_timestamp_ltz('2021-04-24T07:25:23.433387'),
    end_time_range_end=>to_timestamp_ltz('2021-05-01T07:25:23.433387')));
SELECT query_text
FROM table(information_schema.query_history(
    end_time_range_start=>to_timestamp_ltz('2021-04-24T07:10:29.311609'),
    end_time_range_end=>to_timestamp_ltz('2021-05-01T07:10:29.311609')));
SELECT query_text
FROM table(information_schema.query_history(
    end_time_range_start=>to_timestamp_ltz('2021-04-24T07:03:48.882660'),
    end_time_range_end=>to_timestamp_ltz('2021-05-01T07:03:48.882660')));
SELECT query_text
FROM table(information_schema.query_history(
    end_time_range_start=>to_timestamp_ltz('2021-04-24T07:02:13.962780'),
    end_time_range_end=>to_timestamp_ltz('2021-05-01T07:02:13.962780')));
SELECT query_text
FROM table(information_schema.query_history(
    end_time_range_start=>to_timestamp_ltz('2021-04-24T07:02:03.205936'),
    end_time_range_end=>to_timestamp_ltz('2021-05-01T07:02:03.205936')));
SELECT query_text
FROM table(information_schema.query_history(
    end_time_range_start=>to_timestamp_ltz('2021-04-01 00:00:00 +0800'),
    end_time_range_end=>to_timestamp_ltz('2021-05-01 00:00:00 +0800')));
SELECT query_text
FROM table(information_schema.query_history(
    end_time_range_start=>to_timestamp_ltz('2021-04-30 23:59:59 +0800'),
    end_time_range_end=>to_timestamp_ltz('2021-04-26 00:00:00 +0800')));
SELECT query_text
FROM table(information_schema.query_history(
    end_time_range_start=>to_timestamp_ltz('2021-04-30 23:59:59 +0800'),
    end_time_range_end=>to_timestamp_ltz('2021-04-6 00:00:00 +0800')));
put file:☺☺☺☺☺☺☺☺☺☺☺☺☺☺☺☺☺☺☺☺☺☺☺☺☺☺☺☺.csv @-/staged;
PUT file:☺☺☺☺☺☺☺☺☺☺☺☺☺☺☺☺☺☺☺☺☺☺☺☺☺☺☺☺.csv @-/staged;
COPY INTO tmp1
FROM @ANALYTICS_CUSTOM_PH/tmp1.csv
FILE_FORMAT = (
  TYPE = CSV  SKIP_HEADER = 1
);
LIST @ANALYTICS_CUSTOM_PH;
LIST @ANALYTICS_CUSTOM;
CREATE OR REPLACE TABLE tmp1(a INT, b STRING);
SHOW GRANTS TO USER identifier('"YOHEI"');
SELECT * FROM identifier('"SINGLIFE"."ANALYTICS_CUSTOM"."TMP1"') LIMIT 100;
COPY INTO tmp1
FROM @ANALYTICS_CUSTOM/tmp1.csv
FILE_FORMAT = (
  TYPE = CSV  SKIP_HEADER = 1
);
COPY INTO tmp1
FROM @ANALYTICS_CUSTOM/tmp1.csv
HEADER;
COPY INTO tmp1
FROM @ANALYTICS_CUSTOM/tmp1.csv
HEADER = TRUE;
COPY INTO tmp1
FROM @ANALYTICS_CUSTOM/tmp1.csv
FILE_FORMAT = (TYPE = CSV)
HEADER = TRUE;
COPY INTO tmp1
FROM @ANALYTICS_CUSTOM/tmp1.csv
FILE_FORMAT = (TYPE = CSV HEADER = TRUE);
COPY INTO tmp1
FROM @ANALYTICS_CUSTOM/tmp1.csv;
LIST @ANALYTICS_CUSTOM;
SHOW STAGES LIKE 'ANALYTICS_CUSTOM_PH' IN SCHEMA "SINGLIFE"."ANALYTICS_CUSTOM_PH"
SHOW STAGES LIKE 'ANALYTICS_CUSTOM' IN SCHEMA "SINGLIFE"."ANALYTICS_CUSTOM"
DESCRIBE STAGE "SINGLIFE"."ANALYTICS_CUSTOM"."ANALYTICS_CUSTOM"
DESCRIBE STAGE "SINGLIFE"."ANALYTICS_CUSTOM_PH"."ANALYTICS_CUSTOM_PH"
ALTER STAGE "SINGLIFE"."ANALYTICS_CUSTOM_PH"."ANALYTICS_CUSTOM_PH" SET URL = 's3://singlife-data-pf-sandbox-dev/analytics_custom_ph/'
ALTER STAGE "SINGLIFE"."ANALYTICS_CUSTOM"."ANALYTICS_CUSTOM" SET URL = 's3://singlife-data-pf-sandbox-dev/analytics_custom/'
SHOW GRANTS OF ROLE "DATA_ENGINEERING_ADVANCED"
SHOW GRANTS OF ROLE "DATA_ANALYST_ADVANCED_PH"
SHOW GRANTS OF ROLE "DATA_ANALYST_BASE_PH"
SHOW GRANTS OF ROLE "DATA_ANALYST_ADVANCED"
SHOW GRANTS OF ROLE "SNOWPIPE"
SHOW GRANTS OF ROLE "DEPLOY_ADMIN"
SHOW GRANTS ON SCHEMA "SINGLIFE"."LANDING_REALTIME"
SHOW GRANTS ON SCHEMA "SINGLIFE"."LANDING_REALTIME"
SHOW GRANTS ON SCHEMA "SINGLIFE"."LANDING_REALTIME"
SHOW GRANTS ON SCHEMA "SINGLIFE"."LANDING_REALTIME"
SHOW GRANTS ON SCHEMA "SINGLIFE"."LANDING_REALTIME"
SHOW FUTURE GRANTS IN SCHEMA "SINGLIFE"."LANDING_REALTIME"
SHOW GRANTS ON SCHEMA "SINGLIFE"."LANDING_REALTIME"
SHOW GRANTS ON SCHEMA "SINGLIFE"."LANDING_REALTIME"
SHOW GRANTS ON SCHEMA "SINGLIFE"."LANDING_REALTIME"
SHOW GRANTS ON SCHEMA "SINGLIFE"."LANDING_REALTIME"
SHOW FUTURE GRANTS IN SCHEMA "SINGLIFE"."LANDING_REALTIME"
SHOW GRANTS ON SCHEMA "SINGLIFE"."LANDING_REALTIME"
SHOW FUTURE GRANTS IN SCHEMA "SINGLIFE"."LANDING_REALTIME"
SHOW GRANTS ON SCHEMA "SINGLIFE"."LANDING_REALTIME"
SHOW FUTURE GRANTS IN SCHEMA "SINGLIFE"."LANDING_REALTIME"
SHOW GRANTS ON SCHEMA "SINGLIFE"."LANDING_REALTIME"
SHOW GRANTS ON SCHEMA "SINGLIFE"."ANALYTICS_PH"
SHOW GRANTS ON SCHEMA "SINGLIFE"."LANDING_REALTIME"
SHOW FUTURE GRANTS IN SCHEMA "SINGLIFE"."LANDING_REALTIME"
SHOW GRANTS ON SCHEMA "SINGLIFE"."ANALYTICS_PH"
SHOW GRANTS ON SCHEMA "SINGLIFE"."ANALYTICS_PH"
SHOW GRANTS ON SCHEMA "SINGLIFE"."ANALYTICS_PH"
SHOW GRANTS ON SCHEMA "SINGLIFE"."LANDING_REALTIME"
SHOW FUTURE GRANTS IN SCHEMA "SINGLIFE"."ANALYTICS"
SHOW GRANTS ON SCHEMA "SINGLIFE"."ANALYTICS_PH"
SHOW FUTURE GRANTS IN SCHEMA "SINGLIFE"."ANALYTICS_PH"
SHOW GRANTS ON SCHEMA "SINGLIFE"."ANALYTICS_PH"
SHOW FUTURE GRANTS IN SCHEMA "SINGLIFE"."ANALYTICS_PH"
SHOW GRANTS ON SCHEMA "SINGLIFE"."ANALYTICS_PH"
SHOW GRANTS ON SCHEMA "SINGLIFE"."ANALYTICS"
SHOW GRANTS ON SCHEMA "SINGLIFE"."ANALYTICS"
SHOW GRANTS ON SCHEMA "SINGLIFE"."ANALYTICS"
SHOW GRANTS ON SCHEMA "SINGLIFE"."ANALYTICS_PH"
SHOW GRANTS ON SCHEMA "SINGLIFE"."ANALYTICS"
SHOW GRANTS ON SCHEMA "SINGLIFE"."ANALYTICS"
SHOW FUTURE GRANTS IN SCHEMA "SINGLIFE"."ANALYTICS"
SHOW FUTURE GRANTS IN SCHEMA "SINGLIFE"."ANALYTICS_PH"
SHOW GRANTS ON SCHEMA "SINGLIFE"."ANALYTICS"
SHOW GRANTS ON SCHEMA "SINGLIFE"."ANALYTICS"
SHOW GRANTS ON SCHEMA "SINGLIFE"."ANALYTICS"
SHOW GRANTS ON SCHEMA "SINGLIFE"."ANALYTICS_PH"
SHOW FUTURE GRANTS IN SCHEMA "SINGLIFE"."ANALYTICS_PH"
SHOW GRANTS ON SCHEMA "SINGLIFE"."ANALYTICS_PH"
SHOW FUTURE GRANTS IN SCHEMA "SINGLIFE"."ANALYTICS"
SHOW GRANTS ON SCHEMA "SINGLIFE"."ANALYTICS"
SHOW GRANTS ON SCHEMA "SINGLIFE"."ANALYTICS_PH"
SHOW FUTURE GRANTS IN SCHEMA "SINGLIFE"."ANALYTICS_PH"
SHOW GRANTS ON SCHEMA "SINGLIFE"."ANALYTICS_PH"
SHOW GRANTS ON SCHEMA "SINGLIFE"."ANALYTICS"
SHOW GRANTS ON SCHEMA "SINGLIFE"."ANALYTICS"
SHOW GRANTS ON SCHEMA "SINGLIFE"."ANALYTICS_PH"
SHOW GRANTS ON SCHEMA "SINGLIFE"."ANALYTICS_PH"
SHOW GRANTS ON SCHEMA "SINGLIFE"."ANALYTICS"
SHOW GRANTS ON SCHEMA "SINGLIFE"."ANALYTICS_CUSTOM_PH"
SHOW GRANTS ON SCHEMA "SINGLIFE"."ANALYTICS"
SHOW GRANTS ON SCHEMA "SINGLIFE"."ANALYTICS_CUSTOM_PH"
SHOW FUTURE GRANTS IN SCHEMA "SINGLIFE"."ANALYTICS"
SHOW FUTURE GRANTS IN SCHEMA "SINGLIFE"."ANALYTICS"
SHOW GRANTS ON SCHEMA "SINGLIFE"."ANALYTICS_CUSTOM_PH"
SHOW GRANTS ON SCHEMA "SINGLIFE"."ANALYTICS_CUSTOM_PH"

Then trying to parse queries

from data_lineage.parser import parse_queries, visit_dml_queries

# Parse all queries
parsed = parse_queries(queries)

# Visit the parse trees to extract source and target queries
visited = visit_dml_queries(catalog, parsed)

# Create a graph and visualize it

from data_lineage.parser import create_graph
graph = create_graph(catalog, visited)

import plotly
plotly.offline.iplot(graph.fig())

Got this error.

---------------------------------------------------------------------------
ParseError                                Traceback (most recent call last)
<ipython-input-12-151c67ea977c> in <module>
      2 
      3 # Parse all queries
----> 4 parsed = parse_queries(queries)
      5 
      6 # Visit the parse trees to extract source and target queries

/opt/conda/lib/python3.8/site-packages/data_lineage/parser/__init__.py in parse_queries(queries)
     17 
     18 def parse_queries(queries: List[str]) -> List[Parsed]:
---> 19     return [parse(query) for query in queries]
     20 
     21 

/opt/conda/lib/python3.8/site-packages/data_lineage/parser/__init__.py in <listcomp>(.0)
     17 
     18 def parse_queries(queries: List[str]) -> List[Parsed]:
---> 19     return [parse(query) for query in queries]
     20 
     21 

/opt/conda/lib/python3.8/site-packages/data_lineage/parser/node.py in parse(sql, name)
    319     if name is None:
    320         name = str(hash(sql))
--> 321     node = AcceptingNode(parse_sql(sql))
    322 
    323     return Parsed(name, node)

pglast/parser.pyx in pglast.parser.parse_sql()

ParseError: syntax error at or near "table", at location 24

Use markupsafe==2.0.1

$ data_lineage --catalog-user xxx --catalog-password yyy
Traceback (most recent call last):
  File "/opt/homebrew/bin/data_lineage", line 5, in <module>
    from data_lineage.__main__ import main
  File "/opt/homebrew/lib/python3.9/site-packages/data_lineage/__main__.py", line 7, in <module>
    from data_lineage.server import create_server
  File "/opt/homebrew/lib/python3.9/site-packages/data_lineage/server.py", line 5, in <module>
    import flask_restless
  File "/opt/homebrew/lib/python3.9/site-packages/flask_restless/__init__.py", line 22, in <module>
    from .manager import APIManager  # noqa
  File "/opt/homebrew/lib/python3.9/site-packages/flask_restless/manager.py", line 24, in <module>
    from flask import Blueprint
  File "/opt/homebrew/lib/python3.9/site-packages/flask/__init__.py", line 14, in <module>
    from jinja2 import escape
  File "/opt/homebrew/lib/python3.9/site-packages/jinja2/__init__.py", line 12, in <module>
    from .environment import Environment
  File "/opt/homebrew/lib/python3.9/site-packages/jinja2/environment.py", line 25, in <module>
    from .defaults import BLOCK_END_STRING
  File "/opt/homebrew/lib/python3.9/site-packages/jinja2/defaults.py", line 3, in <module>
    from .filters import FILTERS as DEFAULT_FILTERS  # noqa: F401
  File "/opt/homebrew/lib/python3.9/site-packages/jinja2/filters.py", line 13, in <module>
    from markupsafe import soft_unicode
ImportError: cannot import name 'soft_unicode' from 'markupsafe' (/opt/homebrew/lib/python3.9/site-packages/markupsafe/__init__.py)

Looks like that was removed in 2.1.0. You may want to specify markupsafe==2.0.1.

Snowflake source defaulting to prod even though I'm specifying a different db name

I'm adding a snowflake source as follows.. where sf_db_name is my database name e.g. snowfoo (verified in debugger)...

 source = catalog.add_source(name=f"sf1_{time.time_ns()}", source_type="snowflake", database=sf_db_name, username=sf_username, password=sf_password, account=sf_account, role=sf_role, warehouse=sf_warehouse)

... but when it goes to scan, it looks like the code thinks my database name is 'prod':

tokern-data-lineage | sqlalchemy.exc.ProgrammingError: (snowflake.connector.errors.ProgrammingError) 002003 (02000): SQL compilation error:
tokern-data-lineage | Database 'PROD' does not exist or not authorized.
tokern-data-lineage | [SQL:
tokern-data-lineage |     SELECT
tokern-data-lineage |         lower(c.column_name) AS col_name,
tokern-data-lineage |         c.comment AS col_description,
tokern-data-lineage |         lower(c.data_type) AS col_type,
tokern-data-lineage |         lower(c.ordinal_position) AS col_sort_order,
tokern-data-lineage |         lower(c.table_catalog) AS database,
tokern-data-lineage |         lower(c.table_catalog) AS cluster,
tokern-data-lineage |         lower(c.table_schema) AS schema,
tokern-data-lineage |         lower(c.table_name) AS name,
tokern-data-lineage |         t.comment AS description,
tokern-data-lineage |         decode(lower(t.table_type), 'view', 'true', 'false') AS is_view
tokern-data-lineage |     FROM
tokern-data-lineage |         prod.INFORMATION_SCHEMA.COLUMNS AS c
tokern-data-lineage |     LEFT JOIN
tokern-data-lineage |         prod.INFORMATION_SCHEMA.TABLES t
tokern-data-lineage |             ON c.TABLE_NAME = t.TABLE_NAME
tokern-data-lineage |             AND c.TABLE_SCHEMA = t.TABLE_SCHEMA
tokern-data-lineage |      ;
tokern-data-lineage |     ]
tokern-data-lineage | (Background on this error at: http://sqlalche.me/e/13/f405)

.. I'm trying to look through the tokern code repos to see where the disconnect might be happening, but not sure yet...

Python install data-lineage with error legacy-install-failure

Someone can help me to solve this problem to prepare the enviroment to use Tokern Data Lineage ?

Thanks in advance

libpq missing in docker container

I tried using the latest docker file..when I tried to execute the sample notebook, it gave me the following error:

Traceback (most recent call last):
  File "~/Packages/User/lin_test.py", line 27, in <module>
    source = catalog.add_source(name="dev", source_type="redshift", **wikimedia_db)
  File "~/Library/Python/3.8/lib/python/site-packages/data_lineage/__init__.py", line 379, in add_source
    payload = self._post(path="sources", data=data, type="sources")
  File "~/Library/Python/3.8/lib/python/site-packages/data_lineage/__init__.py", line 202, in _post
    response.raise_for_status()
  File "/Library/Python/3.8/site-packages/requests/models.py", line 943, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 502 Server Error: Bad Gateway for url: http://127.0.0.1:8000/api/v1/catalog/sources

When I look at the log for data-lineage-visualizer, I see multiple lines with the same error:

2021/07/09 16:45:09 [error] 25#25: *12 connect() failed (113: Host is unreachable) while connecting to upstream, client:

The data lineage app has been continuously restarting with the following error:

import psycopg2

File "/opt/pysetup/.venv/lib/python3.8/site-packages/psycopg2/__init__.py", line 51, in <module>

from psycopg2._psycopg import ( # noqa

ImportError: libpq.so.5: cannot open shared object file: No such file or directory

On a side note, I'm planning on doing the server implementation without using docker, I thought I'll try to see if I can achieve using the docker file first.

Originally posted by @siva-mudiyanur in #57 (comment)

Enhanced Logging into AWS Redshift

HI team

We are planning to run Data Lineage tool against AWS Redshift to generate column level lineage.
In the demo we can see that the only option to connect to source database is by using username and password which is not inline with our security policy hence raising this request here for any options like tokens of AWS KMS that we can use to login to AWS redshift without using username and password in python code.

Also is there any option where we do NOT login to source database and generate lineage from the queries file ,like can we download the DDL and tool uses that DDL for query parsing rather than using an active connection to the source database.

Appreciate your response on this and please let me know if more information is required on this.

Debian Buster can't find version

Hi, I'm trying to install the 0.8 version on a docker that runs on Debian Buster and when it runs the pip command to install it prints the following warning/error:

#12 9.444 Collecting data-lineage==0.8.0 (from -r /project/requirements.txt (line 25))
#12 9.466   Could not find a version that satisfies the requirement data-lineage==0.8.0 (from -r /project/requirements.txt (line 25)) (from versions: 0.1.2, 0.2.0, 0.3.0, 0.5.1, 0.5.2, 0.6.0, 0.7.0)
#12 9.541 No matching distribution found for data-lineage==0.8.0 (from -r /project/requirements.txt (line 25))

Is this normal behavior? Do I have to add something before trying to install?

Docker image is not running

The docker image doesn’t run successfully. And I can’t run init or runserver command

Update python base image

python:3.8.1-slim is old and has 21 security vulnerabilites.

At least consider swapping to python:3.8-slim.

Any way to increase timeout for scanning?

When I add my snowflake DB for scanning, using this bit of code (with the values replaced as per my snowflake database):

from data_lineage import Catalog

catalog = Catalog(docker_address)

# Register wikimedia datawarehouse with data-lineage app.

source = catalog.add_source(name="wikimedia", source_type="postgresql", **wikimedia_db)

# Scan the wikimedia data warehouse and register all schemata, tables and columns.

catalog.scan_source(source)

... I get

tokern-data-lineage-visualizer | 2021/10/08 21:51:40 [error] 34#34: *1 upstream prematurely closed connection while reading response header from upstream, client: 10.10.0.1, server: , request: "POST /api/v1/catalog/scanner HTTP/1.1", upstream: "http://10.10.0.3:4142/api/v1/catalog/scanner", host: "127.0.0.1:8000"

... I think it's because snowflake isn't returning fast enough... but I'm not sure. Tried updating the warehouse size to large to make the scan faster, but getting the same thing. Seems like it times out pretty fast... at least for my large database. Any ideas?

Python 3.8.0 in an isolated venv, 0.8.3 data-lineage. Thanks for this package!

What query format to pass to Analyzer.analyze(...)?

I am trying to use this example:
https://tokern.io/docs/data-lineage/queries
... first issue... this bit of code looks like it's just going to fetch a single row from the query history from snowflake:

queries = []
with connection.get_cursor() as cursor:
  cursor.execute(query)
  row = cursor.fetchone()

  while row is not None:
    queries.append(row[0])

... is this intended? Note that it's using .fetchone()

Then.. second issue... when I go back to the example here: https://tokern.io/docs/data-lineage/example

I see this bit of code...

analyze = Analyze(docker_address)

for query in queries:
    print(query)
    analyze.analyze(**query, source=source, start_time=datetime.now(), end_time=datetime.now())

... what does the queries array look like? Or better yet, what does the single query item look like? Above it, in the example, it looks to be a JSON payload....

with open("queries.json", "r") as file:
    queries = json.load(file)

.... but I've no idea what the payload is supposed to look like.

I've tried 8 different ways of passing this **query variable into analyze(...) - using the results from the snowflake example on https://tokern.io/docs/data-lineage/queries - but I can never seem to get it right. Either I get an error saying that ** expects a mapping when I use strings or tuples (which is fine, but what's the mapping the function expects?) - or I get an error in the API console itself like

tokern-data-lineage |     raise ValueError('Bad argument, expected a ast.Node instance or a tuple')
tokern-data-lineage | ValueError: Bad argument, expected a ast.Node instance or a tuple

.. could we get a more concrete snowflake example, or at the bare minimum please indicate what the query variable is supposed to look like?

Note that I am also trying to inspect the unit tests and use those as examples, but still not getting very far.

Thanks for this package!

Instructions are wrong

Followed the example and did a docker-compose up -d and got the error

Traceback (most recent call last):
File "urllib3/connectionpool.py", line 670, in urlopen
File "urllib3/connectionpool.py", line 392, in _make_request
File "http/client.py", line 1255, in request
File "http/client.py", line 1301, in _send_request
File "http/client.py", line 1250, in endheaders
File "http/client.py", line 1010, in _send_output
File "http/client.py", line 950, in send
File "docker/transport/unixconn.py", line 43, in connect
FileNotFoundError: [Errno 2] No such file or directory

Are the instructions correct?

wrong selected field

For test i select field "page_id" and i expected find all nodes connected only(!) with this field

Not an issue with data-lineage but issue with required package: pglast

Opening this issue to let everyone know till it gets fixed...the installation for pglast fails requiring a xxhash.h file. Here's the link to issue and how to resolve it: lelit/pglast#82

Please feel free to close if you think its inappropriate

Demo is wrong

Trying out a demo, I tried to run catalog.scan_source(source). But that does not exist. After some digging, it looks like this works:

from data_lineage import Scan

Scan('http://127.0.0.1:8000').start(source)

Please fix the demo pages.

update_schema API returns KeyError

Couple of issues:

update_schema() is adding a default schema but the return seems to be failing:

Traceback (most recent call last):
  File "~/Library/Application Support/Sublime Text 3/Packages/User/lin_test.py", line 46, in <module>
    catalog.update_source(source,schema)
  File "~/Library/Python/3.8/lib/python/site-packages/data_lineage/__init__.py", line 489, in update_source
    attributes=payload["attributes"],
KeyError: 'attributes'

Code executed:

source = catalog.get_source("rs")
schema = catalog.get_schema("rs", "test")
catalog.update_source(source,schema)

Query is sent to the server for parsing but no update in UI and also no logs generated on lineage app as well.

Last line from visualizer log:
10.10.0.1 - - [16/Jul/2021:15:53:08 +0000] "POST /api/v1/parse?query=<query>&source_id=7 HTTP/1.1" 200 181 "-" "python-requests/2.25.1"

*removed the query as its pretty long

Originally posted by @siva-mudiyanur in #57 (comment)

Testing api_example.ipynb

I tried run exapmple

https://github.com/tokern/data-lineage/blob/master/api_example.ipynb

i got error

Could you tell me how i can fix it
Thank you

Redis dependency not documented

Trying out a demo, I saw the scan (see also #106) command fail, with the server looking for port 6379 on localhost. Sure enough, starting local redis removed that problem. Can this be documented ? It looks like the docker compose file includes it, just the instructions don't.

tokern_worker container keep restarting with error: /docker-entrypoint.sh: 11: exec: rq: not found

Steps to reproduce:

wget https://raw.githubusercontent.com/tokern/data-lineage/master/install-manifests/docker-compose/tokern-lineage-engine.yml
configure to use an external Postgres database, change the following parameters in tokern-lineage-engine.yml:
CATALOG_HOST
CATALOG_USER
CATALOG_PASSWORD
CATALOG_DB
docker-compose -f tokern-lineage-engine.yml up -d
run: docker ps (to list the status)
ru: docker logs -f tokern_worker (to get the logs)

Support ARM docker builds

This can't currently be run on ARM hardware without being in emulation mode. Given Mac M1 and AWS EC2 Graviton, it would be nice to have ARM support.

Unable to run demo: Key error "data"

I'm trying to run the data lineage wikimedia demo but I'm running into an error:

Traceback (most recent call last):
File "/Users/georgebezerra/opt/anaconda3/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/Users/georgebezerra/opt/anaconda3/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/Users/georgebezerra/.vscode/extensions/ms-python.python-2021.12.1559732655/pythonFiles/lib/python/debugpy/main.py", line 45, in
cli.main()
File "/Users/georgebezerra/.vscode/extensions/ms-python.python-2021.12.1559732655/pythonFiles/lib/python/debugpy/../debugpy/server/cli.py", line 444, in main
run()
File "/Users/georgebezerra/.vscode/extensions/ms-python.python-2021.12.1559732655/pythonFiles/lib/python/debugpy/../debugpy/server/cli.py", line 285, in run_file
runpy.run_path(target_as_str, run_name=compat.force_str("main"))
File "/Users/georgebezerra/opt/anaconda3/lib/python3.8/runpy.py", line 265, in run_path
return _run_module_code(code, init_globals, run_name,
File "/Users/georgebezerra/opt/anaconda3/lib/python3.8/runpy.py", line 97, in _run_module_code
_run_code(code, mod_globals, init_globals,
File "/Users/georgebezerra/opt/anaconda3/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/Users/georgebezerra/Dev/demo.py", line 19, in
source = catalog.add_source(name="wikimedia", source_type="postgresql", **wikimedia_db)
File "/Users/georgebezerra/opt/anaconda3/lib/python3.8/site-packages/data_lineage/init.py", line 319, in add_source
payload = self._post(path="sources", data=data, type="sources")
File "/Users/georgebezerra/opt/anaconda3/lib/python3.8/site-packages/data_lineage/init.py", line 144, in _post
return response.json()["data"]
KeyError: 'data'

The Docker piece seems to be running fine except of the tokern worker who is returning the following message:

/docker-entrypoint.sh: 11: exec: rq: not found

This is running on macbook pro with M1 chip.

cannot import name 'parse_queries' from 'data_lineage.parser'

Hi, I am trying to parse query history from Snowflake on Jupyter notebook.

data lineage version 0.3.0

!pip install snowflake-connector-python[secure-local-storage,pandas] data-lineage

import datetime
end_time = datetime.datetime.now()
start_time = end_time - datetime.timedelta(days=7)

query = f"""
SELECT query_text
FROM table(information_schema.query_history(
    end_time_range_start=>to_timestamp_ltz('{start_time.isoformat()}'),
    end_time_range_end=>to_timestamp_ltz('{end_time.isoformat()}')));
"""

cursors = conn.execute_string(
    sql_text=query
)

queries = []
for cursor in cursors:
  for row in cursor:
    print(row[0])
    queries.append(row[0])

from data_lineage.parser import parse_queries, visit_dml_queries

# Parse all queries
parsed = parse_queries(queries)

# Visit the parse trees to extract source and target queries
visited = visit_dml_queries(catalog, parsed)

# Create a graph and visualize it

from data_lineage.parser import create_graph
graph = create_graph(catalog, visited)

import plotly
plotly.offline.iplot(graph.fig())

Then I got this error. Would you help me find the root cause?

---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
<ipython-input-33-151c67ea977c> in <module>
----> 1 from data_lineage.parser import parse_queries, visit_dml_queries
      2 
      3 # Parse all queries
      4 parsed = parse_queries(queries)
      5 

ImportError: cannot import name 'parse_queries' from 'data_lineage.parser' (/opt/conda/lib/python3.8/site-packages/data_lineage/parser/__init__.py)

Support for Merge operations on Snowflake queries

We are having a requirement to parse the SQL queries that has MERGE operations - is there a plan to have a functionality to incorporate this ?

Connection drops on Mac when connecting to db in host machine

I see that this error log got appended on lineage log after I posted the previous comment( would have took about ~3-5 mins)

ERROR:data_lineage.server:Exception on /api/main [GET]

Traceback (most recent call last):

File "/opt/pysetup/.venv/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1276, in _execute_context

self.dialect.do_execute(

File "/opt/pysetup/.venv/lib/python3.8/site-packages/sqlalchemy/engine/default.py", line 608, in do_execute

cursor.execute(statement, parameters)

psycopg2.OperationalError: server closed the connection unexpectedly

This probably means the server terminated abnormally

before or while processing the request.



The above exception was the direct cause of the following exception:


Traceback (most recent call last):

File "/opt/pysetup/.venv/lib/python3.8/site-packages/flask/app.py", line 1950, in full_dispatch_request

rv = self.dispatch_request()

File "/opt/pysetup/.venv/lib/python3.8/site-packages/flask/app.py", line 1936, in dispatch_request

return self.view_functions[rule.endpoint](**req.view_args)

File "/opt/pysetup/.venv/lib/python3.8/site-packages/flask_restful/__init__.py", line 467, in wrapper

resp = resource(*args, **kwargs)

File "/opt/pysetup/.venv/lib/python3.8/site-packages/flask/views.py", line 89, in view

return self.dispatch_request(*args, **kwargs)

File "/opt/pysetup/.venv/lib/python3.8/site-packages/flask_restful/__init__.py", line 582, in dispatch_request

resp = meth(*args, **kwargs)

File "/opt/pysetup/.venv/lib/python3.8/site-packages/data_lineage/server.py", line 69, in get

column_edges = self._catalog.get_column_lineages(args["job_ids"])

File "/opt/pysetup/.venv/lib/python3.8/site-packages/dbcat/catalog/catalog.py", line 299, in get_column_lineages

return query.all()

File "/opt/pysetup/.venv/lib/python3.8/site-packages/sqlalchemy/orm/query.py", line 3373, in all

return list(self)

File "/opt/pysetup/.venv/lib/python3.8/site-packages/sqlalchemy/orm/query.py", line 3535, in __iter__

return self._execute_and_instances(context)

File "/opt/pysetup/.venv/lib/python3.8/site-packages/sqlalchemy/orm/query.py", line 3560, in _execute_and_instances

result = conn.execute(querycontext.statement, self._params)

File "/opt/pysetup/.venv/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1011, in execute

return meth(self, multiparams, params)

File "/opt/pysetup/.venv/lib/python3.8/site-packages/sqlalchemy/sql/elements.py", line 298, in _execute_on_connection

return connection._execute_clauseelement(self, multiparams, params)

File "/opt/pysetup/.venv/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1124, in _execute_clauseelement

ret = self._execute_context(

File "/opt/pysetup/.venv/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1316, in _execute_context

self._handle_dbapi_exception(

File "/opt/pysetup/.venv/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1510, in _handle_dbapi_exception

util.raise_(

File "/opt/pysetup/.venv/lib/python3.8/site-packages/sqlalchemy/util/compat.py", line 182, in raise_

raise exception

File "/opt/pysetup/.venv/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1276, in _execute_context

self.dialect.do_execute(

File "/opt/pysetup/.venv/lib/python3.8/site-packages/sqlalchemy/engine/default.py", line 608, in do_execute

cursor.execute(statement, parameters)

sqlalchemy.exc.OperationalError: (psycopg2.OperationalError) server closed the connection unexpectedly

This probably means the server terminated abnormally

before or while processing the request.



[SQL: SELECT column_lineage.id AS column_lineage_id, column_lineage.context AS column_lineage_context, column_lineage.source_id AS column_lineage_source_id, column_lineage.target_id AS column_lineage_target_id, column_lineage.job_execution_id AS column_lineage_job_execution_id, sources_1.id AS sources_1_id, sources_1.source_type AS sources_1_source_type, sources_1.name AS sources_1_name, sources_1.dialect AS sources_1_dialect, sources_1.uri AS sources_1_uri, sources_1.port AS sources_1_port, sources_1.username AS sources_1_username, sources_1.password AS sources_1_password, sources_1.database AS sources_1_database, sources_1.instance AS sources_1_instance, sources_1.cluster AS sources_1_cluster, sources_1.project_id AS sources_1_project_id, sources_1.project_credentials AS sources_1_project_credentials, sources_1.page_size AS sources_1_page_size, sources_1.filter_key AS sources_1_filter_key, sources_1.included_tables_regex AS sources_1_included_tables_regex, sources_1.key_path AS sources_1_key_path, sources_1.account AS sources_1_account, sources_1.role AS sources_1_role, sources_1.warehouse AS sources_1_warehouse, schemata_1.id AS schemata_1_id, schemata_1.name AS schemata_1_name, schemata_1.source_id AS schemata_1_source_id, tables_1.id AS tables_1_id, tables_1.name AS tables_1_name, tables_1.schema_id AS tables_1_schema_id, columns_1.id AS columns_1_id, columns_1.name AS columns_1_name, columns_1.data_type AS columns_1_data_type, columns_1.sort_order AS columns_1_sort_order, columns_1.table_id AS columns_1_table_id, sources_2.id AS sources_2_id, sources_2.source_type AS sources_2_source_type, sources_2.name AS sources_2_name, sources_2.dialect AS sources_2_dialect, sources_2.uri AS sources_2_uri, sources_2.port AS sources_2_port, sources_2.username AS sources_2_username, sources_2.password AS sources_2_password, sources_2.database AS sources_2_database, sources_2.instance AS sources_2_instance, sources_2.cluster AS sources_2_cluster, sources_2.project_id AS sources_2_project_id, sources_2.project_credentials AS sources_2_project_credentials, sources_2.page_size AS sources_2_page_size, sources_2.filter_key AS sources_2_filter_key, sources_2.included_tables_regex AS sources_2_included_tables_regex, sources_2.key_path AS sources_2_key_path, sources_2.account AS sources_2_account, sources_2.role AS sources_2_role, sources_2.warehouse AS sources_2_warehouse, schemata_2.id AS schemata_2_id, schemata_2.name AS schemata_2_name, schemata_2.source_id AS schemata_2_source_id, tables_2.id AS tables_2_id, tables_2.name AS tables_2_name, tables_2.schema_id AS tables_2_schema_id, columns_2.id AS columns_2_id, columns_2.name AS columns_2_name, columns_2.data_type AS columns_2_data_type, columns_2.sort_order AS columns_2_sort_order, columns_2.table_id AS columns_2_table_id, jobs_1.id AS jobs_1_id, jobs_1.name AS jobs_1_name, jobs_1.context AS jobs_1_context, jobs_1.source_id AS jobs_1_source_id, job_executions_1.id AS job_executions_1_id, job_executions_1.job_id AS job_executions_1_job_id, job_executions_1.started_at AS job_executions_1_started_at, job_executions_1.ended_at AS job_executions_1_ended_at, job_executions_1.status AS job_executions_1_status

FROM column_lineage LEFT OUTER JOIN columns AS columns_1 ON columns_1.id = column_lineage.source_id LEFT OUTER JOIN tables AS tables_1 ON tables_1.id = columns_1.table_id LEFT OUTER JOIN schemata AS schemata_1 ON schemata_1.id = tables_1.schema_id LEFT OUTER JOIN sources AS sources_1 ON sources_1.id = schemata_1.source_id LEFT OUTER JOIN columns AS columns_2 ON columns_2.id = column_lineage.target_id LEFT OUTER JOIN tables AS tables_2 ON tables_2.id = columns_2.table_id LEFT OUTER JOIN schemata AS schemata_2 ON schemata_2.id = tables_2.schema_id LEFT OUTER JOIN sources AS sources_2 ON sources_2.id = schemata_2.source_id LEFT OUTER JOIN job_executions AS job_executions_1 ON job_executions_1.id = column_lineage.job_execution_id LEFT OUTER JOIN jobs AS jobs_1 ON jobs_1.id = job_executions_1.job_id]

(Background on this error at: http://sqlalche.me/e/13/e3q8)

Originally posted by @siva-mudiyanur in #57 (comment)

How to save the sql running in postgresql to the query.json

Could supply the instructions related to the postgresql.

Ref: https://tokern.io/docs/data-lineage/queries

Parser trips up on common snowflake query history

Currently, the parser trips up on many common snowflake query history entries like select query_text from table(information_schema.query_history()); - also queries with the rm @SNOWFLAKE_... syntax... also queries with the keyword recluster ... in the latter case, the error being syntax error at or near "recluster", at index 35 ... I am systematically removing these from analysis prior to sending to the analyzer but just FYI that without doing this, the analyzer throws an exception

Data lineage for Teradata DW

Hi! I'm interested in extracting data lineage for multiple SQL Querys in a Teradata Database, as I read in other issues you need to access the Query history to let data-lineage do the work, If you have some documentation of how could I integrate this I'm really interested in contributing to implementing a connector for Teradata!

Docker Compose is not considering environment variables

If you want to use an external Postgres database, replace the following parameters in tokern-lineage-engine.yml:

CATALOG_HOST

CATALOG_USER

CATALOG_PASSWORD

CATALOG_DB

This was my first approach but it wasn't working..Here are my observations:

catalog.add_source() was adding source values into demo catalog despite of values given for external host in tokern-lineage-engine.yml file.
catalog.scan_source() was erroring out with the same error shown above: psycopg2.OperationalError: could not translate host name "" to address: Temporary failure in name resolution

Values provided for External Catalog:

  CATALOG_PASSWORD: t@st_passw0rd
  CATALOG_USER: catalog_test
  CATALOG_DB: tokern
  CATALOG_HOST: "127.0.0.1"

Originally posted by @siva-mudiyanur in #57 (comment)

Conflicting package dependencies

amundsen-databuilder which is one of the package dependencies for dbcat requires flask 1.0.2 whereas data-lineage requires flask 1.1

Please feel free to close if its not a valid issue.

tokern / data-lineage Goto Github PK

data-lineage's People

Stargazers

Watchers

Forkers

data-lineage's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs