reata / sqllineage Goto Github PK

View Code? Open in Web Editor NEW

1.2K 21.0 228.0 9.67 MB

SQL Lineage Analysis Tool powered by Python

License: MIT License

Python 89.00% HTML 0.50% JavaScript 10.16% CSS 0.18% Dockerfile 0.16%

data-discovery data-lineage lineage metadata sql data-governance

sqllineage's Introduction

SQLLineage

SQL Lineage Analysis Tool powered by Python

Never get the hang of a SQL parser? SQLLineage comes to the rescue. Given a SQL command, SQLLineage will tell you its source and target tables, without worrying about Tokens, Keyword, Identifier and all the jagons used by SQL parsers.

Behind the scene, SQLLineage pluggable leverages parser library (sqlfluff and sqlparse) to parse the SQL command, analyze the AST, stores the lineage information in a graph (using graph library networkx), and brings you all the human-readable result with ease.

Demo & Documentation

Talk is cheap, show me a demo.

Documentation is online hosted by readthedocs, and you can check the release note there.

Quick Start

Install sqllineage via PyPI:

$ pip install sqllineage

Using sqllineage command to parse a quoted-query-string:

$ sqllineage -e "insert into db1.table1 select * from db2.table2"
Statements(#): 1
Source Tables:
    db2.table2
Target Tables:
    db1.table1

Or you can parse a SQL file with -f option:

$ sqllineage -f foo.sql
Statements(#): 1
Source Tables:
    db1.table_foo
    db1.table_bar
Target Tables:
    db2.table_baz

Advanced Usage

Multiple SQL Statements

Lineage is combined from multiple SQL statements, with intermediate tables identified:

$ sqllineage -e "insert into db1.table1 select * from db2.table2; insert into db3.table3 select * from db1.table1;"
Statements(#): 2
Source Tables:
    db2.table2
Target Tables:
    db3.table3
Intermediate Tables:
    db1.table1

Verbose Lineage Result

And if you want to see lineage for each SQL statement, just toggle verbose option

$ sqllineage -v -e "insert into db1.table1 select * from db2.table2; insert into db3.table3 select * from db1.table1;"
Statement #1: insert into db1.table1 select * from db2.table2;
    table read: [Table: db2.table2]
    table write: [Table: db1.table1]
    table cte: []
    table rename: []
    table drop: []
Statement #2: insert into db3.table3 select * from db1.table1;
    table read: [Table: db1.table1]
    table write: [Table: db3.table3]
    table cte: []
    table rename: []
    table drop: []
==========
Summary:
Statements(#): 2
Source Tables:
    db2.table2
Target Tables:
    db3.table3
Intermediate Tables:
    db1.table1

Dialect-Awareness Lineage

By default, sqllineage use ansi dialect to parse and validate your SQL. However, some SQL syntax you take for granted in daily life might not be in ANSI standard. In addition, different SQL dialects have different set of SQL keywords, further weakening sqllineage's capabilities when keyword used as table name or column name. To get the most out of sqllineage, we strongly encourage you to pass the dialect to assist the lineage analyzing.

Take below example, INSERT OVERWRITE statement is only supported by big data solutions like Hive/SparkSQL, and MAP is a reserved keyword in Hive thus can not be used as table name while it is not for SparkSQL. Both ansi and hive dialect tell you this causes syntax error and sparksql gives the correct result:

$ sqllineage -e "INSERT OVERWRITE TABLE map SELECT * FROM foo"
...
sqllineage.exceptions.InvalidSyntaxException: This SQL statement is unparsable, please check potential syntax error for SQL

$ sqllineage -e "INSERT OVERWRITE TABLE map SELECT * FROM foo" --dialect=hive
...
sqllineage.exceptions.InvalidSyntaxException: This SQL statement is unparsable, please check potential syntax error for SQL

$ sqllineage -e "INSERT OVERWRITE TABLE map SELECT * FROM foo" --dialect=sparksql
Statements(#): 1
Source Tables:
    <default>.foo
Target Tables:
    <default>.map

Use sqllineage --dialects to see all available dialects.

Column-Level Lineage

We also support column level lineage in command line interface, set level option to column, all column lineage path will be printed.

INSERT INTO foo
SELECT a.col1,
       b.col1     AS col2,
       c.col3_sum AS col3,
       col4,
       d.*
FROM bar a
         JOIN baz b
              ON a.id = b.bar_id
         LEFT JOIN (SELECT bar_id, sum(col3) AS col3_sum
                    FROM qux
                    GROUP BY bar_id) c
                   ON a.id = sq.bar_id
         CROSS JOIN quux d;

INSERT INTO corge
SELECT a.col1,
       a.col2 + b.col2 AS col2
FROM foo a
         LEFT JOIN grault b
              ON a.col1 = b.col1;

Suppose this sql is stored in a file called test.sql

$ sqllineage -f test.sql -l column
<default>.corge.col1 <- <default>.foo.col1 <- <default>.bar.col1
<default>.corge.col2 <- <default>.foo.col2 <- <default>.baz.col1
<default>.corge.col2 <- <default>.grault.col2
<default>.foo.* <- <default>.quux.*
<default>.foo.col3 <- c.col3_sum <- <default>.qux.col3
<default>.foo.col4 <- col4

MetaData-Awareness Lineage

By observing the column lineage generated from previous step, you'll possibly notice that:

<default>.foo.* <- <default>.quux.*: the wildcard is not expanded.
<default>.foo.col4 <- col4: col4 is not assigned with source table.

It's not perfect because we don't know the columns encoded in * of table quux. Likewise, given the context, col4 could be coming from bar, baz or quux. Without metadata, this is the best sqllineage can do.

User can optionally provide the metadata information to sqllineage to improve the lineage result.

Suppose all the tables are created in sqlite database with a file called db.db. In particular, table quux has columns col5 and col6 and baz has column col4.

sqlite3 db.db 'CREATE TABLE IF NOT EXISTS baz (bar_id int, col1 int, col4 int)';
sqlite3 db.db 'CREATE TABLE IF NOT EXISTS quux (quux_id int, col5 int, col6 int)';

Now given the same SQL, column lineage is fully resolved.

$ SQLLINEAGE_DEFAULT_SCHEMA=main sqllineage -f test.sql -l column --sqlalchemy_url=sqlite:///db.db
main.corge.col1 <- main.foo.col1 <- main.bar.col1
main.corge.col2 <- main.foo.col2 <- main.bar.col1
main.corge.col2 <- main.grault.col2
main.foo.col3 <- c.col3_sum <- main.qux.col3
main.foo.col4 <- main.baz.col4
main.foo.col5 <- main.quux.col5
main.foo.col6 <- main.quux.col6

The default schema name in sqlite is called main, we have to specify here because the tables in SQL file are unqualified.

SQLLineage leverages sqlalchemy to retrieve metadata from different SQL databases. Check for more details on SQLLineage MetaData.

Lineage Visualization

One more cool feature, if you want a graph visualization for the lineage result, toggle graph-visualization option

Still using the above SQL file

sqllineage -g -f foo.sql

A webserver will be started, showing DAG representation of the lineage result in browser:

Table-Level Lineage

Column-Level Lineage

sqllineage's People

Contributors

Stargazers

Watchers

Forkers

laashub-sda jiajie999 coldmoutain qingfengzhou ekimd hyattejiang fbjoker allali2018 lineclappe vendetta01 zhushiyude mozartdata monkeyfx mysky528 ghfork kenchensb listenbehind k-des-share mxhellor aqiang520 maoxingda caotao94 eeroel devendrasr imsathiya17 treff7es shalevy1 linweijie0606 cloudhuang wokongxing cheungzi ninwjf tangyibo tmdc-io vmburbinamx jiafan mazamorac harshach tingyliang kswanchai4 charminglittedeveloper andywu93 antoniivanov jackson-wang-making regud chezou xueshijun teslahenry erdal-pb juansalinasponce mingl0l miguelangelmanuttupaligas wxd5146 hjsw1 jibaro sddhsd prodigy-sub iterabletrucks geogerytony huiyuanlu wutao0914 johnclyde lzzcj jorson-chen qdj0511 s-oravec yonghuili1 dvasdekis metaphordata ajaychaubal timegambler 2018wl cjxiao happyfreeangel fxztam zjffdu nahuelverdugo gzdx-chenghui songfang fethu-jokr yaitcon liuxu4567 mars-lan stone-afk open-metadata phenixmzy lydnguyen zixi0825 chenjycode zyxdstu rulerp lordk911 sdpku gmlove karmel5950 olave baocaifeng daxwang daemonfory tangsonghuai

sqllineage's Issues

Refresh Table and Cache Table Should Not Count as Target Table

sqllineage -e "refresh table dual;" 
Statements(#): 1
Source Tables:
    
Target Tables:
    <unknown>.dual

subquery without alias raises exception

SELECT col1 FROM (SELECT col1 FROM tab1)

Subquery without alias name is valid syntax in SparkSQL. And this parsing result says: "An Identifier is expected, got Parenthesis[value: (select col1 from tab1)] instead".

Since we're not assuming any specific SQL dialect, we should support this.

A Detailed Documentation Hosted by Readthedocs

Switch to GitHub Actions for CI

Given the current situation of Travis CI, it takes ~20 minutes for a build to get start once the request is received. It's extremely slow, which greatly impact our efficiency.

Let's see if we can switch to GitHub Actions to achieve the same functionality. Given that we're using tox as CI interface, it should be fine for the switch.

table-wise lineage with sufficient test cases

this should refer to some kind of standard.
For the moment, QueryParser's test case for Hive SQL seems the best choice

special treatment for DDL

Currently sqllineage treat every sql as DML. When there are DDLs, lineage result is weird.

create table taba like tabb;
alter table taba rename to tabb;
drop table if exists taba;

test against Python 3.8

case-sensitive parsing

insert overwrite table tab_a
select * from tab_b
union all
select * from TAB_B

here tab_b and TAB_B will be parsed as two different tables.

DAG Based Lineage Representation

Example Features:

Temporary Table as intermediate Node in this DAG
Insert overwrite table tab1 select * from tab1 union select * from tab2 , tab1 will keep a self dependent link with another link from tab2

Some possible package for research:

networkx + matplotlib
graphviz
stackeddag

Things to consider when choosing our visualization package:

package should well supported (stars, release frequency)
size. I don't want sqllineage to be overweighted
terminal visualization support. Would be useful in NON-GUI environment

Cartesian product exception with ANSI-89 Syntax

when i use LineageRunner to analyze a sql string like "select a.* from table a,table b " it throws a SQLLineageException:
SQLLineageException: An Identifier is expected, got IdentifierList[value: a, table] instead
So how can i deal with that exception?

from sqllineage.runner import * LineageRunner("select a.* from table1 a, table2 b").target_tables

`---------------------------------------------------------------------------
SQLLineageException Traceback (most recent call last)
in ()
----> 1 LineageRunner("select a.* from table1 a, table2 b").target_tables

~/anaconda3/lib/python3.6/site-packages/sqllineage/runner.py in init(self, sql, encoding, verbose)
31 if s.token_first(skip_cm=True)
32 ]
---> 33 self._lineage_results = [LineageAnalyzer().analyze(stmt) for stmt in self._stmt]
34 self._combined_lineage_result = combine(*self._lineage_results)
35 self._verbose = verbose

~/anaconda3/lib/python3.6/site-packages/sqllineage/runner.py in (.0)
31 if s.token_first(skip_cm=True)
32 ]
---> 33 self._lineage_results = [LineageAnalyzer().analyze(stmt) for stmt in self._stmt]
34 self._combined_lineage_result = combine(*self._lineage_results)
35 self._verbose = verbose

~/anaconda3/lib/python3.6/site-packages/sqllineage/core.py in analyze(self, stmt)
91 else:
92 # DML parsing logic also applies to CREATE DDL
---> 93 self._extract_from_DML(stmt)
94 return self._lineage_result
95

~/anaconda3/lib/python3.6/site-packages/sqllineage/core.py in _extract_from_DML(self, token)
143 raise SQLLineageException(
144 "An Identifier is expected, got %s[value: %s] instead"
--> 145 % (type(sub_token).name, sub_token)
146 )
147 source_table_token_flag = False

SQLLineageException: An Identifier is expected, got IdentifierList[value: table1 a, table2 b] instead`

Change Schema Default Value from <unknown> to <default>

Test Against MacOS and Windows

https://docs.travis-ci.com/user/languages/python/#running-python-tests-on-multiple-operating-systems

https://github.com/tornadoweb/tornado/blob/master/.travis.yml

let user choose whether to filter temp table or not

Currently, If a table is both source table and target table. sqllineage will identify this table as temp table and hide it from user.

We should give use control over the display of temp table

friendly Exception

select * from where foo="bar"

This invalid sql causes sqllineage to raise a AssertionError. We could be better. Give user some hints about possible sql syntax error.

Upon dealing with this issue, we could design a dedicated exception for sqllineage.

Enforce Black as Code Formatter

add pre-commit git hook to reformat code use black
use flake8 to detect compliance to black code style, which is already integrated in tox, thus TravisCI

setup install_requires and requirements.txt

now install_requires in setup.py and requirements.txt both define sqlparse as a requirement. This duplication should be addressed in v0.1.0.

Besides, requirements.txt doesn't distinguish install requirements, development requirements and test requirements. We should find a proper way to do it.

Incorrect Result for UPDATE statement

hello reata.when i using sqllineage to analyze the follow SQL .I got a bug.
UPDATE tablea a INNER JOIN ( SELECT col2 FROM tableb GROUP BY col ) b ON b.col = a.col SET a.col1 = a.col2 * b.col2
that SQL update tableas col1 using tablebs col2 . when i running the SQLLINGAGE it returns tableb as source table but null as target table .in my view the source table should be tableb and the target table should be tablea but not null

Upgrade dependency version

cancel sqlparse <0.4.0 restriction.
update CI related package to highest version

Add Bandit As Security Issue Checker

drop table parsed as target table

$ sqllineage -e "DROP TABLE IF EXISTS tab1"
Statements(#): 1
Source Tables:
    
Target Tables:
    tab1

expect:
When a table is in target_tables, dropping table result in a removal from target_tables. Otherwise the DROP DML doesn't affect the result.

Trim Leading Comment for Statement in Verbose Output

sqllineage -v -e "------------------------                    
dquote> select * from dual"
Statement #1: ------------------------select * from dual
table read: {Table: <unknown>.dual}
table write: {}
table rename: {}
table drop: {}
table with_: {}
==========
Summary:
Statements(#): 1
Source Tables:
    <unknown>.dual
Target Tables:

We should trim the leading comment line when showing statement.

stable command line interface

including:

command line options
- verbose
- read file or stdin
- sql dialect?
stdout parsing result
- Tables accessed
- Joins
- Table lineage

this inferface may refer to uber's queryparser

a startup docs for sqllineage's usage

apart from the basic usage and introduction of sqllineage, this sphinx docs should be the first versioned docs in ReadTheDocs.

the README.md should also get update.

Set up Github Actions for PyPi Publish

statement granularity lineage result

In command line interface, we currently show file-wise lineage result. It causes problem like #23 . Some self-dependent sql shouldn't identify related table as temp table. This pattern could easily be identified if we look at statement rather than the whole file.
On the other hand, if a table was created/inserted into, and later queried from, it's perfectly reasonable to be identified as temp table.
So this statement grandularity lineage result may be the right choice also concerning #29 .
We could try the operator reloading in design.

Allow User to Specify Combiner

either the NaiveLineageCombiner in sqllineage.combiners module or User-write custom Combiner

Drop Graphviz Dependency

It's extremely painful to get graphviz & pygraphviz working in Windows, as shown in #87 .

After all, we're just using pygraphviz dot layout, without using the drawing functionality. For drawing, it's actually matplotlib, using graphviz dot layout. Maybe we can try port dot layout from C to Python.

This will greatly ease the usage of visualization.

pypi badges in README

including:

supported python version
latest version
license

referencing to requests repo.

dedicated Table/Partition/Column Class

for things such as database info, alias name, etc

Support Wildcard for -f Option

Combine lineage result from multiple SQL files, same as combine from multiple SQL statements

Partition-level lineage

for self-dependent SQL, it's also to insert data into new partition based on data from old one. This pattern couldn't be addressed since we consider table as the minimum entity.
Should we consider Partition-level lineage?

missing database/schema in lineage result

contributing guide

We should form a mature development strategy including branches, milestone, pull requests, tags management, better be documented as a contributing guide

Empty Statement return

take the following sql as example

SELECT 1;

-- SELECT 2;

the parsing result says that it contains two statements, instead of one.

Drop support for Python 3.5 in v1.0 release

Python 3.5 will reach End Of Life: 2020-09-13. We shall drop support for it before release v1.0

white space in left join

select * from tab_a
left  join tab_b
on tab_a.x = tab_b.x

an extra whitespace in "left join" make tab_b undetectable.

combine tox and TravisCI

TravisCI inherently support multiple python versions. And for now, it doesn't use tox directly. In fact, we're doing CI with TravisCI online and tox offline. Same logic repeated twice. This should be addressing in version 0.1.0

Black Check in CI

Let's add black check in GitHub Actions so that it will be automatically checked in each push, along with the existing pro-commit hook.

Replace print to stderr with logging

Incorrect Result for Cases When Non-Reserved Keyword Used as Table Name

hello,reata:
I got the newest code from your github to test whether SQLLINEAGE works.
it works very well for most of cartesian product SQL.
but ,when i input sql " select a.* from table a,table b" something strange happened:

`LineageRunner("select a.* from source a,source b").source_tables` returns "[Table: .a]" but not " table"

I run this sql just for fun ,after found this BUG .I tested more and get another "strange returns"
LineageRunner("select a.* from source1 a,source b").source_tables returns " [Table: .source1]"
I thought , Was it the Tablenames matter? so I changed the Tablenames
LineageRunner("select * from care a,care b").source_tables returns "[Table: .care]",It works.
LineageRunner("select * from care1 a,care b").source_tables returns "[Table: .care, Table: .care1]" It works well.

`LineageRunner("select * from care a,care b").source_tables` returns "[Table: .care]" but not "a"

well,I thought that maybe some Tablenames sqllineage can`t works well.

Add mypy As static type checker

Sort By Table Name in LineageResult Table Container

Support Create Table Like Statement

drop table  if exists tab_a;
create table if not exists tab_a like tab_b;

drop before create, the result is tab_a exists after the above statements;

If we switch the order:

create table if not exists tab_a like tab_b;
drop table  if exists tab_a;

Although the above statements make no sense, still, we should output result as tab_a does not exist.

Under current circumstances, both cases will be that tab_a does not exists since tab_a is identified as temp table. This is related to #23

comment in line raise AssertionError

select
*
from  -- comment
tab_a

This will raise AssertionError

Test Against Python3.9

also, use The Ubuntu 20.04 (Focal Fossa) as build environment

Column-Level Lineage

Since @ekimd has started the effort for column-level lineage analysis. I create this ticket to track all the questions we have to answer before we dive into implementation details. Considering our pure code analysis approach without involving metadata, I believe we have several design choices to make.

Question No.1: What's the data structure to represent Column-Level Lineage.
Currently we're using DiGraph in library networkx to represent Table-Level Lineage, with table as vertex and table-level lineage as edge, which is pretty straight forward. After changing to Column-Level, what's the plan?

Question No.2: How do we deal with select *

INSERT OVERWRITE tab1
SELECT * FROM tab2;

In this case, we don't know which columns are in tab2.

Question No.3: How do we deal with column without table/alias prefix in case of join.

INSERT OVERWRITE tab1
SELECT col2
FROM tab2
JOIN tab3
ON tab2.col1 = tab3.col1

In this case, we don't know whether col2 is coming from tab2 or tab3.

Question No.4: How do we visualize column-level lineage?

subquery mistake alias as table name

$ python -m sqllineage.core -e "SELECT col1 FROM (SELECT col2 from tab1) dt"
Statements(#): 1
Source Tables:
    dt
Target Tables:

expect source table as tab1

temp table checking

create table tab_a as select * from tab_b;
insert overwrite table tab_c select * from tab_a;
drop table tab_a;

Here tab_a is clearly a temp table, and yet it's marked as source table.
Possibly related to #29 #23

Distinguish between View and Table

multi-line sql causes AssertionError

SELECT * FROM
tab1

causes following exception

Traceback (most recent call last):
File "/home/admin/.pyenv/versions/3.6.8/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/home/admin/.pyenv/versions/3.6.8/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/admin/repos/sqllineage/sqllineage/core.py", line 131, in
main()
File "/home/admin/repos/sqllineage/sqllineage/core.py", line 119, in main
print(LineageParser(sql))
File "/home/admin/repos/sqllineage/sqllineage/core.py", line 23, in init
self._extract_from_token(stmt)
File "/home/admin/repos/sqllineage/sqllineage/core.py", line 75, in _extract_from_token
assert isinstance(sub_token, Identifier)
AssertionError

reata / sqllineage Goto Github PK

sqllineage's Introduction

SQLLineage

Demo & Documentation

Quick Start

Advanced Usage

Multiple SQL Statements

Verbose Lineage Result

Dialect-Awareness Lineage

Column-Level Lineage

MetaData-Awareness Lineage

Lineage Visualization

sqllineage's People

Contributors

Stargazers

Watchers

Forkers

sqllineage's Issues

LineageRunner("select a.* from source a,source b").source_tables returns "[Table: .a]" but not " table"

LineageRunner("select * from care a,care b").source_tables returns "[Table: .care]" but not "a"

Recommend Projects

Recommend Topics

Recommend Org

Jobs

`LineageRunner("select a.* from source a,source b").source_tables` returns "[Table: .a]" but not " table"

`LineageRunner("select * from care a,care b").source_tables` returns "[Table: .care]" but not "a"