lea

lea is a minimalist alternative to tools like dbt, SQLMesh, and Google's Dataform.

lea aims to be simple and opinionated, and yet offers the possibility to be extended. We happily use it every day at Carbonfact to manage our data warehouse. We will actively maintain it and add features, while welcoming contributions.

Right now lea is compatible with BigQuery (used at Carbonfact) and DuckDB (quack quack).

Examples
Teaser
Installation
Usage
Contributing
License

Examples

Teaser

Installation

Use one of the following commands, depending on which warehouse you wish to use:

pip install lea-cli[duckdb]
pip install lea-cli[bigquery]

This installs the lea command. It also makes the lea Python library available.

Usage

Configuration

lea is configured by setting environment variables. The following variables are available:

# General configuration
LEA_USERNAME=max

# DuckDB 🦆
LEA_WAREHOUSE=duckdb
LEA_DUCKDB_PATH=duckdb.db

# BigQuery 🦏
LEA_WAREHOUSE=bigquery
LEA_BQ_LOCATION=EU
LEA_BQ_PROJECT_ID=carbonfact-dwh
LEA_BQ_DATASET_NAME=kaya
LEA_BQ_SERVICE_ACCOUNT=<a JSON dump of the service account file>
LEA_BQ_SCOPES=https://www.googleapis.com/auth/bigquery,https://www.googleapis.com/auth/drive

These parameters can be provided in an .env file, or directly in the shell. Each command also has an --env flag to provide a path to an .env file.

The prepare command has to be run once to create whatever needs creating. For instance, when using BigQuery, a dataset has to be created:

lea prepare

`lea run`

This is the main command. It runs the queries in the views directory.

lea run

The queries are run concurrently. They are organized in a DAG, which is traversed in a topological order. The DAG's structure is determined automatically by analyzing the dependency between queries.

File structure

Each query is expected to be placed under a schema, represented by a directory. Schemas can have sub-schemas. For instance, the following file structure is valid:

views/
    schema_1/
        table_1.sql
        table_2.sql
    schema_2/
        table_3.sql
        table_4.sql
        sub_schema_2_1/
            table_5.sql
            table_6.sql

Each view will be named according to its location, following the warehouse convention:

Warehouse	Dataset	Username	Schema	Table	Name
DuckDB	`dataset`	`user`	`schema`	`table`	`schema.table` (stored in `dataset_user.db`)
BigQuery	`dataset`	`user`	`schema`	`table`	`dataset_user.schema__table`

The convention in lea to reference a table in a sub-schema is to use a double underscore __:

Warehouse	Dataset	Username	Schema	Sub-schema	Table	Name
DuckDB	`dataset`	`user`	`schema`	`sub`	`table`	`schema.sub__table` (stored in `dataset_user.db`)
BigQuery	`dataset`	`user`	`schema`	`sub`	`table`	`dataset_user.schema__sub__table`

Schemas are expected to be placed under a views directory. This can be changed by providing an argument to the run command:

lea run /path/to/views

This argument also applies to other commands in lea.

Development vs. production

By default, lea appends a _<user> suffix to schema names. This way you can have a development schema and a production schema. Use the --production flag to disable this behavior.

lea run --production

The <user> is determined automatically from the login name. It can be overriden by setting the LEA_USERNAME environment variable.

Select which views to run

A single view can be run:

lea run --select core.users

Several views can be run:

lea run --select core.users --select core.orders

Similar to dbt, lea also supports graph operators:

lea run --select core.users+   # users and everything that depends on it
lea run --select +core.users   # users and everything it depends on
lea run --select +core.users+  # users and all its dependencies

You can select all views in a schema:

lea run --select core/

This also work with sub-schemas:

lea run --select analytics.finance/

There are thus 8 possible operators:

schema.table    (table by itself)
schema.table+   (table with its descendants)
+schema.table   (table with its ancestors)
+schema.table+  (table with its ancestors and descendants)
schema/         (all tables in schema)
schema/+        (all tables in schema with their descendants)
+schema/        (all tables in schema with their ancestors)
+schema/+       (all tables in schema with their ancestors and descendants)

Combinations are possible:

lea run --select core.users+ --select +core.orders

There's an Easter egg that allows selecting views that have been commited or modified in the current Git branch:

lea run --select git
lea run --select git+  # includes all descendants

This becomes very handy when using lea in continuous integration. See dependency freezing for more information.

Workflow tips

The lea run command creates a .cache.pkl file during the run. This file is a checkpoint containing the state of the DAG. It is used to determine which queries to run next time. That is, if some queries have failed, only those queries and their descendants will be run again next time. The .cache.pkl is deleted once all queries have succeeded.

This checkpointing logic can be disabled with the --fresh flag.

lea run --fresh

The --fail-fast flag can be used to immediately stop if a query fails:

lea run --fail-fast

For debugging purposes, it is possible to print out a query and copy it to the clipboard:

lea run --select core.users --print | pbcopy

`lea test`

lea test

There are two types of tests:

Singular tests — these are queries which return failing rows. They are stored in a tests directory.
Assertion tests — these are comment annotations in the queries themselves:
- @NO_NULLS — checks that all values in a column are not null.
- @UNIQUE — checks that a column's values are unique.
- @UNIQUE_BY(<by>) — checks that a column's values are unique within a group.
- @SET{<elements>} — checks that a column's values are in a set of values.

Here's an example of a query annotated with assertion tests:

SELECT
    -- @UNIQUE
    -- @NO_NULLS
    user_id,
    -- @NO_NULLS
    address,
    -- @UNIQUE_BY(address)
    full_name,
    -- @SET{'A', 'B', 'AB', 'O'}
    blood_type
FROM core.users

As with the run command, there is a --production flag to disable the <user> suffix, and therefore target production data.

You can select a subset views, which will thus run the tests that depend on them:

lea test --select-views core.users

`lea docs`

It is possible to generate documentation for the queries. This is done by inspecting the schema of the generated views and extracting the comments in the queries.

lea docs
    --output-dir docs  # where to put the generated files

This will also create a Mermaid diagram in the docs directory. This diagram is a visualization of the DAG. See here for an example.

`lea diff`

lea diff

This prints out a summary of the difference between development and production. Here is an example output:

  core__users
+ 42 rows
+ age
+ email

- core__coupons
- 129 rows
- coupon_id
- amount
- user_id
- has_aggregation_key

  core__orders
- discount
+ supplier

  core__sales
+ 100 rows

This is handy in pull requests. For instance, at Carbonfact, we have a dataset for each pull request. We compare it to the production dataset and post the diff as a comment in the pull request. The diff is updated every time the pull request is updated. Check out this example for more information.

`lea teardown`

lea teardown

This deletes the schema created by lea prepare. This is handy during continuous integration. For example, you might create a temporary schema in a branch. You would typically want to delete it after testing is finished and/or when the branch is merged.

Jinja templating

SQL queries can be templated with Jinja. A .sql.jinja extension is necessary for lea to recognise them.

You have access to an env variable within the template context, which is simply an access point to os.environ.

Python scripts

You can write views with Python scripts. The only requirement is that the script contains a dataframe with a pandas DataFrame with the same name as the script. For instance, users.py should contain a users variable.

import pandas as pd

users = pd.DataFrame(
    [
        {"id": 1, "name": "Max"},
        {"id": 2, "name": "Angie"},
    ]
)

Dependency freezing

The lea run command can be used to only refresh a subset of views. Let's say we have this DAG:

fee -> fi -> fo -> fum

Assuming LEA_USERNAME=max, running lea run --select fo+ will

Execute fo and materialize it to fo_max.
Execute fum and materialize it to fum_max.

This only works if fee_max and fi_max already exist. This might be the case if you've run a full refresh before. But if you're running a first refresh, then fee_max and fi_max won't exist! This is where the freeze-unselected flag comes into play:

lea run --select fo+ --freeze-unselected

This means the main fee and fi tables will be used instead of fee_max and fi_max.

Dependency freezing is particularly useful when using lea in a CI/CD context. You can run the following command in a pull request:

lea run --select git+ --freeze-unselected

This will only run the modified views and their descendants. The dependencies of these modified will be taken from production. The added benefit is that you are guaranteed to be doing a comparison with the same tables when you run lea diff. Check out this article to learn more.

Import lea as a Python library

lea is meant to be used as a CLI. But you can import it as a Python library too. For instance, we do this at Carbonfact to craft custom commands.

Parsing a directory of queries

>>> import lea

>>> client = lea.clients.DuckDB(':memory:')
>>> views = client.open_views('examples/jaffle_shop/views')
>>> views = [v for v in views if v.schema != 'tests']
>>> for view in sorted(views, key=str):
...     print(view)
...     print(sorted(view.dependencies))
analytics.finance.kpis
['core.orders']
analytics.kpis
['core.customers', 'core.orders']
core.customers
['staging.customers', 'staging.orders', 'staging.payments']
core.orders
['staging.orders', 'staging.payments']
staging.customers
[]
staging.orders
[]
staging.payments
[]

Organizing queries into a DAG

>>> import lea

>>> client = lea.clients.DuckDB(':memory:')
>>> views = client.open_views('examples/jaffle_shop/views')
>>> views = [v for v in views if v.schema != 'tests']
>>> dag = client.make_dag(views)
>>> dag.prepare()

>>> while dag.is_active():
...     for node in sorted(dag.get_ready()):
...         print(dag[node])
...         dag.done(node)
staging.customers
staging.orders
staging.payments
core.customers
core.orders
analytics.finance.kpis
analytics.kpis

Contributing

Feel free to reach out to [email protected] if you want to know more and/or contribute 😊

We have suggested some issues as good places to get started.

License

lea is free and open-source software licensed under the Apache License, Version 2.0.

leonardonatale / lea Goto Github PK

lea's Introduction

lea

Examples

Teaser

Installation

Usage

Configuration

lea run

File structure

Development vs. production

Select which views to run

Workflow tips

lea test

lea docs

lea diff

lea teardown

Jinja templating

Python scripts

Dependency freezing

Import lea as a Python library

Contributing

License

lea's People

Contributors

Recommend Projects

Recommend Topics

Recommend Org

Jobs

`lea run`

`lea test`

`lea docs`

`lea diff`

`lea teardown`