infuseai / piperider Goto Github PK

Code review for data in dbt

License: Apache License 2.0

Makefile 0.12% Python 57.68% Dockerfile 0.06% Shell 0.24% HTML 0.50% JavaScript 0.96% TypeScript 40.34% Batchfile 0.11%

data-pipeline data-profiling data-quality data-science data-exploration eda exploratory-data-analysis data-testing python data-observability

piperider's Introduction

Docs | Discord | Blog

Important

PipeRider has been superseded by Recce. We recommend that users requiring pre-merge data validation checks migrate to Recce. PipeRider will not longer be updated on a regular basis. You are still welcome to open a PR with bug fixes or feature requests. For questions and help regarding this update, please contact [email protected] or leave a message in the Recce Discord.

Code review for data in dbt

PipeRider automatically compares your data to highlight the difference in impacted downstream dbt models so you can merge your Pull Requests with confidence.

How it works:

Easy to connect your datasource -> PipeRider leverages the connection profiles in your dbt project to connect to the data warehouse
Generate profiling statistics of your models to get a high-level overview of your data
Compare target branch changes with the main branch in a HTML report
Post a quick summary of the data changes to your PR, so others can be confident too

Core concepts

Easy to install: Leveraging dbt's configuration settings, PipeRider can be installed within 2 minutes
Fast comparison: by collecting profiling statistics (e.g. uniqueness, averages, quantiles, histogram) and metric queries, comparing downstream data impact takes little time, speeding up your team's review time
Valuable insights: various profiling statistics displayed in the HTML report give fast insights into your data

Quickstart

Install PipeRider
```
pip install piperider[<connector>]
```
You can find all supported data source connectors here.
Add PipeRider tag on your model: Go to your dbt project, and add the PipeRider tag on the model you want to profile.
```
--models/staging/stg_customers.sql
{{ config(
   tags=["piperider"]
) }}

select ...
```
and show the models would be run by piperider
```
 dbt list -s tag:piperider --resource-type model
```
Run PipeRider
```
piperider run
```

To see the full quick start guide, please refer to PipeRider documentation

Features

Model profiling: PipeRider can profile your dbt models and obtain information such as basic data composition, quantiles, histograms, text length, top categories, and more.
Metric queries: PipeRider can integrate with dbt metrics and present the time-series data of metrics in the report.
HTML report: PipeRider generates a static HTML report each time it runs, which can be viewed locally or shared.
Report comparison: You can compare two previously generated reports or use a single command to compare the differences between the current branch and the main branch. The latter is designed specifically for code review scenarios. In our pull requests on GitHub, we not only want to know which files have been changed, but also the impact of these changes on the data. PipeRider can easily generate comparison reports with a single command to provide this information.
CI integration: The key to CI is automation, and in the code review process, automating this workflow is even more meaningful. PipeRider can easily integrate into your CI process. When new commits are pushed to your PR branch, reports can be automatically generated to provide reviewers with more confidence in the changes made when reviewing.

Example Report Demo

We use the example project git-repo-analytics to demonstrate how to use piperider+dbt+duckdb to analyze dbt-core repository. Here is the generated result (daily update)

Run Report

Comparison Report

Comparison Summary in a PR

PipeRider Cloud (beta)

PipeRider Cloud allows you to upload reports and share them with your team members. For information on pricing plans, please refer to the pricing page.

PipeRider Compare Action

PipeRider provides the PipeRider Compare Action to quickly integrate into your Github Actions workflow. It has the following features:

Automatically generates a report comparing the PR branch to the main branch
Uploads the report to GitHub artifacts or PipeRider cloud
Adds a comment to the pull request with a comparison summary and a link to the report.

You can refer to example workflow yaml and the example pull request.

Development

See setup dev environment and the contributing guildlines to get started.

We love chatting with our users! Let us know if you have any questions, feedback, or need help trying out PipeRider! ❤️

piperider's People

Contributors

Stargazers

Watchers

piperider's Issues

New metric: calculate duplicate rows

Summary

Duplicate row is an obvious data quality problem. It commonly happens when forgetting to truncate the original data before transformation. As a data profiling and data quality tool. we would like to understand if there are duplicate rows in the table.

Duplicate rows is also supported in pandas-profling

Intended Outcome

Allows to opt-in. Duplicate rows are compute-intensive job. We would like to disable it by-default and opt-in by configuration.
The count of duplicate rows is a table metric. We should show the duplicate row count and percentage.

How will it work?

The possible sql query is

 with t as (
select 
    count(*) as duplicate_rows
    from interaction_log
    group by user_id, item_id, item_type, play_amount_second, client_upload_timestamp, server_upload_timestamp, interaction, pt
    having count(*) > 1
)
select sum(duplicate_rows) as duplicate_rows from t;

sc-28170

allow dbt usage with custom generate_schema_name

Is your feature request related to a problem? Please describe.
When running piperider run in our dbt project, we get an error:

Error: Target mismatched. Please run 'dbt compile -t default' to generate the new manifest

This is almost certainly due to the fact that we've customized generate_schema_name so that we can override the typical behavior and set custom schemas that aren't prefixed with the name of the default target schema. Basically we followed the "advanced path" documented here

Describe the solution you'd like
It would be great to make use of piperider in our dbt project even if the schema names aren't all prefixed with the default target schema name.

Additional context
Here's our override for generate_schema_name:

{% macro generate_schema_name(custom_schema_name, node) -%}

    {%- set default_schema = target.schema -%}
    {%- if custom_schema_name is none -%}

        {{ default_schema }}

    {%- else -%}

        {{ custom_schema_name | trim }}

    {%- endif -%}

{%- endmacro %}

Don't generate the assertion files in `piperider run`

Summary

Some users use piperider as the data profiling tool only. However, in current journey, it will always generate the assertion files for the first run.

fetching metadata
[1/1] data ━━━━━━━━━━━  5/5 0:00:00
No assertion found
Do you want to auto generate recommended assertions for this datasource [Yes/no]?

The problem would be

User don't know what will happen when I enter yes or no
Even say NO, there is still empty assertion files generated. But why don't we generate it only when the user would like to write the tests?
If the user say YES, the assertions files are generated for current profiling result. However, if the user is not intended to write assertion files right away, the generated assertions would be confusing for the future runs.
Another problem is that, all the assertion files for every tables are generated. It would be not realistic to write all the tests at the same time.

Intended Outcome

Don't generate assertions in piperider run, use generate-assertions command instead to generate template or assertions.
The real case to write test is table by table. It would be more reasonable to generate assertions -> edit assertion file -> test by table basic.

How will it work?

The piperider run will not generate assertions.
In generate-assertions, we have to specify the table to generate rather than all tables. (e.g. piperider generate-assertions --table mytable)
In generate-assertions, it generate recommended assertions by default. But can generate empty assertion by --no-recommend

Internal ticket sc-28737

Return non-zero exit code if test failed

Summary

Currently, when there are failed test, it still return zero exit-code.

If we would like to integrate pipeline in workflow system (e.g. airflow) or CI system (e.g. jenkins, github action), it would be better to failed with non-zero exit code so that we can make the job with failed condition

Intended Outcome

When there is any test failed, return non-zero (1 would be good enugh)

How will it work?

In the command piperider run, call exit(1) if there is any test failed.

sc-28851

Does `piperider` support Databricks via the `dbt-databricks` adapter?

As per the title, I'd just like to check whether you currently support Databricks via the dbt-databricks adapter? I coudn't find anything in the docs.

Support to profile with row limit

Is your feature request related to a problem? Please describe.
For big dataset, it is time-consuming and costly to profile.

Describe the solution you'd like
Allows to configure the maximum rows to profile

Describe alternatives you've considered
N/A

Additional context

Feature: better CI support

Users can run PipeRider through the CI process, so users can get reports instantly when code/schema changes, pipelines are executed, or perform data quality checks periodically.

Users can use PipeRider to generate report files, but it was hard before to get the latest report files from the command line interface and perform follow-up actions in the CI automation process.

By implementing this feature, users can use PipeRider to:

get the location of output files of reports or comparisons
compare the most recent two reports easily

So users can upload the generated reports to AWS S3, send notifications to Slack, and perform other actions by using tools through the CI process.

Feature Design

Requirement

Users can get the location of output files of reports or comparisons.
Users can compare the most recent two reports without providing any input. (It is more convenient, and users don’t need to know how PipeRider works internally)
We can track if a report is generated by CI.

Non-goals

upload files for users. They should be able to upload via the tools they use.
send notifications for users. They should be able to send notifications via webhook or other tools.

Use Cases

Users can get reports instantly when code/schema changed, pipeline executed, or perform data quality check periodically.
Users can collect all reports in one place.
Data consumers can check the reports without running PipeRider or gaining data access.

User Journeys

User can use piperider generate-report -o $PATH to get the location of output.
User can use piperider compare-reports --last to generate a comparison report with the most recent two reports.

Milestones

Users can get the location of output files of reports or comparisons.
Users can compare the most recent two reports without providing any input.
Track if a report is generated by CI.

support to output comparison summary

Summary

One of the interesting use cases for comparison report is the CI application. You can compare the results from different environment in a CI automation task (e.g. PR vs production, or staging vs production).

Intended Outcome

piperider compare-reports generates additional markdown summary along with the comparison report. It allows you to paste the summary to the PR comments.

How will it work?

Run the piperider compare-reports, the summary.md is generated in the folder of the comparison report.

Improve the CLI UX for dbt project

Introduction

The main use case is to integrate PipeRider with dbt project. This goal of this story is even to improve the CLI UX when using piperider in the dbt project use case.

Here are the feature list

Implicit data sources from dbt profile/target
piperider run would profile all table models in dbt project (rather than tables in the data source)
Allows to annotate a model as candidate for piperider
Use dbt list to select model for piperider

Features

Implicit data sources from dbt profile/target

Here is the dbt profile.yml

# profile.yml
infusetude:
  target: dev
  outputs:
    dev:
      type: snowflake
      ...
      schema:
      threads: 4
    prod:
      type: snowflake
      ...
      schema:
      threads: 4

The new generated config by piperider init would be like this one

dataSources: []
dbt:
   projectDir: .

The dbt attribute indicates that it is a dbt project, and the implicit data sources dev and prod would be available in the piperider run

It makes the profiles.yml be the single source of truth to connect to your data warehouse.

piperider run would profile all models in dbt project

For a dbt project, the original behavior is

piperider run: profile all tables in the data source
piperider run --dbt-state target/: profile all models in the target/run_results.json`

The problem is that it does not allow adhoc profiling without another dbt run or dbt build

The new behavior is

piperider run: profile all table models in the dbt project (rather than tables in the data source's database+schema)
piperider run --dbt-state /tmp/prod: profile all table models in the /tmp/prod/manifest.json, which is generated by dbt run --target-path /tmp/prod. See dbt manifest
piperider run --dbt-run-results: profile all table models run in the latest dbt run. It also integrate the dbt test results.

Allows to annotate a model as piperider candidate

Profiling is expensive. It is recommended to whitelist the models which is allowed to do the profiling. We use dbt tags to mark a model as piperider

To enable this feature, you need to edit the .piperider/config.yml with dbt.tag

dataSources: []
dbt:
   projectDir: .
   tag: piperider

and then mark your model with piperider tag

# models/myconfig.yml

version: 2

models:
  - name: model_name
    config:
      tags: [piperider]

    columns:
      - name: column_name
         ...

Remember to regenerate the manifests.json and run piperider

# update the manifest.json
dbt compile
# run again
piperider run

Use dbt list to select model for piperider

For adhoc profiling/exploration case, it is useful to select models/metrics to query. PipeRider leverage the dbt list command to select models/metrics.

dbt list -s models/mymodel.sql | piperider run --dbt-list

Breaking Changes

For dbt project, the command behavior changes

Old	New
`piperider run`	no equivalent. Or you should manually configure a non-dbt data source to do the profiling
`piperider run --dbt-state target/`	`piperider run --dbt-state target/ --dbt-run-results`

sc-30406

Support multiple CSVs in a project

Is your feature request related to a problem? Please describe.
Currently, we only support to profile one CSV

Describe the solution you'd like

If the path is a folder, load all csv, csv.gz
Support the path to be a package file (e.g. tar, tgz, zip, rar, ...), unpackage it and load as the folder behavior
Otherwise, load it as normal csv.
Each file is a individual table.

[Non-goal] load all files as the data of a single table

Describe alternatives you've considered
N/A

Additional context
sc-28310

Support multiple db and schema when using dbt state

Is your feature request related to a problem? Please describe.
Yes, when using dbt as a datasource it currently does not work at all if your db has custom databases and schemas per group of tables. I have a WIP fix that I might make a PR for that basically looks at dbt state dir and figures out all databases(using snowflake) and schemas for each grouping of tables then profiles them all, concating all the reports into one. The WIP is working for me but is a little bit rough at the moment

Describe the solution you'd like
See above

Describe alternatives you've considered
I am not sure a clean one exists if running on a dbt project that has custom database and schema macros

Additional context

[Draft] Grid view to show the column metrics of a table

Row: metric (distribution, quantile, avg, stddev, distinct, min, p5, p25, median, p75, p95, max)
Column: column name

Validation fails when db only has views but include_views=True

Describe the bug
When using the include_views feature, it would be nice if the db only has views that validation/diagnosing would still work. Currently the code only to see if any tables exist, but if the feature of include_views is turned on it should look for the existence of tables and/or views

To Reproduce
piperider init
edit config.yml to make include_views=true
run piperider diagnose -> Fails
and/or
run piperider run -> Fails

Expected behavior
Diagnose and/or run(validation) passes if there only views in the db

Screenshots
None

Desktop (please complete the following information):
Does not matter

Additional context

[Draft] Calculate freshness from freshness column so that I can use freshness for non-freshness supported data source

Data source integration: AWS Athena

Showing only the parts that have changed

Show some kind of "change report card":

When comparing two reports it would be great to have a feature to only show the data where something has been changed.

Assertion results shown along column details

Hello, I want to suggest a feature, a change on the automatic report.
It would be useful to have the assertion results shown on the column details page, it would be easier to spot failed test per column

[Draft] Support custom assertion by SQL

Cannot run piperider

Describe the bug
Trying to follow the quick start guide to install and begin to use piperider with dbt using vs code. I could do all the steps until run. So init and diagnose worked fine but run gave me an error.

To Reproduce
Steps to reproduce the behaviour:

Setup environment:
Execute command: piperider run
See error: Error: Profiler Exception: TypeError('SQLCompiler.init() got multiple values for argument 'cache_key'')

Expected behaviour
I expected it to be able to run

Screenshots

Desktop (please complete the following information):

OS: macOS v.12.6
Processor: 2,6 GHz 6-Core Intel Core i7
dbt version: 1.3.0
Python Version: 3.10.8
Version: piperider version 0.17.1

Make the Assertions more Flexible to Configure and Easier to Read

Problem

Currently we have these built-in assertion rules

https://docs.piperider.io/data-quality-assertions/assertion-configuration

However, here lists the problems

For each metric, it needs to provide one or more assertion rules to fulfill the needs. Even they are quite similar
The assertion operators for threshold are quite limited. The operator could be various > , ≥ , =, !=, <, ≤, between.
The test function is not easy to understand. assert_row_count_in_range
The expected result is not easy to understand {'count': [1, 10000000]}

Expected Outcome

We need a more generic way to define the assertion rules, which should separate it to metric , operator, threshold
The test name should be the metric display name. (e.g. Rows, Missing percentage)
The expected result should be more human-readable (e.g. ≥ 0.01, [1, 10000000))

How will it work?

Example 1: Test the row count in the range [1, 10000000) (i.e. 1 ≤ row_count < 10000000)

Old

assertion rule

PRICE:
  tests:
  - name: assert_row_count
    assert:
      min: 1
      max: 10000000

assertion display in CLI console

Status     Target        Test Function                Expected                   Actual
 ───────────────────────────────────────────────────────────────────────────────────────────────────
  [  OK  ]   PRICE         assert_row_count_in_range    {'count': [1, 10000000]}   157881

assertion display in report, similar to above

New

PRICE:
  - metric: row_count
    assert:
      gte: 1
      lt: 10000000

Change the min, max to operator gt, gte, ….
Change the name to metric

Status     Target        Test Name                Expected                   Actual
 ───────────────────────────────────────────────────────────────────────────────────────────────────
  [  OK  ]   PRICE         Rows    [1, 10000000)   157881

Change the Test Function column name to Test name
Change the expected more human readable
Make the Test name to the metric display name

Example 2: Test the missing percentage ≤ 0.01

world_city:
  tests:
  - metric: nulls_p
    assert:
      lte: 0.01

Status     Target        Test Name                Expected                   Actual
 ───────────────────────────────────────────────────────────────────────────────────────────────────
  [  OK  ]   PRICE.OPEN     Missing Percentage       ≥ 0.01                   0

sc-28831

Officially supports BigQuery

Is your feature request related to a problem? Please describe.
We use BigQuery with dbt. It would be awesome for piperider to officially support it.

Describe the solution you'd like
I would like piperider to support BigQuery.

Describe alternatives you've considered
NA

Additional context
NA

Fetch table and column description from the data source

Problem

The description or table/column could help us to understand the purpose/usage of table or column. Currently, the description can from

piperider config
dbt model or source description

However, it is common that we can get the table or column description from the information_schema. For example, in snowflake, we can get the table description from here

Expected Outcome

If the table or column has description, show in the report.

How does it works

Acquire the description from metadata
Or from sqlalchemy metadata

Snowflake: Did not recognize type 'DATE' of column 'date'

Describe the bug
A clear and concise description of what the bug is.

To Reproduce
Steps to reproduce the behavior:

Setup environment
Setup snowflake backend.
Have a table with DATE column
Execute command
```
 piperider run
```
See error

/private/tmp/piperider-test/venv/lib/python3.9/site-packages/snowflake/sqlalchemy/snowdialect.py:539: SAWarning: Did not recognize type 'DATE' of column 'date'
  sa_util.warn(
Error: 'NullType' object is not callable

Expected behavior
No error

Screenshots
If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

OS: [e.g. macOS]
Python Version [e.g. 3.9]
Version [e.g. v0.4.1]

Additional context
Add any other context about the problem here.

[Draft] Improve csv and parquet init journey

Error: unsupported operand type(s) for -: 'NoneType' and 'NoneType'

Describe the bug
'Piperider run' is not fully functioning, it gives the following error: "Error: unsupported operand type(s) for -: 'NoneType' and 'NoneType'

The error shows immediately after profiling all the table columns.

To Reproduce
Steps to reproduce the behavior:

install piperider with postgres
piperider init
piperider run

Desktop

OS: macOS
Python Version [3.9.12]
Version [v0.2.0]

New PipeRider compare command

Introduction

To code review the model change, we would like to compare the profiling and metric queries between PR branch and main branch. However, there are many steps for the whole process

Switch branch between pr branch and main branch
Run dbt for the two branches
Run piperider for the two branches

In order to make the process well-defined and reproducible, we would like to source control the workflow for pr comparison somewhere and use single unified command to run the workflow. Here is the piperider compare command comes in

Requirements

The default behavior of piperider compare is to compare current branch with main branch.
Allows to define other RECIPEs to do the compare. (e.g. the branch to switch, the dbt commands to run)
Use piperider compare --recipe <recipe> to change the comparison workflow

Default Compare Recipe

The default recipe is generated when run the piperider init

# .piperider/compare/default.yml
base:
  branch: main
  dbt:
    commands:
    - dbt deps
    - dbt build
  piperider:
    command: piperider run
target:
  dbt:
    commands:
    - dbt deps
    - dbt build
  piperider:
    command: piperider run

and then run

piperider compare

You can change the default recipe later on. Just use editor to modify the configuration at .piperider/compare/default.yml

Custom Compare Recipe

Create a new file at .piperider/compare/<recipe>.yml and use piperider compare --recipe <recipe> to run this recipe

Example recipes

Dev recipe (since v0.20.0)

This is a process to do the run and compare with the same base report. The requirements are

The base result is from file
No dbt command run in target. We assume that we run the dbt command separately

# .piperider/compare/dev.yml
base:
  file: 'path/to/base/run.json'
target:
  piperider:
    command: 'piperider run'

Slim CI recipe

Slim CI is a technique that we only transform the new and changed models comparing to the base state. Here are the requirement

We have already had production dbt state somewhere
Run PR branch dbt command with slim ci options
Run piperider with --dbt-run-results so that it only profile the models which is run in the latest dbt run

# .piperider/compare/slim-ci.yml
base:
  branch: main
  piperider:
    command: "piperider run --dbt-state path/to/prod/artifacts"
target:
  dbt:
    commands:
    - dbt deps
    - "dbt build --select state:changed+ --defer --state path/to/prod/artifacts"
  piperider:
    command: "piperider run --dbt-run-results"

value overflow when profiling table with large numeric IDs

I'm new to PipeRider and have really enjoyed it so far. The ease-of-use is pretty remarkable.

I've profiled a number of databases over the last few days and ran into an issue, in a handful of databases, where numeric ID columns cause an overflow error that prevents the profiler from processing a table. E.g.

[ 9/33] <redacted_table_name> 0/31 0:00:15
Error: Profiler Exception: ProgrammingError('(snowflake.connector.errors.ProgrammingError) 100058 (22000): Value overflow in a SUM aggregate
[SQL: WITH anon_2 AS
(SELECT <redacted_table_name>.<redacted_numeric_id> AS c, <redacted_table_name>.<redacted_numeric_id> AS orig
FROM <redacted_table_name> ),
anon_1 AS
(SELECT anon_2.c AS c, CASE WHEN (anon_2.c = %(c_1)s) THEN %(param_1)s END AS zero, CASE WHEN (anon_2.c < %(c_2)s) THEN %(param_2)s END AS negative, anon_2.orig AS orig
FROM anon_2)
SELECT count(*) AS _total, count(anon_1.orig) AS _non_nulls, count(anon_1.c) AS _valids, count(anon_1.zero) AS _zeros, count(anon_1.negative) AS _negatives, count(DISTINCT anon_1.c) AS _distinct, sum(CAST(anon_1.c AS FLOAT)) AS _sum, avg(anon_1.c) AS _avg, min(anon_1.c) AS _min, max(anon_1.c) AS _max, stddev(anon_1.c) AS _stddev
FROM anon_1]
[parameters: {'c_1': 0, 'param_1': 1, 'c_2': 0, 'param_2': 1}]
(Background on this error at: https://sqlalche.me/e/14/f405)')

The first time this happened I thought it was an obscure edge-case. I removed the problematic table from config.yml and moved on. In my environment, large numeric ID columns are fairly common, and the boxplot of the numeric ID values isn't meaningful. The result of this behavior is that we skip profiling some of our most frequently used data assets.

A quick & dirty workaround would be to copy the entire table, drop the problematic column, document its existence, and profile the copy. That would result in the generated HTML reports becoming disorganized. It's also expensive and slow.

Perhaps config.yml could be modified to accept an optional "exclude_column" property for any table in the includes list. Or maybe PipeRider could optionally warn on error, instead of fail. Or perhaps there's a way to write the profiling SQL statements to capture the data in a way that doesn't cause the overflow.

If a table has 100 columns, and it was possible to profile 99 without any errors, I'm sure there are other people like me who'd prefer to have 99 profiled columns and a warning rather than an error and no report.

Support column value check assertion

Summary

Format check: The text column match certain criteria (or regular expression).
Column values in set: The value of the column should be only valid in certain set.
Reference check: Like foreign key constraint in db. Check if value in a column exist in another table's certain column

Currently, the assertion rules are all based on the profiling result. However, it is not enough as a data quality tool. This story is the first assertion which support non-metric-based assertion.

Intended Outcome

Add at least 3 non-metric based test
Design the foundation of non-metric assertions.

How will it work?

world_city:
  columns:
    create_at:
      tests:
      - name: assert_column_value
        assert:
          gte: <greater than or equals to>
          gt: <great than>
          lte: <less than or equals to>
          lt: <less than>
          like: "<sql like>"
          match: "<regex>" # may not supported in some data source

world_city:
  columns:
    create_at:
      tests:
      - name: assert_column_value
        assert:
          in: ["foo", "bar"]

sc-28973

Support to include/exclude tables through configuration

Is your feature request related to a problem? Please describe.
For each data source, there could be many tables to be profiled. In current piperider, we can only choose to profile all tables or one table

Describe the solution you'd like
Allows to define the tables (and even columns in a table) to profile.

Describe alternatives you've considered
N/A

Additional context
sc-28172

Support Snowflake multi-factor authentication

Is your feature request related to a problem? Please describe.
In current snowflake integration, it only supports to login by basic authentication (user/password). If the user enable the MFA, there is no way to use piperider

Describe the solution you'd like
Reference of snowflake profile in dbt

run piperider run
send push notification to Duo Mobile
Approve the request
Continue

Describe alternatives you've considered
N/A

Additional context
N/A

Support clickhouse

can't use up/down button to select data source in windows terminal

Describe the bug
can't use up/down button to select data source in windows terminal

To Reproduce
Steps to reproduce the behavior:

Execute command: piperider init
When PipeRider asks data source, user can't use up/down button to select data source.

Desktop (please complete the following information):

OS: Windows
Version: 0.4.1

Have a individual test page in the report so that I can see all the test results for a run

Summary

Current, the run report is only to visualize the table/column profiling result. However, assertions are another key feature for piperider. It is desirable to see the test results by a separate page.

Intended Outcome

In the generated report, we can see separate pages for profiling and testing
Profiling report should be navigated through table/column hierarchy
The testing report should be table-like view with filter and sorting.

How will it work?

Add a testing page here, show the testing result in a data table
The frontend part iterate through the json to collect all tests in a list.

sc-28854

`ValueError: Invalid isoformat string` error when running `piperider compare-reports` against a metric

Describe the bug
Running piperider compare-reports --last against a dbt metric results in an error like ValueError: Invalid isoformat string: '2023-04-01 00:00:00'

To Reproduce
Steps to reproduce the behavior:

Setup new virtual environment and run pip install "piperider[bigquery]==0.23.3"
Initialize Piperider with piperider init
Add the piperider tag to a metric configuration file
Run piperider run twice to generate two reports
Run piperider compare-reports --last --debug
See an error like ValueError: Invalid isoformat string: '2023-04-01 00:00:00'

Expected behavior
The compare-reports command should create an HTML report and a Markdown report in the .piperider/comparisons/ directory

Desktop (please complete the following information):

OS: Debian GNU/Linux
Python Version 3.9.16
Version: dbt 1.4.5, dbt metrics 1.4.0, piperider[bigquery]==0.23.3

Additional context
Add any other context about the problem here.

The offending metric configuration can be found below:

version: 2
metrics:
  - name: total_seconds_engaged
    label: Total Seconds Engaged
    model: ref('fct_hits')
    description: "The number of seconds readers spent viewing a page"
    calculation_method: sum
    expression: seconds_engaged
    timestamp: date
    time_grains: [day, week, month, quarter, year, all_time]
    dimensions:
      - article_arc_id
    tags:
      - piperider

The full output of piperider compare-reports --last --debug is attached:

piperider_debug.txt

Support to generate the comparison report with target's tables only

Summary

In #511, we allows to profile the tables for subset of models in a dbt project. It is desirable to compare only the tables which is profiled in the target run.

Intended Outcome

In the comparison report and summary, show only the tables available in the target run.

How will it work?

piperider run --output path/to/base/
Change the model mymodel. Transform the model and downstream models dbt run --select mymodel+
piperider run --dbt-state ./target

Compare the two runs

piperider compare-reports \
   --base path/to/base/run.json \
   --target .piperider/output/latest/run.json \
   --tables-from=target-only

piperider runs on wrong schema by default

Describe the bug
The command piperider run --table <table_name> points to the wrong schema

To Reproduce
Steps to reproduce the behavior:

Setup environment

Core:
  - installed: 1.4.6
  - latest:    1.4.6 - Up to date!

Plugins:
  - postgres: 1.4.6 - Up to date!
  - redshift: 1.4.0 - Up to date!

Execute command

dbt_core_test.sql: select 1, getdate()
$ dbt run --select dbt_core_test
$ piperider run --table dbt_core_test

See error

DataSource: dev
─────────────────────────────────────────────────────────────────────────────────────────────────────── Validating ───────────────────────────────────────────────────────────────────────────────────────────────────────
everything is OK.
─────────────────────────────────────────────────────────────────────────────────────────────────────── Profiling ────────────────────────────────────────────────────────────────────────────────────────────────────────
[0/1] METADATA      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━   0/? 0:00:00Error: No such table 'vincent_dbt_dev.dbt_core_test'

Expected behavior
Piperider should get the correct table from schema set in the dbt profiles.yml

Screenshots

15:03:18  Running with dbt=1.4.6
15:03:19  Found 107 models, 4 tests, 1 snapshot, 0 analyses, 699 macros, 1 operation, 0 seed files, 61 sources, 0 exposures, 0 metrics
15:03:19  
15:03:19  Concurrency: 4 threads (target='dev')
15:03:19  
15:03:19  1 of 1 START sql view model dbt_vincentzhx.dbt_core_test ....................... [RUN]
15:03:19  1 of 1 OK created sql view model dbt_vincentzhx.dbt_core_test .................. [CREATE VIEW in 0.31s]
15:03:19  
15:03:19  Running 1 on-run-end hook
15:03:19  1 of 1 START hook: re_data.on-run-end.0 ........................................ [RUN]
15:03:19  1 of 1 OK hook: re_data.on-run-end.0 ........................................... [OK in 0.00s]
15:03:19  
15:03:19  
15:03:19  Finished running 1 view model, 1 hook in 0 hours 0 minutes and 0.87 seconds (0.87s).
15:03:20  
15:03:20  Completed successfully
15:03:20  
15:03:20  Done. PASS=1 WARN=0 ERROR=0 SKIP=0 TOTAL=1

DataSource: dev
─────────────────────────────────────────────────────────────────────────────────────────────────────── Validating ───────────────────────────────────────────────────────────────────────────────────────────────────────
everything is OK.
─────────────────────────────────────────────────────────────────────────────────────────────────────── Profiling ────────────────────────────────────────────────────────────────────────────────────────────────────────
[0/1] METADATA      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━   0/? 0:00:00Error: No such table 'vincent_dbt_dev.dbt_core_test'

Desktop (please complete the following information):

OS: linux
Python Version 3.9.4
DBT Version 1.4.6

Additional context
Add any other context about the problem here.

allow Snowflake authentication via externalbrowser

It's common for Snowflake users to authenticate using SSO or Snowflake key/pair rather than username/password.

e.g. Jupyter using SSO:

conn = connect(
    user=os.getenv('SNOWFLAKE_USER'),
    account=os.getenv('SNOWFLAKE_ACCOUNT'),
    authenticator="externalbrowser",
)

cursor = conn.cursor()
cursor.execute(sql)

# ... etc...

Great Expectations great_expectations.yml using SSH key/pair:

config_version: 3.0

datasources:
  my_snowflake:
    class_name: Datasource
    module_name: great_expectations.datasource
    execution_engine:
      module_name: great_expectations.execution_engine
      credentials:
        host: <REDACTED>.us-east-1
        username: <REDACTED>
        query:
          schema: <REDACTED>
          warehouse: <REDACTED>
          role: <REDACTED>
        private_key_path: /path/to/key/rsa_key.p8
        private_key_passphrase: <REDACTED>
        drivername: snowflake
      class_name: SqlAlchemyExecutionEngine

It would be great if PipeRider did that too, out-of-the-box. 🚗

Ability to self-host PipeRider

Summary

Currently, there is no easy way to list and view a single runs.

Intended Outcome

Provide a command to run a mini server

piperider server

List the runs
Show the runs without generated report

How will it work?

The server would initiate a flask server for serving
Provide rest API for listing runs and get a run result
Share the common UI for static report and SPA.

Data source integration: Trino

Summary

Data source integration with Trino

Intended Outcome

Support password authentication: server, port, database , schema, user, password
Supoprt dbt trino profile

How will it work?

Support new data source in piperider init

profiler fails when date values span a very large range

I have an edge case in my source data where the range of dates in a column spans an absurdly wide range. These are technically valid dates in Snowflake but, in reality, these extreme values are caused by a data integrity issue.

When I try to profile a table with such extreme date values, PipeRider throws an error:

[1/1]   my_table_with_a_redacted_name          ━━━━━━━╸━━━━━━━━━━━━━━━   4/9 0:00:06
Error: Profiler Exception: ValueError('year 10001 is out of range')

I took a look at the status of the variables in profiler.py and notice that:

dmin = 0001-01-01
dmax = 9221-01-01

Clearly that makes no sense, even if these are technically valid dates.

I was able to get the profiler working, for this table, by increasing the number of buckets. I edited line 1184 in profiler.py. From:

interval_years = math.ceil((dmax.year - dmin.year) / 50)

to:

interval_years = math.ceil((dmax.year - dmin.year) / 10000)

This made the histogram trickier to read (i.e. almost all the values were squished into a handful of buckets), but that's certainly better than an error message and no profiling report:

From a users perspective, it's not clear from the error message what action can be taken to get a profiling report. My dirty hack (increasing the number of buckets from 50 to 10k) allowed me to get a profile report. That's better than no profile report.

I'm not sure what the right solution is. I do think it shouldn't just fail when date ranges span a very wide range.

`piperider init` command fails when macros are present in `dbt_project.yml`

Describe the bug
The piperider init command doesn’t complete successfully when there’s jinja in the dbt_project.yml file (even though it looks like this was fixed in #658. Instead it yields an error like:

Error: Failed to load dbt project '<project_yml_filepath>'. 
 Reason: <macro_name> is undefined

To Reproduce
Steps to reproduce the behavior:

Setup new virtual environment and run pip install "piperider[bigquery]==0.23.0"
Navigate to the root directory of a dbt project whose dbt_project.yml file contains jinja
Execute piperider init and select the appropriate dbt_project.yml filepath
See error

Expected behavior
Expected output

[ DBT ] Use the existing dbt project file: 
<project_yml_filepath>
[ DBT ] By default, PipeRider will profile the models and metrics
with 'piperider' tag
        Apply 'piperider' tag to your models or change the tag in
'.piperider/config.yml'

───────────────────── .piperider/config.yml ─────────────────────
   1 dataSources: []                                             
   2 dbt:                                                        
   3   projectDir: .                                             
   4   tag: piperider                                            
   5                                                             
   6 profiler:                                                   
   7 #   table:                                                  
   8 #     # the maximum row count to profile. (Default unlimited
   9 #     limit: 1000000                                        
  10 #     duplicateRows: false                                  
  11                                                             
  12 telemetry:                                                  
  13   id: <telemetry_id>                      
  14                                                             
───────────────── End of .piperider/config.yml ──────────────────
──────────── Recipe: .piperider/compare/default.yml ─────────────
   1 base:                                                       
   2   branch: main                                              
   3   dbt:                                                      
   4     commands:                                               
   5     - dbt deps                                              
   6     - dbt build                                             
   7   piperider:                                                
   8     command: piperider run                                  
   9 target:                                                     
  10   dbt:                                                      
  11     commands:                                               
  12     - dbt deps                                              
  13     - dbt build                                             
  14   piperider:                                                
  15     command: piperider run                                  
  16                                                             
───────────────────────── End of Recipe ─────────────────────────

Next step:
  Please execute command 'piperider diagnose' to verify 
configuration

Screenshots
If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

OS: Debian GNU/Linux
Python 3.9.16
dbt 1.4.5

Additional context
In our particular instance, dbt_project.yml contains three macros:

A custom macro used to set a variable value:

vars:
  run_date: "{{ get_current_date() }}"

Two nested built-in macros to configure an on-run-end hook:

on-run-end:
  - "{% if target.name == 'prod' %}{{ dbt_artifacts.upload_results(results) }}{% endif %}"

If any of these macros are present, the pipeline init command fails.

Integrate the dbt state in the piperider run

Summary

In dbt, it generate artifacts for each run in the target path. We can treat the snapshot of the artifacts is the state of the dbt run

Intended Outcome

In piperider, we would like to integrate the dbt state for the following features

Profile the tables which is only processed in the run results of the state
Integrate the dbt tests in the run results
Get the table/column description by the manifest in the state

How will it work?

For a dbt data source, run without a dbt state

piperider run
All the tables are profiled, no dbt tests/description are integrated

For a dbt data source, run with dbt state

dbt run --select mymodel+
piperider run --dbt-state ./target
Only the models processed in the dbt are profiled in the piperider

Breaking change

We remove the two options

$ piperider run --dbt-test
$ piperider run --dbt-build

Use the following command instead

$ dbt test
$ piperider run --dbt-state ./target

sc-29554

fe code setup issue

Describe the bug

Local setup and start single report failed.

To Reproduce
Steps to reproduce the behavior:

Setup environment:

node --version              
v14.18.1

Execute command

yarn
yarn run start:single

See error

index.js:19 Uncaught RangeError: Invalid time value
    at DateTimeFormat.formatToParts (<anonymous>)
    at partsTimeZone (index.js:19:1)
    at tzIntlTimeZoneName (index.js:15:1)
    at Object.z (index.js:100:1)
    at index.js:344:1
    at Array.reduce (<anonymous>)
    at format (index.js:337:1)
    at formatInTimeZone (index.js:41:1)
    at formatReportTime (formatters.tsx:20:1)
    at SRTablesListPage (SRTablesListPage.tsx:27:1)
react-dom.development.js:18687 The above error occurred in the <SRTablesListPage> component:

    at SRTablesListPage (http://localhost:3002/static/js/bundle.js:3435:5)
    at component
    at Route (http://localhost:3002/static/js/bundle.js:356378:5)
    at Switch (http://localhost:3002/static/js/bundle.js:356432:5)
    at Router (http://localhost:3002/static/js/bundle.js:356364:69)
    at Suspense
    at AppSingle
    at App
    at EnvironmentProvider (http://localhost:3002/static/js/bundle.js:26937:24)
    at ColorModeProvider (http://localhost:3002/static/js/bundle.js:15589:21)
    at ThemeProvider (http://localhost:3002/static/js/bundle.js:44696:64)
    at ThemeProvider (http://localhost:3002/static/js/bundle.js:32475:27)
    at ChakraProvider (http://localhost:3002/static/js/bundle.js:26228:24)
    at ChakraProvider (http://localhost:3002/static/js/bundle.js:27649:23)

Consider adding an error boundary to your tree to customize error handling behavior.
Visit https://reactjs.org/link/error-boundaries to learn more about error boundaries.
logCapturedError @ react-dom.development.js:18687
react-dom.development.js:26923 Uncaught RangeError: Invalid time value
    at DateTimeFormat.formatToParts (<anonymous>)
    at partsTimeZone (index.js:19:1)
    at tzIntlTimeZoneName (index.js:15:1)
    at Object.z (index.js:100:1)
    at index.js:344:1
    at Array.reduce (<anonymous>)
    at format (index.js:337:1)
    at formatInTimeZone (index.js:41:1)
    at formatReportTime (formatters.tsx:20:1)
    at SRTablesListPage (SRTablesListPage.tsx:27:1)

While in terminal everything looks successful.

Compiled successfully!

You can now view piperider-report in the browser.

  Local:            http://localhost:3002
  On Your Network:  http://192.168.1.159:3002

Note that the development build is not optimized.
To create a production build, use yarn build.

webpack compiled successfully
Files successfully emitted, waiting for typecheck results...
Issues checking in progress...
No issues found.

Expected behavior

Setup run successfully and can serve report locally

Screenshots
If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

OS: macOS
Python Version 3.7.10
Piperider Version 0.8.0-dev

Additional context

I've followed README.md in static reports folder to init, generated single and comparison report with some sample datasets.

Data source integration: SQL Server

Summary

Data source integration with Microsoft SQL Server

Intended Outcome

Support basic authentication: server, port, schema, user, password
Supoprt dbt mssql profile

How will it work?

Support new data source in piperider init

Support SQLAlchemy 2.0

Is your feature request related to a problem? Please describe.
SQLAlchemy is just released at 1/26. PipeRider should support the latest version

Describe the solution you'd like

Describe alternatives you've considered

Additional context
https://www.sqlalchemy.org/blog/2023/01/26/sqlalchemy-2.0.0-released/

Connection bug with dbt-snowflake

Describe the bug
On running the piperider init, there is an error message

Sentry is attempting to send 1 pending error messages
Waiting up to 2 seconds
Press Ctrl-C to quit

However when I run dbt debug , connection is successful

I have tried with dbt-snowflake==1.3 and dbt-snowflake==1.4

To Reproduce
Steps to reproduce the behavior:

Setup environment.

pip install 'piperider[snowflake]'
pip install dbt-snowflake==1.3
pip install markupsafe==2.1.1

Execute command : dbt debug
Its a success
Execute another command : piperider init
See error:

Error: expected token ',', got 'SNOWFLAKE_PASSWORD'
Sentry is attempting to send 1 pending error messages
Waiting up to 2 seconds
Press Ctrl-C to quit

Expected behavior
Config file is created and piperider is initialised

Desktop (please complete the following information):

OS: MacOs
Python Version 3.9.13

Can not execute `run` command on Windows. Show error message "[WinError 3] The system cannot find the path specified: 'C'"

Describe the bug
When running the piperider run comment on the Windows platform. It show the following error message.
[WinError 3] The system cannot find the path specified: 'C'

To Reproduce
Steps to reproduce the behavior:

Install piperider on Windows
Setup piperider project
Run piperider by piperider run

Expected behavior
No error

Desktop (please complete the following information):

OS: Windows
Python Version: 3.9
Version: 0.4.2

Additional context
Add any other context about the problem here.

Show dbt lineage difference for navigating comparison reports

Show the part of modified lineage as the overview and navigation for comparison reports.

We can also mark added/removed/changed nodes in some way, and added/removed dependencies.

Support Custom Metric

Introduction

From in-app feedback

It would be nice if I could customize the metrics as well (not just assertions), I'm looking to fulfill the following metrics: skewness, mode, kurtosis, number of values with leading and trailing spaces, number of values with trailing spaces ONLY, number of values with leading spaces ONLY.

Please provide documentation if there is a way to do this, I'm currently trying to understand how custom assertions work but it would be much appreciated if we can customize the metrics as well.

Intended outcome

TBD

How does it works

TBD

Allows to run test without profiling

Summary

Run profiler is costly and time-consumer. It would be good to run tests only.

Intended Outcome

New command piperider test
Do the test relevant queries only
Get the run result.

How will it work?

For all current built-in assertion, if we found the profiling result is not available, do the query.
Profiling engine and assertion engine can share some query packages

sc-28964

infuseai / piperider Goto Github PK

piperider's Introduction

Code review for data in dbt

How it works:

Core concepts

Quickstart

Features

Example Report Demo

PipeRider Cloud (beta)

PipeRider Compare Action

Development

piperider's People

Contributors

Stargazers

Watchers

Forkers

piperider's Issues

Summary

Intended Outcome

How will it work?

Summary

Intended Outcome

How will it work?

Summary

Intended Outcome

How will it work?

Feature Design

Requirement

Non-goals

Use Cases

User Journeys

Milestones

Summary

Intended Outcome

How will it work?

Introduction

Features

Implicit data sources from dbt profile/target

piperider run would profile all models in dbt project

Allows to annotate a model as piperider candidate

Use dbt list to select model for piperider

Breaking Changes

Problem

Expected Outcome

How will it work?

Example 2: Test the missing percentage ≤ 0.01

Problem

Expected Outcome

How does it works

Introduction

Requirements

Default Compare Recipe

Custom Compare Recipe

Example recipes

Dev recipe (since v0.20.0)

Slim CI recipe

Summary

Intended Outcome

How will it work?

Summary

Intended Outcome

How will it work?

Summary

Intended Outcome

How will it work?

Summary

Intended Outcome

How will it work?

Summary

Intended Outcome

How will it work?

Summary

Intended Outcome

How will it work?

Breaking change

Summary

Intended Outcome

How will it work?

Introduction

Intended outcome