GithubHelp home page GithubHelp logo

greatexpectationslabs / ge_tutorials Goto Github PK

View Code? Open in Web Editor NEW
164.0 164.0 83.0 6.1 MB

Learn how to add data validation and documentation to a data pipeline built with dbt and Airflow.

Python 17.56% Jupyter Notebook 74.71% CSS 1.66% Dockerfile 2.05% Shell 4.03%

ge_tutorials's People

Contributors

anthonyburdi avatar aylr avatar eugmandel avatar jcampbell avatar jdimatteo avatar jennytee avatar martinguindon avatar spbail avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

ge_tutorials's Issues

Great Expectations Error in Airflow execution

The airflow (GE) DAG is unable to successfully execute because the configs appear to be incompatible with the version of GE that is being installed within the container. See error below;

*** Reading local file: /usr/local/airflow/logs/ge_tutorials_dag_with_ge/task_validate_source_data/2020-07-08T14:23:36.810518+00:00/1.log
[2020-07-08 14:23:40,135] {{taskinstance.py:655}} INFO - Dependencies all met for <TaskInstance: ge_tutorials_dag_with_ge.task_validate_source_data 2020-07-08T14:23:36.810518+00:00 [queued]>
[2020-07-08 14:23:40,153] {{taskinstance.py:655}} INFO - Dependencies all met for <TaskInstance: ge_tutorials_dag_with_ge.task_validate_source_data 2020-07-08T14:23:36.810518+00:00 [queued]>
[2020-07-08 14:23:40,153] {{taskinstance.py:866}} INFO - 
--------------------------------------------------------------------------------
[2020-07-08 14:23:40,153] {{taskinstance.py:867}} INFO - Starting attempt 1 of 1
[2020-07-08 14:23:40,153] {{taskinstance.py:868}} INFO - 
--------------------------------------------------------------------------------
[2020-07-08 14:23:40,164] {{taskinstance.py:887}} INFO - Executing <Task(PythonOperator): task_validate_source_data> on 2020-07-08T14:23:36.810518+00:00
[2020-07-08 14:23:40,168] {{standard_task_runner.py:53}} INFO - Started process 1006 to run task
[2020-07-08 14:23:40,211] {{logging_mixin.py:112}} INFO - Running %s on host %s <TaskInstance: ge_tutorials_dag_with_ge.task_validate_source_data 2020-07-08T14:23:36.810518+00:00 [running]> 26f657a26611
[2020-07-08 14:23:40,252] {{taskinstance.py:1128}} ERROR - You appear to have an invalid config version (1.0).
    The version number must be at least 2. Please see the migration guide at https://docs.greatexpectations.io/en/latest/how_to_guides/migrating_versions.html
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/airflow/models/taskinstance.py", line 966, in _run_raw_task
    result = task_copy.execute(context=context)
  File "/usr/local/lib/python3.7/site-packages/airflow/operators/python_operator.py", line 113, in execute
    return_value = self.execute_callable()
  File "/usr/local/lib/python3.7/site-packages/airflow/operators/python_operator.py", line 118, in execute_callable
    return self.python_callable(*self.op_args, **self.op_kwargs)
  File "/usr/local/airflow/dags/ge_tutorials_dag_with_great_expectations.py", line 70, in validate_source_data
    context = ge.data_context.DataContext(great_expectations_context_path)
  File "/usr/local/lib/python3.7/site-packages/great_expectations/data_context/data_context.py", line 2148, in __init__
    project_config = self._load_project_config()
  File "/usr/local/lib/python3.7/site-packages/great_expectations/data_context/data_context.py", line 2186, in _load_project_config
    return DataContextConfig.from_commented_map(config_dict)
  File "/usr/local/lib/python3.7/site-packages/great_expectations/data_context/types/base.py", line 86, in from_commented_map
    config = dataContextConfigSchema.load(commented_map)
  File "/usr/local/lib/python3.7/site-packages/marshmallow/schema.py", line 723, in load
    data, many=many, partial=partial, unknown=unknown, postprocess=True
  File "/usr/local/lib/python3.7/site-packages/marshmallow/schema.py", line 886, in _do_load
    field_errors=field_errors,
  File "/usr/local/lib/python3.7/site-packages/marshmallow/schema.py", line 1189, in _invoke_schema_validators
    partial=partial,
  File "/usr/local/lib/python3.7/site-packages/marshmallow/schema.py", line 774, in _run_validator
    validator_func(output, partial=partial, many=many)
  File "/usr/local/lib/python3.7/site-packages/great_expectations/data_context/types/base.py", line 298, in validate_schema
    data["config_version"], MINIMUM_SUPPORTED_CONFIG_VERSION
great_expectations.exceptions.UnsupportedConfigVersionError: You appear to have an invalid config version (1.0).
    The version number must be at least 2. Please see the migration guide at https://docs.greatexpectations.io/en/latest/how_to_guides/migrating_versions.html
[2020-07-08 14:23:40,254] {{taskinstance.py:1185}} INFO - Marking task as FAILED.dag_id=ge_tutorials_dag_with_ge, task_id=task_validate_source_data, execution_date=20200708T142336, start_date=20200708T142340, end_date=20200708T142340
[2020-07-08 14:23:50,113] {{logging_mixin.py:112}} INFO - [2020-07-08 14:23:50,113] {{local_task_job.py:103}} INFO - Task exited with return code 1

Issue with v3 tutorial -- Creating checkpoint creates error

When I run:

context.run_checkpoint(checkpoint_name=my_checkpoint_name)
context.open_data_docs()

I get this error:

FileNotFoundError: [Errno 2] No such file or directory: 'C:\Users\johnarmstrong\OneDrive - Corus Entertainment Inc\Documents\ge_tutorials\great_expectations\uncommitted/validations/getting_started_expectation_suite_taxi\demo\20220118-150433-my-run-name-template\20220118T150433.457469Z\444fa93fe34e9e162c5f910bca5b5916.json'

Unable to launch Airflow DBT tutorial using Docker

Out of the box, there appear to be several issues with running the Airflow DBT examples using Docker.

  1. The 1.0.x release of dbt should be installed using pip install dbt-core or pip install dbt-<connector>, e.g., pip install dbt-postgres. This applies both to the Dockerfile and to requirements.txt
  2. Dependency resolution complains about a few packages, ultimately resulting in Airflow failing to start up. I've been able to resolve this by pinning dbt-postgres<1.0.0, wtforms==2.3.3 , and werkzeug<1.0.0 in requirements.txt, which is required by Airflow 1.10.9.
  3. The dbt_project.yml is missing the config-version: 2 setting, which prevents DAGs from executing
  4. There is a typo in airflow/ge_tutorials_dag_with_great_expectations.py, though it looks like #16 addresses this
webserver_1     | Traceback (most recent call last):
webserver_1     |   File "/usr/local/lib/python3.7/site-packages/airflow/models/dagbag.py", line 243, in process_file
webserver_1     |     m = imp.load_source(mod_name, filepath)
webserver_1     |   File "/usr/local/lib/python3.7/imp.py", line 171, in load_source
webserver_1     |     module = _load(spec)
webserver_1     |   File "<frozen importlib._bootstrap>", line 696, in _load
webserver_1     |   File "<frozen importlib._bootstrap>", line 677, in _load_unlocked
webserver_1     |   File "<frozen importlib._bootstrap_external>", line 724, in exec_module
webserver_1     |   File "<frozen importlib._bootstrap_external>", line 860, in get_code
webserver_1     |   File "<frozen importlib._bootstrap_external>", line 791, in source_to_code
webserver_1     |   File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
webserver_1     |   File "/usr/local/airflow/dags/ge_tutorials_dag_with_great_expectations.py", line 21
webserver_1     |     "owner":` "Airflow",
webserver_1     |             ^
webserver_1     | SyntaxError: invalid syntax```

No module named `custom_module`

I'm trying to import the ExpectColumnMaxToBeBetweenCustom Expectation from this tutorial, but No module named 'custom_module'.

I have copied the file from Complete Example into the great_expectations/plugins directory.
image

For the suggested line from custom_module import ExpectColumnMaxToBeBetweenCustom to work there are few more steps required:

  1. The column_custom_max_expectation.py file should be in great_expectations/plugins/custom_module instead of great_expectations/plugins
  2. In great_expectations/plugins/custom_module, there should be an __init__.py file
  3. In the __init__.py file there should be the line from .column_custom_max_expectation import ExpectColumnMaxToBeBetweenCustom

Dockerization for easy setup

I think it will be a good idea to dockerize the setup which will make it easier to get started and run the repo. Right it requires a bunch of different steps and configs to be setup which can be done inside a docker-compose. If this is something that is of interest to you, I can work on it.

I idea will be to run the repo in a docker and run a Postgres database in another container which airflow, ge and dbt can connect with static address.

execution error from task_transform_data_in_db

I try to run docker-compose file for this tutorial but It error in this task.
I don't know that what are you use dbt and greatexpectation version ? for this tutorial.
Thank you very much

Reading local file: /usr/local/airflow/logs/ge_tutorials_dag_with_ge/task_transform_data_in_db/2021-05-17T17:02:07.041650+00:00/1.log
[2021-05-17 17:02:35,023] {{taskinstance.py:655}} INFO - Dependencies all met for <TaskInstance: ge_tutorials_dag_with_ge.task_transform_data_in_db 2021-05-17T17:02:07.041650+00:00 [queued]>
[2021-05-17 17:02:35,037] {{taskinstance.py:655}} INFO - Dependencies all met for <TaskInstance: ge_tutorials_dag_with_ge.task_transform_data_in_db 2021-05-17T17:02:07.041650+00:00 [queued]>
[2021-05-17 17:02:35,037] {{taskinstance.py:866}} INFO - 
--------------------------------------------------------------------------------
[2021-05-17 17:02:35,037] {{taskinstance.py:867}} INFO - Starting attempt 1 of 1
[2021-05-17 17:02:35,037] {{taskinstance.py:868}} INFO - 
--------------------------------------------------------------------------------
[2021-05-17 17:02:35,046] {{taskinstance.py:887}} INFO - Executing <Task(BashOperator): task_transform_data_in_db> on 2021-05-17T17:02:07.041650+00:00
[2021-05-17 17:02:35,049] {{standard_task_runner.py:53}} INFO - Started process 855 to run task
[2021-05-17 17:02:35,090] {{logging_mixin.py:112}} INFO - Running %s on host %s <TaskInstance: ge_tutorials_dag_with_ge.task_transform_data_in_db 2021-05-17T17:02:07.041650+00:00 [running]> e031efb82af6
[2021-05-17 17:02:35,106] {{bash_operator.py:82}} INFO - Tmp dir root location: 
 /tmp
[2021-05-17 17:02:35,107] {{bash_operator.py:105}} INFO - Temporary script location: /tmp/airflowtmp0lt4ylbu/task_transform_data_in_dboeetxzin
[2021-05-17 17:02:35,107] {{bash_operator.py:115}} INFO - Running command: dbt run --project-dir /usr/local/airflow/dbt
[2021-05-17 17:02:35,112] {{bash_operator.py:122}} INFO - Output:
**[2021-05-17 17:02:37,650] {{bash_operator.py:126}} INFO - Running with dbt=0.19.1
[2021-05-17 17:02:37,660] {{bash_operator.py:126}} INFO - Encountered an error while reading the project:**
[2021-05-17 17:02:37,661] {{bash_operator.py:126}} INFO -   ERROR: Runtime Error
[2021-05-17 17:02:37,661] {{bash_operator.py:126}} INFO -   Invalid config version: 1, expected 2
[2021-05-17 17:02:37,661] {{bash_operator.py:126}} INFO - 
[2021-05-17 17:02:37,661] {{bash_operator.py:126}} INFO - Error encountered in /usr/local/airflow/dbt/dbt_project.yml
[2021-05-17 17:02:37,665] {{bash_operator.py:126}} INFO - Encountered an error:
[2021-05-17 17:02:37,666] {{bash_operator.py:126}} INFO - Runtime Error
[2021-05-17 17:02:37,666] {{bash_operator.py:126}} INFO -   Could not run dbt
[2021-05-17 17:02:37,754] {{bash_operator.py:130}} INFO - Command exited with return code 2
[2021-05-17 17:02:37,761] {{taskinstance.py:1128}} ERROR - Bash command failed
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/airflow/models/taskinstance.py", line 966, in _run_raw_task
    result = task_copy.execute(context=context)
  File "/usr/local/lib/python3.7/site-packages/airflow/operators/bash_operator.py", line 134, in execute
    raise AirflowException("Bash command failed")
airflow.exceptions.AirflowException: Bash command failed
[2021-05-17 17:02:37,763] {{taskinstance.py:1185}} INFO - Marking task as FAILED.dag_id=ge_tutorials_dag_with_ge, task_id=task_transform_data_in_db, execution_date=20210517T170207, start_date=20210517T170235, end_date=20210517T170237
[2021-05-17 17:02:45,018] {{logging_mixin.py:112}} INFO - [2021-05-17 17:02:45,017] {{local_task_job.py:103}} INFO - Task exited with return code 1

mixed concerns and dependency management issues

Hi there,

Excited to see this tutorial as it's something we've been struggling in the past.

I've tried mixing airflow, dbt and ge in the past.. This approach has 2 issues:

  • dependency management nightmare. too many transitive dependencies
  • mixed concerns:
    • GE performs data quality checks
    • DBT creates/updates tables in your DW
    • Airflow triggers jobs

Here's the approach we've taken:

  • have 2 repos:
    • model definitions along with their expectations in 1 repo that spits out 2 dockerised containers:
      • one for dbt runs
      • one for GE runs
    • airflow dag definitions
      • triggers dbt container
      • triggers GE container

Could you validate our approach please? Is GE used the way it was designed to?

ModuleNotFoundError: No module named 'wtforms.compat'

Hey folks, I'm trying to use this repo specifically the airflow examples for a PoC but I keep getting the following error:

webserver_1     | [2021-11-09 15:56:44,142] {{settings.py:253}} INFO - settings.configure_orm(): Using pool settings. pool_size=5, max_overflow=10, pool_recycle=1800, pid=16
webserver_1     | Traceback (most recent call last):
webserver_1     |   File "/usr/local/bin/airflow", line 26, in <module>
webserver_1     |     from airflow.bin.cli import CLIFactory
webserver_1     |   File "/usr/local/lib/python3.7/site-packages/airflow/bin/cli.py", line 70, in <module>
webserver_1     |     from airflow.www.app import (cached_app, create_app)
webserver_1     |   File "/usr/local/lib/python3.7/site-packages/airflow/www/app.py", line 37, in <module>
webserver_1     |     from airflow.www.blueprints import routes
webserver_1     |   File "/usr/local/lib/python3.7/site-packages/airflow/www/blueprints.py", line 25, in <module>
webserver_1     |     from airflow.www import utils as wwwutils
webserver_1     |   File "/usr/local/lib/python3.7/site-packages/airflow/www/utils.py", line 35, in <module>
webserver_1     |     from wtforms.compat import text_type
webserver_1     | ModuleNotFoundError: No module named 'wtforms.compat'
webserver_1     | [2021-11-09 15:56:45,795] {{settings.py:253}} INFO - settings.configure_orm(): Using pool settings. pool_size=5, max_overflow=10, pool_recycle=1800, pid=1
webserver_1     | [2021-11-09 15:56:45,800] {{settings.py:253}} INFO - settings.configure_orm(): Using pool settings. pool_size=5, max_overflow=10, pool_recycle=1800, pid=19
webserver_1     | Traceback (most recent call last):
webserver_1     |   File "/usr/local/bin/airflow", line 26, in <module>
webserver_1     |     from airflow.bin.cli import CLIFactory
webserver_1     |   File "/usr/local/lib/python3.7/site-packages/airflow/bin/cli.py", line 70, in <module>
webserver_1     |     from airflow.www.app import (cached_app, create_app)
webserver_1     |   File "/usr/local/lib/python3.7/site-packages/airflow/www/app.py", line 37, in <module>
webserver_1     |     from airflow.www.blueprints import routes
webserver_1     |   File "/usr/local/lib/python3.7/site-packages/airflow/www/blueprints.py", line 25, in <module>
webserver_1     |     from airflow.www import utils as wwwutils
webserver_1     |   File "/usr/local/lib/python3.7/site-packages/airflow/www/utils.py", line 35, in <module>
webserver_1     |     from wtforms.compat import text_type
webserver_1     | ModuleNotFoundError: No module named 'wtforms.compat'
webserver_1     | Traceback (most recent call last):
webserver_1     |   File "/usr/local/bin/airflow", line 26, in <module>
webserver_1     |     from airflow.bin.cli import CLIFactory
webserver_1     |   File "/usr/local/lib/python3.7/site-packages/airflow/bin/cli.py", line 70, in <module>
webserver_1     |     from airflow.www.app import (cached_app, create_app)
webserver_1     |   File "/usr/local/lib/python3.7/site-packages/airflow/www/app.py", line 37, in <module>
webserver_1     |     from airflow.www.blueprints import routes
webserver_1     |   File "/usr/local/lib/python3.7/site-packages/airflow/www/blueprints.py", line 25, in <module>
webserver_1     |     from airflow.www import utils as wwwutils
webserver_1     |   File "/usr/local/lib/python3.7/site-packages/airflow/www/utils.py", line 35, in <module>
webserver_1     |     from wtforms.compat import text_type
webserver_1     | ModuleNotFoundError: No module named 'wtforms.compat'

Docker dag run ge error - invalid config version (1.0)

Nice with the Docker tutorial for convenience! However the dockerfile probably got out of date since I got an error from the first task of the dag with ge:

[2020-06-27 11:21:35,173] {{logging_mixin.py:112}} INFO - Running %s on host %s <TaskInstance: ge_tutorials_dag_with_ge.task_validate_source_data 2020-06-27T11:21:27.511382+00:00 [running]> 7617b740a5b3
[2020-06-27 11:21:35,247] {{taskinstance.py:1128}} ERROR - You appear to have an invalid config version (1.0).
    The version number must be at least 2. Please see the migration guide at https://docs.greatexpectations.io/en/latest/how_to_guides/migrating_versions.html

Unable to find match for config variable warehouse

Hi,

trying out the demo, and am running into a problem when executing the following (btw, airflow test command required me to add execution date to dag and task):

airflow test ge_tutorials_dag task_validate_source_data 2020-03-09

This command gives me the following error

  File "/home/kris/anaconda3/lib/python3.7/site-packages/great_expectations/data_context/util.py", line 168, in substitute_config_variable
    raise InvalidConfigError("Unable to find match for config variable {:s}. See https://great-expectations.readthedocs.io/en/latest/reference/data_context_reference.html#managing-environment-and-secrets".format(match.group(1)))
great_expectations.exceptions.InvalidConfigError: Unable to find match for config variable warehouse. See https://great-expectations.readthedocs.io/en/latest/reference/data_context_reference.html#managing-environment-and-secrets

I'm doing mariadb from fresh install on my local machine (user root and no pwd) and my .dbt profile is:

$ cat ~/.dbt/profiles.yml 

# For more information on how to configure this file, please see:
# https://docs.getdbt.com/docs/profile

default:
  outputs:
    dev:
      type: mysql
      threads: 1
      host: 127.0.0.1
      port: 3306
      user: root
      pass: 
      dbname: warehouse

and I do have env vars set up:

GE_TUTORIAL_DB_URL=mysql://root@localhost:3306/ge    #also tried warehouse as db name instead of ge
GE_TUTORIAL_PROJECT_PATH=/home/kris/projects/ge_demo/ge_tutorials

(and airflow is setup correctly because I could see ge_tutorials_dag in the airflow ui)

Sorry, don't have time to investigate further, just wanted to give feedback to your awesome tutorial, maybe it helps also someone else if they hit a snag.

Thanks and keep up your awesome stuff :D

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.