GithubHelp home page GithubHelp logo

etl-with-airflow's People

Contributors

dependabot[bot] avatar dhuang avatar gtoonstra avatar jhtimmins avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

etl-with-airflow's Issues

How do you generate all of the SQL scripts/DDL?

Hello - this project is amazing. I have read over it literally 50 times and I find myself coming back to it over and over again when I am trying to find a nice solution for a problem.

I was wondering how you generated all of the DDL for creating tables and constraints? The issue I am facing now is that I have 116 files (CSV dumped on an SFTP) that I need to extract (I am following a pattern similar to your PredictableFileStorage stuff). I need to load these files into the database but it feels painful to write 116 create table scripts.

Thoughts -

  1. Use pandas to inspect dtypes to determine what the postgres dtype should be (NaNs/nulls in integers are cast as a float though, and some dates are seen as an object even though they should be a date
  2. List all of the columns, dtypes, primary keys, etc and have SQLAlchemy generate the tables - this is painful too because I need to know the columns in advance
  3. Use format strings or Jinja2 templates and pass list of columns per file and print out the DDL, and then copy/paste that into a real SQL file - then just update specific columns or types and add constraints as I go along and learn the contents of the data more

Error in Ad Hoc Query

Hello.

I'm trying to follow the tutorial, and I've created a postgres_oltp connection with login and password set to oltp_read and port 5432. When I try to do an Ad Hoc query, I get an error message saying "FATAL: Peer authentication failed for user "oltp_read" Can you give me some help on this?

Thank you
Marina

Error running the file_ingest example

When running the file_ingest example I'm getting the error:
[2019-02-20 16:25:36,677] {models.py:1788} ERROR - The conn_id fs_source_system isn't defined
Traceback (most recent call last):
File "/home/tkiely/anaconda3/envs/Sunbird/lib/python3.6/site-packages/airflow/models.py", line 1657, in _run_raw_task
result = task_copy.execute(context=context)
File "/home/tkiely/tools/airflow/dags/acme/operators/file_operators.py", line 60, in execute
src_hook = FSHook(conn_id=self.src_conn_id)
File "/home/tkiely/anaconda3/envs/Sunbird/lib/python3.6/site-packages/airflow/contrib/hooks/fs_hook.py", line 38, in init
conn = self.get_connection(conn_id)

I see where 'fs_source_system' is sent to the operator in the dag.
Does this need to be define in Airflow?
Thanks

why you can use beeline command in docker container

Hi, I am new in airflow and I want to use airflow in docker container, my docker image has already install jar pkg that run beeline command need, base on puckel/docker-airflow. In my docker container I can run HiveOperator but can not run MySqlToHiveTransfer operator due to I found that HiveCliHook beeline load data command can only run on Hive Server2 Host or load data in HDFS.

So, I wanna ask why you can run HiveCliHook in docker container?

SQL path

Hello, I wanted to know what it is the SQL path used to access the SQL scripts in template_searchpath. Thank you

PostgresToPostgresOperator causing UnicodeEncodeError

Hi,

I'm failing to get your Operator to work due to what seems to be a mismatch of SQLalchemy and Psycopg2/Postgres. See this Stackoverflow Thread as well. Case is transfer between two PostgresDBs on Heroku.

Log excerpt from testing the Operator with a task:

[2017-06-23 14:43:23,814] {models.py:1342} INFO - Executing <Task(PostgresToPostgresOperator): test_transfer> on 2017-06-23 00:00:00
[2017-06-23 14:43:23,846] {postgres_to_postgres_operator.py:51} INFO - Executing: SELECT a,b,c FROM public.test LIMIT 1000;
[2017-06-23 14:43:23,846] {postgres_to_postgres_operator.py:55} INFO - Transferring Postgres query results into other Postgres database.
[2017-06-23 14:43:23,847] {base_hook.py:67} INFO - Using connection to: src_pg_conn_id
[2017-06-23 14:43:23,867] {postgres_to_postgres_operator.py:64} INFO - Inserting rows into Postgres
[2017-06-23 14:43:23,868] {base_hook.py:67} INFO - Using connection to: dest_pg_conn_id
[2017-06-23 14:43:23,886] {models.py:1417} ERROR - 'ascii' codec can't encode character u'\xfc' in position 177: ordinal not in range(128)
Traceback (most recent call last):
  File "/app/.heroku/miniconda/lib/python2.7/site-packages/airflow/models.py", line 1374, in run
    result = task_copy.execute(context=context)
  File "/app/plugins/postgres_to_postgres_operator.py", line 66, in execute
    dest_pg.insert_rows(table=self.pg_table, rows=cursor)
  File "/app/.heroku/miniconda/lib/python2.7/site-packages/airflow/hooks/dbapi_hook.py", line 220, in insert_rows
    ",".join(values))
UnicodeEncodeError: 'ascii' codec can't encode character u'\xfc' in position 177: ordinal not in range(128)
[2017-06-23 14:43:23,887] {models.py:1441} INFO - Marking task as FAILED.

Did you come across this during development? Any tips how to make the operator work?

Dimension tables

Thanks for your examples - I used your dimension update SQL scripts as a model and it works well.

I was wondering about your thoughts on 'dimension snapshots' as opposed to inserting changes and updating end dates (like how your SQL scripts are):

https://medium.com/@maximebeauchemin/functional-data-engineering-a-modern-paradigm-for-batch-data-processing-2327ec32c42a

I am curious about creating individual dimension partitions per date as a snapshot, but I have several dimension tables and would need to go back until at least 2014 which is around 1500 partitions per table. It seems a bit crazy to think about, but maybe just because I am not used to the idea.

Problem running docker-compose

http://take.ms/IWRwA is what happens when I run docker-compose -f docker-compose-LocalExecutor.yml up --abort-on-container-exit

relevent logs:

webserver_1  | Initialize database...
webserver_1  | /usr/local/lib/python2.7/dist-packages/psycopg2/__init__.py:144: UserWarning: The psycopg2 wheel package will be renamed from release 2.8; in order to keep installing from binary please use "pip install psycopg2-binary" instead. For details see: <http://initd.org/psycopg/docs/install.html#binary-install-from-pypi>.
webserver_1  |   """)
webserver_1  | [2018-03-08 04:28:15,264] {__init__.py:57} INFO - Using executor LocalExecutor
webserver_1  | [2018-03-08 04:28:15,354] {driver.py:120} INFO - Generating grammar tables from /usr/lib/python2.7/lib2to3/Grammar.txt
webserver_1  | [2018-03-08 04:28:15,374] {driver.py:120} INFO - Generating grammar tables from /usr/lib/python2.7/lib2to3/PatternGrammar.txt
webserver_1  | DB: postgresql+psycopg2://airflow:***@postgres:5432/airflow
webserver_1  | [2018-03-08 04:28:15,572] {db.py:287} INFO - Creating tables
webserver_1  | INFO  [alembic.runtime.migration] Context impl PostgresqlImpl.
webserver_1  | INFO  [alembic.runtime.migration] Will assume transactional DDL.
webserver_1  | INFO  [alembic.runtime.migration] Running upgrade  -> e3a246e0dc1, current schema
webserver_1  | INFO  [alembic.runtime.migration] Running upgrade e3a246e0dc1 -> 1507a7289a2f, create is_encrypted
webserver_1  | INFO  [alembic.runtime.migration] Running upgrade 1507a7289a2f -> 13eb55f81627, maintain history for compatibility with earlier migrations
webserver_1  | INFO  [alembic.runtime.migration] Running upgrade 13eb55f81627 -> 338e90f54d61, More logging into task_isntance
webserver_1  | INFO  [alembic.runtime.migration] Running upgrade 338e90f54d61 -> 52d714495f0, job_id indices
webserver_1  | INFO  [alembic.runtime.migration] Running upgrade 52d714495f0 -> 502898887f84, Adding extra to Log
webserver_1  | INFO  [alembic.runtime.migration] Running upgrade 502898887f84 -> 1b38cef5b76e, add dagrun
webserver_1  | INFO  [alembic.runtime.migration] Running upgrade 1b38cef5b76e -> 2e541a1dcfed, task_duration
webserver_1  | INFO  [alembic.runtime.migration] Running upgrade 2e541a1dcfed -> 40e67319e3a9, dagrun_config
webserver_1  | INFO  [alembic.runtime.migration] Running upgrade 40e67319e3a9 -> 561833c1c74b, add password column to user
webserver_1  | INFO  [alembic.runtime.migration] Running upgrade 561833c1c74b -> 4446e08588, dagrun start end
webserver_1  | INFO  [alembic.runtime.migration] Running upgrade 4446e08588 -> bbc73705a13e, Add notification_sent column to sla_miss
webserver_1  | INFO  [alembic.runtime.migration] Running upgrade bbc73705a13e -> bba5a7cfc896, Add a column to track the encryption state of the 'Extra' field in connection
webserver_1  | INFO  [alembic.runtime.migration] Running upgrade bba5a7cfc896 -> 1968acfc09e3, add is_encrypted column to variable table
webserver_1  | INFO  [alembic.runtime.migration] Running upgrade 1968acfc09e3 -> 2e82aab8ef20, rename user table
webserver_1  | INFO  [alembic.runtime.migration] Running upgrade 2e82aab8ef20 -> 211e584da130, add TI state index
webserver_1  | INFO  [alembic.runtime.migration] Running upgrade 211e584da130 -> 64de9cddf6c9, add task fails journal table
webserver_1  | INFO  [alembic.runtime.migration] Running upgrade 64de9cddf6c9 -> f2ca10b85618, add dag_stats table
webserver_1  | INFO  [alembic.runtime.migration] Running upgrade f2ca10b85618 -> 4addfa1236f1, Add fractional seconds to mysql tables
webserver_1  | INFO  [alembic.runtime.migration] Running upgrade 4addfa1236f1 -> 8504051e801b, xcom dag task indices
webserver_1  | INFO  [alembic.runtime.migration] Running upgrade 8504051e801b -> 5e7d17757c7a, add pid field to TaskInstance
webserver_1  | INFO  [alembic.runtime.migration] Running upgrade 5e7d17757c7a -> 127d2bf2dfa7, Add dag_id/state index on dag_run table
webserver_1  | ERROR [airflow.models.DagBag] Failed to import: /usr/local/airflow/dags/orders_staging.py
webserver_1  | Traceback (most recent call last):
webserver_1  |   File "/usr/local/lib/python2.7/dist-packages/airflow/models.py", line 263, in process_file
webserver_1  |     m = imp.load_source(mod_name, filepath)
webserver_1  |   File "/usr/local/airflow/dags/orders_staging.py", line 29, in <module>
webserver_1  |     tmpl_search_path = Variable.get("sql_path")
webserver_1  |   File "/usr/local/lib/python2.7/dist-packages/airflow/utils/db.py", line 53, in wrapper
webserver_1  |     result = func(*args, **kwargs)
webserver_1  |   File "/usr/local/lib/python2.7/dist-packages/airflow/models.py", line 3598, in get
webserver_1  |     raise KeyError('Variable {} does not exist'.format(key))
webserver_1  | KeyError: u'Variable sql_path does not exist'
webserver_1  | ERROR [airflow.models.DagBag] Failed to import: /usr/local/airflow/dags/process_order_fact.py
webserver_1  | Traceback (most recent call last):
webserver_1  |   File "/usr/local/lib/python2.7/dist-packages/airflow/models.py", line 263, in process_file
webserver_1  |     m = imp.load_source(mod_name, filepath)
webserver_1  |   File "/usr/local/airflow/dags/process_order_fact.py", line 29, in <module>
webserver_1  |     tmpl_search_path = Variable.get("sql_path")
webserver_1  |   File "/usr/local/lib/python2.7/dist-packages/airflow/utils/db.py", line 53, in wrapper
webserver_1  |     result = func(*args, **kwargs)
webserver_1  |   File "/usr/local/lib/python2.7/dist-packages/airflow/models.py", line 3598, in get
webserver_1  |     raise KeyError('Variable {} does not exist'.format(key))
webserver_1  | KeyError: u'Variable sql_path does not exist'
webserver_1  | ERROR [airflow.models.DagBag] Failed to import: /usr/local/airflow/dags/product_staging.py
webserver_1  | Traceback (most recent call last):
webserver_1  |   File "/usr/local/lib/python2.7/dist-packages/airflow/models.py", line 263, in process_file
webserver_1  |     m = imp.load_source(mod_name, filepath)
webserver_1  |   File "/usr/local/airflow/dags/product_staging.py", line 29, in <module>
webserver_1  |     tmpl_search_path = Variable.get("sql_path")
webserver_1  |   File "/usr/local/lib/python2.7/dist-packages/airflow/utils/db.py", line 53, in wrapper
webserver_1  |     result = func(*args, **kwargs)
webserver_1  |   File "/usr/local/lib/python2.7/dist-packages/airflow/models.py", line 3598, in get
webserver_1  |     raise KeyError('Variable {} does not exist'.format(key))
webserver_1  | KeyError: u'Variable sql_path does not exist'
webserver_1  | ERROR [airflow.models.DagBag] Failed to import: /usr/local/airflow/dags/customer_staging.py
webserver_1  | Traceback (most recent call last):
webserver_1  |   File "/usr/local/lib/python2.7/dist-packages/airflow/models.py", line 263, in process_file
webserver_1  |     m = imp.load_source(mod_name, filepath)

Example Data Vault

Creating network "etl-with-airflow_default" with the default driver
Pulling webserver (puckel/docker-airflow:1.8.0)...
ERROR: manifest for puckel/docker-airflow:1.8.0 not found: manifest unknown: manifest unknown

ImportError: No module named acme.operators.dwh_operators

Hello

I'm running airflow locally, and I followed the tutorial as far "drop DAGS into Airflow." I'm now getting errors that look like this:

ERROR - Failed to import: /home/trent/airflow/dags/product_staging.py
Traceback (most recent call last):
File "/home/trent/.local/lib/python2.7/site-packages/airflow/models/dagbag.py", line 236, in process_file
m = imp.load_source(mod_name, filepath)
File "/home/trent/airflow/dags/product_staging.py", line 18, in
from acme.operators.dwh_operators import PostgresToPostgresOperator
ImportError: No module named acme.operators.dwh_operators

This is happening for every dag. I checked, and $AIRFLOW_HOME/dags/acme/operators/dwh_operators.py exists on my machine. What do I need to do to get it to import correctly?

Quick setup using docker

Would be nice feature having a docker-compose.yml so that we could easily run examples with docker-compose up.
I will try to contribute with that whenever I retake analyzing the examples, just took a brief look to get an overall idea of Airflow.

How to populate the tables in dwh schema?

Thanks for putting this tutorial together. I have the dockerized webserver up and running, with all dags able to successfully run etc. But I'm having trouble understanding how to populate the "dim_" and fact_orderline tables in the dwh schema? Even after running all of the dags successfully, there doesn't seem to be any data in these tables.

My tables in the 'orders' database seem to be populated correctly from all of your helpful insert statements. But am I still missing something here?

Project to define python modules in packages

Hi @gtoonstra,

I found your repository at discover and found it cool to approach using airflow as ETL. Lately I've been using it for this purpose, and I've tried to implement a lot through PythonOperator. Over time, I was organizing so that I could separate the code responsible for processing into separate modules and prepare them to install via setuptools. Below is the link to a cookiecutter template I created. If you can give me any feedback, appreciate for this.

https://github.com/gilsondev/cookiecutter-airflow

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.