mara / mara-pipelines Goto Github PK

View Code? Open in Web Editor NEW

2.1K 56.0 102.0 3.37 MB

A lightweight opinionated ETL framework, halfway between plain scripts and Apache Airflow

License: MIT License

Python 84.12% CSS 1.01% JavaScript 13.13% PLpgSQL 1.41% Makefile 0.33%

etl data-integration python postgresql pipeline data

mara-pipelines's People

Stargazers

Watchers

Forkers

willyhakim jessewei monad-one neo4reo franciscogodoy danertor marius92mc vitorbaptista mqwutom radovankavicky gapdata vdt olel-may hbcbh1999 snowch-forks emuhedo eagles0607 rizplate ianco dikang123 zixuedanxin sbsb3838 kaylon www3838438 adrian-triteanu afcarl jjaniuk luke202001 miguelperalvo 13768324554 gemako gaohuasen boob-sbcm kkrbalam damozhiying awesomedatatool lbzhao28 dengliang10 xia0204 kabassociates losurdom b3838438org youzhanghao geoxing rocktangcd kfzteile24 anke522 laeeth yunwisdom kellehera-py xyzlat databill86 devotionzhu kvasagiri zhongweng qsays stvhanna ace-racer danielson23 ierosodin caiqing aaronsql2019 bbbbbgit hieuttrieu kyushi gathineou islammohamed user9000 kenh8 bugfyi moezaza jangocheng cxz minge-b tebellox zxdev158 piotrkochan anilktechie kcxg rfkimi mygithub93 vaaceph leo-schick laopeng2021 faisaljavedgondal jeffamaxey weiplanet alvintaoops jt-yang webclinic017 erdal-pb x1g1 adambear plditallo emmanuel50-dev deepeshn1988 iq-scm danplischke matthewkurt nickali2

mara-pipelines's Issues

Passwords included in log output

When using the ExecuteSQL command with a PostgreSQL database, passwords are leaked to the logs:

It would be better to pass sensitive values using Popen's env option.

~~(If you also passed in the query as the process' stdin, you wouldn't have to worry about quoting with shlex.quote):~~

Dynamic Task/ParallelTask/Pipeline

Currently the data pipeline DAG is defined fixed on compilation and supports only a small option of dynamics e.g. the task ParallelReadFile supports to read files (the number of files are unknown on compilation time).

I would like to have similar dynamics in other areas as well:

Dynamic nodes

The following dynamic nodes could be implemented:

Dynamic tasks

A option to give the Task a python function which is executed on pipeline runtime and returns a list of commands to execute in order.

Dynamic parallel tasks

A option to give the ParallelTask a python function which is executed on pipeline runtime and returns a list of commands / command chains to be executed in parallel.

Dynamic pipeline

A option to define a DynamicPipeline where the nodes are defined within a python function which is executed on pipeline runtime.

Implement UI awareness

The dynamic node objects (Task/ParallelTask/Pipeline) must be defined so that the python function which defines the actual commands/tasks/nodes is not run when interacting with the UI.

Implement node cost handling

These dynamic nodes should be defined so that they define sub-nodes for the dynamic node object. The pipeline execution should then intelligently retract the node cost from the database when the node had been executed in the past. E.g. a dynamic node could represent a export of a database table. By defining the sub-nodes, the pipeline execution can intelligently run the nodes with the highest node cost first to save up execution time.

Example use cases

performing actions against tables on a database (e.g. export table to datalake). We don't know on time of compilation what tables exist in the database
performing actions against a data lake / lakehouse per table on disk (e.g. connecting the table to our database engine). We don't know on time of compilation what tables exist on the data lake / lakehouse.

the pip command no longer works

Q: feasibility of substituting pgsql for sqlite?

More of a question than an issue.. but maybe it could turn into a PR :)

Is it easy/hard/impossible/a good idea/a bad idea to substitute postgres for sqlite? This looks really cool and promising but I'm interested in reducing external (or local -- this is going to be dockerized) dependencies

Thanks in advance!

Pass in the filename to the mapper script as an argument

It would be nice in certain debugging scenarios to actually know the filename (or any parts in that filename) in the mapper scripts (e.g. to write it to the row in the final table).

It would basically adding -- "{self.file_name}"in the mapper script part of the pipe:

https://github.com/mara/mara-pipelines/blob/master/mara_pipelines/commands/files.py#L65-L71

                f'{uncompressor(self.compression)} "{pathlib.Path(config.data_dir()) / self.file_name}" \\\n' \
                + (f'  | {shlex.quote(sys.executable)} "{self.mapper_file_path()}" -- "{self.file_name}" \\\n' # <- changed

As far as I understand, as current mappers do not get any args, none should fail if they get one now... @hz-lschick @martin-loetzsch ?

Make PyPI package graphviz optional

Currently the PyPI package graphviz is required. I tested out running a mara instance on Google App Engine which worked fine except that you are not able to install additional linux packages (except you spin up a docker image which costs extra).

Would be interesting to investigate making graphviz optional by using the JS engine d3-graphviz when python package graphviz is not installed.

In addition, when running mara_pipelines headless there isn't a reason for graphviz and therefore IMHO this package should be defined as an extra requirement.

Issues with _ParallelRead / Redesign adding optional Worker nodes

User story

I came across an interesting problem while reading multiple files via a subclass of _ParallelRead: With the current design _ParallelRead loads all files to be processed into RAM, decides which tasks to do first and then starts the parallel reading.

I have a folder with over 1.3 million files which needs to be processed on a cloud storage. It takes ages to get to the point that mara starts reading (and when you have a invalid file, all starts over again...)

In addition, it looks to me that the calculation is inefficient when one would load a lot of files which have different sizes (= different processing times).

Here is what I came up with

I redesigned the base class ParallelTask to support using generic Worker nodes instead of tasks. This is an optional mode which needs to be activated with self.use_workers = True.
The Worker nodes will get their commands during runtime of the pipeline (in contrast to the Task node which requires that its commands are defined upfront).

An additional function feed_workers in class ParallelTask can be overloaded. This method is run in a separate process during pipeline execution and yields the commands which are then passed over to the workers. You can eigher yield a single command or a command list. In case you yield a command list, the list is only passed to a single worker. (This is necessary because in some cases you want to execute several commands in order for a single file).

Since the workers now get their files / commands passed on runtime form an "message queue", I expect this logic to work better when many files need to be processed.

This new design does not work for all _ParallelRead execution options, so it is only used when possible.

PR: #74

Some points to note

When the feed_workers function throws an exception all commands already in the queue will be processed by the worker tasks. There is no implementation done to inform the worker nodes that they should stop their work. They will stop when all open commands in the queue picket up.
When a worker node fails, the other workers and the feed woerker process will continue their work until the Queue is empty.

This implementation only works with ReadMode.ALL.

Duplication of pk in table data_integration_system_statistics

Hi,
I'm now using mara to create a pipeline, and use multiprocessing to parallelly run the workflow.
In my situation, it may have several pipeline running at the same time; however, it would incur duplicated timestamp key created.
If it is possible to use another column (like node_output_id or something else) as the primary key? Or maybe using a global lock to avoid creating data at the same time.

EDIT: For now, I add another column name 'index' as the primary_key and index column to fix this problem.

Thanks!
Sincerely,
Tony

Support different execution contexts

Currently mara pipelines are always executed locally. But I would like to have an option to execute it sometimes somewhere else e.g. in another environment where other ressources are closer available.

The idea

So I came up with the idea about execution contexts. Here is the rough idea:

one can define a execution context for a pipeline or for a specific task
the exection context then defines where the shell command shall be executed
it should be possible to define multiple execution contexts within one pipeline
a execution context has a "enter" / "exit" method which gives the option to spin up or release the required resources for the execution context

The current idea is to support the following execution context:

BashExecutionContext - local bash (this is the current default behavior)
SshBashExecutionContext - remote bash execution via ssh
DockerExecutionContext - docker exec with optional start/stop of a container

Possible other options (Out of scope)

This concept could be extended in the future to add other options like:

executing a job on a remote server and before spin up / release predefined cloud resources
executing a job in Google Cloud Run
executing a pipeline in another pipeline engine e.g. Airflow

These ideas are just noted here and are out of scope for this issue.

Blueprint for the ExecutionContext base class

class ExecutionContext:
    """The execution context for a shell command"""
    self.is_active: bool = false

    def __enter__(self):
        """Enters the execution context."""
        return self

    def __exit__(self, type, value, traceback) -> bool:
        """Exits the execution context freeing up used resources."""
        return True

    def run_shell_command(self, shell_command: str) -> bool:
        """Executes a shell command in the context"""
        pass

Rename `data_integration_*` tables

With version 3 the package data_integration as been renamed to mara-pipelines but the internal tables are still named data_integration_*.

I suggest to rename the tables to pipeline_* in the next major version. For example data_integration_run would then get the name pipeline_run. The data could be moved to the new tables during db migration (needs to be implemented). For people still needing the old tables one could create SQL views (CREATE VIEW data_integration_run AS SELECT * FROM pipelines_run) to still be able to acces the table with their old name.

PostgreSQLDB(..) constructor doesn't handle port

Passing a port (as an int or a string) isn't honored as it runs against 5432 no matter what you pass in to the constructor.

mara_db.config.databases \
    = lambda: {'mara': mara_db.dbs.PostgreSQLDB(host='localhost', port=25432, user='etl', password='etl', database='etl')}

Pipeline.add upstreams parameter default should not be a list

https://github.com/mara/data-integration/blob/master/data_integration/pipelines.py#L162
def add(self, node: Node, upstreams: [typing.Union[str, Node]] = []) -> 'Pipeline':

This code creates a new list when the code is compiled and sets it as the default for the upstreams parameter. All calls to this function will use that list. Here's a short illusatrative example:

>>> def append(value, lst=[]):
...     lst.append(value)
...     return lst
...
>>> append(1)
[1]
>>> append(2)
[1, 2]

You probably want to set the default parameter to None and then initialize a new list inside the method if upstreams is None, otherwise all pipelines which don't set upstreams when this method is called will share the same upstreams.

Tables do not exist: data_integration_node_run

I am getting the error:

packages/mara_pipelines/logging/node_cost.py", line 41, in node_durations_and_run_times
    GROUP BY node_path;""", {'path': node.path(), 'level': len(node.path())})
psycopg2.errors.UndefinedTable: relation "data_integration_node_run" does not exist
LINE 10:         FROM data_integration_node_run node
                      ^

Running config code I only see the below. No tables were created:

Created database "postgresql+psycopg2://root@localhost/example_etl_mara

Discussion | Supporting different mara db engines than PostgreSQL

This issue holds a list of tasks to be done before other db engins than PostgreSQL can be used as mara db alias.

replace cursor context calls with SQLAlchemy Query API
find alternative ways to store the node path. Currently it is stored as string array. Potentially it could be stored in a singe comma separated string pipeline,my_parallel_task,before ...

Multiple root pipelines or pipelines collections

By design, there is one root pipeline where all other pipelines are added to. This might make sence when you just have one pipeline which just does one task, but I have several pipelines which I don't want to execute together.
E.g. a pipeline for a daily refresh with several incremental refreshes, a pipeline to execute a complete full load, a pipeline running at a specific time to refresh on demand specific data areas etc.

I came up with the following ideas how this could be solved:

Multiple root pipelines --> when more than one is added, you have to specify the root pipeline name as additional parameter in flask mara_pipeline.ui.run --pipline ...
add a PipelineCollection class. This pipeline class would have a collection of pipelines. You can't run a PipelineCollection, but can run its sub-pipelines. This class could then be set as the root pipeline. A pipeline would then be called via flask mara_pipeline.ui.run --path <pipeline_name>.<path within the pipeline>

Has someone other ideas? Is there maybe a common way how to solve this I am not aware of?

Optimistic pipeline execution behavior

When a node in a pipeline fails, the whole pipeline fails. It would be great to have a more optimistic execution: When a node in a pipeline fails, just skip the downstream nodes instead of all the open nodes and the mark the pipeline as “failed”. This would match the execution logic like dbt does it and gives the data engineer the opinion to just fix the missing nodes after failure.
Current I most of the time have to restart the whole pipeline again even some small tasks at the start which are not connected to other tasks fail.

Run pipeline in UI stops when closing website

This is probably a known issue to many users using mara but I want to mention it here because I think this should not be that way:

When running a pipeline via the UI using run or run with upstreams it runs in a flask UI thread. This creates several issues:

when you close the web page, the pipeline execution stops
when the network connection breaks for just a short period of time (e.g. you disconnect your laptop from a docking station) it stops as well

In addition, when you run a pipeline via the server e.g. from a scheduled CI script or a cron job, you are not able to "connect" to the execution to see what is currently running. The only option to see what is going on is to open the node page and refresh it manually.

Better monitoring of pipeline executions [EPIC]

User story

Currently the only way to see in the UI if a pipeline run through is to click on the pipeline and to check its last runs. There you can see if the pipeline still runs (unfinished), succeeded or failed. When you have a bunch of pipelines (e.g. in the root pipeline) which run on different time schedules, you have to click through all of them to know if everything worked out fine.

This shall be improved

Ideas how to solve this

add a dashboard showing the last runs of all pipelines
add a dashboard showing the currently running pipelines (based on table data_integration_run)
add a status icon behind a pipeline in the dependency graph and/or child nodes page to show the status of the last run. Here we could use simple status icons like those are used in GitHub Actions.

Redesign system statistics

Mara collects system statistic information (e.g. CPU, Disk, Memory) when a pipeline is executed.

This is a collection ticket of things which should be redesigned in this logic.

add option to disable the collection of system statistic information. This should disable the cards in the web UI as well.
redesign system statistics, remove it from the standard EventHandler
add extras_require statistics with package psutil, remove psutil from install_requires (default package requirements)

Postgres Authentication issue while running the pipeline

I am trying Mara very first time. Pretty good. From the demo, I am running the following code but getting the error

 Traceback (most recent call last):
  File "/Users/AdnanAhmad/Data/anaconda3/lib/python3.7/site-packages/mara_pipelines/execution.py", line 67, in run
    node_durations_and_run_times = node_cost.node_durations_and_run_times(pipeline)
  File "/Users/AdnanAhmad/Data/anaconda3/lib/python3.7/site-packages/mara_pipelines/logging/node_cost.py", line 22, in node_durations_and_run_times
    with mara_db.postgresql.postgres_cursor_context('mara') as cursor:
  File "/Users/AdnanAhmad/Data/anaconda3/lib/python3.7/contextlib.py", line 112, in __enter__
    return next(self.gen)
  File "/Users/AdnanAhmad/Data/anaconda3/lib/python3.7/site-packages/mara_db/postgresql.py", line 19, in postgres_cursor_context
    host=db.host, port=db.port)  # type: psycopg2.extensions.connection
  File "/x/lib/python3.7/site-packages/psycopg2/__init__.py", line 127, in connect
    conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
psycopg2.OperationalError: fe_sendauth: no password supplied

from mara_pipelines.commands.bash import RunBash
from mara_pipelines.pipelines import Pipeline, Task
from mara_pipelines.ui.cli import run_pipeline, run_interactively
from mara_pipelines.ui.cli import run_pipeline

pipeline = Pipeline(
    id='demo',
    description='A small pipeline that demonstrates the interplay between pipelines, tasks and commands')

pipeline.add(Task(id='ping_localhost', description='Pings localhost',
                  commands=[RunBash('ping -c 3 localhost')]))

sub_pipeline = Pipeline(id='sub_pipeline', description='Pings a number of hosts')

for host in ['google', 'amazon', 'facebook']:
    sub_pipeline.add(Task(id=f'ping_{host}', description=f'Pings {host}',
                          commands=[RunBash(f'ping -c 3 {host}.com')]))

sub_pipeline.add_dependency('ping_amazon', 'ping_facebook')
sub_pipeline.add(Task(id='ping_foo', description='Pings foo',
                      commands=[RunBash('ping foo')]), ['ping_amazon'])

pipeline.add(sub_pipeline, ['ping_localhost'])

pipeline.add(Task(id='sleep', description='Sleeps for 2 seconds',
                  commands=[RunBash('sleep 2')]), ['sub_pipeline'])


if __name__ == '__main__':
    run_pipeline(pipeline)

I had create a separate .py file in which I used the code to setup db

import mara_db.auto_migration
import mara_db.config
import mara_db.dbs

mara_db.config.databases \
    = lambda: {'mara': mara_db.dbs.PostgreSQLDB(
    host='localhost',
    user='postgres',
    password='postgres',
    database='example_etl_mara')}

mara_db.auto_migration.auto_discover_models_and_migrate()

It worked well and created db. Now it is not clear whether I have to run this code while running pipeline too?

Thanks

Adding support for Mara Storage

The implementation of the storage module from @ice1e0 has been merged into the master branch. I see this as a breaking change and suggest to publish this therefore in the next major version 4.

In addition I think some additional commands should be added for working with files. Here some samples:

WriteFile - Write an SQL Output to a file on a blob storage
CopyFile - copies a file from one location to another
RemoveFile - removes a file from a blob storage

optional additional commands:

ValidateFile - check if a file is of a specific format and validates the file against a Schema file (XSD or a JSON Schema file)
ConvertFile - converts a file from one file format to another, e.g. xml to json

Is there any way to deal with errors/exceptions raised in the pipeline tasks?

I realize that these are just shell scripts.. but not clear how to manage errors (some of which may invalidate the remaining tasks). Otherwise.. pretty cool.

Implement a "continue on error" behavior for pipelines

Currently the pipeline executor stops once a task fails (after waiting for all currently running tasks to finish).

Sometimes it is desirable to continue execution of a pipeline upon failure (e.g. for consistency checks or frontend updates).

Such a feature could be easily implemented

by adding a parameter continue_on_error to class Pipeline in https://github.com/mara/data-integration/blob/master/data_integration/pipelines.py#L150.
by extending the if node.parent in failed_pipelines: condition in https://github.com/mara/data-integration/blob/master/data_integration/execution.py#L109

Reverting a bad create_xxx_data_table with file dependencies leads to inconsistent state

We just run in this scenario: Incremental load job, with create_xxx_data_table.sql + read_xxx.sql. create_xx_data_table.sql depends on schema + both involved sql files.

The problem happened on a bad merge which resulted in create_xxx_data_table.sql errors after the DROP and before the CREATE TABLE was run successfully. This merge got reverted, so the file afterwards had the same checksum as before the merge.

The problem was that the table got DROPed but the checksum was thinking the file didn't change so does not need to rerun.

IMO the logic should be changed so that the old checksum is removed from the cache if the process resulted in an error.

I can send a PR if this is the right approach (or any other which might be better).

Rerun failed pipelines function

Add a re-run function to run all failed and skipped nodes from a failed pipeline run.

This should be put into a UI interface, e.g. next to a overview of the last pipelines executions:

passing/overriding config via `flask ...run <param>`

I would wish an option in the cli command flask mara_pipelines.ui.run to override variables.

Example:

I have a app.config.py file:

from enum import Enum

class ProcessingMode(Enum):
    FULL = 'full'
    INCREMENTAL = 'incremental'

def processing_mode() -> ProcessingMode:
    """The processing mode to be used"""
    return ProcessingMode.INCREMENTAL

def default_window_in_days() -> int:
    """The default refresh window for models when running a incremental sync."""
    return 90

Now I would like to override the config for a singe pipeline execution.

Sample design:

# sets the processing_mode to FULL
flask mara_pipelines.ui.run --patch-config app.config.processing_mode=FULL

# sets the default_window_in_days to 180 days
flask mara_pipelines.ui.run --patch-config app.config.default_window_in_days=FULL

The parameter name should maybe not be called --patch-config <str>. Terraform uses e.g. -var=<...>/-var-file=<...>

Install from PyPI missing statics, run_time_chart.sql

It appears when I pip install mara-pipelines from pypi (instead of from github as is done in the example project), the statics don't get installed and neither does mara_pipelines/ui/run_time_chart.sql, which causes a bunch of 404s and a 500, when trying to access the flask page:

➜  app git:(master) ✗ flask run
 * Environment: production
   WARNING: This is a development server. Do not use it in a production deployment.
   Use a production WSGI server instead.
 * Debug mode: off
 * Running on http://0.0.0.0:5000/ (Press CTRL+C to quit)
172.19.0.1 - - [13/Oct/2020 13:41:58] "GET / HTTP/1.1" 302 -
172.19.0.1 - - [13/Oct/2020 13:41:58] "GET /pipelines HTTP/1.1" 200 -
172.19.0.1 - - [13/Oct/2020 13:41:58] "GET /pipelines/static/common.css HTTP/1.1" 404 -
172.19.0.1 - - [13/Oct/2020 13:41:58] "GET /pipelines/static/node-page.css HTTP/1.1" 404 -
172.19.0.1 - - [13/Oct/2020 13:41:58] "GET /pipelines/static/timeline-chart.css HTTP/1.1" 404 -
172.19.0.1 - - [13/Oct/2020 13:41:58] "GET /pipelines/static/node-page.js HTTP/1.1" 404 -
172.19.0.1 - - [13/Oct/2020 13:41:58] "GET /pipelines/static/utils.js HTTP/1.1" 404 -
172.19.0.1 - - [13/Oct/2020 13:41:58] "GET /pipelines/static/run-time-chart.js HTTP/1.1" 404 -
172.19.0.1 - - [13/Oct/2020 13:41:58] "GET /pipelines/static/system-stats-chart.js HTTP/1.1" 404 -
172.19.0.1 - - [13/Oct/2020 13:41:58] "GET /pipelines/static/kolorwheel.js HTTP/1.1" 404 -
172.19.0.1 - - [13/Oct/2020 13:41:58] "GET /pipelines/static/timeline-chart.js HTTP/1.1" 404 -
172.19.0.1 - - [13/Oct/2020 13:41:58] "GET /pipelines/static/node-page.js HTTP/1.1" 404 -
172.19.0.1 - - [13/Oct/2020 13:41:59] "GET /pipelines/static/utils.js HTTP/1.1" 404 -
172.19.0.1 - - [13/Oct/2020 13:41:59] "GET /pipelines/static/run-time-chart.js HTTP/1.1" 404 -
172.19.0.1 - - [13/Oct/2020 13:41:59] "GET /pipelines/static/system-stats-chart.js HTTP/1.1" 404 -
172.19.0.1 - - [13/Oct/2020 13:41:59] "GET /pipelines/static/timeline-chart.js HTTP/1.1" 404 -
172.19.0.1 - - [13/Oct/2020 13:41:59] "GET /pipelines/static/kolorwheel.js HTTP/1.1" 404 -
[2020-10-13 13:41:59,141] ERROR in app: Exception on /pipelines/run-time-chart [GET]
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/flask/app.py", line 2447, in wsgi_app
    response = self.full_dispatch_request()
  File "/usr/local/lib/python3.7/site-packages/flask/app.py", line 1952, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/usr/local/lib/python3.7/site-packages/flask/app.py", line 1821, in handle_user_exception
    reraise(exc_type, exc_value, tb)
  File "/usr/local/lib/python3.7/site-packages/flask/_compat.py", line 39, in reraise
    raise value
  File "/usr/local/lib/python3.7/site-packages/flask/app.py", line 1950, in full_dispatch_request
    rv = self.dispatch_request()
  File "/usr/local/lib/python3.7/site-packages/flask/app.py", line 1936, in dispatch_request
    return self.view_functions[rule.endpoint](**req.view_args)
  File "/usr/local/lib/python3.7/site-packages/mara_page/acl.py", line 108, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/mara_pipelines/ui/run_time_chart.py", line 30, in run_time_chart
    query = (pathlib.Path(__file__).parent / 'run_time_chart.sql').read_text()
  File "/usr/local/lib/python3.7/pathlib.py", line 1221, in read_text
    with self.open(mode='r', encoding=encoding, errors=errors) as f:
  File "/usr/local/lib/python3.7/pathlib.py", line 1208, in open
    opener=self._opener)
  File "/usr/local/lib/python3.7/pathlib.py", line 1063, in _opener
    return self._accessor.open(self, flags, mode)
FileNotFoundError: [Errno 2] No such file or directory: '/usr/local/lib/python3.7/site-packages/mara_pipelines/ui/run_time_chart.sql'
172.19.0.1 - - [13/Oct/2020 13:41:59] "GET /pipelines/run-time-chart HTTP/1.1" 500 -
172.19.0.1 - - [13/Oct/2020 13:41:59] "GET /pipelines/pipeline-children-table HTTP/1.1" 200 -
172.19.0.1 - - [13/Oct/2020 13:41:59] "GET /pipelines/dependency-graph HTTP/1.1" 200 -
172.19.0.1 - - [13/Oct/2020 13:41:59] "GET /pipelines/system-stats HTTP/1.1" 200 -
172.19.0.1 - - [13/Oct/2020 13:41:59] "GET /pipelines/timeline-chart HTTP/1.1" 200 -
172.19.0.1 - - [13/Oct/2020 13:41:59] "GET /pipelines/last-runs-selector HTTP/1.1" 200 -
172.19.0.1 - - [13/Oct/2020 13:41:59] "GET /pipelines/run-output-limited HTTP/1.1" 200 -
172.19.0.1 - - [13/Oct/2020 13:41:59] "GET /mara-app/navigation-bar HTTP/1.1" 200 -

High memory issue with RunLogger

The RunLogger class caches the Output event and writes it to a local variable (node_output). There are two issues with the current implementation which might cause high memory usage during pipeline execution:

when you have a long running task / parallel task, a long output chain might be cached. There should be a solution implemented which writes the cached output to the db after e.g. 500 entries.
the node_output dict is not freed after writing the output to the dabase. This will result in that the whole pipeline ouput is cached in RAM until class RunLogger is disposed (!!)

Handle SIGTERM

K8s sends a SIGTERM followed by a waiting period and SIGKILL. The pipeline runner currently only handles SIGINT (ctrl-c, via the exception python raises for this key combo which triggers the generic except: block in the main loop).

Python per default seems to react to SIGTERM by terminating. This results in situations where the pipeline run is abruptly killed which leaves open runs around.

How to reproduce:

Run an pipeline
Kill all processes belonging to the run with SIGTERM (kill <list of pids> or close the docker container)

-> the runs (data_integration_run) and node_runs (data_integration_node_run) are left open (end_time is NULL).

We use the runs to not start a second run and the node_runs to display prometheus metrics per main pipeline (last run, currently running,...). Leaving them open means someone has to close them manually :-(

parallel_tasks\files.py ParallelReadFile copies data into master table instead of child tables

ReadFile.shell_command is called by ParallelReadFile.read_command with the master table as param target_table.
Postgres does not redirect insert/copy from master table to child tables automatically.

In ReadFile the correct child name is writen to a local param.
l.140: target_table = self.target_table + '' + day.strftime("%Y%m%d")
The param is not passed to function parallel_task, it is only used to analyze the (empty) partition.

Possible solutions could be:

use Postgres Partitioning introduced with Version 10 instead of inheritance
Pass local target_table to parallel_commands

limit node cost calc.

The node cost calculation here just takes the whole history which I think is not a good idea. There should be a limit e.g. for the last 60 (successful) executions.

The perfect solution would probably be using the expotential smoothing algorithm weighting older execution times less than newer once. (but guess that that would be a bit oversized)

add HTTP Request command

I plan to add a new HttpRequestCommand which shall run a HTTP/HTTPS command via curl.

It should keep the output clean (using option -s) and fail when the request returns an error code (option -f).

Things to implement

support different request types (arg -X), using GET requests by default
support passing headers (option -H)
implement a command which generates the shell command string with an option to pass data from stdin (option --data-binary @-)

Possible use cases

calling an API function with an access token
running a serverless cloud function (AWS Lambda / Google Cloud Function / Azure Function command) via HTTP request

How to implement

add a new function http_request_command(...) in mara_pipelines.shell
add a new command HttpRequestCommand in mara_pipelines.commands.http

ValueError: cannot find context for 'fork'

This example code on my system, I assume should run without error:

from data_integration.commands.bash import RunBash
from data_integration.pipelines import Pipeline, Task
from data_integration.ui.cli import run_pipeline, run_interactively

pipeline = Pipeline(
    id='demo',
    description='A small pipeline that demonstrates the interplay between pipelines, tasks and commands')

pipeline.add(Task(id='ping_localhost', description='Pings localhost',
                  commands=[RunBash('ping -c 3 localhost')]))

sub_pipeline = Pipeline(id='sub_pipeline', description='Pings a number of hosts')

for host in ['google', 'amazon', 'facebook']:
    sub_pipeline.add(Task(id=f'ping_{host}', description=f'Pings {host}',
                          commands=[RunBash(f'ping -c 3 {host}.com')]))

sub_pipeline.add_dependency('ping_amazon', 'ping_facebook')
sub_pipeline.add(Task(id='ping_foo', description='Pings foo',
                      commands=[RunBash('ping foo')]), ['ping_amazon'])

pipeline.add(sub_pipeline, ['ping_localhost'])

pipeline.add(Task(id='sleep', description='Sleeps for 2 seconds',
                  commands=[RunBash('sleep 2')]), ['sub_pipeline'])


run_pipeline(pipeline)

Here's the output of the script:

$ python historical.py
Traceback (most recent call last):
  File "C:\Users\david\Anaconda3\lib\multiprocessing\context.py", line 190, in get_context
    ctx = _concrete_contexts[method]
KeyError: 'fork'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "historical.py", line 28, in <module>
    run_pipeline(pipeline)
  File "C:\Users\david\Anaconda3\lib\site-packages\data_integration\ui\cli.py", line 46, in run_pipeline
    for event in execution.run_pipeline(pipeline, nodes, with_upstreams, interactively_started=interactively_started):
  File "C:\Users\david\Anaconda3\lib\site-packages\data_integration\execution.py", line 48, in run_pipeline
    multiprocessing_context = multiprocessing.get_context('fork')
  File "C:\Users\david\Anaconda3\lib\multiprocessing\context.py", line 238, in get_context
    return super().get_context(method)
  File "C:\Users\david\Anaconda3\lib\multiprocessing\context.py", line 192, in get_context
    raise ValueError('cannot find context for %r' % method)
ValueError: cannot find context for 'fork'

If it matters, I've also provisioned a PostgreSQL instance for mara:

import mara_db.auto_migration
import mara_db.config
import mara_db.dbs

mara_db.config.databases \
    = lambda: {'mara': mara_db.dbs.PostgreSQLDB(host='localhost', user='postgres', password = '', database='etl_mara')}

mara_db.auto_migration.auto_discover_models_and_migrate()

Don't show stack traces on Ctrl+c

Since a few releases run_pipeline prints various stack traces for KeyboardInterrupt when hitting Ctrl+c:

 ★ 0s
utils: ★ 4.9s
utils / initialize_utils: ★ 4.7s
...
^CProcess system_statistics:
Traceback (most recent call last):
  File "/usr/local/Cellar/[email protected]/3.8.1/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/usr/local/Cellar/[email protected]/3.8.1/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/mloetzsch/Projects/project-a/mara-example-project-2/packages/data-integration/data_integration/execution.py", line 61, in <lambda>
    target=lambda: system_statistics.generate_system_statistics(event_queue), name='system_statistics')
  File "/Users/mloetzsch/Projects/project-a/mara-example-project-2/packages/data-integration/data_integration/logging/system_statistics.py", line 86, in generate_system_statistics
    time.sleep(period)
KeyboardInterrupt
 Traceback (most recent call last):
  File "/Users/mloetzsch/Projects/project-a/mara-example-project-2/packages/data-integration/data_integration/execution.py", line 348, in run_pipeline
    _notify_all(event)
  File "/Users/mloetzsch/Projects/project-a/mara-example-project-2/packages/data-integration/data_integration/execution.py", line 340, in _notify_all
    raise e
  File "/Users/mloetzsch/Projects/project-a/mara-example-project-2/packages/data-integration/data_integration/execution.py", line 333, in _notify_all
    runlogger.handle_event(event)
  File "/Users/mloetzsch/Projects/project-a/mara-example-project-2/packages/data-integration/data_integration/logging/run_log.py", line 124, in handle_event
    with mara_db.postgresql.postgres_cursor_context('mara') as cursor:  # type: psycopg2.extensions.cursor
  File "/usr/local/Cellar/[email protected]/3.8.1/Frameworks/Python.framework/Versions/3.8/lib/python3.8/contextlib.py", line 113, in __enter__
    return next(self.gen)
  File "/Users/mloetzsch/Projects/project-a/mara-example-project-2/packages/mara-db/mara_db/postgresql.py", line 18, in postgres_cursor_context
    connection = psycopg2.connect(dbname=db.database, user=db.user, password=db.password,
  File "/Users/mloetzsch/Projects/project-a/mara-example-project-2/.venv/lib/python3.8/site-packages/psycopg2/__init__.py", line 126, in connect
    conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
KeyboardInterrupt
Cleaned up open runs/node_runs (run_id = 35)```


While this is not a real problem, it would be nice if we could print something less heavy, e.g. `KeyboardInterrupt`.

mara / mara-pipelines Goto Github PK

mara-pipelines's People

Stargazers

Watchers

Forkers

mara-pipelines's Issues

Dynamic nodes

Dynamic tasks

Dynamic parallel tasks

Dynamic pipeline

Implement UI awareness

Implement node cost handling

Example use cases

User story

Here is what I came up with

Some points to note

The idea

Possible other options (Out of scope)

Blueprint for the ExecutionContext base class

User story

Ideas how to solve this

Things to implement

Possible use cases

How to implement

Recommend Projects

Recommend Topics

Recommend Org

Jobs