GithubHelp home page GithubHelp logo

mara / mara-pipelines Goto Github PK

View Code? Open in Web Editor NEW
2.1K 56.0 102.0 3.37 MB

A lightweight opinionated ETL framework, halfway between plain scripts and Apache Airflow

License: MIT License

Python 84.12% CSS 1.01% JavaScript 13.13% PLpgSQL 1.41% Makefile 0.33%
etl data-integration python postgresql pipeline data

mara-pipelines's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

mara-pipelines's Issues

Passwords included in log output

When using the ExecuteSQL command with a PostgreSQL database, passwords are leaked to the logs:

image

It would be better to pass sensitive values using Popen's env option.

(If you also passed in the query as the process' stdin, you wouldn't have to worry about quoting with shlex.quote):

Dynamic Task/ParallelTask/Pipeline

Currently the data pipeline DAG is defined fixed on compilation and supports only a small option of dynamics e.g. the task ParallelReadFile supports to read files (the number of files are unknown on compilation time).

I would like to have similar dynamics in other areas as well:

Dynamic nodes

The following dynamic nodes could be implemented:

Dynamic tasks

A option to give the Task a python function which is executed on pipeline runtime and returns a list of commands to execute in order.

Dynamic parallel tasks

A option to give the ParallelTask a python function which is executed on pipeline runtime and returns a list of commands / command chains to be executed in parallel.

Dynamic pipeline

A option to define a DynamicPipeline where the nodes are defined within a python function which is executed on pipeline runtime.

Implement UI awareness

The dynamic node objects (Task/ParallelTask/Pipeline) must be defined so that the python function which defines the actual commands/tasks/nodes is not run when interacting with the UI.

Implement node cost handling

These dynamic nodes should be defined so that they define sub-nodes for the dynamic node object. The pipeline execution should then intelligently retract the node cost from the database when the node had been executed in the past. E.g. a dynamic node could represent a export of a database table. By defining the sub-nodes, the pipeline execution can intelligently run the nodes with the highest node cost first to save up execution time.

Example use cases

  • performing actions against tables on a database (e.g. export table to datalake). We don't know on time of compilation what tables exist in the database
  • performing actions against a data lake / lakehouse per table on disk (e.g. connecting the table to our database engine). We don't know on time of compilation what tables exist on the data lake / lakehouse.

Q: feasibility of substituting pgsql for sqlite?

More of a question than an issue.. but maybe it could turn into a PR :)

Is it easy/hard/impossible/a good idea/a bad idea to substitute postgres for sqlite? This looks really cool and promising but I'm interested in reducing external (or local -- this is going to be dockerized) dependencies

Thanks in advance!

Pass in the filename to the mapper script as an argument

It would be nice in certain debugging scenarios to actually know the filename (or any parts in that filename) in the mapper scripts (e.g. to write it to the row in the final table).

It would basically adding -- "{self.file_name}"in the mapper script part of the pipe:

https://github.com/mara/mara-pipelines/blob/master/mara_pipelines/commands/files.py#L65-L71

                f'{uncompressor(self.compression)} "{pathlib.Path(config.data_dir()) / self.file_name}" \\\n' \
                + (f'  | {shlex.quote(sys.executable)} "{self.mapper_file_path()}" -- "{self.file_name}" \\\n' # <- changed

As far as I understand, as current mappers do not get any args, none should fail if they get one now... @hz-lschick @martin-loetzsch ?

Make PyPI package graphviz optional

Currently the PyPI package graphviz is required. I tested out running a mara instance on Google App Engine which worked fine except that you are not able to install additional linux packages (except you spin up a docker image which costs extra).

Would be interesting to investigate making graphviz optional by using the JS engine d3-graphviz when python package graphviz is not installed.

In addition, when running mara_pipelines headless there isn't a reason for graphviz and therefore IMHO this package should be defined as an extra requirement.

Issues with _ParallelRead / Redesign adding optional Worker nodes

User story

I came across an interesting problem while reading multiple files via a subclass of _ParallelRead: With the current design _ParallelRead loads all files to be processed into RAM, decides which tasks to do first and then starts the parallel reading.

I have a folder with over 1.3 million files which needs to be processed on a cloud storage. It takes ages to get to the point that mara starts reading (and when you have a invalid file, all starts over again...)

In addition, it looks to me that the calculation is inefficient when one would load a lot of files which have different sizes (= different processing times).

Here is what I came up with

I redesigned the base class ParallelTask to support using generic Worker nodes instead of tasks. This is an optional mode which needs to be activated with self.use_workers = True.
The Worker nodes will get their commands during runtime of the pipeline (in contrast to the Task node which requires that its commands are defined upfront).

An additional function feed_workers in class ParallelTask can be overloaded. This method is run in a separate process during pipeline execution and yields the commands which are then passed over to the workers. You can eigher yield a single command or a command list. In case you yield a command list, the list is only passed to a single worker. (This is necessary because in some cases you want to execute several commands in order for a single file).

Since the workers now get their files / commands passed on runtime form an "message queue", I expect this logic to work better when many files need to be processed.

This new design does not work for all _ParallelRead execution options, so it is only used when possible.

PR: #74

Some points to note

  1. When the feed_workers function throws an exception all commands already in the queue will be processed by the worker tasks. There is no implementation done to inform the worker nodes that they should stop their work. They will stop when all open commands in the queue picket up.
  2. When a worker node fails, the other workers and the feed woerker process will continue their work until the Queue is empty.

This implementation only works with ReadMode.ALL.

Duplication of pk in table data_integration_system_statistics

Hi,
I'm now using mara to create a pipeline, and use multiprocessing to parallelly run the workflow.
In my situation, it may have several pipeline running at the same time; however, it would incur duplicated timestamp key created.
If it is possible to use another column (like node_output_id or something else) as the primary key? Or maybe using a global lock to avoid creating data at the same time.

EDIT: For now, I add another column name 'index' as the primary_key and index column to fix this problem.

Thanks!
Sincerely,
Tony

Support different execution contexts

Currently mara pipelines are always executed locally. But I would like to have an option to execute it sometimes somewhere else e.g. in another environment where other ressources are closer available.

The idea

So I came up with the idea about execution contexts. Here is the rough idea:

  1. one can define a execution context for a pipeline or for a specific task
  2. the exection context then defines where the shell command shall be executed
  3. it should be possible to define multiple execution contexts within one pipeline
  4. a execution context has a "enter" / "exit" method which gives the option to spin up or release the required resources for the execution context

The current idea is to support the following execution context:

  1. BashExecutionContext - local bash (this is the current default behavior)
  2. SshBashExecutionContext - remote bash execution via ssh
  3. DockerExecutionContext - docker exec with optional start/stop of a container

Possible other options (Out of scope)

This concept could be extended in the future to add other options like:

  • executing a job on a remote server and before spin up / release predefined cloud resources
  • executing a job in Google Cloud Run
  • executing a pipeline in another pipeline engine e.g. Airflow

These ideas are just noted here and are out of scope for this issue.

Blueprint for the ExecutionContext base class

class ExecutionContext:
    """The execution context for a shell command"""
    self.is_active: bool = false

    def __enter__(self):
        """Enters the execution context."""
        return self

    def __exit__(self, type, value, traceback) -> bool:
        """Exits the execution context freeing up used resources."""
        return True

    def run_shell_command(self, shell_command: str) -> bool:
        """Executes a shell command in the context"""
        pass

Rename `data_integration_*` tables

With version 3 the package data_integration as been renamed to mara-pipelines but the internal tables are still named data_integration_*.

I suggest to rename the tables to pipeline_* in the next major version. For example data_integration_run would then get the name pipeline_run. The data could be moved to the new tables during db migration (needs to be implemented). For people still needing the old tables one could create SQL views (CREATE VIEW data_integration_run AS SELECT * FROM pipelines_run) to still be able to acces the table with their old name.

PostgreSQLDB(..) constructor doesn't handle port

Passing a port (as an int or a string) isn't honored as it runs against 5432 no matter what you pass in to the constructor.

mara_db.config.databases \
    = lambda: {'mara': mara_db.dbs.PostgreSQLDB(host='localhost', port=25432, user='etl', password='etl', database='etl')}

Pipeline.add upstreams parameter default should not be a list

https://github.com/mara/data-integration/blob/master/data_integration/pipelines.py#L162
def add(self, node: Node, upstreams: [typing.Union[str, Node]] = []) -> 'Pipeline':

This code creates a new list when the code is compiled and sets it as the default for the upstreams parameter. All calls to this function will use that list. Here's a short illusatrative example:

>>> def append(value, lst=[]):
...     lst.append(value)
...     return lst
...
>>> append(1)
[1]
>>> append(2)
[1, 2]

You probably want to set the default parameter to None and then initialize a new list inside the method if upstreams is None, otherwise all pipelines which don't set upstreams when this method is called will share the same upstreams.

Tables do not exist: data_integration_node_run

I am getting the error:

packages/mara_pipelines/logging/node_cost.py", line 41, in node_durations_and_run_times
    GROUP BY node_path;""", {'path': node.path(), 'level': len(node.path())})
psycopg2.errors.UndefinedTable: relation "data_integration_node_run" does not exist
LINE 10:         FROM data_integration_node_run node
                      ^

Running config code I only see the below. No tables were created:

Created database "postgresql+psycopg2://root@localhost/example_etl_mara

Discussion | Supporting different mara db engines than PostgreSQL

This issue holds a list of tasks to be done before other db engins than PostgreSQL can be used as mara db alias.

  • replace cursor context calls with SQLAlchemy Query API
  • find alternative ways to store the node path. Currently it is stored as string array. Potentially it could be stored in a singe comma separated string pipeline,my_parallel_task,before ...

Multiple root pipelines or pipelines collections

By design, there is one root pipeline where all other pipelines are added to. This might make sence when you just have one pipeline which just does one task, but I have several pipelines which I don't want to execute together.
E.g. a pipeline for a daily refresh with several incremental refreshes, a pipeline to execute a complete full load, a pipeline running at a specific time to refresh on demand specific data areas etc.

I came up with the following ideas how this could be solved:

  • Multiple root pipelines --> when more than one is added, you have to specify the root pipeline name as additional parameter in flask mara_pipeline.ui.run --pipline ...
  • add a PipelineCollection class. This pipeline class would have a collection of pipelines. You can't run a PipelineCollection, but can run its sub-pipelines. This class could then be set as the root pipeline. A pipeline would then be called via flask mara_pipeline.ui.run --path <pipeline_name>.<path within the pipeline>

Has someone other ideas? Is there maybe a common way how to solve this I am not aware of?

Optimistic pipeline execution behavior

When a node in a pipeline fails, the whole pipeline fails. It would be great to have a more optimistic execution: When a node in a pipeline fails, just skip the downstream nodes instead of all the open nodes and the mark the pipeline as “failed”. This would match the execution logic like dbt does it and gives the data engineer the opinion to just fix the missing nodes after failure.
Current I most of the time have to restart the whole pipeline again even some small tasks at the start which are not connected to other tasks fail.

Run pipeline in UI stops when closing website

This is probably a known issue to many users using mara but I want to mention it here because I think this should not be that way:

When running a pipeline via the UI using run or run with upstreams it runs in a flask UI thread. This creates several issues:

  • when you close the web page, the pipeline execution stops
  • when the network connection breaks for just a short period of time (e.g. you disconnect your laptop from a docking station) it stops as well

In addition, when you run a pipeline via the server e.g. from a scheduled CI script or a cron job, you are not able to "connect" to the execution to see what is currently running. The only option to see what is going on is to open the node page and refresh it manually.

Better monitoring of pipeline executions [EPIC]

User story

Currently the only way to see in the UI if a pipeline run through is to click on the pipeline and to check its last runs. There you can see if the pipeline still runs (unfinished), succeeded or failed. When you have a bunch of pipelines (e.g. in the root pipeline) which run on different time schedules, you have to click through all of them to know if everything worked out fine.

This shall be improved

Ideas how to solve this

  • add a dashboard showing the last runs of all pipelines
  • add a dashboard showing the currently running pipelines (based on table data_integration_run)
  • add a status icon behind a pipeline in the dependency graph and/or child nodes page to show the status of the last run. Here we could use simple status icons like those are used in GitHub Actions.

Redesign system statistics

Mara collects system statistic information (e.g. CPU, Disk, Memory) when a pipeline is executed.

This is a collection ticket of things which should be redesigned in this logic.

  • add option to disable the collection of system statistic information. This should disable the cards in the web UI as well.
  • redesign system statistics, remove it from the standard EventHandler
  • add extras_require statistics with package psutil, remove psutil from install_requires (default package requirements)

Postgres Authentication issue while running the pipeline

Hi

I am trying Mara very first time. Pretty good. From the demo, I am running the following code but getting the error

 Traceback (most recent call last):
  File "/Users/AdnanAhmad/Data/anaconda3/lib/python3.7/site-packages/mara_pipelines/execution.py", line 67, in run
    node_durations_and_run_times = node_cost.node_durations_and_run_times(pipeline)
  File "/Users/AdnanAhmad/Data/anaconda3/lib/python3.7/site-packages/mara_pipelines/logging/node_cost.py", line 22, in node_durations_and_run_times
    with mara_db.postgresql.postgres_cursor_context('mara') as cursor:
  File "/Users/AdnanAhmad/Data/anaconda3/lib/python3.7/contextlib.py", line 112, in __enter__
    return next(self.gen)
  File "/Users/AdnanAhmad/Data/anaconda3/lib/python3.7/site-packages/mara_db/postgresql.py", line 19, in postgres_cursor_context
    host=db.host, port=db.port)  # type: psycopg2.extensions.connection
  File "/x/lib/python3.7/site-packages/psycopg2/__init__.py", line 127, in connect
    conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
psycopg2.OperationalError: fe_sendauth: no password supplied
from mara_pipelines.commands.bash import RunBash
from mara_pipelines.pipelines import Pipeline, Task
from mara_pipelines.ui.cli import run_pipeline, run_interactively
from mara_pipelines.ui.cli import run_pipeline

pipeline = Pipeline(
    id='demo',
    description='A small pipeline that demonstrates the interplay between pipelines, tasks and commands')

pipeline.add(Task(id='ping_localhost', description='Pings localhost',
                  commands=[RunBash('ping -c 3 localhost')]))

sub_pipeline = Pipeline(id='sub_pipeline', description='Pings a number of hosts')

for host in ['google', 'amazon', 'facebook']:
    sub_pipeline.add(Task(id=f'ping_{host}', description=f'Pings {host}',
                          commands=[RunBash(f'ping -c 3 {host}.com')]))

sub_pipeline.add_dependency('ping_amazon', 'ping_facebook')
sub_pipeline.add(Task(id='ping_foo', description='Pings foo',
                      commands=[RunBash('ping foo')]), ['ping_amazon'])

pipeline.add(sub_pipeline, ['ping_localhost'])

pipeline.add(Task(id='sleep', description='Sleeps for 2 seconds',
                  commands=[RunBash('sleep 2')]), ['sub_pipeline'])


if __name__ == '__main__':
    run_pipeline(pipeline)

I had create a separate .py file in which I used the code to setup db

import mara_db.auto_migration
import mara_db.config
import mara_db.dbs

mara_db.config.databases \
    = lambda: {'mara': mara_db.dbs.PostgreSQLDB(
    host='localhost',
    user='postgres',
    password='postgres',
    database='example_etl_mara')}

mara_db.auto_migration.auto_discover_models_and_migrate()

It worked well and created db. Now it is not clear whether I have to run this code while running pipeline too?

Thanks

Adding support for Mara Storage

The implementation of the storage module from @ice1e0 has been merged into the master branch. I see this as a breaking change and suggest to publish this therefore in the next major version 4.

In addition I think some additional commands should be added for working with files. Here some samples:

  • WriteFile - Write an SQL Output to a file on a blob storage
  • CopyFile - copies a file from one location to another
  • RemoveFile - removes a file from a blob storage

optional additional commands:

  • ValidateFile - check if a file is of a specific format and validates the file against a Schema file (XSD or a JSON Schema file)
  • ConvertFile - converts a file from one file format to another, e.g. xml to json

Implement a "continue on error" behavior for pipelines

Currently the pipeline executor stops once a task fails (after waiting for all currently running tasks to finish).

Sometimes it is desirable to continue execution of a pipeline upon failure (e.g. for consistency checks or frontend updates).

Such a feature could be easily implemented

Reverting a bad create_xxx_data_table with file dependencies leads to inconsistent state

We just run in this scenario: Incremental load job, with create_xxx_data_table.sql + read_xxx.sql. create_xx_data_table.sql depends on schema + both involved sql files.

The problem happened on a bad merge which resulted in create_xxx_data_table.sql errors after the DROP and before the CREATE TABLE was run successfully. This merge got reverted, so the file afterwards had the same checksum as before the merge.

The problem was that the table got DROPed but the checksum was thinking the file didn't change so does not need to rerun.

IMO the logic should be changed so that the old checksum is removed from the cache if the process resulted in an error.

I can send a PR if this is the right approach (or any other which might be better).

Rerun failed pipelines function

Add a re-run function to run all failed and skipped nodes from a failed pipeline run.

This should be put into a UI interface, e.g. next to a overview of the last pipelines executions:

image

passing/overriding config via `flask ...run <param>`

I would wish an option in the cli command flask mara_pipelines.ui.run to override variables.

Example:

I have a app.config.py file:

from enum import Enum

class ProcessingMode(Enum):
    FULL = 'full'
    INCREMENTAL = 'incremental'

def processing_mode() -> ProcessingMode:
    """The processing mode to be used"""
    return ProcessingMode.INCREMENTAL

def default_window_in_days() -> int:
    """The default refresh window for models when running a incremental sync."""
    return 90

Now I would like to override the config for a singe pipeline execution.

Sample design:

# sets the processing_mode to FULL
flask mara_pipelines.ui.run --patch-config app.config.processing_mode=FULL

# sets the default_window_in_days to 180 days
flask mara_pipelines.ui.run --patch-config app.config.default_window_in_days=FULL

The parameter name should maybe not be called --patch-config <str>. Terraform uses e.g. -var=<...>/-var-file=<...>

Install from PyPI missing statics, run_time_chart.sql

It appears when I pip install mara-pipelines from pypi (instead of from github as is done in the example project), the statics don't get installed and neither does mara_pipelines/ui/run_time_chart.sql, which causes a bunch of 404s and a 500, when trying to access the flask page:

➜  app git:(master) ✗ flask run
 * Environment: production
   WARNING: This is a development server. Do not use it in a production deployment.
   Use a production WSGI server instead.
 * Debug mode: off
 * Running on http://0.0.0.0:5000/ (Press CTRL+C to quit)
172.19.0.1 - - [13/Oct/2020 13:41:58] "GET / HTTP/1.1" 302 -
172.19.0.1 - - [13/Oct/2020 13:41:58] "GET /pipelines HTTP/1.1" 200 -
172.19.0.1 - - [13/Oct/2020 13:41:58] "GET /pipelines/static/common.css HTTP/1.1" 404 -
172.19.0.1 - - [13/Oct/2020 13:41:58] "GET /pipelines/static/node-page.css HTTP/1.1" 404 -
172.19.0.1 - - [13/Oct/2020 13:41:58] "GET /pipelines/static/timeline-chart.css HTTP/1.1" 404 -
172.19.0.1 - - [13/Oct/2020 13:41:58] "GET /pipelines/static/node-page.js HTTP/1.1" 404 -
172.19.0.1 - - [13/Oct/2020 13:41:58] "GET /pipelines/static/utils.js HTTP/1.1" 404 -
172.19.0.1 - - [13/Oct/2020 13:41:58] "GET /pipelines/static/run-time-chart.js HTTP/1.1" 404 -
172.19.0.1 - - [13/Oct/2020 13:41:58] "GET /pipelines/static/system-stats-chart.js HTTP/1.1" 404 -
172.19.0.1 - - [13/Oct/2020 13:41:58] "GET /pipelines/static/kolorwheel.js HTTP/1.1" 404 -
172.19.0.1 - - [13/Oct/2020 13:41:58] "GET /pipelines/static/timeline-chart.js HTTP/1.1" 404 -
172.19.0.1 - - [13/Oct/2020 13:41:58] "GET /pipelines/static/node-page.js HTTP/1.1" 404 -
172.19.0.1 - - [13/Oct/2020 13:41:59] "GET /pipelines/static/utils.js HTTP/1.1" 404 -
172.19.0.1 - - [13/Oct/2020 13:41:59] "GET /pipelines/static/run-time-chart.js HTTP/1.1" 404 -
172.19.0.1 - - [13/Oct/2020 13:41:59] "GET /pipelines/static/system-stats-chart.js HTTP/1.1" 404 -
172.19.0.1 - - [13/Oct/2020 13:41:59] "GET /pipelines/static/timeline-chart.js HTTP/1.1" 404 -
172.19.0.1 - - [13/Oct/2020 13:41:59] "GET /pipelines/static/kolorwheel.js HTTP/1.1" 404 -
[2020-10-13 13:41:59,141] ERROR in app: Exception on /pipelines/run-time-chart [GET]
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/flask/app.py", line 2447, in wsgi_app
    response = self.full_dispatch_request()
  File "/usr/local/lib/python3.7/site-packages/flask/app.py", line 1952, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/usr/local/lib/python3.7/site-packages/flask/app.py", line 1821, in handle_user_exception
    reraise(exc_type, exc_value, tb)
  File "/usr/local/lib/python3.7/site-packages/flask/_compat.py", line 39, in reraise
    raise value
  File "/usr/local/lib/python3.7/site-packages/flask/app.py", line 1950, in full_dispatch_request
    rv = self.dispatch_request()
  File "/usr/local/lib/python3.7/site-packages/flask/app.py", line 1936, in dispatch_request
    return self.view_functions[rule.endpoint](**req.view_args)
  File "/usr/local/lib/python3.7/site-packages/mara_page/acl.py", line 108, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/mara_pipelines/ui/run_time_chart.py", line 30, in run_time_chart
    query = (pathlib.Path(__file__).parent / 'run_time_chart.sql').read_text()
  File "/usr/local/lib/python3.7/pathlib.py", line 1221, in read_text
    with self.open(mode='r', encoding=encoding, errors=errors) as f:
  File "/usr/local/lib/python3.7/pathlib.py", line 1208, in open
    opener=self._opener)
  File "/usr/local/lib/python3.7/pathlib.py", line 1063, in _opener
    return self._accessor.open(self, flags, mode)
FileNotFoundError: [Errno 2] No such file or directory: '/usr/local/lib/python3.7/site-packages/mara_pipelines/ui/run_time_chart.sql'
172.19.0.1 - - [13/Oct/2020 13:41:59] "GET /pipelines/run-time-chart HTTP/1.1" 500 -
172.19.0.1 - - [13/Oct/2020 13:41:59] "GET /pipelines/pipeline-children-table HTTP/1.1" 200 -
172.19.0.1 - - [13/Oct/2020 13:41:59] "GET /pipelines/dependency-graph HTTP/1.1" 200 -
172.19.0.1 - - [13/Oct/2020 13:41:59] "GET /pipelines/system-stats HTTP/1.1" 200 -
172.19.0.1 - - [13/Oct/2020 13:41:59] "GET /pipelines/timeline-chart HTTP/1.1" 200 -
172.19.0.1 - - [13/Oct/2020 13:41:59] "GET /pipelines/last-runs-selector HTTP/1.1" 200 -
172.19.0.1 - - [13/Oct/2020 13:41:59] "GET /pipelines/run-output-limited HTTP/1.1" 200 -
172.19.0.1 - - [13/Oct/2020 13:41:59] "GET /mara-app/navigation-bar HTTP/1.1" 200 -

High memory issue with RunLogger

The RunLogger class caches the Output event and writes it to a local variable (node_output). There are two issues with the current implementation which might cause high memory usage during pipeline execution:

  • when you have a long running task / parallel task, a long output chain might be cached. There should be a solution implemented which writes the cached output to the db after e.g. 500 entries.
  • the node_output dict is not freed after writing the output to the dabase. This will result in that the whole pipeline ouput is cached in RAM until class RunLogger is disposed (!!)

Handle SIGTERM

K8s sends a SIGTERM followed by a waiting period and SIGKILL. The pipeline runner currently only handles SIGINT (ctrl-c, via the exception python raises for this key combo which triggers the generic except: block in the main loop).

Python per default seems to react to SIGTERM by terminating. This results in situations where the pipeline run is abruptly killed which leaves open runs around.

How to reproduce:

  • Run an pipeline
  • Kill all processes belonging to the run with SIGTERM (kill <list of pids> or close the docker container)

-> the runs (data_integration_run) and node_runs (data_integration_node_run) are left open (end_time is NULL).

We use the runs to not start a second run and the node_runs to display prometheus metrics per main pipeline (last run, currently running,...). Leaving them open means someone has to close them manually :-(

parallel_tasks\files.py ParallelReadFile copies data into master table instead of child tables

ReadFile.shell_command is called by ParallelReadFile.read_command with the master table as param target_table.
Postgres does not redirect insert/copy from master table to child tables automatically.

In ReadFile the correct child name is writen to a local param.
l.140: target_table = self.target_table + '
' + day.strftime("%Y%m%d")
The param is not passed to function parallel_task, it is only used to analyze the (empty) partition.

Possible solutions could be:

  • use Postgres Partitioning introduced with Version 10 instead of inheritance
  • Pass local target_table to parallel_commands

limit node cost calc.

The node cost calculation here just takes the whole history which I think is not a good idea. There should be a limit e.g. for the last 60 (successful) executions.

The perfect solution would probably be using the expotential smoothing algorithm weighting older execution times less than newer once. (but guess that that would be a bit oversized)

add HTTP Request command

I plan to add a new HttpRequestCommand which shall run a HTTP/HTTPS command via curl.

It should keep the output clean (using option -s) and fail when the request returns an error code (option -f).

Things to implement

  • support different request types (arg -X), using GET requests by default
  • support passing headers (option -H)
  • implement a command which generates the shell command string with an option to pass data from stdin (option --data-binary @-)

Possible use cases

  • calling an API function with an access token
  • running a serverless cloud function (AWS Lambda / Google Cloud Function / Azure Function command) via HTTP request

How to implement

  • add a new function http_request_command(...) in mara_pipelines.shell
  • add a new command HttpRequestCommand in mara_pipelines.commands.http

ValueError: cannot find context for 'fork'

This example code on my system, I assume should run without error:

from data_integration.commands.bash import RunBash
from data_integration.pipelines import Pipeline, Task
from data_integration.ui.cli import run_pipeline, run_interactively

pipeline = Pipeline(
    id='demo',
    description='A small pipeline that demonstrates the interplay between pipelines, tasks and commands')

pipeline.add(Task(id='ping_localhost', description='Pings localhost',
                  commands=[RunBash('ping -c 3 localhost')]))

sub_pipeline = Pipeline(id='sub_pipeline', description='Pings a number of hosts')

for host in ['google', 'amazon', 'facebook']:
    sub_pipeline.add(Task(id=f'ping_{host}', description=f'Pings {host}',
                          commands=[RunBash(f'ping -c 3 {host}.com')]))

sub_pipeline.add_dependency('ping_amazon', 'ping_facebook')
sub_pipeline.add(Task(id='ping_foo', description='Pings foo',
                      commands=[RunBash('ping foo')]), ['ping_amazon'])

pipeline.add(sub_pipeline, ['ping_localhost'])

pipeline.add(Task(id='sleep', description='Sleeps for 2 seconds',
                  commands=[RunBash('sleep 2')]), ['sub_pipeline'])


run_pipeline(pipeline)

Here's the output of the script:

$ python historical.py
Traceback (most recent call last):
  File "C:\Users\david\Anaconda3\lib\multiprocessing\context.py", line 190, in get_context
    ctx = _concrete_contexts[method]
KeyError: 'fork'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "historical.py", line 28, in <module>
    run_pipeline(pipeline)
  File "C:\Users\david\Anaconda3\lib\site-packages\data_integration\ui\cli.py", line 46, in run_pipeline
    for event in execution.run_pipeline(pipeline, nodes, with_upstreams, interactively_started=interactively_started):
  File "C:\Users\david\Anaconda3\lib\site-packages\data_integration\execution.py", line 48, in run_pipeline
    multiprocessing_context = multiprocessing.get_context('fork')
  File "C:\Users\david\Anaconda3\lib\multiprocessing\context.py", line 238, in get_context
    return super().get_context(method)
  File "C:\Users\david\Anaconda3\lib\multiprocessing\context.py", line 192, in get_context
    raise ValueError('cannot find context for %r' % method)
ValueError: cannot find context for 'fork'

If it matters, I've also provisioned a PostgreSQL instance for mara:

import mara_db.auto_migration
import mara_db.config
import mara_db.dbs

mara_db.config.databases \
    = lambda: {'mara': mara_db.dbs.PostgreSQLDB(host='localhost', user='postgres', password = '', database='etl_mara')}

mara_db.auto_migration.auto_discover_models_and_migrate()

Don't show stack traces on Ctrl+c

Since a few releases run_pipeline prints various stack traces for KeyboardInterrupt when hitting Ctrl+c:

 ★ 0s
utils: ★ 4.9s
utils / initialize_utils: ★ 4.7s
...
^CProcess system_statistics:
Traceback (most recent call last):
  File "/usr/local/Cellar/[email protected]/3.8.1/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/usr/local/Cellar/[email protected]/3.8.1/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/mloetzsch/Projects/project-a/mara-example-project-2/packages/data-integration/data_integration/execution.py", line 61, in <lambda>
    target=lambda: system_statistics.generate_system_statistics(event_queue), name='system_statistics')
  File "/Users/mloetzsch/Projects/project-a/mara-example-project-2/packages/data-integration/data_integration/logging/system_statistics.py", line 86, in generate_system_statistics
    time.sleep(period)
KeyboardInterrupt
 Traceback (most recent call last):
  File "/Users/mloetzsch/Projects/project-a/mara-example-project-2/packages/data-integration/data_integration/execution.py", line 348, in run_pipeline
    _notify_all(event)
  File "/Users/mloetzsch/Projects/project-a/mara-example-project-2/packages/data-integration/data_integration/execution.py", line 340, in _notify_all
    raise e
  File "/Users/mloetzsch/Projects/project-a/mara-example-project-2/packages/data-integration/data_integration/execution.py", line 333, in _notify_all
    runlogger.handle_event(event)
  File "/Users/mloetzsch/Projects/project-a/mara-example-project-2/packages/data-integration/data_integration/logging/run_log.py", line 124, in handle_event
    with mara_db.postgresql.postgres_cursor_context('mara') as cursor:  # type: psycopg2.extensions.cursor
  File "/usr/local/Cellar/[email protected]/3.8.1/Frameworks/Python.framework/Versions/3.8/lib/python3.8/contextlib.py", line 113, in __enter__
    return next(self.gen)
  File "/Users/mloetzsch/Projects/project-a/mara-example-project-2/packages/mara-db/mara_db/postgresql.py", line 18, in postgres_cursor_context
    connection = psycopg2.connect(dbname=db.database, user=db.user, password=db.password,
  File "/Users/mloetzsch/Projects/project-a/mara-example-project-2/.venv/lib/python3.8/site-packages/psycopg2/__init__.py", line 126, in connect
    conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
KeyboardInterrupt
Cleaned up open runs/node_runs (run_id = 35)```


While this is not a real problem, it would be nice if we could print something less heavy, e.g. `KeyboardInterrupt`.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.