mara / mara-pipelines Goto Github PK
View Code? Open in Web Editor NEWA lightweight opinionated ETL framework, halfway between plain scripts and Apache Airflow
License: MIT License
A lightweight opinionated ETL framework, halfway between plain scripts and Apache Airflow
License: MIT License
Currently the data pipeline DAG is defined fixed on compilation and supports only a small option of dynamics e.g. the task ParallelReadFile
supports to read files (the number of files are unknown on compilation time).
I would like to have similar dynamics in other areas as well:
The following dynamic nodes could be implemented:
A option to give the Task
a python function which is executed on pipeline runtime and returns a list of commands to execute in order.
A option to give the ParallelTask
a python function which is executed on pipeline runtime and returns a list of commands / command chains to be executed in parallel.
A option to define a DynamicPipeline
where the nodes are defined within a python function which is executed on pipeline runtime.
The dynamic node objects (Task/ParallelTask/Pipeline) must be defined so that the python function which defines the actual commands/tasks/nodes is not run when interacting with the UI.
These dynamic nodes should be defined so that they define sub-nodes for the dynamic node object. The pipeline execution should then intelligently retract the node cost from the database when the node had been executed in the past. E.g. a dynamic node could represent a export of a database table. By defining the sub-nodes, the pipeline execution can intelligently run the nodes with the highest node cost first to save up execution time.
More of a question than an issue.. but maybe it could turn into a PR :)
Is it easy/hard/impossible/a good idea/a bad idea to substitute postgres for sqlite? This looks really cool and promising but I'm interested in reducing external (or local -- this is going to be dockerized) dependencies
Thanks in advance!
It would be nice in certain debugging scenarios to actually know the filename (or any parts in that filename) in the mapper scripts (e.g. to write it to the row in the final table).
It would basically adding -- "{self.file_name}"
in the mapper script part of the pipe:
https://github.com/mara/mara-pipelines/blob/master/mara_pipelines/commands/files.py#L65-L71
f'{uncompressor(self.compression)} "{pathlib.Path(config.data_dir()) / self.file_name}" \\\n' \
+ (f' | {shlex.quote(sys.executable)} "{self.mapper_file_path()}" -- "{self.file_name}" \\\n' # <- changed
As far as I understand, as current mappers do not get any args, none should fail if they get one now... @hz-lschick @martin-loetzsch ?
Currently the PyPI package graphviz is required. I tested out running a mara instance on Google App Engine which worked fine except that you are not able to install additional linux packages (except you spin up a docker image which costs extra).
Would be interesting to investigate making graphviz optional by using the JS engine d3-graphviz when python package graphviz
is not installed.
In addition, when running mara_pipelines headless there isn't a reason for graphviz and therefore IMHO this package should be defined as an extra requirement.
I came across an interesting problem while reading multiple files via a subclass of _ParallelRead
: With the current design _ParallelRead
loads all files to be processed into RAM, decides which tasks to do first and then starts the parallel reading.
I have a folder with over 1.3 million files which needs to be processed on a cloud storage. It takes ages to get to the point that mara starts reading (and when you have a invalid file, all starts over again...)
In addition, it looks to me that the calculation is inefficient when one would load a lot of files which have different sizes (= different processing times).
I redesigned the base class ParallelTask
to support using generic Worker
nodes instead of tasks. This is an optional mode which needs to be activated with self.use_workers = True
.
The Worker
nodes will get their commands during runtime of the pipeline (in contrast to the Task
node which requires that its commands are defined upfront).
An additional function feed_workers
in class ParallelTask
can be overloaded. This method is run in a separate process during pipeline execution and yields the commands which are then passed over to the workers. You can eigher yield a single command or a command list. In case you yield a command list, the list is only passed to a single worker. (This is necessary because in some cases you want to execute several commands in order for a single file).
Since the workers now get their files / commands passed on runtime form an "message queue", I expect this logic to work better when many files need to be processed.
This new design does not work for all _ParallelRead
execution options, so it is only used when possible.
PR: #74
feed_workers
function throws an exception all commands already in the queue will be processed by the worker tasks. There is no implementation done to inform the worker nodes that they should stop their work. They will stop when all open commands in the queue picket up.This implementation only works with ReadMode.ALL
.
Hi,
I'm now using mara to create a pipeline, and use multiprocessing to parallelly run the workflow.
In my situation, it may have several pipeline running at the same time; however, it would incur duplicated timestamp key created.
If it is possible to use another column (like node_output_id or something else) as the primary key? Or maybe using a global lock to avoid creating data at the same time.
EDIT: For now, I add another column name 'index' as the primary_key and index column to fix this problem.
Thanks!
Sincerely,
Tony
Currently mara pipelines are always executed locally. But I would like to have an option to execute it sometimes somewhere else e.g. in another environment where other ressources are closer available.
So I came up with the idea about execution contexts. Here is the rough idea:
The current idea is to support the following execution context:
BashExecutionContext
- local bash (this is the current default behavior)SshBashExecutionContext
- remote bash execution via sshDockerExecutionContext
- docker exec with optional start/stop of a containerThis concept could be extended in the future to add other options like:
These ideas are just noted here and are out of scope for this issue.
class ExecutionContext:
"""The execution context for a shell command"""
self.is_active: bool = false
def __enter__(self):
"""Enters the execution context."""
return self
def __exit__(self, type, value, traceback) -> bool:
"""Exits the execution context freeing up used resources."""
return True
def run_shell_command(self, shell_command: str) -> bool:
"""Executes a shell command in the context"""
pass
With version 3 the package data_integration
as been renamed to mara-pipelines
but the internal tables are still named data_integration_*
.
I suggest to rename the tables to pipeline_*
in the next major version. For example data_integration_run
would then get the name pipeline_run
. The data could be moved to the new tables during db migration (needs to be implemented). For people still needing the old tables one could create SQL views (CREATE VIEW data_integration_run AS SELECT * FROM pipelines_run
) to still be able to acces the table with their old name.
Passing a port (as an int or a string) isn't honored as it runs against 5432
no matter what you pass in to the constructor.
mara_db.config.databases \
= lambda: {'mara': mara_db.dbs.PostgreSQLDB(host='localhost', port=25432, user='etl', password='etl', database='etl')}
https://github.com/mara/data-integration/blob/master/data_integration/pipelines.py#L162
def add(self, node: Node, upstreams: [typing.Union[str, Node]] = []) -> 'Pipeline':
This code creates a new list when the code is compiled and sets it as the default for the upstreams parameter. All calls to this function will use that list. Here's a short illusatrative example:
>>> def append(value, lst=[]):
... lst.append(value)
... return lst
...
>>> append(1)
[1]
>>> append(2)
[1, 2]
You probably want to set the default parameter to None
and then initialize a new list inside the method if upstreams is None
, otherwise all pipelines which don't set upstreams when this method is called will share the same upstreams.
I am getting the error:
packages/mara_pipelines/logging/node_cost.py", line 41, in node_durations_and_run_times
GROUP BY node_path;""", {'path': node.path(), 'level': len(node.path())})
psycopg2.errors.UndefinedTable: relation "data_integration_node_run" does not exist
LINE 10: FROM data_integration_node_run node
^
Running config code I only see the below. No tables were created:
Created database "postgresql+psycopg2://root@localhost/example_etl_mara
This issue holds a list of tasks to be done before other db engins than PostgreSQL can be used as mara
db alias.
pipeline,my_parallel_task,before
...By design, there is one root pipeline where all other pipelines are added to. This might make sence when you just have one pipeline which just does one task, but I have several pipelines which I don't want to execute together.
E.g. a pipeline for a daily refresh with several incremental refreshes, a pipeline to execute a complete full load, a pipeline running at a specific time to refresh on demand specific data areas etc.
I came up with the following ideas how this could be solved:
flask mara_pipeline.ui.run --pipline ...
PipelineCollection
class. This pipeline class would have a collection of pipelines. You can't run a PipelineCollection
, but can run its sub-pipelines. This class could then be set as the root pipeline. A pipeline would then be called via flask mara_pipeline.ui.run --path <pipeline_name>.<path within the pipeline>
Has someone other ideas? Is there maybe a common way how to solve this I am not aware of?
When a node in a pipeline fails, the whole pipeline fails. It would be great to have a more optimistic execution: When a node in a pipeline fails, just skip the downstream nodes instead of all the open nodes and the mark the pipeline as “failed”. This would match the execution logic like dbt does it and gives the data engineer the opinion to just fix the missing nodes after failure.
Current I most of the time have to restart the whole pipeline again even some small tasks at the start which are not connected to other tasks fail.
This is probably a known issue to many users using mara but I want to mention it here because I think this should not be that way:
When running a pipeline via the UI using run or run with upstreams it runs in a flask UI thread. This creates several issues:
In addition, when you run a pipeline via the server e.g. from a scheduled CI script or a cron job, you are not able to "connect" to the execution to see what is currently running. The only option to see what is going on is to open the node page and refresh it manually.
Currently the only way to see in the UI if a pipeline run through is to click on the pipeline and to check its last runs. There you can see if the pipeline still runs (unfinished), succeeded or failed. When you have a bunch of pipelines (e.g. in the root pipeline) which run on different time schedules, you have to click through all of them to know if everything worked out fine.
This shall be improved
data_integration_run
)Mara collects system statistic information (e.g. CPU, Disk, Memory) when a pipeline is executed.
This is a collection ticket of things which should be redesigned in this logic.
statistics
with package psutil
, remove psutil
from install_requires (default package requirements)Hi
I am trying Mara very first time. Pretty good. From the demo, I am running the following code but getting the error
Traceback (most recent call last):
File "/Users/AdnanAhmad/Data/anaconda3/lib/python3.7/site-packages/mara_pipelines/execution.py", line 67, in run
node_durations_and_run_times = node_cost.node_durations_and_run_times(pipeline)
File "/Users/AdnanAhmad/Data/anaconda3/lib/python3.7/site-packages/mara_pipelines/logging/node_cost.py", line 22, in node_durations_and_run_times
with mara_db.postgresql.postgres_cursor_context('mara') as cursor:
File "/Users/AdnanAhmad/Data/anaconda3/lib/python3.7/contextlib.py", line 112, in __enter__
return next(self.gen)
File "/Users/AdnanAhmad/Data/anaconda3/lib/python3.7/site-packages/mara_db/postgresql.py", line 19, in postgres_cursor_context
host=db.host, port=db.port) # type: psycopg2.extensions.connection
File "/x/lib/python3.7/site-packages/psycopg2/__init__.py", line 127, in connect
conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
psycopg2.OperationalError: fe_sendauth: no password supplied
from mara_pipelines.commands.bash import RunBash
from mara_pipelines.pipelines import Pipeline, Task
from mara_pipelines.ui.cli import run_pipeline, run_interactively
from mara_pipelines.ui.cli import run_pipeline
pipeline = Pipeline(
id='demo',
description='A small pipeline that demonstrates the interplay between pipelines, tasks and commands')
pipeline.add(Task(id='ping_localhost', description='Pings localhost',
commands=[RunBash('ping -c 3 localhost')]))
sub_pipeline = Pipeline(id='sub_pipeline', description='Pings a number of hosts')
for host in ['google', 'amazon', 'facebook']:
sub_pipeline.add(Task(id=f'ping_{host}', description=f'Pings {host}',
commands=[RunBash(f'ping -c 3 {host}.com')]))
sub_pipeline.add_dependency('ping_amazon', 'ping_facebook')
sub_pipeline.add(Task(id='ping_foo', description='Pings foo',
commands=[RunBash('ping foo')]), ['ping_amazon'])
pipeline.add(sub_pipeline, ['ping_localhost'])
pipeline.add(Task(id='sleep', description='Sleeps for 2 seconds',
commands=[RunBash('sleep 2')]), ['sub_pipeline'])
if __name__ == '__main__':
run_pipeline(pipeline)
I had create a separate .py
file in which I used the code to setup db
import mara_db.auto_migration
import mara_db.config
import mara_db.dbs
mara_db.config.databases \
= lambda: {'mara': mara_db.dbs.PostgreSQLDB(
host='localhost',
user='postgres',
password='postgres',
database='example_etl_mara')}
mara_db.auto_migration.auto_discover_models_and_migrate()
It worked well and created db. Now it is not clear whether I have to run this code while running pipeline too?
Thanks
The implementation of the storage module from @ice1e0 has been merged into the master branch. I see this as a breaking change and suggest to publish this therefore in the next major version 4.
In addition I think some additional commands should be added for working with files. Here some samples:
optional additional commands:
I realize that these are just shell scripts.. but not clear how to manage errors (some of which may invalidate the remaining tasks). Otherwise.. pretty cool.
Currently the pipeline executor stops once a task fails (after waiting for all currently running tasks to finish).
Sometimes it is desirable to continue execution of a pipeline upon failure (e.g. for consistency checks or frontend updates).
Such a feature could be easily implemented
by adding a parameter continue_on_error
to class Pipeline
in https://github.com/mara/data-integration/blob/master/data_integration/pipelines.py#L150.
by extending the if node.parent in failed_pipelines:
condition in https://github.com/mara/data-integration/blob/master/data_integration/execution.py#L109
We just run in this scenario: Incremental load job, with create_xxx_data_table.sql
+ read_xxx.sql
. create_xx_data_table.sql
depends on schema + both involved sql files.
The problem happened on a bad merge which resulted in create_xxx_data_table.sql
errors after the DROP
and before the CREATE TABLE
was run successfully. This merge got reverted, so the file afterwards had the same checksum as before the merge.
The problem was that the table got DROPed but the checksum was thinking the file didn't change so does not need to rerun.
IMO the logic should be changed so that the old checksum is removed from the cache if the process resulted in an error.
I can send a PR if this is the right approach (or any other which might be better).
I would wish an option in the cli command flask mara_pipelines.ui.run
to override variables.
Example:
I have a app.config.py
file:
from enum import Enum
class ProcessingMode(Enum):
FULL = 'full'
INCREMENTAL = 'incremental'
def processing_mode() -> ProcessingMode:
"""The processing mode to be used"""
return ProcessingMode.INCREMENTAL
def default_window_in_days() -> int:
"""The default refresh window for models when running a incremental sync."""
return 90
Now I would like to override the config for a singe pipeline execution.
Sample design:
# sets the processing_mode to FULL
flask mara_pipelines.ui.run --patch-config app.config.processing_mode=FULL
# sets the default_window_in_days to 180 days
flask mara_pipelines.ui.run --patch-config app.config.default_window_in_days=FULL
The parameter name should maybe not be called --patch-config <str>
. Terraform uses e.g. -var=<...>/-var-file=<...>
It appears when I pip install mara-pipelines
from pypi (instead of from github as is done in the example project), the statics don't get installed and neither does mara_pipelines/ui/run_time_chart.sql
, which causes a bunch of 404s and a 500, when trying to access the flask page:
➜ app git:(master) ✗ flask run
* Environment: production
WARNING: This is a development server. Do not use it in a production deployment.
Use a production WSGI server instead.
* Debug mode: off
* Running on http://0.0.0.0:5000/ (Press CTRL+C to quit)
172.19.0.1 - - [13/Oct/2020 13:41:58] "GET / HTTP/1.1" 302 -
172.19.0.1 - - [13/Oct/2020 13:41:58] "GET /pipelines HTTP/1.1" 200 -
172.19.0.1 - - [13/Oct/2020 13:41:58] "GET /pipelines/static/common.css HTTP/1.1" 404 -
172.19.0.1 - - [13/Oct/2020 13:41:58] "GET /pipelines/static/node-page.css HTTP/1.1" 404 -
172.19.0.1 - - [13/Oct/2020 13:41:58] "GET /pipelines/static/timeline-chart.css HTTP/1.1" 404 -
172.19.0.1 - - [13/Oct/2020 13:41:58] "GET /pipelines/static/node-page.js HTTP/1.1" 404 -
172.19.0.1 - - [13/Oct/2020 13:41:58] "GET /pipelines/static/utils.js HTTP/1.1" 404 -
172.19.0.1 - - [13/Oct/2020 13:41:58] "GET /pipelines/static/run-time-chart.js HTTP/1.1" 404 -
172.19.0.1 - - [13/Oct/2020 13:41:58] "GET /pipelines/static/system-stats-chart.js HTTP/1.1" 404 -
172.19.0.1 - - [13/Oct/2020 13:41:58] "GET /pipelines/static/kolorwheel.js HTTP/1.1" 404 -
172.19.0.1 - - [13/Oct/2020 13:41:58] "GET /pipelines/static/timeline-chart.js HTTP/1.1" 404 -
172.19.0.1 - - [13/Oct/2020 13:41:58] "GET /pipelines/static/node-page.js HTTP/1.1" 404 -
172.19.0.1 - - [13/Oct/2020 13:41:59] "GET /pipelines/static/utils.js HTTP/1.1" 404 -
172.19.0.1 - - [13/Oct/2020 13:41:59] "GET /pipelines/static/run-time-chart.js HTTP/1.1" 404 -
172.19.0.1 - - [13/Oct/2020 13:41:59] "GET /pipelines/static/system-stats-chart.js HTTP/1.1" 404 -
172.19.0.1 - - [13/Oct/2020 13:41:59] "GET /pipelines/static/timeline-chart.js HTTP/1.1" 404 -
172.19.0.1 - - [13/Oct/2020 13:41:59] "GET /pipelines/static/kolorwheel.js HTTP/1.1" 404 -
[2020-10-13 13:41:59,141] ERROR in app: Exception on /pipelines/run-time-chart [GET]
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/flask/app.py", line 2447, in wsgi_app
response = self.full_dispatch_request()
File "/usr/local/lib/python3.7/site-packages/flask/app.py", line 1952, in full_dispatch_request
rv = self.handle_user_exception(e)
File "/usr/local/lib/python3.7/site-packages/flask/app.py", line 1821, in handle_user_exception
reraise(exc_type, exc_value, tb)
File "/usr/local/lib/python3.7/site-packages/flask/_compat.py", line 39, in reraise
raise value
File "/usr/local/lib/python3.7/site-packages/flask/app.py", line 1950, in full_dispatch_request
rv = self.dispatch_request()
File "/usr/local/lib/python3.7/site-packages/flask/app.py", line 1936, in dispatch_request
return self.view_functions[rule.endpoint](**req.view_args)
File "/usr/local/lib/python3.7/site-packages/mara_page/acl.py", line 108, in wrapper
return f(*args, **kwargs)
File "/usr/local/lib/python3.7/site-packages/mara_pipelines/ui/run_time_chart.py", line 30, in run_time_chart
query = (pathlib.Path(__file__).parent / 'run_time_chart.sql').read_text()
File "/usr/local/lib/python3.7/pathlib.py", line 1221, in read_text
with self.open(mode='r', encoding=encoding, errors=errors) as f:
File "/usr/local/lib/python3.7/pathlib.py", line 1208, in open
opener=self._opener)
File "/usr/local/lib/python3.7/pathlib.py", line 1063, in _opener
return self._accessor.open(self, flags, mode)
FileNotFoundError: [Errno 2] No such file or directory: '/usr/local/lib/python3.7/site-packages/mara_pipelines/ui/run_time_chart.sql'
172.19.0.1 - - [13/Oct/2020 13:41:59] "GET /pipelines/run-time-chart HTTP/1.1" 500 -
172.19.0.1 - - [13/Oct/2020 13:41:59] "GET /pipelines/pipeline-children-table HTTP/1.1" 200 -
172.19.0.1 - - [13/Oct/2020 13:41:59] "GET /pipelines/dependency-graph HTTP/1.1" 200 -
172.19.0.1 - - [13/Oct/2020 13:41:59] "GET /pipelines/system-stats HTTP/1.1" 200 -
172.19.0.1 - - [13/Oct/2020 13:41:59] "GET /pipelines/timeline-chart HTTP/1.1" 200 -
172.19.0.1 - - [13/Oct/2020 13:41:59] "GET /pipelines/last-runs-selector HTTP/1.1" 200 -
172.19.0.1 - - [13/Oct/2020 13:41:59] "GET /pipelines/run-output-limited HTTP/1.1" 200 -
172.19.0.1 - - [13/Oct/2020 13:41:59] "GET /mara-app/navigation-bar HTTP/1.1" 200 -
The RunLogger
class caches the Output event and writes it to a local variable (node_output). There are two issues with the current implementation which might cause high memory usage during pipeline execution:
node_output
dict is not freed after writing the output to the dabase. This will result in that the whole pipeline ouput is cached in RAM until class RunLogger is disposed (!!)K8s sends a SIGTERM followed by a waiting period and SIGKILL. The pipeline runner currently only handles SIGINT (ctrl-c, via the exception python raises for this key combo which triggers the generic except:
block in the main loop).
Python per default seems to react to SIGTERM by terminating. This results in situations where the pipeline run is abruptly killed which leaves open runs around.
How to reproduce:
kill <list of pids>
or close the docker container)-> the runs (data_integration_run
) and node_runs (data_integration_node_run
) are left open (end_time
is NULL
).
We use the runs to not start a second run and the node_runs to display prometheus metrics per main pipeline (last run, currently running,...). Leaving them open means someone has to close them manually :-(
ReadFile.shell_command is called by ParallelReadFile.read_command with the master table as param target_table.
Postgres does not redirect insert/copy from master table to child tables automatically.
In ReadFile the correct child name is writen to a local param.
l.140: target_table = self.target_table + '' + day.strftime("%Y%m%d")
The param is not passed to function parallel_task, it is only used to analyze the (empty) partition.
Possible solutions could be:
The node cost calculation here just takes the whole history which I think is not a good idea. There should be a limit e.g. for the last 60 (successful) executions.
The perfect solution would probably be using the expotential smoothing algorithm weighting older execution times less than newer once. (but guess that that would be a bit oversized)
I plan to add a new HttpRequestCommand
which shall run a HTTP/HTTPS command via curl.
It should keep the output clean (using option -s
) and fail when the request returns an error code (option -f
).
-X
), using GET
requests by default-H
)--data-binary @-
)http_request_command(...)
in mara_pipelines.shell
HttpRequestCommand
in mara_pipelines.commands.http
This example code on my system, I assume should run without error:
from data_integration.commands.bash import RunBash
from data_integration.pipelines import Pipeline, Task
from data_integration.ui.cli import run_pipeline, run_interactively
pipeline = Pipeline(
id='demo',
description='A small pipeline that demonstrates the interplay between pipelines, tasks and commands')
pipeline.add(Task(id='ping_localhost', description='Pings localhost',
commands=[RunBash('ping -c 3 localhost')]))
sub_pipeline = Pipeline(id='sub_pipeline', description='Pings a number of hosts')
for host in ['google', 'amazon', 'facebook']:
sub_pipeline.add(Task(id=f'ping_{host}', description=f'Pings {host}',
commands=[RunBash(f'ping -c 3 {host}.com')]))
sub_pipeline.add_dependency('ping_amazon', 'ping_facebook')
sub_pipeline.add(Task(id='ping_foo', description='Pings foo',
commands=[RunBash('ping foo')]), ['ping_amazon'])
pipeline.add(sub_pipeline, ['ping_localhost'])
pipeline.add(Task(id='sleep', description='Sleeps for 2 seconds',
commands=[RunBash('sleep 2')]), ['sub_pipeline'])
run_pipeline(pipeline)
Here's the output of the script:
$ python historical.py
Traceback (most recent call last):
File "C:\Users\david\Anaconda3\lib\multiprocessing\context.py", line 190, in get_context
ctx = _concrete_contexts[method]
KeyError: 'fork'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "historical.py", line 28, in <module>
run_pipeline(pipeline)
File "C:\Users\david\Anaconda3\lib\site-packages\data_integration\ui\cli.py", line 46, in run_pipeline
for event in execution.run_pipeline(pipeline, nodes, with_upstreams, interactively_started=interactively_started):
File "C:\Users\david\Anaconda3\lib\site-packages\data_integration\execution.py", line 48, in run_pipeline
multiprocessing_context = multiprocessing.get_context('fork')
File "C:\Users\david\Anaconda3\lib\multiprocessing\context.py", line 238, in get_context
return super().get_context(method)
File "C:\Users\david\Anaconda3\lib\multiprocessing\context.py", line 192, in get_context
raise ValueError('cannot find context for %r' % method)
ValueError: cannot find context for 'fork'
If it matters, I've also provisioned a PostgreSQL instance for mara:
import mara_db.auto_migration
import mara_db.config
import mara_db.dbs
mara_db.config.databases \
= lambda: {'mara': mara_db.dbs.PostgreSQLDB(host='localhost', user='postgres', password = '', database='etl_mara')}
mara_db.auto_migration.auto_discover_models_and_migrate()
Since a few releases run_pipeline
prints various stack traces for KeyboardInterrupt
when hitting Ctrl+c:
★ 0s
utils: ★ 4.9s
utils / initialize_utils: ★ 4.7s
...
^CProcess system_statistics:
Traceback (most recent call last):
File "/usr/local/Cellar/[email protected]/3.8.1/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/usr/local/Cellar/[email protected]/3.8.1/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/Users/mloetzsch/Projects/project-a/mara-example-project-2/packages/data-integration/data_integration/execution.py", line 61, in <lambda>
target=lambda: system_statistics.generate_system_statistics(event_queue), name='system_statistics')
File "/Users/mloetzsch/Projects/project-a/mara-example-project-2/packages/data-integration/data_integration/logging/system_statistics.py", line 86, in generate_system_statistics
time.sleep(period)
KeyboardInterrupt
Traceback (most recent call last):
File "/Users/mloetzsch/Projects/project-a/mara-example-project-2/packages/data-integration/data_integration/execution.py", line 348, in run_pipeline
_notify_all(event)
File "/Users/mloetzsch/Projects/project-a/mara-example-project-2/packages/data-integration/data_integration/execution.py", line 340, in _notify_all
raise e
File "/Users/mloetzsch/Projects/project-a/mara-example-project-2/packages/data-integration/data_integration/execution.py", line 333, in _notify_all
runlogger.handle_event(event)
File "/Users/mloetzsch/Projects/project-a/mara-example-project-2/packages/data-integration/data_integration/logging/run_log.py", line 124, in handle_event
with mara_db.postgresql.postgres_cursor_context('mara') as cursor: # type: psycopg2.extensions.cursor
File "/usr/local/Cellar/[email protected]/3.8.1/Frameworks/Python.framework/Versions/3.8/lib/python3.8/contextlib.py", line 113, in __enter__
return next(self.gen)
File "/Users/mloetzsch/Projects/project-a/mara-example-project-2/packages/mara-db/mara_db/postgresql.py", line 18, in postgres_cursor_context
connection = psycopg2.connect(dbname=db.database, user=db.user, password=db.password,
File "/Users/mloetzsch/Projects/project-a/mara-example-project-2/.venv/lib/python3.8/site-packages/psycopg2/__init__.py", line 126, in connect
conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
KeyboardInterrupt
Cleaned up open runs/node_runs (run_id = 35)```
While this is not a real problem, it would be nice if we could print something less heavy, e.g. `KeyboardInterrupt`.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.