neurosys-pl / magda Goto Github PK

View Code? Open in Web Editor NEW

13.0 13.0 5.0 688 KB

Library for building Modular and Asynchronous Graphs with Directed and Acyclic edges (MAGDA)

Home Page: https://neurosys-pl.github.io/magda/

License: Apache License 2.0

Dockerfile 0.47% Python 96.04% JavaScript 2.61% CSS 0.88%

acyclic-graphs asynchronous asyncio directed-graphs magda modular parallelization pipeline python

magda's People

Contributors

Stargazers

Watchers

Forkers

jamebs danielpopek mmaslankowska-neurosys lukaszsus p-mielniczuk

magda's Issues

Migrate to Result Object pattern

What?
We would like to use the Result Object pattern to pass results/errors throughout the pipelines.

Why?
Currently, any error during the pipeline lifecycle can break the whole process. This makes the pipelines unreliable.

How?
By changing the way, how results are propagated through the pipelines. The Module.Result should have an additional field error which will be a placeholder for any potential exceptions raised during the run.

Acceptance Criteria:

The exception raised for one request (run) doesn't impact the other requests (runs)
Module.Result has a new field error keeping any potential exceptions
when module completes successfully, its output is set in Module.Result.result and keeps Module.Result.error empty
when a module fails (raises an exception), the exception is handled by MAGDA Graph and assigned to Module.Result.error. The Module.Result.result is kept empty. Such a result is then passed to the next part of the pipeline.
ModuleRuntime.run is invoked by Graph only when all previous Module.Result (not only from dependant modules) are valid (i.e. Module.Result.error is None)
Pipeline returns Tuple[Optional[Dict[str, Any]], Optional[Exception]]. The 1st value is the dictionary of all exposed results (if the request was processed successfully) or None (when the exception occurred during the processing). The 2nd value is an exception raised during the processing or None if request was processed successfully

Update ray

What?
We would like to update ray package.

Why?
To eliminate warnings:

FutureWarning: Not all Ray CLI dependencies were found. In Ray 1.4+, the Ray CLI,
autoscaler, and dashboard will only be usable via pip install 'ray[default]'.
Please update your install command.

How?
By changing the dependency to ray[default]

Update graphs to support asyncio coroutines

What?

We would like to change behavior of module.run to be asynchronous (asyncio). This change will imply modifications in graph executor. To provide backward compatibility, the both coroutines and basic functions should be supported as module.run implementation.

Why?

The whole graph is currently sequential and blocking. If module performs IO operation (which can be awaited) the rest modules are blocked by this operation. This update will give more optimization possibilities.

How?

By updating graph, graph executor and sequential pipeline to be based on asyncio coroutines. However, module.run implementation can be either standard function (def run()) or coroutine (async def run()). This change shouldn't imply updating module.run implementations

Migrate documentation to Github Pages

What?

We would like to move documentation from wiki to the github pages (built with Sphinx)

Why?

Documentation should be pinned to the specific version of the library. Currently, we cannot keep documentation together for both the latest deployed version and describing the new future features.

How?

By migrating the wiki content into a new folder docs. The documentation will be generated (with Sphinx) from that files and automatically uploaded to the github pages. The new documentation should reflects all versions of the library.

Add backpressure mechanism

What?

We would like to add an ability to limit the number of processed and blocked results.

Why?

This will eliminate overflow error when the first group is processing much faster than the second one.

How?

By adding a new group's option which will set upper limit of blocked results. When the group reaches the limit, it must wait until the number of blocked results is reduced to begin processing a new request. The option/mechanism will be disabled by default

Optional expose parameter added to module's config

What?
I'd like to optionally determine if a module is exposed or not in YAML config. This setting should override the configuration in decorator (if such exists).

Why?
The modification should increase the reusability of modules, that could be used in some pipelines as exposable and in another as non-exposable.

How?
By adding expose field to YAML config and by parsing it in ConfigReader. Expose should accept a boolean or a string with a concrete name.

Examples:

modules:
- name: module-a1
  type: ModuleA
  expose: partial-results

- name: module-b1
  type: ModuleA
  depends_on:
  - module-a1

modules:
- name: module-a1
  type: ModuleA
  expose: True

- name: module-b1
  type: ModuleA
  depends_on:
  - module-a1

should have higher priority than:

@accepts(ModuleA)
@register('ModuleB')
@expose('articles')
@finalize
class ModuleB(Module.Runtime):
   ...

Testing modules

What?

We would like to create a pattern for testing modules (real ones with business logic). However, it's impractical to mock module's internal methods (like build) or wrap into a complete pipeline. So, we need a new class - ModuleTestingWrapper.

Why?

To give a reliable way for unit-testing real modules with a minimal baseline.

How?

By creating a new module magda.testing with all testing-related classes and mocks. The ModuleTestingWrapper should behave like minimal SequentialPipeline i.e. without graphs, checking dependencies etc. because it'll always wrap only one module. However, it should support building modules and its whole life-cycle (bootstrap-run-teardown, even if they're mocked).

The example code:

@accept(RawData)
@produce(ProcessedData)
@finalize
class MyModule(Module.Runtime):
    def bootstrap(self, **kwargs):
        self.do_something_with(self.parameters)

    def run(self, data: Module.ResultSet, **kwargs):
        return ProcessedData(data.get(RawData)[::-1])

from magda.testing import ModuleTestingWrapper

async def test_should_not_compile():
    module = MyModule('my-module-a').set_parameters({}) # Missing required params

    with pytest.raises(Exception):
        await ModuleTestingWrapper(module).build() # it mimics pipeline.build()

async def test_should_process_correctly():
    module = MyModule('my-module-a').set_parameters(...)
    mock = await ModuleTestingWrapper(module).build()

    result = await mock.run(RawData('xyz'))
    assert result == ProcessedData('zyx'))

Support getting Result via module's name

What?

I'd like to get/check results by module's name (currently supported are only exposed name and reference to python classes).

Why?

To cover more cases of using Module.ResultSet

How?

Example:

modules:
- name: module-a1
  type: ModuleA

- name: module-a2
  type: ModuleA

- name: module-b
  type: ModuleB
  parameters:
    ref: module-a1
  depends_on:
  - module-a1
  - module-a2

@accepts(ModuleA)
@register('ModuleB')
@finalize
class ModuleB(Module.Runtime):
    def run(self, data: Module.ResultSet, *args, **kwargs):
        ref = self.parameters.get('ref')  # := module-a1
        ...
        # Currently:
        x = next((d.result for d in data.collection if d.name == ref))
        # It should be:
        x = data.get(ref)
        ...

Extend ReadMe

What?

We would like to extend the readme file.

Why?

To improve your understanding of the library and let you start working with it easily.

How?

By adding a Quick start section with easily reproducible minimal examples:

with one Module and SequentialPipeline
with adding an Interface to the previous example
with recreating the pipeline from the YAML file

Partially lacking logs when using ParallelPipeline and MagdaLogger with MagdaLogger.Config.Output.LOGGING

Problem description:

In the Parallel pipeline, Ray seems to block logs when using MagdaLogger.Config.Output.LOGGING output. Logs are partially missing. The problem doesn't exist in the case of logging to standard output.

How to reproduce?
Run the file examples/example2.py with the change of default MagdaLoggerConfig so that it uses MagdaLogger.Config.Output.LOGGING.

Comparison of behavior:

Using ParallelPipeline + standard output:
Using ParallelPipeline + logger output:

ConfigReader regex doubt

Hello,

Reading the code of magda, I have noticed a line that probably doesn't work as expected to work. The line is magda/config_reader.py:71:

declared_variables = list(set(re.findall(r'\${(\w+)}', config_str)))

I am quite sure that this regex also finds parameters that are commented (with #) and I think it is undesired behavior.

Unused key in dictionary

Hello,

I have noticed that probably the dictionary's key declared in test/test_config_reader.py:85 is unused:

'BIAS': 0.1,

The test title does not suggested that is intentionally implemented this way. It is not a big issue and maybe we would like to test this case in another test or leave it as it is now :-)

Add logging system

What?

We would like to add optional logging system, which will inform about all module/pipeline-related events. E.g. module's bootstrap/teardown or beginning request processing.

Why?

To improve debugging and add easily integrable verbose mode.

How?

By adding an option, which will enable/disable verbose mode. MAGDA will automatically print out logging messages while in verbose mode. The target (console/file/anything else) will be configurable

Not mentioned requirement to use loading pipeline from config functionality

There is a non-trivial requirement if you would like to use the functionality of loading magda's pipeline from the config file. Every madga Module.Runtime has to be registered with ModuleFactory.register function. A convenient alternative to do that is using @register decorator. However, this @register decorator has to be invoked somehow before loading the pipeline. A very easy workaround for this is to import all modules that you have implemented with one line of import:

from modules import *

This assumes that you imported all modules in modules/__init__.py file.

I propose to add two small changes. First one is to mention this suggestion in the documentation here or in this section. The second thing that could be very useful is to change the error message by adding a question: 'Didn't you forget to import this module with @register decorator?'

Support different logging levels

What?
We would like to be able to use MAGDA Loggers to print warnings, errors and debug information.

Why?
To make logs more readable.

How?

By adding a new formatter showing the log level, e.g.:

[2021-08-16 12:05:14.366] [INFO] ParallelPipeline (Example) ModuleA (m1:g2) [BOOTSTRAP]
[2021-08-16 12:05:16.641] [WARNING] ParallelPipeline (Example) ModuleA (m1:g2) Missing parameter "a" - using default value "None"

Different colors for the above formatter and messages (last part) should be used at different levels.
2. By adding new methods for a logger: debug(), warning(), error() and critical() printing at specific level

Parameterize config file

What?

We would like to parameterize config files of pipeline declaration (similar to parameterizing docker-compose.yml with environment variables)

Why?

Sometimes you would like to run the same pipeline with just a different parameter. In this situation you have to modify the config or duplicate it (which is impractical).

How?

By updating ConfigReader to accept optional dictionary of parameters and replacing all placeholders within the config file with an appropriate value from the provided dictionary.

Support custom pipeline name in ConfigReader

What?
We would like to provide a custom name for pipelines created by ConfigReader (via code or configuration file).

Why?
Currently, only pipelines created manually can have a defined name.

How?

By handling the name provided in the configuration file:

name: MyPipeline
modules: ...
groups: ...
shared_parameters: ...

In the above example, the pipeline will be called MyPipeline.

By handling the name provided as an argument for ConfigReader:

with open(config_file, 'r') as config:
    self.pipeline = await ConfigReader.read(
        config,
        ModuleFactory,
        name="SuperPipeline",
        **rest_kwargs,
    )

In the above example, the pipeline will be called SuperPipeline.

In case the names are provided both in the configuration file and as an argument, the name given as an argument has higher priority. So in the above example, the name SuperPipeline will be used instead ~~MyPipeline~~.

neurosys-pl / magda Goto Github PK

magda's People

Contributors

Stargazers

Watchers

Forkers

magda's Issues

What?

Why?

How?

What?

Why?

How?

What?

Why?

How?

What?

Why?

How?

What?

Why?

How?

What?

Why?

How?

What?

Why?

How?

What?

Why?

How?

Recommend Projects

Recommend Topics

Recommend Org

Jobs