neurosys-pl / magda Goto Github PK
View Code? Open in Web Editor NEWLibrary for building Modular and Asynchronous Graphs with Directed and Acyclic edges (MAGDA)
Home Page: https://neurosys-pl.github.io/magda/
License: Apache License 2.0
Library for building Modular and Asynchronous Graphs with Directed and Acyclic edges (MAGDA)
Home Page: https://neurosys-pl.github.io/magda/
License: Apache License 2.0
What?
We would like to use the Result Object pattern to pass results/errors throughout the pipelines.
Why?
Currently, any error during the pipeline lifecycle can break the whole process. This makes the pipelines unreliable.
How?
By changing the way, how results are propagated through the pipelines. The Module.Result
should have an additional field error
which will be a placeholder for any potential exceptions raised during the run.
Acceptance Criteria:
Module.Result
has a new field error
keeping any potential exceptionsModule.Result.result
and keeps Module.Result.error
emptyModule.Result.error
. The Module.Result.result
is kept empty. Such a result is then passed to the next part of the pipeline.ModuleRuntime.run
is invoked by Graph
only when all previous Module.Result
(not only from dependant modules) are valid (i.e. Module.Result.error is None
)Tuple[Optional[Dict[str, Any]], Optional[Exception]]
. The 1st value is the dictionary of all exposed results (if the request was processed successfully) or None (when the exception occurred during the processing). The 2nd value is an exception raised during the processing or None
if request was processed successfullyWhat?
We would like to update ray
package.
Why?
To eliminate warnings:
FutureWarning: Not all Ray CLI dependencies were found. In Ray 1.4+, the Ray CLI,
autoscaler, and dashboard will only be usable via pip install 'ray[default]'.
Please update your install command.
How?
By changing the dependency to ray[default]
We would like to change behavior of module.run
to be asynchronous (asyncio). This change will imply modifications in graph executor. To provide backward compatibility, the both coroutines and basic functions should be supported as module.run
implementation.
The whole graph is currently sequential and blocking. If module performs IO operation (which can be awaited) the rest modules are blocked by this operation. This update will give more optimization possibilities.
By updating graph, graph executor and sequential pipeline to be based on asyncio coroutines. However, module.run implementation can be either standard function (def run()
) or coroutine (async def run()
). This change shouldn't imply updating module.run implementations
We would like to move documentation from wiki to the github pages (built with Sphinx)
Documentation should be pinned to the specific version of the library. Currently, we cannot keep documentation together for both the latest deployed version and describing the new future features.
By migrating the wiki content into a new folder docs
. The documentation will be generated (with Sphinx) from that files and automatically uploaded to the github pages. The new documentation should reflects all versions of the library.
We would like to add an ability to limit the number of processed and blocked results.
This will eliminate overflow error when the first group is processing much faster than the second one.
By adding a new group's option which will set upper limit of blocked results. When the group reaches the limit, it must wait until the number of blocked results is reduced to begin processing a new request. The option/mechanism will be disabled by default
What?
I'd like to optionally determine if a module is exposed or not in YAML config. This setting should override the configuration in decorator (if such exists).
Why?
The modification should increase the reusability of modules, that could be used in some pipelines as exposable and in another as non-exposable.
How?
By adding expose field to YAML config and by parsing it in ConfigReader. Expose should accept a boolean or a string with a concrete name.
Examples:
modules:
- name: module-a1
type: ModuleA
expose: partial-results
- name: module-b1
type: ModuleA
depends_on:
- module-a1
or
modules:
- name: module-a1
type: ModuleA
expose: True
- name: module-b1
type: ModuleA
depends_on:
- module-a1
should have higher priority than:
@accepts(ModuleA)
@register('ModuleB')
@expose('articles')
@finalize
class ModuleB(Module.Runtime):
...
We would like to create a pattern for testing modules (real ones with business logic). However, it's impractical to mock module's internal methods (like build
) or wrap into a complete pipeline. So, we need a new class - ModuleTestingWrapper
.
To give a reliable way for unit-testing real modules with a minimal baseline.
By creating a new module magda.testing
with all testing-related classes and mocks. The ModuleTestingWrapper
should behave like minimal SequentialPipeline i.e. without graphs, checking dependencies etc. because it'll always wrap only one module. However, it should support building modules and its whole life-cycle (bootstrap-run-teardown, even if they're mocked).
The example code:
@accept(RawData)
@produce(ProcessedData)
@finalize
class MyModule(Module.Runtime):
def bootstrap(self, **kwargs):
self.do_something_with(self.parameters)
def run(self, data: Module.ResultSet, **kwargs):
return ProcessedData(data.get(RawData)[::-1])
from magda.testing import ModuleTestingWrapper
async def test_should_not_compile():
module = MyModule('my-module-a').set_parameters({}) # Missing required params
with pytest.raises(Exception):
await ModuleTestingWrapper(module).build() # it mimics pipeline.build()
async def test_should_process_correctly():
module = MyModule('my-module-a').set_parameters(...)
mock = await ModuleTestingWrapper(module).build()
result = await mock.run(RawData('xyz'))
assert result == ProcessedData('zyx'))
I'd like to get/check results by module's name (currently supported are only exposed name and reference to python classes).
To cover more cases of using Module.ResultSet
Example:
modules:
- name: module-a1
type: ModuleA
- name: module-a2
type: ModuleA
- name: module-b
type: ModuleB
parameters:
ref: module-a1
depends_on:
- module-a1
- module-a2
@accepts(ModuleA)
@register('ModuleB')
@finalize
class ModuleB(Module.Runtime):
def run(self, data: Module.ResultSet, *args, **kwargs):
ref = self.parameters.get('ref') # := module-a1
...
# Currently:
x = next((d.result for d in data.collection if d.name == ref))
# It should be:
x = data.get(ref)
...
We would like to extend the readme file.
To improve your understanding of the library and let you start working with it easily.
By adding a Quick start
section with easily reproducible minimal examples:
Module
and SequentialPipeline
Interface
to the previous exampleProblem description:
In the Parallel pipeline, Ray seems to block logs when using MagdaLogger.Config.Output.LOGGING output. Logs are partially missing. The problem doesn't exist in the case of logging to standard output.
How to reproduce?
Run the file examples/example2.py
with the change of default MagdaLoggerConfig so that it uses MagdaLogger.Config.Output.LOGGING.
Comparison of behavior:
Hello,
Reading the code of magda, I have noticed a line that probably doesn't work as expected to work. The line is magda/config_reader.py:71
:
declared_variables = list(set(re.findall(r'\${(\w+)}', config_str)))
I am quite sure that this regex also finds parameters that are commented (with #
) and I think it is undesired behavior.
Hello,
I have noticed that probably the dictionary's key declared in test/test_config_reader.py:85
is unused:
'BIAS': 0.1,
The test title does not suggested that is intentionally implemented this way. It is not a big issue and maybe we would like to test this case in another test or leave it as it is now :-)
We would like to add optional logging system, which will inform about all module/pipeline-related events. E.g. module's bootstrap/teardown or beginning request processing.
To improve debugging and add easily integrable verbose mode.
By adding an option, which will enable/disable verbose mode. MAGDA will automatically print out logging messages while in verbose mode. The target (console/file/anything else) will be configurable
There is a non-trivial requirement if you would like to use the functionality of loading magda's pipeline from the config file. Every madga Module.Runtime
has to be registered with ModuleFactory.register
function. A convenient alternative to do that is using @register
decorator. However, this @register
decorator has to be invoked somehow before loading the pipeline. A very easy workaround for this is to import all modules that you have implemented with one line of import:
from modules import *
This assumes that you imported all modules in modules/__init__.py
file.
I propose to add two small changes. First one is to mention this suggestion in the documentation here or in this section. The second thing that could be very useful is to change the error message by adding a question: 'Didn't you forget to import this module with @register
decorator?'
What?
We would like to be able to use MAGDA Loggers to print warnings, errors and debug information.
Why?
To make logs more readable.
How?
[2021-08-16 12:05:14.366] [INFO] ParallelPipeline (Example) ModuleA (m1:g2) [BOOTSTRAP]
[2021-08-16 12:05:16.641] [WARNING] ParallelPipeline (Example) ModuleA (m1:g2) Missing parameter "a" - using default value "None"
Different colors for the above formatter and messages (last part) should be used at different levels.
2. By adding new methods for a logger: debug()
, warning()
, error()
and critical()
printing at specific level
We would like to parameterize config files of pipeline declaration (similar to parameterizing docker-compose.yml with environment variables)
Sometimes you would like to run the same pipeline with just a different parameter. In this situation you have to modify the config or duplicate it (which is impractical).
By updating ConfigReader
to accept optional dictionary of parameters and replacing all placeholders within the config file with an appropriate value from the provided dictionary.
What?
We would like to provide a custom name for pipelines created by ConfigReader (via code or configuration file).
Why?
Currently, only pipelines created manually can have a defined name.
How?
name: MyPipeline
modules: ...
groups: ...
shared_parameters: ...
In the above example, the pipeline will be called MyPipeline
.
with open(config_file, 'r') as config:
self.pipeline = await ConfigReader.read(
config,
ModuleFactory,
name="SuperPipeline",
**rest_kwargs,
)
In the above example, the pipeline will be called SuperPipeline
.
SuperPipeline
will be used instead A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.