Defines the basic transform protocol to be used in the streamed source to sink concept. Also provides some generic
transformer implementations such as Compose
or Result
.
The eotransform package defines Source, Transform, and Sink protocols, to facilitate the creation of modularised processing pipelines.
Adhering to a common contract, makes it easier to mix and match processing blocks, allowing for better code reusage, and more flexible pipelines.
We also provide a streamed_process
function, which you can use for I/O hiding when implementing these protocols.
The package also provides some common transformations, and sinks like Compose
or Result
.
pip install eotransform
This example shows how to implement the Transformer
protocol for a simple multiplication:
class Multiply(Transformer[int, int]):
def __init__(self, factor: int):
self.factor = factor
def __call__(self, x: int) -> int:
return x * self.factor
This code snippet illustrates how to implement the Sink
protocol, using a simple accumulation example:
class AccumulatingSink(Sink[int]):
def __init__(self):
self.result = 0
def __call__(self, x: int) -> None:
self.result += x
In the following example we show how to combine ApplyToOkResult
and SinkUnwrapped
to process data in a streamed fashion with proper error handling across thread boundaries.
def a_data_source():
for i in range(4):
if i == 1:
yield Result.error(RuntimeError("A runtime error occured!"))
else:
yield Result.ok(i)
accumulated = AccumulatingSink()
sink = SinkUnwrapped(accumulated, ignore_exceptions={RuntimeError})
with ThreadPoolExecutor(max_workers=3) as ex:
streamed_process(a_data_source(), ApplyToOkResult(Multiply(2)), sink, ex)
assert accumulated.result == 10
The following briefly describes the concept of streaming, and how it can be used to hide I/O processes.
The most straightforward way to process data is to first load it and then process it:
This has the advantage of being simple to implement and maintain, as you don't need to be concerned with issues of parallelism.
For many cases this will work sufficiently well, however, it can stall your processing pipeline because it needs to wait for data to be fetched. Often an easy way to increase throughput, is to interleave the I/O or data fetching with processing chunks:
With this streaming process you can utilise resources more effectively.
eotransform requires Python 3.8 and has these dependencies:
more-itertools
If you find this repository useful, please consider giving it a star or a citation:
@software{raml_bernhard_2023_8002789,
author = {Raml, Bernhard},
title = {eotransform},
month = jun,
year = 2023,
publisher = {Zenodo},
version = {1.8.2},
doi = {10.5281/zenodo.8002789},
url = {https://doi.org/10.5281/zenodo.8002789}
}