python-streamz / streamz Goto Github PK
View Code? Open in Web Editor NEWReal-time stream processing for python
Home Page: https://streamz.readthedocs.io/en/latest/
License: BSD 3-Clause "New" or "Revised" License
Real-time stream processing for python
Home Page: https://streamz.readthedocs.io/en/latest/
License: BSD 3-Clause "New" or "Revised" License
Would it be possible/desirable to separate node and edge creation?
When we create a streamz pipeline we are creating a directed task graph. Currently this is done by creating nodes and simultaneously attaching them to other nodes (creating an edge). However, we may run into the situation where we want a bunch of nodes sitting in a library waiting to be used. This way users could mix and match how their nodes connected, essentially adding edges after the nodes were imported.
Is it possible to dynamically change node parameters? What would be the best way to do this? Would it be possible to regain control of the terminal during emitting such that the pipeline could be edited in real time?
Here is my attempt at an example:
from streams import Streams
from operator import mul
s1 = Stream()
s2 = Stream()
op = s1.product(s2, mul)
L = op.sink_to_list()
a = [1, 2, 3]
b = [4, 5, 6]
for x, y in zip(a, b):
s1.emit(x)
s2.emit(y)
print(L)
[4, 5, 6, 8, 10, 12, 12, 15, 18]
The best way that I can kind of come up with this is to stash all but one of the streams into a list and then for each incoming piece of data perform an itertools.product
on the resulting iterable and push it into some function. I'm not so crazy about this as it means we end up storing a bunch of stuff in memory which seems un-stream like.
It may be helpful to have a Gate node, or super node or at least some usage docs/pattern docs.
A gate
node would filter one stream by another stream. This could be done by
s1 = Stream()
s2 = Stream()
l = s1.zip_latest(s2).filter(lambda x: bool(x[1])).pluck(0).sink_to_list()
We have some options in terms of implementation:
I don't know which one is better.
Thoughts?
Should sink
mirror map in having args and kwargs?
I'm not sure where to post this but I think it seems to have something to do with streams.
There must be something fundamental I don't understand of distributed. I
would expect the following to result in a cached result on the cluster:
from distributed import Client
from streams import Stream
from dask import delayed
client = Client("xx.xx.x.x:xxxxx")
client.has_what()
#output: {'tcp://xx.xx.x.x:xxxxx': []}
s = Stream(); sout = s.map(delayed).map(lambda x : x).map(client.compute)
s.emit(10)
client.has_what()
#output: {'tcp://xx.xx.x.x:xxxxx': []}
s = Stream(); sout = s.map(delayed).map(lambda x : x).map(client.compute).map(print)
s.emit(10)
#output: <Future: status: pending, key: finalize-30855c9e1ecca7881ff1355bbe9335e7>
client.has_what()
#output: {'tcp://xx.xx.x.x:xxxxx': []}
However, as you can see, the output is nothing.
On the other hand, this works:
from distributed import Client
from streams import Stream
from dask import delayed
client = Client("xx.xx.x.x:xxxxx")
client.compute(delayed(lambda x : x))
client.has_what()
#output {'tcp://xx.xx.x.x:xxxxx': ['finalize-2e05de10e50701fc34ec28edcdf03599']}
Do you know what is possibly happening? I have given a quick look and
don't see anything obvious. This issue may be deeper than my current
understanding of distributed.
We should eventually make sure this gets into the documentation (in reaction to #72)
Something like:
s = Stream()
s2 = s.map(lambda x : x + 1)
s2.map(print)
s.emit(1)
won't print
but
s = Stream()
s2 = s.map(lambda x : x + 1)
s2.sink(print)
s.emit(1)
will print.
Also, mention the subtlety that the first code block will print if run in an ipython terminal because the return results are stored in internal variables. (i.e.
In [1]: from streamz import Stream
In [2]: s = Stream()
In [3]: s2 = s.map(lambda x : x+1)
In [4]: s2.map(print)
Out[4]: <streamz.core.map at 0x7fc750b13ba8>
In [5]: _
Out[5]: <streamz.core.map at 0x7fc750b13ba8>
The dask extensions have given us the ability to parallelize on the "inside a node" level. However, some nodes can be run completely independently from one another. What is the best way to access that level of parallelizem?
This causes problems when we try to pull stuff from pypi for the conda-forge release.
Is it possible to create a dataframe from a growing log file, or is pygtail a better approach?
In conversation with @ordirules and @mrocklin
The idea is that you may only want combine latest to trigger on some of the updates.
Currently zip_latest
only buffers the first stream, we may want to buffer multiple streams.
Could You provide an example to map functions returning Delayed
objects?
We are developing a tool for streaming N-D arrays from low level languages (Fortran/C) to Python. We are using Redis and are in the process of designing an Xarray backend (nbren12/geostreams#6) as the user facing API. We are exploring different stream handling intermediaries that will allow us do many of the common reactive programing tasks on a collection of key/value pairs mapped to Xarray objects. Ideally, we come up with a solution that works well with dask arrays too.
@nbren12 may want to expand my initial description.
References:
Currently streams contain reference cycles between parent and child. As a result, they never get cleaned up. Generally this is ok, they're pretty small in memory. It does make it difficult to stop operating on a stream though.
source.map(print) # this stream will print forever
This behavior can be both convenient and limiting.
We might instead choose to require that a reference to a stream be kept explicitly in order to keep it around.
printer = source.map(print) # this stream will print forever
Then if we delete all references to the printer, the stream will be cleaned up and printing will stop
del printer
This would be a change in semantics. Do we want this change?
I'm interested in streamz
as a possible higher-level api for implementing ETL workflows I'm currently using dask.distributed
for. @mrocklin - is this something you think streamz
would be useful for?
One of the issues I'm facing is that an Extract-Transform-Load pipeline is composed of multiple tasks (including error handling and analytics jobs) but when multiple pipelines are running in parallel it can be impossible to know which distributed-level tasks are running which pipelines in the Bokeh UI. It might be nice if you could give a name to a pipeline which could then be used to aggregate tasks running under that name.
This may be something better suited to dask/distributed itself but I thought I'd at least bring it up somewhere. I'm curious if this is something anyone else is interested in?
We may want the inverse of zip, which takes in a stream of tuples and then either: a) breaks them up into separate streams or b) selects one "column" of the tuples to report. I think a) might be more general.
The name "streams" is overly generic. It is also taken on PyPI.
We should find a better name. One option would be to use the dask namespace and call this dask-streams
although this makes less sense if the vanilla stream implementation is to be a core feature.
How do we feel about git hygiene when merging? I personally tend to squash-and-merge PRs, except if git history is particularly clean (this happens very rarely for me). Any thoughts?
When people maintain larger projects it's usually nice to be able to assume a few things about git history
These conditions are very rarely held during normal active development of any feature branch, but are typically held when a branch is merged. We can use github's squash-and-merge feature to automatically squash all commits in a PR to one. This loses some history though, so there is a balance here if the git history of that branch is particularly valuable.
Most pydata projects that I know of use squash-and-merge by default.
Inherited methods on DaskStreams produce Streams
>>> type(source.to_dask())
DaskStream
>>> type(source.to_dask().buffer(10))
Stream
Somewhat related to #2
Would it be possible to add a kwarg to make a given node a strong ref?
It seems that delay doesn't work with zip.
I keep getting a ValueError: Stream has multiple children
error.
Should we have coverage statistics?
Logging the details here for now. This now raises an error:
In [1]: from streamz import Stream
In [2]: s = Stream()
In [3]: s2 = s.map(lambda x : x + 1)
In [4]: s.visualize()
Error: Unknown HTML element <lambda> on line 1
in label of node map; <lambda>
---------------------------------------------------------------------------
CalledProcessError Traceback (most recent call last)
but this works:
In [1]: from streamz import Stream
In [2]: s =Stream()
In [3]: def inc(x):
...: return x+1
...:
In [4]: s2 =s.map(inc)
In [5]: s.visualize()
Out[5]: <IPython.core.display.Image object>
I did not have issues with it before.
I will try to get back to it later and add more details if I cannot figure it out. If anyone has a quick idea to save time that would be great. thanks!
We probably want a Stream variant that moves around not individual elements, but batches or sequences of elements. We probably also want a Stream variant that moves around Pandas dataframes. Each of these would probably want a different API. Tor example map
on a batched stream might look like the following:
class map(BatchStream):
def __init__(self, func):
...
def update(self, batch):
new_batch = list(builtins.map(self.func, batch))
self.emit(new_batch)
However each of these new collection-wise interfaces would probably want to compose with both the lower level local and dask Stream objects.
To that end maybe it makes sense to encapsulate a low-level stream within a user-level stream.
It would be useful when explaining streamz to have a live source of data. Are there any good web APIs that we can query from somewhat rapidly without making anyone angry at us? Perhaps a time series of changing data like stock data? If anyone has time to search around the internet that would be helpful. If anyone finds something nice with requests or whatnot I'd be more than happy to tornado-ify it.
From the initial blog post:
Annotate elements: we want to pass through event time, processing time, and presumably other metadata
It might be good to have this discussion in the near future (now?) since we are having some discussion on the SHED side about things that are like this/may include this.
Not that this is necessarily the best option:
Make a dedicated object for metadata (really just a dict). Every node knows to check if the thing that came down was this object before doing anything. If isinstance (x, MetadataObject)
then just pass it on or append something to it or modify it. The sinks know to ignore it for the most part.
@mrocklin I seem to have pushed to master by accident, do you want a revert and PR or is the code ok
Sorry ๐ฆ
Any interest in web documentation?
It seems that lambda functions now make the graph plotting rather unhappy.
def test_create_file():
source1 = Stream(stream_name='source1')
source2 = Stream(stream_name='source2')
n1 = source1.zip(source2)
n2 = n1.map(add).scan(mul).map(lambda x: x + 1)
n2.sink(source1.emit)
with tmpfile(extension='png') as fn:
visualize(n1, filename=fn)
assert os.path.exists(fn)
with tmpfile(extension='svg') as fn:
n1.visualize(filename=fn, rankdir="LR")
assert os.path.exists(fn)
with tmpfile(extension='dot') as fn:
n1.visualize(filename=fn, rankdir="LR")
with open(fn) as f:
text = f.read()
for word in ['rankdir', 'source1', 'source2', 'zip', 'map', 'add',
'shape=box', 'shape=ellipse']:
assert word in text
/home/christopher/mc/envs/dp_dev/bin/python /home/christopher/pycharm-2016.3/helpers/pycharm/_jb_pytest_runner.py --target test_graph.py::test_create_file
Testing started at 12:43 PM ...
Launching py.test with arguments test_graph.py::test_create_file in /home/christopher/dev/streamz/streamz/tests
============================= test session starts ==============================
platform linux -- Python 3.5.4, pytest-3.2.2, py-1.4.34, pluggy-0.4.0
rootdir: /home/christopher/dev/streamz, inifile:
plugins: xonsh-0.5.12, env-0.6.0
collected 1 item
test_graph.py .Error: Unknown HTML element <lambda> on line 1
in label of node map; <lambda>
Error: Unknown HTML element <lambda> on line 1
in label of node map; <lambda>
Error: Unknown HTML element <lambda> on line 1
in label of node map; <lambda>
I'm following up on this, I'm not certain how this broke from our previous versions.
Is it possible to emit more things than were put into the stream?
Firstly, this looks super interesting @mrocklin. There is a definite use case in finance / trading - we often build complex DAGs and ideally need some form of streaming service. In the past, I have used Luigi, but it doesn't really fit the streaming model and the scheduler isn't as intelligent as dask.
Some info about the process (skip if you don't care): generally we ingest data from multiple streaming sources, do some transformations or run data through a model, and output multiple streams to different trading strategies. We also need to do some sort of model fitting/backtesting/validation. Often these models or strategies are of varying complexity - some strategies may be able to trade with only one or two inputs (and we want them to be fast) and then others may require more complex calculations or several nodes to complete.
Ideally (and at a high level) we would like the ability to do the following:
Luigi offered most of this, except
I also believe we have most of this with dask. 1. is obvious, 2. is easy, I had a working version of 3. using dask and joblibs memory.cache pretty well. The only missing piece is being able to move that dask graph into production. This could be done by creating a DAG in dask and calling it repeatedly with new data, however in the case of nodes completing at different times, the pipeline becomes as fast as its' slowest part, which isn't ideal when you want paths to complete ASAP.
I suppose my question is, given the information above, do you see this as a potential use case for streams, or should I be working harder to get dask to play how I want it to. I am very keen to contribute if this problem fits into the broader goals streams is trying to achieve.
If the motivation isn't clear I can try and provide some simple examples of what I mean
What do we do when a stream receives bad data that causes an exception to be raised. For ex:
def foo(x):
if x is None:
raise Exception
else:
return x + 1
s = Stream()
s2 = s.map(foo)
s3.sink(print)
s.emit(1)
s.emit(None)
s.emit(2)
Here, foo
is a point of vulnerability in the stream, where it may or may not cause the whole stream architecture to halt.
Is it worth trying to incorporate some quiet exception handling? I am not sure exactly how to tackle this so I'm being a little vague at this point. I can think of many ways of doing this. Here are a few:
s.emit
. Note that in this case catching the exception may be harder to findI'll think about it, but I would like to hear opinions from @mrocklin and @CJ-Wright (who has already handled this in his streams extension). My current method is to wrap all mapped functions to look for exceptions, and return a document that flags the document as having encountered an exception. This works in my subclassed module only though. It would be nice to unify this I think.
(Note: exceptions can occur not just in map
but other things like filter
etc. Other modules like zip
may also want to be exception aware, that something passing through is bad data, and pass this on etc.)
Should we buffer the combine_latest
emit_on
stream?
Consider the following situation, we have two streams a
and b
. We are going to use combine_latest
to combine them, with an emit only on a
. 5 entries come down from a
then one comes down from b
then one from a
. Due to the lack of data from b
none of the 5 initial a
entries have been emitted since the b
was missing. Now that we have b
data do we expect it to emit 6 times (which would require the buffering of the emit_on
stream or only once (which may violate the idea of the emit_on
stream always emitting)?
@mrocklin @ordirules @danielballan
Right now, graph
from streams import Stream
s1 = Stream()
s2 = Stream()
sout1 = s1.map(lambda x : x +1)
sout1.map(print)
# connect stream 2 to stream 1
s2.map(s1.emit)
from streams.graph import visualize
visualize(s2, 'stream2.png')
Result is this:
http://imgur.com/a/0Z0JA
What would solve this is that rather than using emit
, we could define a new function function, say s2.inject(s1)
which would add s2
as a child to s1
. The only issue I see here is we'll need an update
defined for the base stream object, which will just call emit
. What do you think?
@mrocklin @CJ-Wright
I'm interested to hear of a better way, thanks!
It might be nice to have a pretty task graph in the readme.
Would it be possible to parse a dask task graph dictionary and turn it into a streaming pipeline?
Is it possible to color each node in graphvis? If so we may want to come up with a color scheme such that every node type has a dedicated color for ease of viewing (and making nice graphs for presentations/publications).
Would it be possible to send out an alpha release?
I don't mind conda-forge packaging.
Would it be possible to split the stream class into two classes?
The first class would hold the init, emit, child, and loop methods.
The second class would inherit from the first and implement map, sink, buffer, etc.
Inspiration:
I have a very specific data topology and I need the various functions to operate differently then they currently do. By splitting the class I'd be able to use the same base class and just have to re-implement map, filter, etc. for my data needs.
A similar proposition; would it be possible to make the various internals of Streams hot swappable?
I don't need to change all the methods, for example delay most likely could stay the same, as could buffer. But map, filter, sliding_windo, etc won't work for the event model data topology.
Thoughts?
@danielballan
This point has been raised a few times over PR's and issues here. However, I feel it deserves its own post.
This is strongly tied to #15.
May be a nice alternative for subclassing could be to use a metaclass to contain some boilerplate. Here is an example:
from functools import partial
class StreamMeta(type):
''' Handles the boilerplate of the function registry.
'''
def __init__(cls, name, bases, dct):
# I want to modify the cls methods
print("Preparing class: {}".format(cls))
#print("Updating function registry...")
# set up the function registry
if not hasattr(cls, '_fun_reg'):
cls._fun_reg = dict()
# intercept the functions coming in and add them
# (or modify) the function registry of streams
for key, elem in dct.items():
if not key.startswith("_") and callable(elem):
cls._fun_reg[key] = elem
delattr(cls, key)
def __getattr__(self, name):
#print("Checking function registry")
if name in self._fun_reg:
return partial(self._fun_reg[name],self)
else:
raise AttributeError
cls.__getattr__ = __getattr__
#print("Finishing initialization")
super(StreamMeta, cls).__init__(name, bases, dct)
class Stream(object, metaclass=StreamMeta):
# this was python 2.7, add for compatibility?
#__metaclass__ = StreamMeta
def map(self):
print("Stream map")
s = Stream()
print("Stream's map method")
s.map()
class CustomStream(Stream):
def map(self):
print("Custom map")
def map2(self):
print("Custom map2")
s2 = CustomStream()
print("CustomStream's map method")
s2.map()
print("CustomStream's map2 method")
s2.map2()
print("Stream's map method")
s.map()
The output is:
Preparing class: <class '__main__.Stream'>
Stream's map method
Stream map
Preparing class: <class '__main__.CustomStream'>
CustomStream's map method
Custom map
CustomStream's map2 method
Custom map2
Stream's map method
Custom map
The _fun_reg
dictionary is used to save the functions in conjunction with __getattr__
. The metaclass is used so that the stream classes may still be defined in the usual way. The drawback with this method is that only one instance of Stream
may ever be used.
Any thoughts, is this perhaps overdone? @mrocklin @CJ-Wright @danielballan ?
I am not sure where to put this. It seems that I am running into problems with this simple code:
from distributed import Client
import streamz.dask as sd
from streamz import Stream
from tornado import gen
from tornado.ioloop import IOLoop
from distributed import sync
HOSTNAME = "localhost"
PORT = 8786
net_string = "{}:{}".format(HOSTNAME, PORT)
client = Client(net_string)
def foo(x):
print("incrementing")
return x + 1
s = Stream()
s2 = sd.scatter(s)
s3 = s2.map(foo)
s4 = sd.gather(s3)
s4.sink(print)
@gen.coroutine
def start():
for i in range(10):
yield s.emit(1)
def start_run():
loop = IOLoop()
sync(loop, start)
start_run()
where I am getting CancelledError: foo-dbfccbe1f665466a68906d614a6d8348
. I am submitting the job to a dask-scheduler that I have initialized. Running the code a second time resolves the issue.
I am not sure what is going on, but I suspect this may have something to do with references to futures being lost before information is gathered. My question is, am I doing something wrong, or is this a potential issue with the way streamz
handles futures? If I find any more information I'll post here.
Also, please let me know where this may be best to post. One option is to start a #streamz on stackoverflow. Thanks!
EDIT : Corrected typo in source code (s4 to s5)
One could imagine using accumulate to sum things together (or do something else) until some threshold is met and then starting the average all over again (resetting the result to the starting value). What would be the best way to do this?
I have a couple of DAG workflows in dask. These workflows are triggered via user interfactions through a web API. I'd like to achieve the followings:
What would be the correct approach to implement it?
Hi,
Could you please add an example of how to run an event loop in a separate thread so that one can use the timed_window()
and delay()
methods?
The only reference I found is that it works in a Jupyter notebook because that already has an event loop running in the background. This doesn't help me as I'm not running in a Jupyter notebook.
I'm using asyncio
in Python 3.6 and mostly that's working fine but I can't get the timed_window()
function to work.
Would it be possible to have an .edit()
method on the nodes so that it opens a GUI which we can use to delete/add nodes and connections?
I guess this is getting a bit close to vistrails and all but... (I also know that we may not want to spend our time writing a GUI)
Maybe color graph nodes by when the nodes trigger?
This could use the message concept in #111, where we pass a message which tracks each call time and node.
RX has some nice tutorials, with great graphs showing how streams get combined, eg combine latest. It would be nice if we could have these too, especially as we have extended some of the operators beyond their tutorials. Even better it would be awesome if we could generate these through the streams themselves.
One potential way to do this:
Ideally we would create a whole bunch of these examples so we could put them in the docs. Then every time the docs are generated we could recreate them, just in case we introduce a new execution model or change the nodes (or even check to make certain that the nodes behave as we expect them to be)
@ordirules (since we were talking about this on #48)
When the pipeline branches, how do we know/specify which branch gets executed first?
Dask has a nice feature where it draws out the task graph and saves it as an image.
While this is not dask and some things could be very different, a similar feature could be very helpful to tracing exactly what is going on in an otherwise complex pipeline.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.