Comments (15)
You might want to look at Stream.from_textfile
from streamz.
Probably something like the following:
source = Stream.from_textfile(...)
example = pd.DataFrame(...) # provide an empty example to tell streamz about column names and dtypes
df = source.map(parse).timed_window(0.5).to_dataframe(example=example)
# do stuff with df
source.start()
from streamz.
@apiszcz , were you wanting to produce a new dataframe every so often, as in the example above, or an accumulated dataframe that steadily grows with time and includes all data so far? I'm not sure how you would achieve the latter.
from streamz.
or an accumulated dataframe that steadily grows with time and includes all data so far? I'm not sure how you would achieve the latter.
I think that this is an anti-pattern. Data should always flow through streams and should not accumulate infinitely.
Instead I might ask "how did you want to convert rows into dataframes?" The example above catches all the data for the last 500ms. But you could imagine bundling up data by a fixed number (every 100 rows) or passing it along directly.
from streamz.
accumulate
with pd.concat
would do it, I suppose, but you would surely strain the system to death as the data continues to grow. If you wish to accumulate, you ought to only keep (near) constant-size state around, as you would for a simple mean.
Probably a better example would be to buffer lines of text until some threshold, then batch-process them to parquet for later. There could be a source that watches the size of a file, without reading the text line-by-line to memory, and loads in a chunk of bytes when enough data has appeared.
from streamz.
Probably a better example would be to buffer lines of text until some threshold
Yes I agree. Things like this already exist within streamz. You might consider partition
or timed_window
in streamz.core
. We need more operations like this on the dataframe side. This starts to get interesting though.
from streamz.
from streamz.
Are the examples provided above insufficient to get you started @apiszcz ?
from streamz.
from streamz.
from streamz.
That would be ok. If you have some data already then you might also read a little bit of your file and pass that dataframe instead (the example doesn't have to be empty).
from streamz.
Ideally I could control the types.
from streamz import Stream
import pandas as pd
source = Stream.filenames('test.csv')
sdf = (source.map(pd.read_csv).to_dataframe(example=...))
sdf.mean().stream.sink(print)
source = Stream.from_textfile(...)
example = pd.DataFrame({'a': pd.Series([], dtype=np.uint32),
'b': pd.Series([], dtype=np.uint32),
'c': pd.Series([], dtype=np.uint32)})
df = source.map(parse).timed_window(0.5).to_dataframe(example=example)
# do stuff with df
source.start()
Traceback (most recent call last):
File "r2.py", line 7, in
sdf = (source.map(pd.read_csv).to_dataframe(example=...))
File "lib\streamz\core.py", line 344, in to_dataframe
return StreamingDataFrame(stream=self, example=example)
File "lib\streamz\dataframe.py", line 267, in init
return super(StreamingDataFrame, self).init(*args, **kwargs)
File "lib\streamz\collection.py", line 33, in init
assert isinstance(self.example, self._subtype)
AssertionError
from streamz.
Thanks for the error report. I've improved the error message in #106
from streamz.
df = source.map(parse).timed_window(0.5).to_dataframe(example=example)
to_dataframe will also expect to be given a stream of dataframes. So assuming that the output of parse is something like a python dict, we might do something like the following instead:
df = (source.map(parse)
.timed_window(0.5) # batch into lists every 500ms
.filter(None) # remove empty batches
.map(pd.DataFrame) # convert lists to pandas dataframes
.to_dataframe(example=example))
But please understand that I'm just giving out loose examples here. Ultimately you'll probably have to learn about these operations. I wouldn't recommend copy-pasting what I write and expecting it to work :)
from streamz.
getting closer,, will read the docs to operate on the df ... , mapping parse to the stream, is that needed..
from streamz import Stream
import pandas as pd
import numpy as np
source = Stream.from_textfile('test.csv')
example = pd.DataFrame({'a': pd.Series([], dtype=np.uint32),
'b': pd.Series([], dtype=np.uint32),
'c': pd.Series([], dtype=np.uint32)})
sdf = source.map(pd.read_csv).timed_window(2.0).to_dataframe(example=example)
# do stuff with df
source.start()
print(sdf)
from streamz.
Related Issues (20)
- missing positional argument: 'topic' in to_mqtt HOT 5
- How to parametrize stream/pipeline creation? HOT 2
- Passing Username and Password to from_mqtt() HOT 4
- Dynamically add upstreams to zip HOT 2
- Dropping `pkg_resources`
- AttributeError: 'Output' object has no attribute '_ipython_display_' HOT 3
- flatten doesn't work with iterables without defined length HOT 2
- visualizing streams and changing variables during runtime HOT 5
- Is it possible to use event time/syntethic time rather than system time? HOT 2
- Time based lookback window? HOT 1
- Streamz not working in Jupyterlite
- Collect does not allow awaitable sinks
- Quickstart lacks conda/environments/streamz_dev.yml HOT 6
- Hello from LorryStream and Kotori / Streamz is cool HOT 1
- Add pytest fixture to clean up the IO loop HOT 4
- Streamz hvplot resets zoom and pan on each update HOT 3
- streamz's typing system can't work properly in vscode HOT 1
- Streamz with websocket not steaming any data HOT 5
- Parallel streams with buffers HOT 3
- Is streamz not maintained anymore? What happened to cuStreamz? HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from streamz.