GithubHelp home page GithubHelp logo

Comments (15)

mrocklin avatar mrocklin commented on June 6, 2024

You might want to look at Stream.from_textfile

from streamz.

mrocklin avatar mrocklin commented on June 6, 2024

Probably something like the following:

source = Stream.from_textfile(...)
example = pd.DataFrame(...)  # provide an empty example to tell streamz about column names and dtypes
df = source.map(parse).timed_window(0.5).to_dataframe(example=example)

# do stuff with df

source.start()

from streamz.

martindurant avatar martindurant commented on June 6, 2024

@apiszcz , were you wanting to produce a new dataframe every so often, as in the example above, or an accumulated dataframe that steadily grows with time and includes all data so far? I'm not sure how you would achieve the latter.

from streamz.

mrocklin avatar mrocklin commented on June 6, 2024

or an accumulated dataframe that steadily grows with time and includes all data so far? I'm not sure how you would achieve the latter.

I think that this is an anti-pattern. Data should always flow through streams and should not accumulate infinitely.

Instead I might ask "how did you want to convert rows into dataframes?" The example above catches all the data for the last 500ms. But you could imagine bundling up data by a fixed number (every 100 rows) or passing it along directly.

from streamz.

martindurant avatar martindurant commented on June 6, 2024

accumulate with pd.concat would do it, I suppose, but you would surely strain the system to death as the data continues to grow. If you wish to accumulate, you ought to only keep (near) constant-size state around, as you would for a simple mean.

Probably a better example would be to buffer lines of text until some threshold, then batch-process them to parquet for later. There could be a source that watches the size of a file, without reading the text line-by-line to memory, and loads in a chunk of bytes when enough data has appeared.

from streamz.

mrocklin avatar mrocklin commented on June 6, 2024

Probably a better example would be to buffer lines of text until some threshold

Yes I agree. Things like this already exist within streamz. You might consider partition or timed_window in streamz.core. We need more operations like this on the dataframe side. This starts to get interesting though.

from streamz.

apiszcz avatar apiszcz commented on June 6, 2024

from streamz.

mrocklin avatar mrocklin commented on June 6, 2024

Are the examples provided above insufficient to get you started @apiszcz ?

from streamz.

apiszcz avatar apiszcz commented on June 6, 2024

from streamz.

apiszcz avatar apiszcz commented on June 6, 2024

from streamz.

mrocklin avatar mrocklin commented on June 6, 2024

That would be ok. If you have some data already then you might also read a little bit of your file and pass that dataframe instead (the example doesn't have to be empty).

from streamz.

apiszcz avatar apiszcz commented on June 6, 2024

Ideally I could control the types.

from streamz import Stream
import pandas as pd

source = Stream.filenames('test.csv')
sdf = (source.map(pd.read_csv).to_dataframe(example=...))

sdf.mean().stream.sink(print)

source = Stream.from_textfile(...)
example = pd.DataFrame({'a': pd.Series([], dtype=np.uint32),
                        'b': pd.Series([], dtype=np.uint32),
                        'c': pd.Series([], dtype=np.uint32)})
df = source.map(parse).timed_window(0.5).to_dataframe(example=example)
# do stuff with df
source.start()

Traceback (most recent call last):
File "r2.py", line 7, in
sdf = (source.map(pd.read_csv).to_dataframe(example=...))
File "lib\streamz\core.py", line 344, in to_dataframe
return StreamingDataFrame(stream=self, example=example)
File "lib\streamz\dataframe.py", line 267, in init
return super(StreamingDataFrame, self).init(*args, **kwargs)
File "lib\streamz\collection.py", line 33, in init
assert isinstance(self.example, self._subtype)
AssertionError

from streamz.

mrocklin avatar mrocklin commented on June 6, 2024

Thanks for the error report. I've improved the error message in #106

from streamz.

mrocklin avatar mrocklin commented on June 6, 2024
df = source.map(parse).timed_window(0.5).to_dataframe(example=example)

to_dataframe will also expect to be given a stream of dataframes. So assuming that the output of parse is something like a python dict, we might do something like the following instead:

df = (source.map(parse)
    .timed_window(0.5)  # batch into lists every 500ms 
    .filter(None)  # remove empty batches
    .map(pd.DataFrame)  # convert lists to pandas dataframes
    .to_dataframe(example=example))

But please understand that I'm just giving out loose examples here. Ultimately you'll probably have to learn about these operations. I wouldn't recommend copy-pasting what I write and expecting it to work :)

from streamz.

apiszcz avatar apiszcz commented on June 6, 2024

getting closer,, will read the docs to operate on the df ... , mapping parse to the stream, is that needed..

from streamz import Stream
import pandas as pd
import numpy as np

source = Stream.from_textfile('test.csv')
example = pd.DataFrame({'a': pd.Series([], dtype=np.uint32),
                        'b': pd.Series([], dtype=np.uint32),
                        'c': pd.Series([], dtype=np.uint32)})
sdf = source.map(pd.read_csv).timed_window(2.0).to_dataframe(example=example)

# do stuff with df
source.start()
print(sdf)

from streamz.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.