Is it possible to create a dataframe from a growing log file, or is pygtail a better a

You might want to look at <a href="http://streamz.readthedocs.io/en/latest/api.html#s

Probably something like the following: <div class="highlight highlight-source-pyth

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Example: Read growing log file, create a dataframe or append to existing data frame? about streamz HOT 15 OPEN

apiszcz commented on June 6, 2024

Example: Read growing log file, create a dataframe or append to existing data frame?

from streamz.

Comments (15)

mrocklin commented on June 6, 2024

You might want to look at Stream.from_textfile

from streamz.

mrocklin commented on June 6, 2024

Probably something like the following:

source = Stream.from_textfile(...)
example = pd.DataFrame(...)  # provide an empty example to tell streamz about column names and dtypes
df = source.map(parse).timed_window(0.5).to_dataframe(example=example)

# do stuff with df

source.start()

from streamz.

martindurant commented on June 6, 2024

@apiszcz , were you wanting to produce a new dataframe every so often, as in the example above, or an accumulated dataframe that steadily grows with time and includes all data so far? I'm not sure how you would achieve the latter.

from streamz.

mrocklin commented on June 6, 2024

or an accumulated dataframe that steadily grows with time and includes all data so far? I'm not sure how you would achieve the latter.

I think that this is an anti-pattern. Data should always flow through streams and should not accumulate infinitely.

Instead I might ask "how did you want to convert rows into dataframes?" The example above catches all the data for the last 500ms. But you could imagine bundling up data by a fixed number (every 100 rows) or passing it along directly.

from streamz.

martindurant commented on June 6, 2024

accumulate with pd.concat would do it, I suppose, but you would surely strain the system to death as the data continues to grow. If you wish to accumulate, you ought to only keep (near) constant-size state around, as you would for a simple mean.

Probably a better example would be to buffer lines of text until some threshold, then batch-process them to parquet for later. There could be a source that watches the size of a file, without reading the text line-by-line to memory, and loads in a chunk of bytes when enough data has appeared.

from streamz.

mrocklin commented on June 6, 2024

Probably a better example would be to buffer lines of text until some threshold

Yes I agree. Things like this already exist within streamz. You might consider partition or timed_window in streamz.core. We need more operations like this on the dataframe side. This starts to get interesting though.

from streamz.

apiszcz commented on June 6, 2024

I need a data frame at intervals, it will be discarded after processing. Average size will be <10K rows, definitely less than 100K rows with 20-30 attributes so no issues processing and discarding old data. Interval range 1-20 seconds. I'm wondering if streamz or any similar capability exists to do this. Thank you for all the thoughts on this topic.

…

On Tue, Oct 17, 2017 at 3:12 PM, Martin Durant ***@***.***> wrote: @apiszcz <https://github.com/apiszcz> , were you wanting to produce a new dataframe every so often, as in the example above, or an accumulated dataframe that steadily grows with time and includes all data so far? I'm not sure how you would achieve the latter. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#103 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABXVTYoeHYMIR3waqeXyzOgzfWhSTPvrks5stPwhgaJpZM4P8kO-> .

from streamz.

mrocklin commented on June 6, 2024

Are the examples provided above insufficient to get you started @apiszcz ?

from streamz.

apiszcz commented on June 6, 2024

Will try this. ...

…

On Tue, Oct 17, 2017 at 3:02 PM, Matthew Rocklin ***@***.***> wrote: Probably something like the following: source = Stream.from_textfile(...) example = pd.DataFrame(...) # provide an empty example to tell streamz about column names and dtypes df = source.map(parse).timed_window(0.5).to_dataframe(example=example) # do stuff with df source.start() — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#103 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABXVTTWbYwH2c6uaDVSJ3btsUu1Ns16jks5stPnCgaJpZM4P8kO-> .

from streamz.

apiszcz commented on June 6, 2024

Those empty examples I worked that for another issue recently, someone suggested a set of Series, is there a better way? ndf=pd.DataFrame({'a':pd.Series([],dtype=np.uint8), 'b': pd.Series([], dtype=np.uint8), 'c': pd.Series([], dtype=np.float32), 'd': pd.Series([], dtype=np.uint32)})

…

from streamz.

mrocklin commented on June 6, 2024

That would be ok. If you have some data already then you might also read a little bit of your file and pass that dataframe instead (the example doesn't have to be empty).

from streamz.

apiszcz commented on June 6, 2024

Ideally I could control the types.

from streamz import Stream
import pandas as pd

source = Stream.filenames('test.csv')
sdf = (source.map(pd.read_csv).to_dataframe(example=...))

sdf.mean().stream.sink(print)

source = Stream.from_textfile(...)
example = pd.DataFrame({'a': pd.Series([], dtype=np.uint32),
                        'b': pd.Series([], dtype=np.uint32),
                        'c': pd.Series([], dtype=np.uint32)})
df = source.map(parse).timed_window(0.5).to_dataframe(example=example)
# do stuff with df
source.start()

Traceback (most recent call last):
File "r2.py", line 7, in
sdf = (source.map(pd.read_csv).to_dataframe(example=...))
File "lib\streamz\core.py", line 344, in to_dataframe
return StreamingDataFrame(stream=self, example=example)
File "lib\streamz\dataframe.py", line 267, in init
return super(StreamingDataFrame, self).init(*args, **kwargs)
File "lib\streamz\collection.py", line 33, in init
assert isinstance(self.example, self._subtype)
AssertionError

from streamz.

mrocklin commented on June 6, 2024

Thanks for the error report. I've improved the error message in #106

from streamz.

mrocklin commented on June 6, 2024

df = source.map(parse).timed_window(0.5).to_dataframe(example=example)

to_dataframe will also expect to be given a stream of dataframes. So assuming that the output of parse is something like a python dict, we might do something like the following instead:

df = (source.map(parse)
    .timed_window(0.5)  # batch into lists every 500ms 
    .filter(None)  # remove empty batches
    .map(pd.DataFrame)  # convert lists to pandas dataframes
    .to_dataframe(example=example))

But please understand that I'm just giving out loose examples here. Ultimately you'll probably have to learn about these operations. I wouldn't recommend copy-pasting what I write and expecting it to work :)

from streamz.

apiszcz commented on June 6, 2024

getting closer,, will read the docs to operate on the df ... , mapping parse to the stream, is that needed..

from streamz import Stream
import pandas as pd
import numpy as np

source = Stream.from_textfile('test.csv')
example = pd.DataFrame({'a': pd.Series([], dtype=np.uint32),
                        'b': pd.Series([], dtype=np.uint32),
                        'c': pd.Series([], dtype=np.uint32)})
sdf = source.map(pd.read_csv).timed_window(2.0).to_dataframe(example=example)

# do stuff with df
source.start()
print(sdf)

from streamz.

Example: Read growing log file, create a dataframe or append to existing data frame? about streamz HOT 15 OPEN

Comments (15)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs