To efficiently handle huge data sets, we could make use of <a href="http://dask.pydata

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

reference pr <a class="issue-link js-issue-link" data-error-text="Failed to load title

Dask support about tsfresh HOT 11 CLOSED

blue-yonder commented on August 24, 2024

Dask support

from tsfresh.

Comments (11)

nils-braun commented on August 24, 2024

This is quite an interesting topic. However, the dask DataFrame-creation API is highly incompatible with the pandas one and using dask correctly would mean some severe changes to the implementation.

So my question is
(1) do we really need this
(2) are there use cases where we need to run over more data that fits into the memory and
(3) isn't the processing time so large in these cases that we should rather think about a "cluster-parallel" implementation?

from tsfresh.

jneuff commented on August 24, 2024

Yes, this would imply quite some changes. This issue was meant to trigger the discussion.

On (3): Dask actually offers distributed scheduling.

from tsfresh.

MaxBenChrist commented on August 24, 2024

(1) maybe dask is not the right solution for our proble. But at the moment, when wants to extract features for big number of time series, one has to apply tsfresh chunkwise because some feature calculators are quite memory intensive. So lets say you have 16 GB of ram, then you probably can not process 2-3 GB of Time series data in one chunk, you have to split it by devices and pickle the feature DataFrames. What do you think about renaming this issue to "allow extraction of features for big time series"?

(2) yes, definitely

(3) you are probably right here.

from tsfresh.

mrocklin commented on August 24, 2024

Just came across this. If there are particular incompatibilities between Dask.dataframes and Pandas dataframes that affect this project please let us know.

Also, it may be that the right approach isn't to use dask.dataframe directly, but instead use it to pre-process data and then apply tsfresh functions over it. For example it would be easy to use dask.dataframe to load large datasets, share a fixed window of data (maybe five minutes) between neighboring partitions (which are just pandas dataframes) and then call tsfresh functions on each of those pandas dataframes. This sort of solution (or something similar) is common for more complex applications like yours.

Anyway, let me know if dask developers can help.

from tsfresh.

MaxBenChrist commented on August 24, 2024

@mrocklin, thank you for your nice message.

At the moment we are working on some other issues (E.g. shaping the tsfresh APi, restructure the test framework, ...). I discussed a possible dask implementation with @chmp (he contributed a few things to the dask project). As soon as this comes up again and we have some spare time to tackle this, I will get in contact with you guys :).

from tsfresh.

MaxBenChrist commented on August 24, 2024

@mrocklin

We are working on supporting dask to calculate tsfresh features in a distributed fashion (see the distributed branch). dask is a great tool and our first experiments on a cluster make me really exiting about the upcoming tsfresh release

I have one question regarding the pure argument on the "client.map" method. Essentially, tsfresh is a wrapper around numpy, scipy and pandas methods. Most of the feature that we calculate, e.g. fourier coefficients, are calculated in c libraries.

In the dask documentation you say

By default we assume that all functions are pure. If this is not the case we should use the pure=False keyword argument.

so following this advice, I should set pure=False because we rely on those scipy/numpy methods. But, on that same page you mention that

This key should be the same across all computations with the same inputs and across all machines. If we run the computation above on any computer with the same environment then we should get the exact same key.

The scheduler avoids redundant computations. If the result is already in memory from a previous call then that old result will be used rather than recomputing it. Calls to submit or map are idempotent in the common case.

We want the same features to be calculated for the same input. So, to achieve that and reduce the number of redundant calculations, we have to set pure=True such that same input gets the same calculation key, right?

from tsfresh.

MaxBenChrist commented on August 24, 2024

reference pr #316

from tsfresh.

mrocklin commented on August 24, 2024

The pure= keyword is a bit inaccurate. Pure has a few meanings. It would be more accurate to define this as deterministic= instead. If you use pure=True then you are stating that applying the same function to the same arguments will always produce the same result.

If you submit two functions

x = client.submit(inc, 1)
y = client.submit(inc, 1)

Under pure=True these will point to the same data. Under pure=False they will point to different data.

Relying on NumPy/SciPy/Pandas has no impact here. Most of those functions are deterministic.

from tsfresh.

MaxBenChrist commented on August 24, 2024

The pure= keyword is a bit inaccurate. Pure has a few meanings. It would be more accurate to define this as deterministic= instead. If you use pure=True then you are stating that applying the same function to the same arguments will always produce the same result.

Thanks for the explanation!

from tsfresh.

mrocklin commented on August 24, 2024

Understandable. The term pure was really a bad decision on my part early on. It means too many things.

…

On Thu, Sep 14, 2017 at 6:58 AM, Maximilian Christ ***@***.*** > wrote: The pure= keyword is a bit inaccurate. Pure has a few meanings. It would be more accurate to define this as deterministic= instead. If you use pure=True then you are stating that applying the same function to the same arguments will always produce the same result. Thanks for the explanation. I was somehow associating pure with pure python code in the sense of having not C extensions. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#10 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AASszMnkLJqnmX-SWsB_pkY0KWd4OFGmks5siQbbgaJpZM4KiIVd> .

from tsfresh.

MaxBenChrist commented on August 24, 2024

Dask support is now in the master 👍

from tsfresh.

Dask support about tsfresh HOT 11 CLOSED

Comments (11)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs