GithubHelp home page GithubHelp logo

Dask support about tsfresh HOT 11 CLOSED

blue-yonder avatar blue-yonder commented on August 24, 2024
Dask support

from tsfresh.

Comments (11)

nils-braun avatar nils-braun commented on August 24, 2024

This is quite an interesting topic. However, the dask DataFrame-creation API is highly incompatible with the pandas one and using dask correctly would mean some severe changes to the implementation.

So my question is
(1) do we really need this
(2) are there use cases where we need to run over more data that fits into the memory and
(3) isn't the processing time so large in these cases that we should rather think about a "cluster-parallel" implementation?

from tsfresh.

jneuff avatar jneuff commented on August 24, 2024

Yes, this would imply quite some changes. This issue was meant to trigger the discussion.

On (3): Dask actually offers distributed scheduling.

from tsfresh.

MaxBenChrist avatar MaxBenChrist commented on August 24, 2024

(1) maybe dask is not the right solution for our proble. But at the moment, when wants to extract features for big number of time series, one has to apply tsfresh chunkwise because some feature calculators are quite memory intensive. So lets say you have 16 GB of ram, then you probably can not process 2-3 GB of Time series data in one chunk, you have to split it by devices and pickle the feature DataFrames. What do you think about renaming this issue to "allow extraction of features for big time series"?

(2) yes, definitely

(3) you are probably right here.

from tsfresh.

mrocklin avatar mrocklin commented on August 24, 2024

Just came across this. If there are particular incompatibilities between Dask.dataframes and Pandas dataframes that affect this project please let us know.

Also, it may be that the right approach isn't to use dask.dataframe directly, but instead use it to pre-process data and then apply tsfresh functions over it. For example it would be easy to use dask.dataframe to load large datasets, share a fixed window of data (maybe five minutes) between neighboring partitions (which are just pandas dataframes) and then call tsfresh functions on each of those pandas dataframes. This sort of solution (or something similar) is common for more complex applications like yours.

Anyway, let me know if dask developers can help.

from tsfresh.

MaxBenChrist avatar MaxBenChrist commented on August 24, 2024

@mrocklin, thank you for your nice message.

At the moment we are working on some other issues (E.g. shaping the tsfresh APi, restructure the test framework, ...). I discussed a possible dask implementation with @chmp (he contributed a few things to the dask project). As soon as this comes up again and we have some spare time to tackle this, I will get in contact with you guys :).

from tsfresh.

MaxBenChrist avatar MaxBenChrist commented on August 24, 2024

@mrocklin

We are working on supporting dask to calculate tsfresh features in a distributed fashion (see the distributed branch). dask is a great tool and our first experiments on a cluster make me really exiting about the upcoming tsfresh release

I have one question regarding the pure argument on the "client.map" method. Essentially, tsfresh is a wrapper around numpy, scipy and pandas methods. Most of the feature that we calculate, e.g. fourier coefficients, are calculated in c libraries.

In the dask documentation you say

By default we assume that all functions are pure. If this is not the case we should use the pure=False keyword argument.

so following this advice, I should set pure=False because we rely on those scipy/numpy methods. But, on that same page you mention that

This key should be the same across all computations with the same inputs and across all machines. If we run the computation above on any computer with the same environment then we should get the exact same key.

The scheduler avoids redundant computations. If the result is already in memory from a previous call then that old result will be used rather than recomputing it. Calls to submit or map are idempotent in the common case.

We want the same features to be calculated for the same input. So, to achieve that and reduce the number of redundant calculations, we have to set pure=True such that same input gets the same calculation key, right?

from tsfresh.

MaxBenChrist avatar MaxBenChrist commented on August 24, 2024

reference pr #316

from tsfresh.

mrocklin avatar mrocklin commented on August 24, 2024

The pure= keyword is a bit inaccurate. Pure has a few meanings. It would be more accurate to define this as deterministic= instead. If you use pure=True then you are stating that applying the same function to the same arguments will always produce the same result.

If you submit two functions

x = client.submit(inc, 1)
y = client.submit(inc, 1)

Under pure=True these will point to the same data. Under pure=False they will point to different data.

Relying on NumPy/SciPy/Pandas has no impact here. Most of those functions are deterministic.

from tsfresh.

MaxBenChrist avatar MaxBenChrist commented on August 24, 2024

The pure= keyword is a bit inaccurate. Pure has a few meanings. It would be more accurate to define this as deterministic= instead. If you use pure=True then you are stating that applying the same function to the same arguments will always produce the same result.

Thanks for the explanation!

from tsfresh.

mrocklin avatar mrocklin commented on August 24, 2024

from tsfresh.

MaxBenChrist avatar MaxBenChrist commented on August 24, 2024

Dask support is now in the master 👍

from tsfresh.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.