GithubHelp home page GithubHelp logo

k-means Memory usage about dask-ml HOT 9 CLOSED

dask avatar dask commented on August 10, 2024
k-means Memory usage

from dask-ml.

Comments (9)

TomAugspurger avatar TomAugspurger commented on August 10, 2024

@mrocklin I don't see "failed to deserialize" error every time. Adding

diff --git a/distributed/protocol/serialize.py b/distributed/protocol/serialize.py
index d657dab..b7b492e 100644
--- a/distributed/protocol/serialize.py
+++ b/distributed/protocol/serialize.py
@@ -87,6 +87,8 @@ def typename(typ):
 def _find_lazy_registration(typename):
     toplevel, _, _ = typename.partition('.')
     if toplevel in lazy_registrations:
+        import time
+        time.sleep(.1)
         lazy_registrations.pop(toplevel)()
         return True
     else:

to serialize.py makes it consistent. Perhaps a race condition in the lazy importer? I'll see if I can narrow down the example further.

from dask-ml.

TomAugspurger avatar TomAugspurger commented on August 10, 2024

Collecting some some debugging observations:

The getitem takes ~0.01s normally with an indexer of length 1. There's no increase in memory usage. When things slow down, things take ~.5s. The slower tasks include a disk-read-getitem:

screen shot 2017-10-17 at 9 55 46 am

(slow on the left, normal on the right)

When slicing with multiple (e.g. size=2) the graphs sometimes look different. Fast:

fast

slow:

slow

Not sure if this is meaningful or not. I suspect it is. I assumed the order of operations would be

getitem for each worker -> concatenate results.

But if it's

transfer blocks to single worker -> getitem

that would explain the slowdown and memory increase.

from dask-ml.

mrocklin avatar mrocklin commented on August 10, 2024

disk-read- time blocks are due to getting elements out of worker.data. This could mean that there are many elements in worker.data that are in memory, or more likely that there are a few elements that are stored on disk. This also corresponds to colored bars in the upper left memory use plot in the scheduler.

from dask-ml.

mrocklin avatar mrocklin commented on August 10, 2024

If your getitem/transfer vs transfer/getitem question refers to the above code:

for i in range(5):
    idx = np.random.randint(0, len(X), size=5)
    centers = X[idx].compute()
    mem()

Then you should be fine. Dask.array definitely does the intelligent thing here.

from dask-ml.

TomAugspurger avatar TomAugspurger commented on August 10, 2024

Interestingly, whether or not the indexer is sorted seems to matter. Adding a sorted(idx) before indexing:

centers = c.compute(X[sorted(idx)])

outputs:

3.84 GB
3.84 GB
3.84 GB
3.84 GB
3.84 GB

Trying it out in k_init now.

from dask-ml.

mrocklin avatar mrocklin commented on August 10, 2024

There is a special fast-path within dask/array/slicing.py for when the index is sorted. You might search that file for issorted to find what the other non-fast-path is doing

from dask-ml.

mrocklin avatar mrocklin commented on August 10, 2024

Or just always sort

from dask-ml.

TomAugspurger avatar TomAugspurger commented on August 10, 2024

Just sorting works for me here. I'll still take a look in slicing to see if anything weird is going on.

from dask-ml.

TomAugspurger avatar TomAugspurger commented on August 10, 2024

Thanks!

from dask-ml.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.