Comments (9)
@mrocklin I don't see "failed to deserialize" error every time. Adding
diff --git a/distributed/protocol/serialize.py b/distributed/protocol/serialize.py
index d657dab..b7b492e 100644
--- a/distributed/protocol/serialize.py
+++ b/distributed/protocol/serialize.py
@@ -87,6 +87,8 @@ def typename(typ):
def _find_lazy_registration(typename):
toplevel, _, _ = typename.partition('.')
if toplevel in lazy_registrations:
+ import time
+ time.sleep(.1)
lazy_registrations.pop(toplevel)()
return True
else:
to serialize.py
makes it consistent. Perhaps a race condition in the lazy importer? I'll see if I can narrow down the example further.
from dask-ml.
Collecting some some debugging observations:
The getitem
takes ~0.01s normally with an indexer of length 1. There's no increase in memory usage. When things slow down, things take ~.5s. The slower tasks include a disk-read-getitem
:
(slow on the left, normal on the right)
When slicing with multiple (e.g. size=2
) the graphs sometimes look different. Fast:
slow:
Not sure if this is meaningful or not. I suspect it is. I assumed the order of operations would be
getitem for each worker -> concatenate results.
But if it's
transfer blocks to single worker -> getitem
that would explain the slowdown and memory increase.
from dask-ml.
disk-read-
time blocks are due to getting elements out of worker.data
. This could mean that there are many elements in worker.data that are in memory, or more likely that there are a few elements that are stored on disk. This also corresponds to colored bars in the upper left memory use plot in the scheduler.
from dask-ml.
If your getitem/transfer vs transfer/getitem question refers to the above code:
for i in range(5):
idx = np.random.randint(0, len(X), size=5)
centers = X[idx].compute()
mem()
Then you should be fine. Dask.array definitely does the intelligent thing here.
from dask-ml.
Interestingly, whether or not the indexer is sorted seems to matter. Adding a sorted(idx)
before indexing:
centers = c.compute(X[sorted(idx)])
outputs:
3.84 GB
3.84 GB
3.84 GB
3.84 GB
3.84 GB
Trying it out in k_init now.
from dask-ml.
There is a special fast-path within dask/array/slicing.py for when the index is sorted. You might search that file for issorted
to find what the other non-fast-path is doing
from dask-ml.
Or just always sort
from dask-ml.
Just sorting works for me here. I'll still take a look in slicing to see if anything weird is going on.
from dask-ml.
Thanks!
from dask-ml.
Related Issues (20)
- sklearn handles text labels differently than ml_dask on OneHotEncoding
- Implementation for make_s_curve HOT 2
- Import dask_ml with python 3.10 failed due to conflict with dask.distributed HOT 4
- Python 3.11 support HOT 2
- LogisticRegression.score returns an empty dask array
- Incremental does not handle dask arrays of ndim>2 in estimator training HOT 2
- loading dask_ml gives error contextualversionconflict with sklearn HOT 4
- For a single record data frame train_test_split() sometimes assigns this single record to test set. HOT 2
- The `log_loss`-function crashes when using mixed types
- Area under the receiving operating characteristic curve (AUROC) calculation. HOT 2
- The latest version doesn't support perceptron model
- sklearn StandardScaler vs dask StandardScaler. HOT 1
- Nearest Neighbors
- `TypeError` when predicting non-array data with `dask-expr` HOT 6
- Undeclared runtime dependency on setuptools HOT 1
- Documentation on PCA expected max memory usage HOT 1
- Support GPU-backed data for metrics.
- Unexpected behavior in train_test_split with shuffle=False
- ColumnTransformer does not work with Dask dataframes HOT 1
- Add versions after 2023.3.24 to the Anaconda main channel HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from dask-ml.