GithubHelp home page GithubHelp logo

dask-xgboost's Introduction

Dask-XGBoost

Warning

Dask-XGBoost has been deprecated and is no longer maintained. The functionality of this project has been included directly in XGBoost. To use Dask and XGBoost together, please use xgboost.dask instead https://xgboost.readthedocs.io/en/latest/tutorials/dask.html.

Distributed training with XGBoost and Dask.distributed

This repository offers a legacy option to perform distributed training with XGBoost on Dask.array and Dask.dataframe collections.

pip install dask-xgboost

Please note that XGBoost now includes a Dask API as part of its official Python package. That API is independent of dask-xgboost and is now the recommended way to use Dask adn XGBoost together. See the xgb.dask documentation here https://xgboost.readthedocs.io/en/latest/tutorials/dask.html for more details on the new API.

Example

from dask.distributed import Client
client = Client('scheduler-address:8786')  # connect to cluster

import dask.dataframe as dd
df = dd.read_csv('...')  # use dask.dataframe to load and
df_train = ...           # preprocess data
labels_train = ...

import dask_xgboost as dxgb
params = {'objective': 'binary:logistic', ...}  # use normal xgboost params
bst = dxgb.train(client, params, df_train, labels_train)

>>> bst  # Get back normal XGBoost result
<xgboost.core.Booster at ... >

predictions = dxgb.predict(client, bst, data_test)

How this works

For more information on using Dask.dataframe for preprocessing see the Dask.dataframe documentation.

Once you have created suitable data and labels we are ready for distributed training with XGBoost. Every Dask worker sets up an XGBoost slave and gives them enough information to find each other. Then Dask workers hand their in-memory Pandas dataframes to XGBoost (one Dask dataframe is just many Pandas dataframes spread around the memory of many machines). XGBoost handles distributed training on its own without Dask interference. XGBoost then hands back a single xgboost.Booster result object.

Larger Example

For a more serious example see

History

Conversation during development happened at dmlc/xgboost #2032

dask-xgboost's People

Contributors

evanfwelch avatar jacobtomlinson avatar javabrett avatar johnzed avatar jrbourbeau avatar ksangeek avatar kylejn27 avatar mmccarty avatar mrocklin avatar mrphilroth avatar sultanorazbayev avatar tomaugspurger avatar tomlaube avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

dask-xgboost's Issues

AttributeError when using GridSearchCV with XGBClassifier

Hello,

I'm working on a small proof of concept. I use dask in my project and would like to use the XGBClassifier. I also need a parameter search and, of course, cross-validation mechanisms.

Unfortunately, when fitting the dask_xgboost.XGBClassifier, I get the following error:

Anaconda_3.5.1\envs\prescoring\lib\site-packages\dask_xgboost\core.py", line 97, in _train
AttributeError: 'DataFrame' object has no attribute 'to_delayed'

Although I call .fit() with two dask objects, somehow it becomes a pandas.DataFrame later on.

Here's the code I'm using:

import dask.dataframe as dd
import numpy as np
import pandas as pd
from dask_ml.model_selection import GridSearchCV
from dask_xgboost import XGBClassifier
from distributed import Client
from sklearn.datasets import load_iris

if __name__ == '__main__':

    client = Client()

    data = load_iris()

    x = pd.DataFrame(data=data['data'], columns=data['feature_names'])
    x = dd.from_pandas(x, npartitions=2)

    y = pd.Series(data['target'])
    y = dd.from_pandas(y, npartitions=2)

    estimator = XGBClassifier(objective='multi:softmax', num_class=4)
    grid_search = GridSearchCV(
        estimator,
        param_grid={
            'n_estimators': np.arange(15, 105, 15)
        },
        scheduler='threads'
    )

    grid_search.fit(x, y)
    results = pd.DataFrame(grid_search.cv_results_)
    print(results.to_string())

I use the packages in the following versions:

pandas==0.23.3
numpy==1.15.1
dask==0.20.0
dask-ml==0.11.0
dask-xgboost==0.1.5

Note that I don't get this exception when using sklearn.ensemble.GradientBoostingClassifier.

Any help would be appreciated.

Mateusz

new release?

Thanks for maintaining this awesome package! I'm really digging it.

Would you consider doing a minor release to PyPi? The last one (1.10.0) was about 6 months ago, and I'm relying on installing from GitHub right now to get this fix: #40

Thanks for your time and consideration

Getting ValueError when fitting model

I'm getting the following error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-13-23f3b9353ae8> in <module>
      3 
      4 est=XGBClassifier()
----> 5 est.fit(X_train, y_train)

~/anaconda3/lib/python3.7/site-packages/dask_xgboost/core.py in fit(self, X, y, classes, eval_set, sample_weight_eval_set, eval_metric, early_stopping_rounds)
    515             missing=self.missing,
    516             n_jobs=self.n_jobs,
--> 517             early_stopping_rounds=early_stopping_rounds,
    518         )
    519 

~/anaconda3/lib/python3.7/site-packages/dask_xgboost/core.py in train(client, params, data, labels, dmatrix_kwargs, **kwargs)
    240     """
    241     return client.sync(
--> 242         _train, client, params, data, labels, dmatrix_kwargs, **kwargs
    243     )
    244 

~/anaconda3/lib/python3.7/site-packages/distributed/client.py in sync(self, func, asynchronous, callback_timeout, *args, **kwargs)
    754         else:
    755             return sync(
--> 756                 self.loop, func, *args, callback_timeout=callback_timeout, **kwargs
    757             )
    758 

~/anaconda3/lib/python3.7/site-packages/distributed/utils.py in sync(loop, func, callback_timeout, *args, **kwargs)
    331     if error[0]:
    332         typ, exc, tb = error[0]
--> 333         raise exc.with_traceback(tb)
    334     else:
    335         return result[0]

~/anaconda3/lib/python3.7/site-packages/distributed/utils.py in f()
    315             if callback_timeout is not None:
    316                 future = gen.with_timeout(timedelta(seconds=callback_timeout), future)
--> 317             result[0] = yield future
    318         except Exception as exc:
    319             error[0] = sys.exc_info()

~/anaconda3/lib/python3.7/site-packages/tornado/gen.py in run(self)
    733 
    734                     try:
--> 735                         value = future.result()
    736                     except Exception:
    737                         exc_info = sys.exc_info()

~/anaconda3/lib/python3.7/site-packages/tornado/gen.py in run(self)
    746                             exc_info = None
    747                     else:
--> 748                         yielded = self.gen.send(value)
    749 
    750                 except (StopIteration, Return) as e:

~/anaconda3/lib/python3.7/site-packages/dask_xgboost/core.py in _train(client, params, data, labels, dmatrix_kwargs, **kwargs)
    182 
    183     # Start the XGBoost tracker on the Dask scheduler
--> 184     host, port = parse_host_port(client.scheduler.address)
    185     env = yield client._run_on_scheduler(
    186         start_tracker, host.strip("/:"), len(worker_map)

~/anaconda3/lib/python3.7/site-packages/dask_xgboost/core.py in parse_host_port(address)
     29     if "://" in address:
     30         address = address.rsplit("://", 1)[1]
---> 31     host, port = address.split(":")
     32     port = int(port)
     33     return host, port

ValueError: not enough values to unpack (expected 2, got 1)

While trying to fit the model:

import pandas as pd

from dask.distributed import Client, progress
#from sklearn.ensemble import RandomForestClassifier
#from sklearn.model_selection import train_test_split
import joblib
from dask import dataframe as ddf
import numpy as np
from dask_ml.model_selection import train_test_split
from dask_ml.xgboost import XGBClassifier, train, predict

client=Client(processes=False,threads_per_worker=8,n_workers=1, memory_limit="16GB")


#split dataset into train and test
X_train, X_test, y_train, y_test = train_test_split(df["Store Area"], df[Y_column])

est=XGBClassifier()
est.fit(X_train, y_train)

Any idea why this might be happening?

import xgboost error

currently on dask 1.14 [complete] (from pip)
dask-xgboost also installed (from pip)
xgb 0.62 ubuntu 64

following error occurs.

import dask_xgboost


ImportError Traceback (most recent call last)
in ()
----> 1 import dask_xgboost

/work/miniconda/lib/python3.5/site-packages/dask_xgboost/init.py in ()
----> 1 from .core import _train, train, predict

/work/miniconda/lib/python3.5/site-packages/dask_xgboost/core.py in ()
12 from distributed.client import _wait
13 from distributed.utils import sync
---> 14 from distributed.comm.core import parse_host_port
15 import xgboost as xgb
16

ImportError: cannot import name 'parse_host_port'

===========================================================

according to the documentation,
parse_host_port exists in distributed.comm.addressing
rather than distributed.comm.core

http://distributed.readthedocs.io/en/latest/_modules/distributed/comm/addressing.html?highlight=parse_host_port

Problem with rabit.init

After modifying the code with PR #4, I'm now running into errors when xgboost's rabit is started:

retry connect to ip(retry time 1): [localhost]
retry connect to ip(retry time 2): [localhost]
retry connect to ip(retry time 3): [localhost]
retry connect to ip(retry time 4): [localhost]
connect to (failed): [localhost]
Socket Connect Error:Invalid argument
distributed.nanny - WARNING - Worker process 34605 was killed by unknown signal
distributed.nanny - WARNING - Restarting worker

These are logs from a local worker on my laptop when xgb.rabit.init(args) is called. I'm not sure how to go about debugging this.

'return' with argument error on dask_xgboost import

When trying to import dask_xgboost (installed via pip), I get the error:

File "/usr/local/lib/python2.7/dist-packages/dask_xgboost/core.py", line 126
    return result
SyntaxError: 'return' with argument inside generator

Thoughts?

"module 'dask_xgboost' has no attribute * erors"

Hello,

I am experiencing strange behaviour when trying to use dask-xgboost. I am on Windows 10 machine and am using Python 3.7. I am trying to test dask on a local machine first with a local cluster.

I have no errors when importing dask_xgboost, but when I try to call any of it's functions I get a no attribute error. For some reason I am unable to use this package.
Two examples that throw the error:

  1. bst = dask_xgboost.train(client, params, X_train, y_train, num_boost_round=10) --> no attribute train
  2. model = dxgb.XGBClassifier() --> no attribute Xgb

This is probably something silly and perhaps Windows related.

Thanx for any help!

Training on selected subset of Dask DataFrame rows

In order to do CV iterations, I'd like to load all the data into a Dask DataFrame, select a subset of rows, and then train only on those selected rows. I've written some code that does this and it works when I use a LocalCluster. When I try and use a true distributed cluster, I get an error when constructing the worker_map:

Traceback (most recent call last):
  File "/home/proth/.conda/envs/dask/lib/python3.5/site-packages/dask_xgboost/core.py", line 116, in _train
    worker_map[first(workers)].append(key_to_part_dict[key])
  File "/home/proth/.conda/envs/dask/lib/python3.5/site-packages/toolz/itertoolz.py", line 368, in first
    return next(iter(seq))
StopIteration

I tried to write a test that demonstrates this behavior, but the test passes. This makes me think that distributed.utils_test.gen_cluster uses a LocalCluster. Is there anyway to run this test with a true distributed cluster?

@gen_cluster(client=True, timeout=None)
def test_subset(c, s, a, b):
    # Train in xgboost with selected data
    dtrain = xgb.DMatrix(df[df.x < 8], label=labels[:7])
    bst = xgb.train(param, dtrain)

    # Combine data and labels and select rows together
    newdf = df.copy()
    newdf["labels"] = labels
    ddf = dd.from_pandas(newdf, npartitions=4)
    ddf_trimmed = ddf[ddf.x < 8]

    # Split data and labels back apart
    ddf_train = ddf_trimmed.drop("labels", axis=1)
    ddf_labels = ddf_trimmed["labels"]

    # Train in dask-xgboost with trimmed data (fails with distributed cluster!)
    dbst = yield dxgb._train(c, param, ddf_train, ddf_labels)

    result = bst.predict(dtrain)
    dresult = dbst.predict(dtrain)

    correct = (result > 0.5) == labels[:7]
    dcorrect = (dresult > 0.5) == labels[:7]
    assert dcorrect.sum() == correct.sum()

[FEATURE DISCUSSION] Adding dedicated support for Dask-cuDF

Currently, RAPIDS deploys a forked version of this repository with additions that support a few things:

  1. New Random Forests (RF) interface in XGBoost
  2. Ingest of dask-cudf objects
  3. Ingest of GPU device-resident XGBoost.DMatrix objects.

We're considering abstracting these additional features and their associated dependencies into a compat.py so that we can contribute these feature additions back to this code repository without breaking its current pipeline, or requiring any additional code dependencies (in case users want the default experience).

We're wondering what the thoughts around this are, and if the community would be receptive to these additional features.

Cannot assign requested address

Hi,

On a simple Dask XGBoost run I get the error in the subject. The sample code looks like:

from dask_ml.xgboost import XGBRegressor
est = XGBRegressor(...)
x = dd.read_csv('somedata.csv')
y = x.y
del x['y'] 
est.fit(x, y)

And the error is as follows:

---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
<ipython-input-4-d50d84593355> in <module>()
      5 y = x.y
      6 del x['y']
----> 7 est.fit(x, y)

/opt/conda/lib/python3.6/site-packages/dask_xgboost/core.py in fit(self, X, y)
    239         xgb_options = self.get_xgb_params()
    240         self._Booster = train(client, xgb_options, X, y,
--> 241                               num_boost_round=self.n_estimators)
    242         return self
    243 

/opt/conda/lib/python3.6/site-packages/dask_xgboost/core.py in train(client, params, data, labels, dmatrix_kwargs, **kwargs)
    167     """
    168     return sync(client.loop, _train, client, params, data,
--> 169                 labels, dmatrix_kwargs, **kwargs)
    170 
    171 

/opt/conda/lib/python3.6/site-packages/distributed/utils.py in sync(loop, func, *args, **kwargs)
    273             e.wait(10)
    274     if error[0]:
--> 275         six.reraise(*error[0])
    276     else:
    277         return result[0]

/opt/conda/lib/python3.6/site-packages/six.py in reraise(tp, value, tb)
    691             if value.__traceback__ is not tb:
    692                 raise value.with_traceback(tb)
--> 693             raise value
    694         finally:
    695             value = None

/opt/conda/lib/python3.6/site-packages/distributed/utils.py in f()
    258             yield gen.moment
    259             thread_state.asynchronous = True
--> 260             result[0] = yield make_coro()
    261         except Exception as exc:
    262             error[0] = sys.exc_info()

/opt/conda/lib/python3.6/site-packages/tornado/gen.py in run(self)
   1097 
   1098                     try:
-> 1099                         value = future.result()
   1100                     except Exception:
   1101                         self.had_exception = True

/opt/conda/lib/python3.6/site-packages/tornado/gen.py in run(self)
   1105                     if exc_info is not None:
   1106                         try:
-> 1107                             yielded = self.gen.throw(*exc_info)
   1108                         finally:
   1109                             # Break up a reference to itself

/opt/conda/lib/python3.6/site-packages/dask_xgboost/core.py in _train(client, params, data, labels, dmatrix_kwargs, **kwargs)
    122     env = yield client._run_on_scheduler(start_tracker,
    123                                          host.strip('/:'),
--> 124                                          len(worker_map))
    125 
    126     # Tell each worker to train on the chunks/parts that it has locally

/opt/conda/lib/python3.6/site-packages/tornado/gen.py in run(self)
   1097 
   1098                     try:
-> 1099                         value = future.result()
   1100                     except Exception:
   1101                         self.had_exception = True

/opt/conda/lib/python3.6/site-packages/tornado/gen.py in run(self)
   1111                             exc_info = None
   1112                     else:
-> 1113                         yielded = self.gen.send(value)
   1114 
   1115                     if stack_context._state.contexts is not orig_stack_contexts:

/opt/conda/lib/python3.6/site-packages/distributed/client.py in _run_on_scheduler(self, function, *args, **kwargs)
   1911                                                      kwargs=dumps(kwargs))
   1912         if response['status'] == 'error':
-> 1913             six.reraise(*clean_exception(**response))
   1914         else:
   1915             raise gen.Return(response['result'])

/opt/conda/lib/python3.6/site-packages/six.py in reraise(tp, value, tb)
    690                 value = tp()
    691             if value.__traceback__ is not tb:
--> 692                 raise value.with_traceback(tb)
    693             raise value
    694         finally:

/opt/conda/lib/python3.6/site-packages/dask_xgboost/core.py in start_tracker()
     30     """ Start Rabit tracker """
     31     env = {'DMLC_NUM_WORKER': n_workers}
---> 32     rabit = RabitTracker(hostIP=host, nslave=n_workers)
     33     env.update(rabit.slave_envs())
     34 

/opt/conda/lib/python3.6/site-packages/dask_xgboost/tracker.py in __init__()
    166         for port in range(port, port_end):
    167             try:
--> 168                 sock.bind((hostIP, port))
    169                 self.port = port
    170                 break

OSError: [Errno 99] Cannot assign requested address

Any help will be greatly appreciated.

Thanks.

xgboost stop working after train finished with dask.distributed.SSHCluster

What happened:
I have a cluster with 3 nodes and I tried to run xgb.dask.train() with SSHCluster as:

   with SSHCluster(["localhost", "node005", "node006"],worker_options={"nthreads": 2}) as cluster:
        cluster.scale(3)
        with Client(cluster) as client:
            output = xgb.dask.train(client,
                            {'verbosity': 1,
                             'tree_method': 'hist',
                             'n_estimators': 5,
                             'max_depth': 50,
                             'n_jobs': -1,
                             'random_state': 2,
                             'learning_rate': 0.1,
                             'min_child_weight': 1,
                             'seed': 0,
                             'subsample': 0.8,
                             'colsample_bytree': 0.8,
                             'gamma': 0,
                             'reg_alpha': 0,
                             'reg_lambda': 1},
                            dtrain,
                            num_boost_round=500, evals=[(dTest, 'train')])
            print("after train")
            bst = output['booster']
            history = output['history']
            with open('/tmp/model_xgbregressor0.pkl', 'wb') as f1:
                  pickle.dump(bst, f1)
            cluster.scale(0)
            client.shutdown()

I can see the model began training on 2 nodes as cpu usage for one python process is about 100%. However after training finished(cpu usage drop under 10%), nothing happens in nodes and client as I didn't see debug log "after train" and the model is not dumped on either node. And it never finished.

Environment:

  • Dask version: 2.20.0
  • Python version: 3.6
  • Operating System: CentOS-7
  • Install method (conda, pip, source): pip

[BUG] testcase faliure - gen_cluster() got an unexpected keyword argument 'check_new_threads'

Describe the bug
pytest for the tests having call to gen_cluster() fails with -

E   TypeError: gen_cluster() got an unexpected keyword argument 'check_new_threads'

Steps/Code to reproduce bug
This can be easily reproduced with the distributed version 2.3.2.
Here is the complete error seen on the terminal -

$ pytest test_core.py
====================================== test session starts =======================================
platform linux -- Python 3.6.9, pytest-5.0.1, py-1.8.0, pluggy-0.12.0
rootdir: /home/sangeek/dask-xgb-tests/tests
plugins: cov-2.7.1, xdist-1.28.0, forked-1.0.2
collected 0 items / 1 errors

============================================= ERRORS =============================================
_________________________________ ERROR collecting test_core.py __________________________________
test_core.py:136: in <module>
    @gen_cluster(client=True, timeout=None, check_new_threads=False)
E   TypeError: gen_cluster() got an unexpected keyword argument 'check_new_threads'
!!!!!!!!!!!!!!!!!!!!!!!!!!!! Interrupted: 1 errors during collection !!!!!!!!!!!!!!!!!!!!!!!!!!!!!
==================================== 1 error in 3.70 seconds =====================================

Additional context
Looking at the code for the latest release of distributed -
https://github.com/dask/distributed/blob/a0d68d42b81fe0215098d7f87783e2d57b2aa703/distributed/utils_test.py#L831
I see that they have removed support for the parameter check_new_threads.

I see that removing that parameter from the call to gen_cluster(), helps me get past this issue i.e.

< @gen_cluster(client=True, timeout=None, check_new_threads=False)
---
> @gen_cluster(client=True, timeout=None)

I could have attempted to patch this change but for the reasons mentioned in #47 I can't still run all the tests successfully, so not sure if there are any other side-effects of this change?

xgboost not running when not using LocalCluster

Hey guys, I am trying to run a distributed xgboost model on dask. I am able to run it using a LocalCluster(), however when I manually set up my scheduler with just 1 worker and use a Client pointing towards that scheduler, I hit some exceptions. This is my code:

from dask.distributed import Client
import dask.dataframe as dd

client = Client('192.168.49.37:8786')
client.restart()

df = dd.read_csv("adult_comp_cont", storage_options={'anon' : True})
df = df[:100]
df.columns = [str(i) for i in range(6)] + ['target']
Y = client.persist(df['target'])
X = client.persist(df.drop('target', axis=1))

import dask_xgboost as dxgb
params = {'objective' :'binary:logistic', 'n_estimators' : 10, 'max_depth' : 3, 'learning_rate' : 0.033}
bst = dxgb.train(client, params, X, Y)
print("Predicted!")

The exception I am getting is:

Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/dask_xgboost/core.py", line 116, in _train
    worker_map[first(workers)].append(key_to_part_dict[key])
  File "/usr/local/lib/python3.5/dist-packages/toolz/itertoolz.py", line 368, in first
    return next(iter(seq))
StopIteration

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/sagnikb/PycharmProjects/auto_ML/test_dask_distributed.py", line 15, in <module>
    bst = dxgb.train(client, params, X, Y)
  File "/usr/local/lib/python3.5/dist-packages/dask_xgboost/core.py", line 169, in train
    labels, dmatrix_kwargs, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/distributed/utils.py", line 253, in sync
    six.reraise(*error[0])
  File "/home/sagnikb/.local/lib/python3.5/site-packages/six.py", line 693, in reraise
    raise value
  File "/usr/local/lib/python3.5/dist-packages/distributed/utils.py", line 238, in f
    result[0] = yield make_coro()
  File "/usr/local/lib/python3.5/dist-packages/tornado/gen.py", line 1099, in run
    value = future.result()
  File "/usr/lib/python3.5/asyncio/futures.py", line 274, in result
    raise self._exception
  File "/usr/local/lib/python3.5/dist-packages/tornado/gen.py", line 1113, in run
    yielded = self.gen.send(value)
RuntimeError: generator raised StopIteration

Process finished with exit code 1

This is my scheduler log:

(dask)bblite@MasterNode >>>>  source dask/bin/activate && dask-scheduler
distributed.scheduler - INFO - -----------------------------------------------
distributed.scheduler - INFO - Clear task state
distributed.scheduler - INFO -   Scheduler at:  tcp://192.168.49.37:8786
distributed.scheduler - INFO - Local Directory:    /tmp/scheduler-otyb311x
distributed.scheduler - INFO - -----------------------------------------------
distributed.scheduler - INFO - Register tcp://192.168.49.38:34850
distributed.scheduler - INFO - Starting worker compute stream, tcp://192.168.49.38:34850
distributed.core - INFO - Starting established connection
distributed.scheduler - INFO - Receive client connection: Client-3fecd394-79cf-11e8-9d33-945330cf5663
distributed.core - INFO - Starting established connection
distributed.scheduler - INFO - Send lost future signal to clients
distributed.scheduler - INFO - Remove worker tcp://192.168.49.38:34850
distributed.core - INFO - Removing comms to tcp://192.168.49.38:34850
distributed.scheduler - INFO - Lost all workers
distributed.scheduler - INFO - Clear task state
distributed.scheduler - INFO - Register tcp://192.168.49.38:41151
distributed.scheduler - INFO - Starting worker compute stream, tcp://192.168.49.38:41151
distributed.core - INFO - Starting established connection
distributed.scheduler - INFO - Clear task state
distributed.scheduler - INFO - Remove client Client-3fecd394-79cf-11e8-9d33-945330cf5663
distributed.scheduler - INFO - Close client connection: Client-3fecd394-79cf-11e8-9d33-945330cf5663

This is my worker log:

(dask)bblite@WorkerNode1:~$ source dask/bin/activate && dask-worker 192.168.49.37:8786
distributed.nanny - INFO -         Start Nanny at: 'tcp://192.168.49.38:39734'
distributed.worker - INFO -       Start worker at:  tcp://192.168.49.38:34850
distributed.worker - INFO -          Listening to:  tcp://192.168.49.38:34850
distributed.worker - INFO -              nanny at:        192.168.49.38:39734
distributed.worker - INFO - Waiting to connect to:   tcp://192.168.49.37:8786
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          8
distributed.worker - INFO -                Memory:                   16.83 GB
distributed.worker - INFO -       Local Directory: /home/bblite/dask-worker-space/worker-bwwa2seh
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -         Registered to:   tcp://192.168.49.37:8786
distributed.worker - INFO - -------------------------------------------------
distributed.core - INFO - Starting established connection
distributed.worker - INFO - Stopping worker at tcp://192.168.49.38:34850
distributed.worker - INFO -       Start worker at:  tcp://192.168.49.38:41151
distributed.worker - INFO -          Listening to:  tcp://192.168.49.38:41151
distributed.worker - INFO -              nanny at:        192.168.49.38:39734
distributed.worker - INFO - Waiting to connect to:   tcp://192.168.49.37:8786
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          8
distributed.worker - INFO -                Memory:                   16.83 GB
distributed.worker - INFO -       Local Directory: /home/bblite/dask-worker-space/worker-mqcum0th
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -         Registered to:   tcp://192.168.49.37:8786
distributed.worker - INFO - -------------------------------------------------
distributed.core - INFO - Starting established connection
distributed.worker - WARNING -  Compute Failed
Function:  read_block_from_file
args:      (<dask.bytes.core.OpenFile object at 0x7f9f01355978>, 0, 64000000, b'\n')
kwargs:    {}
Exception: FileNotFoundError(2, 'No such file or directory')

Is anyone else facing this issue? Thanks in advance.
Environment: Python 3.5, dask 0.17.5

train rabit module

Hi,

I have the following issue after I run:
bst = dxgb.train(client, params, X, y)

/usr/local/lib/python2.7/dist-packages/dask_xgboost/core.pyc in train_part()
69
70 args = [('%s=%s' % item).encode() for item in env.items()]
---> 71 xgb.rabit.init(args)
72 try:
73 logger.info("Starting Rabit, Rank %d", xgb.rabit.get_rank())

AttributeError: 'module' object has no attribute 'rabit'

I already reinstalled xgboost

Thanks for your help

CI failures

There are a few test failures on master that I came across in #33 and thought it was worth opening up a separate issue.

The test failures are due to a ChildProcessError. For example, the traceback for pytest dask_xgboost/tests/test_core.py::test_classifier on master is

Traceback details
[gw0] darwin -- Python 3.6.6 /Users/jbourbeau/miniconda/envs/quansight/bin/python
loop = <tornado.platform.asyncio.AsyncIOLoop object at 0x1c1a8e8080>

    def test_classifier(loop):  # noqa
>       with cluster() as (s, [a, b]):

dask_xgboost/tests/test_core.py:38:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
../../miniconda/envs/quansight/lib/python3.6/contextlib.py:81: in __enter__
    return next(self.gen)
../../miniconda/envs/quansight/lib/python3.6/site-packages/distributed/utils_test.py:626: in cluster
    scheduler_q = mp_context.Queue()
../../miniconda/envs/quansight/lib/python3.6/multiprocessing/context.py:102: in Queue
    return Queue(maxsize, ctx=self.get_context())
../../miniconda/envs/quansight/lib/python3.6/multiprocessing/queues.py:42: in __init__
    self._rlock = ctx.Lock()
../../miniconda/envs/quansight/lib/python3.6/multiprocessing/context.py:67: in Lock
    return Lock(ctx=self.get_context())
../../miniconda/envs/quansight/lib/python3.6/multiprocessing/synchronize.py:163: in __init__
    SemLock.__init__(self, SEMAPHORE, 1, 1, ctx=ctx)
../../miniconda/envs/quansight/lib/python3.6/multiprocessing/synchronize.py:81: in __init__
    register(self._semlock.name)
../../miniconda/envs/quansight/lib/python3.6/multiprocessing/semaphore_tracker.py:83: in register
    self._send('REGISTER', name)
../../miniconda/envs/quansight/lib/python3.6/multiprocessing/semaphore_tracker.py:90: in _send
    self.ensure_running()
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <multiprocessing.semaphore_tracker.SemaphoreTracker object at 0xa16135320>

    def ensure_running(self):
        '''Make sure that semaphore tracker process is running.

        This can be run from any process.  Usually a child process will use
        the semaphore created by its parent.'''
        with self._lock:
            if self._pid is not None:
                # semaphore tracker was launched before, is it still running?
>               pid, status = os.waitpid(self._pid, os.WNOHANG)
E               ChildProcessError: [Errno 10] No child processes

../../miniconda/envs/quansight/lib/python3.6/multiprocessing/semaphore_tracker.py:46: ChildProcessError

[Feature Request] Added the support of evals for the `train` function for early stopping

Hi,

Currently the dask-xgboost package does not support early_stopping.

Not super familiar with dask or xgboost, but my very rough thinking is it should be just passing the validation set DMatrix object during the training function here? And the way of finding the location of validation set on each worker should be similar to what we are doing now with training dataset?

Please let me know if I am on the right track. I would love to see if I can get this eval functions to work for dask-xgboost but would like to avoid falling into apparent traps as I am not familiar with the backend xgboost/dask too much. Thank you!

dashboard updates

When using dask-xgboost I see the profile dashboard page update but nothing in the task stream or progress panes. Is this expected ?

Training data not arriving to train_part

I am getting the following error message ValueError: too many values to unpack (expected 2) when running dask_xgboost.train on a hadoop cluster.

Digging into the logs the shows that for one of the containers list_of_parts = ['finalize-df33507e3c937f53053b5955c9a84040', 'finalize-c424c6035f86bdbdb8fafde5a9004238', 'finalize-f62e06bceea3e45d4502b6975e94c2fc', 'finalize-4de332f518ef9fbbe910824136a231a2', 'finalize-fe3e86b7ae5aa3e16009a8f39c35bc37'].

I call train with a dask dataframe and by hacking the dask_xgboost.core._train I was able to track the transformations of data to list_of_parts (the values of worker_map).

After to_delayed:

[Delayed(('split-0-0744fa3c657f0b5ea93ab842f9b006a6', 0)), 
Delayed(('split-0-0744fa3c657f0b5ea93ab842f9b006a6', 1)), 
Delayed(('split-0-0744fa3c657f0b5ea93ab842f9b006a6', 2)), 
Delayed(('split-0-0744fa3c657f0b5ea93ab842f9b006a6', 3)), 
Delayed(('split-0-0744fa3c657f0b5ea93ab842f9b006a6', 4)), ...

After zipping with the labels creating parts:

[Delayed('tuple-f99e57e3-9f4c-4f69-90ae-9f66607901cf'), 
Delayed('tuple-1734a065-e9d9-4ccd-9194-000d3cef869b'), 
Delayed('tuple-87e592ee-40eb-4c3d-85cb-aae63ee23b76'), 
Delayed('tuple-4dc1fd3e-a9b6-4f12-9c28-21dfb101c100'), 
Delayed('tuple-2438f4d1-f6d4-451a-834b-783677c32f69'), ...

The finalize strings are generated by client.compute(parts).

[<Future: status: pending, key: finalize-3cf2a1ef82ada3afa320cfa353eaace1>, 
<Future: status: pending, key: finalize-df33507e3c937f53053b5955c9a84040>, 
<Future: status: pending, key: finalize-fed822c53ef1f53bfc48c985ffbdf728>, 
<Future: status: pending, key: finalize-c424c6035f86bdbdb8fafde5a9004238>, 
<Future: status: pending, key: finalize-f6793429bfe6f2520871a391e7882c1e>, 
<Future: status: pending, key: finalize-56347b470161a91b696e25de5f2229b6>, ...]

After `_wait(parts):

[<Future: status: finished, type: tuple, key: finalize-3cf2a1ef82ada3afa320cfa353eaace1>, 
<Future: status: finished, type: tuple, key: finalize-df33507e3c937f53053b5955c9a84040>, 
<Future: status: finished, type: tuple, key: finalize-fed822c53ef1f53bfc48c985ffbdf728>, 
<Future: status: finished, type: tuple, key: finalize-c424c6035f86bdbdb8fafde5a9004238>, 
<Future: status: finished, type: tuple, key: finalize-f6793429bfe6f2520871a391e7882c1e>, 
<Future: status: finished, type: tuple, key: finalize-56347b470161a91b696e25de5f2229b6>, 
<Future: status: finished, type: tuple, key: finalize-455e539014f1526f318b5dfc2fd96b6d>, ...

who_has:

{'finalize-3cf2a1ef82ada3afa320cfa353eaace1': ['tcp://10.195.208.244:44408'], 
'finalize-df33507e3c937f53053b5955c9a84040': ['tcp://10.195.102.71:44112'], 
'finalize-fed822c53ef1f53bfc48c985ffbdf728': ['tcp://10.195.208.252:35744'], 
'finalize-c424c6035f86bdbdb8fafde5a9004238': ['tcp://10.195.102.71:44112'], 
'finalize-f6793429bfe6f2520871a391e7882c1e': ['tcp://10.195.208.249:36605'], 
'finalize-56347b470161a91b696e25de5f2229b6': ['tcp://10.195.208.249:36605'], 
'finalize-455e539014f1526f318b5dfc2fd96b6d': ['tcp://10.195.208.244:44408'],

worker_map:

defaultdict(<class 'list'>, {
'tcp://10.195.208.244:44408': 
['finalize-3cf2a1ef82ada3afa320cfa353eaace1', 'finalize-455e539014f1526f318b5dfc2fd96b6d', 'finalize-4921531869b649367471b87f49589a90', 'finalize-4d457790610b74dd1fd2378a4b6e8f45', 'finalize-0b2ac9d0c0ba6e950836eed3cf0d1793'], 
'tcp://10.195.102.71:44112': 
['finalize-df33507e3c937f53053b5955c9a84040', 'finalize-c424c6035f86bdbdb8fafde5a9004238', 'finalize-f62e06bceea3e45d4502b6975e94c2fc', 'finalize-4de332f518ef9fbbe910824136a231a2', 'finalize-fe3e86b7ae5aa3e16009a8f39c35bc37'], ...

Run the central Rabit process on a worker

Currently we run Rabit's central process on the scheduler and the worker processes with the dask workers. This has caused issues in two cases:

  1. Sometimes the scheduler has a more stripped down environment and doesn't have all of the libraries that the workers do.
  2. Sometimes the scheduler's networking position is somewhat different from the workers #23 #40

We might consider instead running the tracker on a worker. This would also keep the scheduler more isolated. This is awkward if there is data on the worker where we want to run the tracker, but if we're comfortable moving data (as is the case in @RAMitchell 's rewrite) then maybe this doesn't matter.

@RAMitchell thought I'd bring this up now rather than later in case it affects things

xgboost does not train from existing model in distributed environment

When continuing training xgboost from an existing model in distributed environment with more than 3 workers, xgboost does not train: nothing happens in workers and it never finishes. But in local cluster or distributed cluster with less than 3 workers, the training happens and finishes.

dxgb.train(client, params, X_train, y_train,
xgb_model=existing_model,...)

Using RandomizedCV with xgboost

Hi all,

I am trying to use randomized cross validation on dask-xgboost.
Here is the snippet of code I am trying to get to work:

from dask_ml.model_selection import RandomizedSearchCV
from dask_ml.datasets import make_classification
from scipy.stats import uniform

X, y = make_classification(chunks=50)

model=dxgboost.XGBClassifier()
client=Client()
metric="f1"
data=X
target=y
bounds={"base_score": uniform(0.3, 0.7), "max_depth": uniform(3, 40), "learning_rate": uniform(0.05, 0.4),
                       "n_estimators": uniform(50, 200), "gamma": uniform(0, 10)}
results=RandomizedSearchCV(model, bounds,cv=3,scoring=metric,scheduler=client.scheduler).fit(data,target).best_estimator_

While running this code, I get the error TypeError: 'Future' object does not support indexing.
I think that I am misusing the api. If somebody can confirm that and point me to the right direction, I will be gratefull.
Thanks

unable to finish training

setup:
dask 0.14 . (pip installed)
xgboost 0.62 (conda installed)
dask-xgboost 0.10.X (modified distributed.comm.addressing) for loading import dask_xgboost without error (#1)

I was following the example here, https://gist.github.com/mrocklin/3696fe2398dc7152c66bf593a674e4d9

screen shot 2017-03-05 at 3 42 07 am

i produces the job, and looks like it runs for a few minutes.

screen shot 2017-03-05 at 3 37 38 am

however there would be some errors and would not finish nor crash my python code.

screen shot 2017-03-05 at 3 38 01 am

I wish I could provide more logs.

only one worker used out of 2 available

I have 3 machines (physical). One is the scheduler, and 2 separate machines for workers (one worker process on both machines respectively). When I submit a dask_xgboost job I can see that only one worker is being used and not all of them. Any clue as to why this may be happening? Also when I submit 2 jobs simultaneously, then also only one worker is being used.

dask - 0.16.1
dask-glm - 0.1.0
dask-ml - 0.6.0
dask-searchcv - 0.2.0
dask-xgboost - 0.1.5

[bug] Dask worker dies during dask-xgboost classifier training : test_core.py::test_classifier

Dask worker dies while during dask-xgboost classifier training ; It is being observed while running test_core.py::test_classifier

Configuration used -

Dask Version: 2.9.2
Distributed Version: 2.9.3
XGBoost Version: 0.90
Dask-XGBoost Version: 0.1.9
OS-release : 4.14.0-115.16.1.el7a.ppc64le

Description / Steps - :-

  1. Test create two cluster -
> /mnt/pai/home/pradghos/dask-xgboost/dask_xgboost/tests/test_core.py(38)test_classifier()
-> with cluster() as (s, [a, b]):
(Pdb) n
distributed.scheduler - INFO - Clear task state
distributed.scheduler - INFO -   Scheduler at:     tcp://127.0.0.1:45767
distributed.worker - INFO -       Start worker at:      tcp://127.0.0.1:40743
distributed.worker - INFO -          Listening to:      tcp://127.0.0.1:40743
distributed.worker - INFO - Waiting to connect to:      tcp://127.0.0.1:45767
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          1
distributed.worker - INFO -                Memory:                  612.37 GB
distributed.worker - INFO -       Local Directory: /mnt/pai/home/pradghos/dask-xgboost/dask_xgboost/tests/_test_worker-c6ea91c7-746e-4c7a-9c13-f5afcd244966/worker-ebbqtfdu
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -       Start worker at:      tcp://127.0.0.1:33373
distributed.worker - INFO -          Listening to:      tcp://127.0.0.1:33373
distributed.worker - INFO - Waiting to connect to:      tcp://127.0.0.1:45767
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          1
distributed.worker - INFO -                Memory:                  612.37 GB
distributed.worker - INFO -       Local Directory: /mnt/pai/home/pradghos/dask-xgboost/dask_xgboost/tests/_test_worker-050815d2-54f6-4edc-9a03-dd075213449d/worker-i1yr8xvc
distributed.worker - INFO - -------------------------------------------------
distributed.scheduler - INFO - Register worker <Worker 'tcp://127.0.0.1:40743', name: tcp://127.0.0.1:40743, memory: 0, processing: 0>
distributed.scheduler - INFO - Starting worker compute stream, tcp://127.0.0.1:40743
distributed.core - INFO - Starting established connection
distributed.worker - INFO -         Registered to:      tcp://127.0.0.1:45767
distributed.worker - INFO - -------------------------------------------------
distributed.scheduler - INFO - Register worker <Worker 'tcp://127.0.0.1:33373', name: tcp://127.0.0.1:33373, memory: 0, processing: 0>
distributed.core - INFO - Starting established connection
distributed.scheduler - INFO - Starting worker compute stream, tcp://127.0.0.1:33373
distributed.core - INFO - Starting established connection
distributed.worker - INFO -         Registered to:      tcp://127.0.0.1:45767
distributed.worker - INFO - -------------------------------------------------
distributed.core - INFO - Starting established connection

  1. After couple of steps - fit is being called for dask-xgboost -
-> a.fit(X2, y2)
(Pdb) distributed.worker - DEBUG - Heartbeat: tcp://127.0.0.1:40743
distributed.worker - DEBUG - Heartbeat: tcp://127.0.0.1:33373
distributed.worker - DEBUG - Heartbeat: tcp://127.0.0.1:40743
distributed.worker - DEBUG - Heartbeat: tcp://127.0.0.1:33373
distributed.worker - DEBUG - Heartbeat: tcp://127.0.0.1:40743
distributed.worker - DEBUG - Heartbeat: tcp://127.0.0.1:33373
distributed.worker - DEBUG - Heartbeat: tcp://127.0.0.1:40743
distributed.worker - DEBUG - Heartbeat: tcp://127.0.0.1:33373
distributed.worker - DEBUG - Heartbeat: tcp://127.0.0.1:40743
distributed.worker - DEBUG - Heartbeat: tcp://127.0.0.1:33373
ndistributed.worker - DEBUG - Heartbeat: tcp://127.0.0.1:40743
distributed.worker - DEBUG - Heartbeat: tcp://127.0.0.1:33373

distributed.worker - DEBUG - Execute key: array-original-8d35e675b41aad38dc334c7f79ea1982 worker: tcp://127.0.0.1:33373
distributed.worker - DEBUG - Send compute response to scheduler: array-original-8d35e675b41aad38dc334c7f79ea1982, {'op': 'task-finished', 'status': 'OK', 'nbytes': 80, 'type': <class 'numpy.ndarray'>, 'start': 1580372953.2651937, 'stop': 1580372953.265216, 'thread': 140735736705456, 'key': 'array-original-8d35e675b41aad38dc334c7f79ea1982'}
distributed.worker - DEBUG - Execute key: ('array-8d35e675b41aad38dc334c7f79ea1982', 0) worker: tcp://127.0.0.1:33373
distributed.worker - DEBUG - Send compute response to scheduler: ('array-8d35e675b41aad38dc334c7f79ea1982', 0), {'op': 'task-finished', 'status': 'OK', 'nbytes': 40, 'type': <class 'numpy.ndarray'>, 'start': 1580372953.2696354, 'stop': 1580372953.2696435, 'thread': 140735736705456, 'key': "('array-8d35e675b41aad38dc334c7f79ea1982', 0)"}
distributed.worker - DEBUG - Execute key: ('array-8d35e675b41aad38dc334c7f79ea1982', 1) worker: tcp://127.0.0.1:33373
distributed.worker - DEBUG - Send compute response to scheduler: ('array-8d35e675b41aad38dc334c7f79ea1982', 1), {'op': 'task-finished', 'status': 'OK', 'nbytes': 40, 'type': <class 'numpy.ndarray'>, 'start': 1580372953.2705007, 'stop': 1580372953.2705073, 'thread': 140735736705456, 'key': "('array-8d35e675b41aad38dc334c7f79ea1982', 1)"}
distributed.worker - DEBUG - Deleted 1 keys
distributed.worker - DEBUG - Execute key: ('unique_internal-getitem-a6b7823aa95705e499984f972c2b58b3', 0) worker: tcp://127.0.0.1:33373
distributed.worker - DEBUG - Send compute response to scheduler: ('unique_internal-getitem-a6b7823aa95705e499984f972c2b58b3', 0), {'op': 'task-finished', 'status': 'OK', 'nbytes': 16, 'type': <class 'numpy.ndarray'>, 'start': 1580372953.2753158, 'stop': 1580372953.275466, 'thread': 140735736705456, 'key': "('unique_internal-getitem-a6b7823aa95705e499984f972c2b58b3', 0)"}
distributed.worker - DEBUG - Execute key: ('unique_internal-getitem-a6b7823aa95705e499984f972c2b58b3', 1) worker: tcp://127.0.0.1:33373
distributed.worker - DEBUG - Send compute response to scheduler: ('unique_internal-getitem-a6b7823aa95705e499984f972c2b58b3', 1), {'op': 'task-finished', 'status': 'OK', 'nbytes': 16, 'type': <class 'numpy.ndarray'>, 'start': 1580372953.2762377, 'stop': 1580372953.2763371, 'thread': 140735736705456, 'key': "('unique_internal-getitem-a6b7823aa95705e499984f972c2b58b3', 1)"}
distributed.worker - DEBUG - Deleted 1 keys
distributed.worker - DEBUG - Deleted 1 keys
distributed.worker - DEBUG - Execute key: ('getitem-a6b7823aa95705e499984f972c2b58b3', 0) worker: tcp://127.0.0.1:33373
distributed.worker - DEBUG - Send compute response to scheduler: ('getitem-a6b7823aa95705e499984f972c2b58b3', 0), {'op': 'task-finished', 'status': 'OK', 'nbytes': 16, 'type': <class 'numpy.ndarray'>, 'start': 1580372953.2805014, 'stop': 1580372953.2805073, 'thread': 140735736705456, 'key': "('getitem-a6b7823aa95705e499984f972c2b58b3', 0)"}
distributed.worker - DEBUG - Execute key: ('getitem-a6b7823aa95705e499984f972c2b58b3', 1) worker: tcp://127.0.0.1:33373
distributed.worker - DEBUG - Send compute response to scheduler: ('getitem-a6b7823aa95705e499984f972c2b58b3', 1), {'op': 'task-finished', 'status': 'OK', 'nbytes': 16, 'type': <class 'numpy.ndarray'>, 'start': 1580372953.2813187, 'stop': 1580372953.2813244, 'thread': 140735736705456, 'key': "('getitem-a6b7823aa95705e499984f972c2b58b3', 1)"}
distributed.worker - DEBUG - Deleted 1 keys
distributed.worker - DEBUG - Deleted 1 keys

Dask worker dies -

distributed.worker - DEBUG - Deleted 1 keys
distributed.worker - DEBUG - Deleted 1 keys
distributed.worker - DEBUG - Deleted 1 keys
/mnt/pai/home/pradghos/anaconda3/envs/gdf37/lib/python3.7/site-packages/dask/dataframe/_compat.py:19: FutureWarning: pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead.
  import pandas.util.testing as tm  # noqa: F401
distributed.worker - DEBUG - Heartbeat: tcp://127.0.0.1:40743
distributed.worker - DEBUG - Heartbeat: tcp://127.0.0.1:33373
distributed.worker - DEBUG - Heartbeat skipped: channel busy
distributed.worker - DEBUG - Heartbeat skipped: channel busy
distributed.worker - INFO - Run out-of-band function 'start_tracker'
distributed.worker - DEBUG - Deleted 1 keys
distributed.worker - DEBUG - Deleted 1 keys
distributed.worker - DEBUG - Deleted 1 keys
/mnt/pai/home/pradghos/anaconda3/envs/gdf37/lib/python3.7/site-packages/dask/dataframe/_compat.py:19: FutureWarning: pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead.
  import pandas.util.testing as tm  # noqa: F401
/mnt/pai/home/pradghos/anaconda3/envs/gdf37/lib/python3.7/site-packages/dask/dataframe/_compat.py:19: FutureWarning: pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead.
  import pandas.util.testing as tm  # noqa: F401
distributed.scheduler - INFO - Remove worker <Worker 'tcp://127.0.0.1:40743', name: tcp://127.0.0.1:40743, memory: 1, processing: 1>
distributed.core - INFO - Removing comms to tcp://127.0.0.1:40743     ===========================>>> One worker dies 
/mnt/pai/home/pradghos/anaconda3/envs/gdf37/lib/python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 1 leaked semaphores to clean up at shutdown
  len(cache))
distributed.worker - DEBUG - Execute key: train_part-e17e49e3769aaa4870dc8cc01a1e015e worker: tcp://127.0.0.1:33373
distributed.worker - DEBUG - Heartbeat: tcp://127.0.0.1:33373
distributed.worker - DEBUG - Heartbeat: tcp://127.0.0.1:33373
distributed.worker - DEBUG - Heartbeat: tcp://127.0.0.1:33373
distributed.worker - DEBUG - future state: train_part-e17e49e3769aaa4870dc8cc01a1e015e - RUNNING   ===  One worker is running infinitely 
distributed.worker - DEBUG - Heartbeat: tcp://127.0.0.1:33373
distributed.worker - DEBUG - Heartbeat: tcp://127.0.0.1:33373
distributed.worker - DEBUG - future state: train_part-e17e49e3769aaa4870dc8cc01a1e015e - RUNNING
distributed.worker - DEBUG - Heartbeat: tcp://127.0.0.1:33373
distributed.worker - DEBUG - Heartbeat: tcp://127.0.0.1:33373

It is not clear why does dask worker die at that point .

Thanks!
Pradipta

Migrate to XGBoost mainline repository

@RAMitchell (an xgboost maintainer) was mentioning that it might be possible to migrate the whole of the dask-xgboost codebase into xgboost itself.

Any thoughts or concerns about this?

I think that the proposed API change would be to add a dask_client= keyword (or something similar) to the official train and predict methods.

XGBoost works with local cluster, but fails with "no-workers" when using distributed.

The dask cluster is setup as:

dask-scheduler (named $MACHINE-dask-scheduler)
dask-worker" --container-arg="$MACHINE-dask-scheduler:8786
from dask.distributed import Client
from dask_ml.linear_model import LogisticRegression
from dask_ml.xgboost import XGBClassifier
from dask_ml.datasets import make_classification
import dask.dataframe as dd
import dask_xgboost
from dask_ml.model_selection import train_test_split

client = Client('127.0.0.1:8786')


X, y = make_classification(n_samples=20000, n_features=20,
                           chunks=10000, n_informative=4,
                           random_state=0)

And this works fine with

from dask_ml.linear_model import LogisticRegression
lr = LogisticRegression()
model = lr.fit(X, y)

But this doesn't work with

import dask_xgboost
params = {'objective': 'binary:logistic',
          'max_depth': 4, 'eta': 0.01, 'subsample': 0.5, 
          'min_child_weight': 0.5}

bst = dask_xgboost.train(client, params, X, y, num_boost_round=10)

Or

from dask_ml.xgboost import XGBClassifier
est = XGBClassifier()
est.fit(X, y)

Any idea of where to find more logs or what is going on?

Task: train_part...

Status | no-worker
Priority | (0, 54, 0)
Worker Restrictions | set(['tcp://10.0.0.60:42869'])
Suspicious | 1

Worker Logs

distributed.worker - INFO - Start worker at: tcp://10.0.0.59:39471
distributed.worker - INFO - Listening to: tcp://10.0.0.59:39471
distributed.worker - INFO - bokeh at: 10.0.0.59:35569
distributed.worker - INFO - nanny at: 10.0.0.59:33889
distributed.worker - INFO - Waiting to connect to: tcp://damien-dask-scheduler:8786
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Threads: 8
distributed.worker - INFO - Memory: 16.82 GB
distributed.worker - INFO - Local Directory: /current/worker-TSgjNV
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Registered to: tcp://damien-dask-scheduler:8786
distributed.worker - INFO - -------------------------------------------------

Scheduler

distributed.scheduler - INFO - Clear task state
distributed.scheduler - INFO - Scheduler at: tcp://10.0.0.54:8786
distributed.scheduler - INFO - bokeh at: :8787
distributed.scheduler - INFO - Local Directory: /tmp/scheduler-zH1301
distributed.scheduler - INFO - -----------------------------------------------
distributed.scheduler - INFO - Register tcp://10.0.0.57:41765
distributed.scheduler - INFO - Starting worker compute stream, tcp://10.0.0.57:41765
distributed.scheduler - INFO - Register tcp://10.0.0.58:41237
distributed.scheduler - INFO - Starting worker compute stream, tcp://10.0.0.58:41237
distributed.scheduler - INFO - Receive client connection: Client-cea67f61-9127-11e9-8756-8c8590bca016
distributed.scheduler - INFO - Remove client Client-cea67f61-9127-11e9-8756-8c8590bca016
distributed.scheduler - INFO - Remove client Client-cea67f61-9127-11e9-8756-8c8590bca016
distributed.scheduler - INFO - Close client connection: Client-cea67f61-9127-11e9-8756-8c8590bca016
distributed.scheduler - INFO - Receive client connection: Client-d7cc7a23-9127-11e9-9779-8c8590bca016
distributed.scheduler - INFO - Remove worker tcp://10.0.0.57:41765
distributed.scheduler - INFO - Remove worker tcp://10.0.0.58:41237
distributed.scheduler - INFO - Lost all workers
distributed.scheduler - INFO - Register tcp://10.0.0.58:44763
distributed.scheduler - INFO - Starting worker compute stream, tcp://10.0.0.58:44763
distributed.scheduler - INFO - Register tcp://10.0.0.57:45983
distributed.scheduler - INFO - Starting worker compute stream, tcp://10.0.0.57:45983
distributed.scheduler - INFO - Send lost future signal to clients
distributed.scheduler - INFO - Remove worker tcp://10.0.0.57:45983
distributed.scheduler - INFO - Remove worker tcp://10.0.0.58:44763
distributed.scheduler - INFO - Lost all workers
distributed.scheduler - INFO - Clear task state
distributed.scheduler - INFO - Register tcp://10.0.0.57:46453
distributed.scheduler - INFO - Starting worker compute stream, tcp://10.0.0.57:46453
distributed.scheduler - INFO - Register tcp://10.0.0.58:44711
distributed.scheduler - INFO - Starting worker compute stream, tcp://10.0.0.58:44711
distributed.scheduler - INFO - Clear task state
distributed.scheduler - INFO - Receive client connection: Client-cdb9a29e-9129-11e9-9779-8c8590bca016
distributed.scheduler - INFO - Remove worker tcp://10.0.0.58:44711
distributed.scheduler - INFO - Remove worker tcp://10.0.0.57:46453
distributed.scheduler - INFO - Lost all workers
distributed.scheduler - INFO - Register tcp://10.0.0.58:41009
distributed.scheduler - INFO - Starting worker compute stream, tcp://10.0.0.58:41009
distributed.scheduler - INFO - Register tcp://10.0.0.57:43319
distributed.scheduler - INFO - Starting worker compute stream, tcp://10.0.0.57:43319
distributed.scheduler - INFO - Receive client connection: Client-683c9678-912a-11e9-8023-42010a000036
distributed.scheduler - INFO - Remove worker tcp://10.0.0.58:41009
distributed.scheduler - INFO - Remove worker tcp://10.0.0.57:43319
distributed.scheduler - INFO - Lost all workers
distributed.scheduler - INFO - Register tcp://10.0.0.58:41003
distributed.scheduler - INFO - Starting worker compute stream, tcp://10.0.0.58:41003
distributed.scheduler - INFO - Register tcp://10.0.0.57:45287
distributed.scheduler - INFO - Starting worker compute stream, tcp://10.0.0.57:45287
distributed.scheduler - INFO - Remove worker tcp://10.0.0.57:45287
distributed.scheduler - INFO - Remove worker tcp://10.0.0.58:41003
distributed.scheduler - INFO - Lost all workers
distributed.scheduler - INFO - Remove client Client-cdb9a29e-9129-11e9-9779-8c8590bca016
distributed.scheduler - INFO - Remove client Client-d7cc7a23-9127-11e9-9779-8c8590bca016
distributed.scheduler - INFO - Close client connection: Client-cdb9a29e-9129-11e9-9779-8c8590bca016
distributed.scheduler - INFO - Close client connection: Client-d7cc7a23-9127-11e9-9779-8c8590bca016
distributed.scheduler - INFO - Register tcp://10.0.0.59:44611
distributed.scheduler - INFO - Starting worker compute stream, tcp://10.0.0.59:44611
distributed.scheduler - INFO - Register tcp://10.0.0.60:45183
distributed.scheduler - INFO - Starting worker compute stream, tcp://10.0.0.60:45183
distributed.scheduler - INFO - Receive client connection: Client-10ed5adc-9130-11e9-9779-8c8590bca016
distributed.scheduler - INFO - Send lost future signal to clients
distributed.scheduler - INFO - Remove worker tcp://10.0.0.59:44611
distributed.scheduler - INFO - Remove worker tcp://10.0.0.60:45183
distributed.scheduler - INFO - Lost all workers
distributed.scheduler - INFO - Clear task state
distributed.scheduler - INFO - Register tcp://10.0.0.59:37507
distributed.scheduler - INFO - Starting worker compute stream, tcp://10.0.0.59:37507
distributed.scheduler - INFO - Register tcp://10.0.0.60:46703
distributed.scheduler - INFO - Starting worker compute stream, tcp://10.0.0.60:46703
distributed.scheduler - INFO - Clear task state
distributed.scheduler - INFO - Remove client Client-10ed5adc-9130-11e9-9779-8c8590bca016
distributed.scheduler - INFO - Close client connection: Client-10ed5adc-9130-11e9-9779-8c8590bca016
distributed.scheduler - INFO - Receive client connection: Client-b8b1ff42-913b-11e9-b1c6-8c8590bca016
distributed.scheduler - INFO - Remove client Client-b8b1ff42-913b-11e9-b1c6-8c8590bca016
distributed.scheduler - INFO - Close client connection: Client-b8b1ff42-913b-11e9-b1c6-8c8590bca016
distributed.scheduler - INFO - Receive client connection: Client-22a417cc-913d-11e9-b635-8c8590bca016
distributed.scheduler - INFO - Remove worker tcp://10.0.0.59:37507
distributed.scheduler - INFO - Remove worker tcp://10.0.0.60:46703
distributed.scheduler - INFO - Lost all workers
distributed.scheduler - INFO - Register tcp://10.0.0.59:38681
distributed.scheduler - INFO - Starting worker compute stream, tcp://10.0.0.59:38681
distributed.scheduler - INFO - Register tcp://10.0.0.60:46663
distributed.scheduler - INFO - Starting worker compute stream, tcp://10.0.0.60:46663
distributed.scheduler - INFO - Send lost future signal to clients
distributed.scheduler - INFO - Remove worker tcp://10.0.0.59:38681
distributed.scheduler - INFO - Remove worker tcp://10.0.0.60:46663
distributed.scheduler - INFO - Lost all workers
distributed.scheduler - INFO - Clear task state
distributed.scheduler - INFO - Register tcp://10.0.0.59:35635
distributed.scheduler - INFO - Starting worker compute stream, tcp://10.0.0.59:35635
distributed.scheduler - INFO - Register tcp://10.0.0.60:39269
distributed.scheduler - INFO - Starting worker compute stream, tcp://10.0.0.60:39269
distributed.scheduler - INFO - Clear task state
distributed.scheduler - INFO - Receive client connection: Client-84114e0a-913e-11e9-b635-8c8590bca016
distributed.scheduler - INFO - Send lost future signal to clients
distributed.scheduler - INFO - Remove worker tcp://10.0.0.59:35635
distributed.scheduler - INFO - Remove worker tcp://10.0.0.60:39269
distributed.scheduler - INFO - Lost all workers
distributed.scheduler - INFO - Clear task state
distributed.scheduler - INFO - Register tcp://10.0.0.59:45541
distributed.scheduler - INFO - Starting worker compute stream, tcp://10.0.0.59:45541
distributed.scheduler - INFO - Register tcp://10.0.0.60:43083
distributed.scheduler - INFO - Starting worker compute stream, tcp://10.0.0.60:43083
distributed.scheduler - INFO - Clear task state
distributed.scheduler - INFO - Remove worker tcp://10.0.0.59:45541
distributed.scheduler - INFO - Remove worker tcp://10.0.0.60:43083
distributed.scheduler - INFO - Lost all workers
distributed.scheduler - INFO - Register tcp://10.0.0.59:37297
distributed.scheduler - INFO - Starting worker compute stream, tcp://10.0.0.59:37297
distributed.scheduler - INFO - Register tcp://10.0.0.60:43473
distributed.scheduler - INFO - Starting worker compute stream, tcp://10.0.0.60:43473
distributed.scheduler - INFO - Remove client Client-84114e0a-913e-11e9-b635-8c8590bca016
distributed.scheduler - INFO - Remove client Client-22a417cc-913d-11e9-b635-8c8590bca016
distributed.scheduler - INFO - Close client connection: Client-84114e0a-913e-11e9-b635-8c8590bca016
distributed.scheduler - INFO - Close client connection: Client-22a417cc-913d-11e9-b635-8c8590bca016
distributed.scheduler - INFO - Receive client connection: Client-a070d614-913f-11e9-b68e-8c8590bca016
distributed.scheduler - INFO - Send lost future signal to clients
distributed.scheduler - INFO - Remove worker tcp://10.0.0.59:37297
distributed.scheduler - INFO - Remove worker tcp://10.0.0.60:43473
distributed.scheduler - INFO - Lost all workers
distributed.scheduler - INFO - Clear task state
distributed.scheduler - INFO - Register tcp://10.0.0.59:33001
distributed.scheduler - INFO - Starting worker compute stream, tcp://10.0.0.59:33001
distributed.scheduler - INFO - Register tcp://10.0.0.60:42869
distributed.scheduler - INFO - Starting worker compute stream, tcp://10.0.0.60:42869
distributed.scheduler - INFO - Clear task state
distributed.scheduler - INFO - Remove worker tcp://10.0.0.59:33001
distributed.scheduler - INFO - Remove worker tcp://10.0.0.60:42869
distributed.scheduler - INFO - Lost all workers
distributed.scheduler - INFO - Register tcp://10.0.0.59:39471
distributed.scheduler - INFO - Starting worker compute stream, tcp://10.0.0.59:39471
distributed.scheduler - INFO - Register tcp://10.0.0.60:36841
distributed.scheduler - INFO - Starting worker compute stream, tcp://10.0.0.60:36841
sq-blocks>=0.6.1
bokeh
dask-ml[complete] 
distributed >= 1.15.2
gcsfs
jupyterlab
pyarrow==0.12.1 # 0.13.0 has bug for dask
dask_xgboost
xgboost==0.81.0

[BUG] testcase failure - TypeError: can not initialize DMatrix from COO

Describe the bug
pytest for test_sparse() fails with -

E   TypeError: can not initialize DMatrix from COO

Steps/Code to reproduce bug
This can be easily reproduced with xgboost 0.82 and 0.90.

============================= test session starts ============================== platform linux -- Python 3.6.9, pytest-5.0.1, py-1.8.0, pluggy-0.12.0 rootdir: /home/sangeek/examples/dask-xgb-examples/tests plugins: xdist-1.28.0, forked-1.0.2, cov-2.7.1 collected 1 item

test_sparse.py F [100%]

=================================== FAILURES ===================================
_________________________________ test_sparse __________________________________

def test_func():
    result = None
    workers = []
    with clean(timeout=active_rpc_timeout, **clean_kwargs) as loop:

        async def coro():
            with dask.config.set(config):
                s = False
                for i in range(5):
                    try:
                        s, ws = await start_cluster(
                            nthreads,
                            scheduler,
                            loop,
                            security=security,
                            Worker=Worker,
                            scheduler_kwargs=scheduler_kwargs,
                            worker_kwargs=worker_kwargs,
                        )
                    except Exception as e:
                        logger.error(
                            "Failed to start gen_cluster, retrying",
                            exc_info=True,
                        )
                    else:
                        workers[:] = ws
                        args = [s] + workers
                        break
                if s is False:
                    raise Exception("Could not start cluster")
                if client:
                    c = await Client(
                        s.address,
                        loop=loop,
                        security=security,
                        asynchronous=True,
                        **client_kwargs
                    )
                    args = [c] + args
                try:
                    future = func(*args)
                    if timeout:
                        future = gen.with_timeout(
                            timedelta(seconds=timeout), future
                        )
                    result = await future
                    if s.validate:
                        s.validate_state()
                finally:
                    if client and c.status not in ("closing", "closed"):
                        await c._close(fast=s.status == "closed")
                    await end_cluster(s, workers)
                    await gen.with_timeout(
                        timedelta(seconds=1), cleanup_global_workers()
                    )

                try:
                    c = await default_client()
                except ValueError:
                    pass
                else:
                    await c._close(fast=True)

                for i in range(5):
                    if all(c.closed() for c in Comm._instances):
                        break
                    else:
                        await gen.sleep(0.05)
                else:
                    L = [c for c in Comm._instances if not c.closed()]
                    Comm._instances.clear()
                    # raise ValueError("Unclosed Comms", L)
                    print("Unclosed Comms", L)

                return result

        result = loop.run_sync(
          coro, timeout=timeout * 2 if timeout else timeout
        )

/opt/anaconda3/envs/test-xgb-90-cpu/lib/python3.6/site-packages/distributed/utils_test.py:947:


/opt/anaconda3/envs/test-xgb-90-cpu/lib/python3.6/site-packages/tornado/ioloop.py:532: in run_sync
return future_cell[0].result()
/opt/anaconda3/envs/test-xgb-90-cpu/lib/python3.6/site-packages/distributed/utils_test.py:915: in coro
result = await future
/opt/anaconda3/envs/test-xgb-90-cpu/lib/python3.6/site-packages/tornado/gen.py:742: in run
yielded = self.gen.throw(*exc_info) # type: ignore
test_sparse.py:42: in test_sparse
dbst = yield dxgb.train(c, param, dX, dy)
/opt/anaconda3/envs/test-xgb-90-cpu/lib/python3.6/site-packages/tornado/gen.py:735: in run
value = future.result()
/opt/anaconda3/envs/test-xgb-90-cpu/lib/python3.6/site-packages/tornado/gen.py:742: in run
yielded = self.gen.throw(*exc_info) # type: ignore
/opt/anaconda3/envs/test-xgb-90-cpu/lib/python3.6/site-packages/dask_xgboost/core.py:153: in _train
results = yield client._gather(futures)
/opt/anaconda3/envs/test-xgb-90-cpu/lib/python3.6/site-packages/tornado/gen.py:735: in run
value = future.result()
/opt/anaconda3/envs/test-xgb-90-cpu/lib/python3.6/site-packages/distributed/client.py:1668: in _gather
six.reraise(type(exception), exception, traceback)
/opt/anaconda3/envs/test-xgb-90-cpu/lib/python3.6/site-packages/six.py:692: in reraise
raise value.with_traceback(tb)
/opt/anaconda3/envs/test-xgb-90-cpu/lib/python3.6/site-packages/dask_xgboost/core.py:83: in train_part
dtrain = xgb.DMatrix(data, labels, **dmatrix_kwargs)


' {}'.format(type(data).name))
E TypeError: can not initialize DMatrix from COO

/opt/anaconda3/envs/test-xgb-90-cpu/lib/python3.6/site-packages/xgboost/core.py:413: TypeError
----------------------------- Captured stderr call -----------------------------
distributed.scheduler - INFO - Clear task state
distributed.scheduler - INFO - Scheduler at: tcp://127.0.0.1:38521
distributed.worker - INFO - Start worker at: tcp://127.0.0.1:39659
distributed.worker - INFO - Listening to: tcp://127.0.0.1:39659
distributed.worker - INFO - Waiting to connect to: tcp://127.0.0.1:38521
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Threads: 1
distributed.worker - INFO - Memory: 1.16 TB
distributed.worker - INFO - Local Directory: /home/sangeek/examples/dask-xgb-examples/tests/dask-worker-space/worker-wievyhvt
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Start worker at: tcp://127.0.0.1:40217
distributed.worker - INFO - Listening to: tcp://127.0.0.1:40217
distributed.worker - INFO - Waiting to connect to: tcp://127.0.0.1:38521
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Threads: 2
distributed.worker - INFO - Memory: 1.16 TB
distributed.worker - INFO - Local Directory: /home/sangeek/examples/dask-xgb-examples/tests/dask-worker-space/worker-v71yf5nk
distributed.worker - INFO - -------------------------------------------------
distributed.scheduler - INFO - Register tcp://127.0.0.1:39659
distributed.scheduler - INFO - Register tcp://127.0.0.1:40217
distributed.scheduler - INFO - Starting worker compute stream, tcp://127.0.0.1:39659
distributed.core - INFO - Starting established connection
distributed.scheduler - INFO - Starting worker compute stream, tcp://127.0.0.1:40217
distributed.core - INFO - Starting established connection
distributed.worker - INFO - Registered to: tcp://127.0.0.1:38521
distributed.worker - INFO - -------------------------------------------------
distributed.core - INFO - Starting established connection
distributed.worker - INFO - Registered to: tcp://127.0.0.1:38521
distributed.worker - INFO - -------------------------------------------------
distributed.core - INFO - Starting established connection
distributed.scheduler - INFO - Receive client connection: Client-8ea853d0-ce35-11e9-ae1e-590974dc444e
distributed.core - INFO - Starting established connection
distributed.worker - INFO - Run out-of-band function 'start_tracker'
distributed.worker - WARNING - Compute Failed
Function: train_part
args: ({'DMLC_NUM_WORKER': 2, 'DMLC_TRACKER_URI': '127.0.0.1', 'DMLC_TRACKER_PORT': 9091}, {'max_depth': 2, 'eta': 1, 'silent': 1, 'objective': 'binary:logistic', 'nthread': 1}, [(<COO: shape=(2, 2), dtype=int64, nnz=3, fill_value=0>, array([1, 0])), (<COO: shape=(2, 2), dtype=int64, nnz=3, fill_value=0>, array([1, 1]))])
kwargs: {'dmatrix_kwargs': {'feature_names': None}}
Exception: TypeError('can not initialize DMatrix from COO',)

distributed.worker - WARNING - Compute Failed
Function: train_part
args: ({'DMLC_NUM_WORKER': 2, 'DMLC_TRACKER_URI': '127.0.0.1', 'DMLC_TRACKER_PORT': 9091}, {'max_depth': 2, 'eta': 1, 'silent': 1, 'objective': 'binary:logistic', 'nthread': 2}, [(<COO: shape=(2, 2), dtype=int64, nnz=3, fill_value=0>, array([1, 0])), (<COO: shape=(2, 2), dtype=int64, nnz=3, fill_value=0>, array([1, 0])), (<COO: shape=(2, 2), dtype=int64, nnz=3, fill_value=0>, array([1, 1]))])
kwargs: {'dmatrix_kwargs': {'feature_names': None}}
Exception: TypeError('can not initialize DMatrix from COO',)

distributed.scheduler - INFO - Remove client Client-8ea853d0-ce35-11e9-ae1e-590974dc444e
distributed.scheduler - INFO - Remove client Client-8ea853d0-ce35-11e9-ae1e-590974dc444e
distributed.scheduler - INFO - Close client connection: Client-8ea853d0-ce35-11e9-ae1e-590974dc444e
distributed.worker - INFO - Stopping worker at tcp://127.0.0.1:40217
distributed.worker - INFO - Stopping worker at tcp://127.0.0.1:39659
distributed.scheduler - INFO - Remove worker tcp://127.0.0.1:40217
distributed.core - INFO - Removing comms to tcp://127.0.0.1:40217
distributed.scheduler - INFO - Remove worker tcp://127.0.0.1:39659
distributed.core - INFO - Removing comms to tcp://127.0.0.1:39659
distributed.scheduler - INFO - Lost all workers
distributed.scheduler - INFO - Scheduler closing...
distributed.scheduler - INFO - Scheduler closing all comms
=========================== 1 failed in 2.13 seconds ===========================

Additional context

It looks like XGBoost does not support DMatrix to be created from sparse.COO.
Looking at the documentation it looks like xgboost.DMatrix(data, ...) only supports -
data (string/numpy.array/scipy.sparse/pd.DataFrame/dt.Frame) – Data source of DMatrix. When data is string type, it represents the path libsvm format txt file, or binary file that xgboost can read from.
ref - https://xgboost.readthedocs.io/en/release_0.90/python/python_api.html

I see that making this change to use scipy.sparse.csr_matrix instead sparse.COO helps me get past this issue -

228c228
<     dX = da.from_array(X, chunks=(2, 2)).map_blocks(sparse.COO)
---
>     dX = da.from_array(X, chunks=(2, 2)).map_blocks(scipy.sparse.csr_matrix)
237c237
<     _test_container(dbst, predictions_result, sparse.COO)
---
>     _test_container(dbst, predictions_result, scipy.sparse.csr_matrix)

Ensure that training and testing data align

Currently if you provide training and testing data that have the same number of partitions, but a different number of rows per partition then the user will get a non-informative error.

Given that we need to have all the data in memory anyway, we could just fix this for the user and balance partitions for them.

cc @jrbourbeau

Check column names before passing on

e.g. if they're ints, xgboost will refuse them.

import pandas as pd
import numpy as np
import dask.dataframe as dd
import dask_xgboost as xgb
from distributed import Client

df = pd.DataFrame({0: np.random.randint(0, 2, size=100),
                   1: np.random.uniform(0, 1, size=100),
                   2: np.random.uniform(0, 1, size=100)})
a = dd.from_pandas(df, 2)
labels = a.loc[:, 0]
data = a.loc[:, 1:]

c = Client()

xgb.train(c, {}, data, labels)
ValueError                                Traceback (most recent call last)
<ipython-input-6-ea984a812dfe> in <module>()
     14 c = Client()
     15
---> 16 xgb.train(c, {}, data, labels)

~/sandbox/dask-xgboost/dask_xgboost/core.py in train(client, params, data, labels, dmatrix_kwargs, **kwargs)
    167     """
    168     return sync(client.loop, _train, client, params, data,
--> 169                 labels, dmatrix_kwargs, **kwargs)
    170
    171

~/Envs/dask-dev/lib/python3.6/site-packages/distributed/distributed/utils.py in sync(loop, func, *args, **kwargs)
    252             e.wait(1000000)
    253     if error[0]:
--> 254         six.reraise(*error[0])
    255     else:
    256         return result[0]

~/Envs/dask-dev/lib/python3.6/site-packages/six.py in reraise(tp, value, tb)
    691             if value.__traceback__ is not tb:
    692                 raise value.with_traceback(tb)
--> 693             raise value
    694         finally:
    695             value = None

~/Envs/dask-dev/lib/python3.6/site-packages/distributed/distributed/utils.py in f()
    236             yield gen.moment
    237             thread_state.asynchronous = True
--> 238             result[0] = yield make_coro()
    239         except Exception as exc:
    240             logger.exception(exc)

~/.virtualenvs/dask-dev/lib/python3.6/site-packages/tornado/gen.py in run(self)
   1053
   1054                     try:
-> 1055                         value = future.result()
   1056                     except Exception:
   1057                         self.had_exception = True

~/.virtualenvs/dask-dev/lib/python3.6/site-packages/tornado/concurrent.py in result(self, timeout)
    236         if self._exc_info is not None:
    237             try:
--> 238                 raise_exc_info(self._exc_info)
    239             finally:
    240                 self = None

~/.virtualenvs/dask-dev/lib/python3.6/site-packages/tornado/util.py in raise_exc_info(exc_info)

~/.virtualenvs/dask-dev/lib/python3.6/site-packages/tornado/gen.py in run(self)
   1061                     if exc_info is not None:
   1062                         try:
-> 1063                             yielded = self.gen.throw(*exc_info)
   1064                         finally:
   1065                             # Break up a reference to itself

~/sandbox/dask-xgboost/dask_xgboost/core.py in _train(client, params, data, labels, dmatrix_kwargs, **kwargs)
    132
    133     # Get the results, only one will be non-None
--> 134     results = yield client._gather(futures)
    135     result = [v for v in results if v][0]
    136     raise gen.Return(result)

~/.virtualenvs/dask-dev/lib/python3.6/site-packages/tornado/gen.py in run(self)
   1053
   1054                     try:
-> 1055                         value = future.result()
   1056                     except Exception:
   1057                         self.had_exception = True

~/.virtualenvs/dask-dev/lib/python3.6/site-packages/tornado/concurrent.py in result(self, timeout)
    236         if self._exc_info is not None:
    237             try:
--> 238                 raise_exc_info(self._exc_info)
    239             finally:
    240                 self = None

~/.virtualenvs/dask-dev/lib/python3.6/site-packages/tornado/util.py in raise_exc_info(exc_info)

~/.virtualenvs/dask-dev/lib/python3.6/site-packages/tornado/gen.py in run(self)
   1061                     if exc_info is not None:
   1062                         try:
-> 1063                             yielded = self.gen.throw(*exc_info)
   1064                         finally:
   1065                             # Break up a reference to itself

~/Envs/dask-dev/lib/python3.6/site-packages/distributed/distributed/client.py in _gather(self, futures, errors, direct, local_worker)
   1305                             six.reraise(type(exception),
   1306                                         exception,
-> 1307                                         traceback)
   1308                     if errors == 'skip':
   1309                         bad_keys.add(key)

~/Envs/dask-dev/lib/python3.6/site-packages/six.py in reraise(tp, value, tb)
    690                 value = tp()
    691             if value.__traceback__ is not tb:
--> 692                 raise value.with_traceback(tb)
    693             raise value
    694         finally:

~/sandbox/dask-xgboost/dask_xgboost/core.py in train_part()
     66     labels = concat(labels)
     67     dmatrix_kwargs["feature_names"] = getattr(data, 'columns', None)
---> 68     dtrain = xgb.DMatrix(data, labels, **dmatrix_kwargs)
     69
     70     args = [('%s=%s' % item).encode() for item in env.items()]

~/sandbox/xgboost/python-package/xgboost/core.py in __init__()
    294                 self.set_weight(weight)
    295
--> 296         self.feature_names = feature_names
    297         self.feature_types = feature_types
    298

~/sandbox/xgboost/python-package/xgboost/core.py in feature_names()
    663                        not any(x in f for x in set(('[', ']', '<')))
    664                        for f in feature_names):
--> 665                 raise ValueError('feature_names may not contain [, ] or <')
    666         else:
    667             # reset feature_types also

ValueError: feature_names may not contain [, ] or <

Incorrect overly simple model is returned

No matter how much I play with the input data or xgboost parameters, I can only get dask_xgboost.train to return an overly simple model that has 10 trees with a single leaf node on each tree. While training locally, I have output that looks like:

[18:55:12] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 0 extra nodes, 0 pruned nodes, max_depth=0
[18:55:12] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 0 extra nodes, 0 pruned nodes, max_depth=0
[18:55:12] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 0 extra nodes, 0 pruned nodes, max_depth=0
[18:55:12] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 0 extra nodes, 0 pruned nodes, max_depth=0
[18:55:12] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 0 extra nodes, 0 pruned nodes, max_depth=0
[18:55:12] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 0 extra nodes, 0 pruned nodes, max_depth=0
[18:55:12] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 0 extra nodes, 0 pruned nodes, max_depth=0
[18:55:12] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 0 extra nodes, 0 pruned nodes, max_depth=0

I'm guessing that this means the training parameters aren't making it to the xgboost train method, but printing those out immediately before the xgb.train(param, dtrain, **kwargs) call shows that each instance of this function has the correct parameters that are specifying a nonzero max_depth and many more n_estimators than 10.

You can test to see if your training outputs look like this with:

In [96]: bst = xgboost.Booster(model_file="saved_model.xgb")

In [97]: bst.get_dump()
Out[97]:
['0:leaf=-0.749752\n',
 '0:leaf=-0.551972\n',
 '0:leaf=-0.476789\n',
 '0:leaf=-0.43804\n',
 '0:leaf=-0.415481\n',
 '0:leaf=-0.401443\n',
 '0:leaf=-0.392314\n',
 '0:leaf=-0.386148\n',
 '0:leaf=-0.381795\n',
 '0:leaf=-0.378518\n']

My leaf values vary, but the structure of the model doesn't change.

Verify the benchmark of XgboostClassifier with initial xgboost

Hello,
I find maybe a bug about the XgboostClassifier in dask.xgboost.

from sklearn.datasets import load_iris
import dask.dataframe as dd
import pandas as pd
dataset = load_iris()
train = dataset.data
target = dataset.target

pdf = pd.DataFrame(data = train,columns=["1","2","3","4"])
pdf_y = pd.Series(target)

# pass the multi-class to binary problem to easily show the bug.
pdf_y.replace(2,1,inplace =True) 

from xgboost import XGBClassifier
est = XGBClassifier(n_estimators=30,max_depth=7,verbosity=0,learning_rate= 0.1)

est.fit(pdf, pdf_y)
est.score(pdf, pdf_y)

with the intial xgboost , we can easily get 100% accuracy.

from dask_ml.xgboost import XGBClassifier
from distributed import Client


client = Client()
est = XGBClassifier(n_estimators=30,max_depth=7,verbosity=1,learning_rate= 0.1)
df = dd.from_pandas(pdf,chunksize=640000)
df_y = dd.from_pandas(pdf_y,chunksize=640000).astype(int)
est.fit(df, df_y )
est.score(df, df_y )

with the same parameter and the same data, we can only get 66% accuracy and the problem is that the estimator with predict() only returns 1 all the time. The 66% have no sense.

This is a simple example to show the bug. I have tested on my project with titanic dataset and it has the same problem.

est.predict(df).compute()
return 1 for all the df.

Time for a new release?

I see the latest released version of dask-xgboost is 0.1.5 and on 18 Nov 2017. Multiple fixes have gone in after that!
@mrocklin @TomAugspurger Could you please suggest if it is a good time to have a new release?

Migrate CI to GitHub Actions?

Due to changes in the Travis CI billing, the Dask org is migrating Travis CI to GitHub Actions.

This repo appears to use CircleCI. As we are putting in the effort to migrate many projects to GitHub Actions does it make sense to standardise here?

See dask/community#107 for more details.

Remove duplicate memory of data in _train()

I run the project and notice in _train() there is a memory usage in process same as memory usage of data. However, data in train_part() (also the return value in concat()) also takes up the memory usage. As the train_part() only uses the list_of_parts, if it is possible to delete the original data after returning the list_of_parts in _train()?

I tried client.cancel() and del but both failed.

How can I avoid dask-xgboost no-work status

I used 2800W samples to train dask xgboost, but the status of training task always shows no work.

    from distributed import Client, progress
from dask.distributed import Client as Client2
import dask.dataframe as dd
import pandas as pd
import dask_xgboost as dxgb

filenames = [ '/data//000000_0','/data//000001_0']
global feature
global y_name
feature=["A","B"]

y_name = ["C"]
client2 = Client2("xx.xx.xx.xx:xx")

def data2dataframe(fn):
	df = pd.read_csv(fn, names =y_name+feature ,na_values='NULL',header=None,sep=',')
	df= df.fillna("0")
	for col in feature+y_name:
		df[col] = df[col].astype("float64")
	return (df[feature], df[y_name])

futures2 = client2.map(data2dataframe, filenames)
results= client2.gather(iter(futures2))

i=0
for re in results:
	if i==0:
		X_trains = re[0]
		y_trains = re[1]
	else:
		X_trains=pd.concat([X_trains,re[0]])
		y_trains=pd.concat([y_trains,re[1]])
	i=i+1

X_trains=dd.from_pandas(X_trains,npartitions=54)
y_trains=dd.from_pandas(y_trains,npartitions=54)
dd_train = X_trains
dd_train_label = y_trains
params = {'objective': 'binary:logistic',
		  'max_depth': 1, 'eta': 0.01, 'subsample': 0.5,
		  'min_child_weight': 1}

bst = dxgb.train(client2, params, dd_train, dd_train_label,num_boost_round=140)
predictions = dxgb.predict(client2, bst, dd_train)
print(predictions.persist())

Dask worker dies during dask-xgboost classifier - no sparse package, fails only on GPU

Dask worker dies after calling dask_xgboost.train method using GPU's (one or many).
Same code runs fine on CPU.

Configuration used - RAPIDS docker image
cuDF Version: 0.12.0b+2032.gab97331
Dask Version: 2.10.1
Dask cuDF Version: 0.12.0b+2032.gab97331
Dask XGBoost Version: 0.1.5
numpy Version: 1.18.1
pandas Version: 0.25.3
Scikit-Learn Version: 0.22.1

Description / Steps

# Local CUDA Cluster
from dask.distributed import Client
from dask.distributed import performance_report
from dask_cuda import LocalCUDACluster

# create a local CUDA cluster
cluster = LocalCUDACluster(silence_logs=False, n_workers=1)
client = Client(cluster)
client.run(cudf.set_allocator, "managed") 
client

distributed.scheduler - INFO - Clear task state
distributed.scheduler - INFO - Scheduler at: tcp://127.0.0.1:43901
distributed.scheduler - INFO - dashboard at: 127.0.0.1:8787
distributed.nanny - INFO - Start Nanny at: 'tcp://127.0.0.1:36881'
distributed.scheduler - INFO - Register worker <Worker 'tcp://127.0.0.1:41279', name: 0, memory: 0, processing: 0>
distributed.scheduler - INFO - Starting worker compute stream, tcp://127.0.0.1:41279
distributed.core - INFO - Starting established connection
distributed.scheduler - INFO - Receive client connection: Client-b52dc8d4-6d07-11ea-b55f-0242ac110002
distributed.core - INFO - Starting established connection

Reading data from CSV: Pandas DF -> Dask DF -> Dask cudf
Here data processing on single GPU / multi GPU works fine. Then:

# instantiate params
params = {}

# general params
general_params = {'silent': 0}
params.update(general_params)

# booster params
n_gpus = 1  
booster_params = {}
booster_params['max_depth'] = 4
booster_params['grow_policy'] = 'lossguide'
booster_params['max_leaves'] = 2**8
booster_params['tree_method'] = 'gpu_hist'
booster_params['num_class'] = n_categories   
booster_params['n_gpus'] = 1
params.update(booster_params)

# learning task params
learning_task_params = {}
learning_task_params['eval_metric'] = 'mlogloss'
learning_task_params['objective'] = 'multi:softmax'
params.update(learning_task_params)

# model training settings
num_round = 100

# This was GPU training
bst = dask_xgboost.train(client, params, X_train_dask_cudf, y_train_dask_cudf, num_boost_round=num_round)

distributed.worker - INFO - Run out-of-band function 'start_tracker'
distributed.scheduler - INFO - Remove worker <Worker 'tcp://127.0.0.1:41279', name: 0, memory: 12, processing: 1>
distributed.core - INFO - Removing comms to tcp://127.0.0.1:41279
distributed.scheduler - INFO - Lost all workers
distributed.nanny - INFO - Worker process 13746 was killed by unknown signal
distributed.nanny - WARNING - Restarting worker
distributed.scheduler - INFO - Register worker <Worker 'tcp://127.0.0.1:33803', name: 0, memory: 0, processing: 8>
distributed.scheduler - INFO - Starting worker compute stream, tcp://127.0.0.1:33803
distributed.core - INFO - Starting established connection

The train_part indicates "no-worker" and just hangs there.

Remove duplicate memory of data in _train() in dask_xgboost

I run the project and notice in _train() there is a memory usage in process same as memory usage of data. However, data in train_part() (also the return value in concat()) also takes up the memory usage. As the train_part() only uses the list_of_parts, if it is possible to delete the original data after returning the list_of_parts in _train()?

I tried client.cancel() and del but both failed.

Impossible to reproduce model results

I´ve just opened this issue in the dask repo, but maybe here is better...

I´m using dask for implementing a data pipeline with dask dataframes and dask ml in a Yarn Cluster.

When I build an XGBoost model, the results are always different, even if I manually fix a seed with da.random.seed().

import dask_xgboost as dxgb


params = {'objective': 'binary:logistic', 'n_estimators': 420,
           'max_depth': 5, 'eta': .01,
          'subsample': .8, 'colsample_bytree': .8,
          'learning_rate': .05, 'scale_pos_weight': 1}

bst = dxgb.train(client, params, fitted.transform(X), y)

Is it possible to reproduce the results of a dask model like the one in local using sklearn instead of dask ml???

Get dask-gateway scheduler address

When connecting to a dask-gateway the client.scheduler_address is a proxy address

>>>client.scheduler.address
'gateway://dask.training.anaconda.com:8786/4fd53916f0214703934701aa7a7eaf85'

I was able to solve this with the following in core::_train with client.scheduler_info()['address'])

    # Start the XGBoost tracker on the Dask scheduler
    host, port = parse_host_port(client.scheduler_info()['address'])
    env = yield client._run_on_scheduler(
        start_tracker, host.strip("/:"), len(worker_map)
    )

However, I get the following warning.

>>> from dask_xgboost import XGBRegressor
>>> xgb = XGBRegressor()
>>> xgb.fit(X, y)

/Users/adefusco/Applications/miniconda3/envs/xgb/lib/python3.7/site-packages/distributed/client.py:3299: RuntimeWarning: coroutine 'Client._update_scheduler_info' was never awaited
  self.sync(self._update_scheduler_info)
RuntimeWarning: Enable tracemalloc to get the object allocation traceback

I have verified that his update works correctly on a 9m row training set and scales linearly from 4 to 8 workers (2cores/worker). Is this the correct approach to get the actual scheduler address?

Archive this repo?

Since the functionality has been moved to xgboost itself. Would now be the appropriate time to archive this repo?

Early stopping eval_set is array in memory this can be problematic for large datasets

When using Early stopping, the eval set must be a numpy array, which is then duplicated across workers. This causes no problem with small eval_sets, but when larger eval_sets are desired, this can easily cause workers to push their memory caps.

The DaskDMatrix concept from dmlc/xgboost/dask.py seems to be a great way to handle this. Maybe there's something that can be implemented in this library that mimics that functionality

I'd be happy to take a crack at this, rather than reworking the whole library to work with a DaskDMatrix its probably simpler to this with the evals_set data https://github.cloud.capitalone.com/dask/dask-xgboost/blob/master/dask_xgboost/core.py#L167-L203

Example

import dask_xgboost as dxgb
from dask.distributed import LocalCluster, Client
from dask_ml.datasets import make_regression

client = Client(LocalCluster(dashboard_address=":8887", memory_limit="100Mb"))

regress_kwargs = dict(n_features=60, chunks=100, random_state=0)
X_train, y_train = make_regression(n_samples=400000, **regress_kwargs)
# this produces data to push memory limits
X_test, y_test = make_regression(n_samples=180000, **regress_kwargs)

xgb_options = {'seed': 0,
                   'tree_method': 'hist',
                   'obj': 'rmse',
                   'verbose': True}

model = dxgb.XGBRegressor(**xgb_options)

model.fit(X_train,
    y_train,
    eval_set=[(X_test.compute(), y_test.compute())],
    early_stopping_rounds=5,
    eval_metric='rmse'
)

After running this you should see KilledWorker exceptions

[QST] test_numpy() fail with "rabit::Init is already called in this thread"

I am using dask-xgboost 0.1.7 with xgboost 0.82.
test_core.py::test_numpy was failing for me and I looked more into the failure and this is my understanding. I am bit amused as these tests were passing for me the last week and AFAIR with the same version of packages )!
Need some help to understand what is going on here.

  1. test_core.py::test_numpy failed with rabit::Init is already called in this thread. And these are the details from pdb -
$ pytest test_core.py::test_numpy
====================================== test session starts =======================================
platform linux -- Python 3.6.8, pytest-4.6.2, py-1.8.0, pluggy-0.12.0
rootdir: ./tests
plugins: cov-2.7.1, forked-1.0.2, xdist-1.28.0
collected 1 item

test_core.py
>>>>>>>>>>>>>>>>>>>>>>>>>>>> PDB set_trace (IO-capturing turned off) >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> ./tests/test_core.py(200)test_numpy()
-> dX = da.from_array(X, chunks=(2, 2))
(Pdb) n
> ./tests/test_core.py(201)test_numpy()
-> dy = da.from_array(y, chunks=(2,))
(Pdb)
> ./tests/test_core.py(202)test_numpy()
-> dbst = yield dxgb.train(c, param, dX, dy)
(Pdb)
[08:42:34] Tree method is automatically selected to be 'approx' for distributed training.[08:42:34
] Tree method is automatically selected to be 'approx' for distributed training.

> ./tests/test_core.py(203)test_numpy()
-> dbst = yield dxgb.train(c, param, dX, dy)  # we can do this twice
(Pdb)
[08:42:38] Tree method is automatically selected to be 'approx' for distributed training.[08:42:38
] Tree method is automatically selected to be 'approx' for distributed training.

> ./tests/test_core.py(205)test_numpy()
-> predictions = dxgb.predict(c, dbst, dX)
(Pdb)
rabit::Init is already called in this thread
  1. On seeing the comment python# workaround for "Doing rabit call after Finalize" in the test-case; I attempted to fix it with -
@@ -179,6 +179,7 @@ def test_dmatrix_kwargs(c, s, a, b):


 def _test_container(dbst, predictions, X_type):
+    xgb.rabit.init()  # workaround for "Doing rabit call after Finalize"
     dtrain = xgb.DMatrix(X_type(X), label=y)
     bst = xgb.train(param, dtrain)

@@ -195,7 +196,6 @@ def _test_container(dbst, predictions, X_type):

 @gen_cluster(client=True, timeout=None, check_new_threads=False)
 def test_numpy(c, s, a, b):
-    xgb.rabit.init()  # workaround for "Doing rabit call after Finalize"
     dX = da.from_array(X, chunks=(2, 2))
     dy = da.from_array(y, chunks=(2,))
     dbst = yield dxgb.train(c, param, dX, dy)

and this particular test case worked fine, but it does not help me to fix failure with overall test script execution. That still fails like this -

$ pytest
======================================================================================== test session starts =========================================================================================
platform linux -- Python 3.6.8, pytest-4.6.2, py-1.8.0, pluggy-0.12.0 -- ./anaconda3/envs/test-dask-xgb/bin/python
cachedir: .pytest_cache
rootdir: ./sandbox/dask-xgboost, inifile: setup.cfg
plugins: cov-2.7.1, forked-1.0.2, xdist-1.28.0
[gw0] linux Python 3.6.8 cwd: ./sandbox/dask-xgboost/dask_xgboost/tests
[gw0] Python 3.6.8 |Anaconda, Inc.| (default, Dec 30 2018, 01:34:02)  -- [GCC 7.3.0]
gw0 [12]
scheduling tests via LoadScheduling
[gw0] [  8%] PASSED test_core.py::test_basic
[gw0] [ 16%] PASSED test_core.py::test_dmatrix_kwargs
[gw0] [ 25%] FAILED test_core.py::test_numpy
[gw0] [ 33%] FAILED test_core.py::test_scipy_sparse
[gw0] [ 41%] FAILED test_core.py::test_sparse
[gw0] [ 50%] PASSED test_core.py::test_errors
[gw0] [ 58%] FAILED test_core.py::test_classifier
[gw0] [ 66%] FAILED test_core.py::test_multiclass_classifier
[gw0] [ 75%] FAILED test_core.py::test_classifier_multi[array]
[gw0] [ 83%] FAILED test_core.py::test_classifier_multi[dataframe]
[gw0] [ 91%] FAILED test_core.py::test_regressor
[gw0] [100%] FAILED test_core.py::test_synchronous_api ./anaconda3/envs/test-dask-xgb/lib/python3.6/multiprocessing/semaphore_tracker.py:143: UserWarning: semaphore_tracker: There appear to be 6 leaked semaphores to clean up at shutdown
..

Predict() Method Always Returns 1 (Binary Classification)

When you attempt to use dxgb.XGBClassifier's predict method, it always generates a prediction of 1 regardless of the predict_proba (sigmoid) output. See minimal motivating example below, where I generate targets of all 0. The model learns it should generally predict 0 (low probabilities), but the predictions all generate 1.

Note: you cannot pass a threshold parameter into .predict(), another notable gap.

import dask_xgboost as dxgb
from dask.distributed import Client
import dask.array as da
import numpy as np

client = Client()

X = np.random.randint(1,5,(10,2))
y = np.zeros(10)

X = da.from_array(X)
y = da.from_array(y)

model = dxgb.XGBClassifier(n_estimator=5)
model.fit(X, y)

sigmoids = model.predict_proba(X).compute()
preds = model.predict(X).compute()

print(sigmoids, preds)

Output:
(First list is sigmoids, second list is predictions)

[0.10914253 0.10914253 0.10914253 0.10914253 0.10914253 0.10914253
 0.10914253 0.10914253 0.10914253 0.10914253] [1 1 1 1 1 1 1 1 1 1]

It stems from line 537 of core.py

            cidx = (class_probs > 0).astype(np.int64)

Where any generated single dimensional class probability is evaluated as a 1. It's an easy fix, all you have to do is pass in a threshold parameter that allows you to set that 0 to some float and default that value to 0.5.

XGBoostError: Boolean is not supported

Hi all, I am the running the rapids NYCTaxi notebook via docker image rapidsai/rapidsai:0.10-cuda10.1-runtime-ubuntu18.04, but I am getting the below error at the training step, some tip to fix it?:

import dask_xgboost as dxgb_gpu

params = {
 'learning_rate': 0.3,
  'max_depth': 8,
  'objective': 'reg:squarederror',
  'subsample': 0.6,
  'gamma': 1,
  'silent': True,
  'verbose_eval': True,
  'tree_method':'gpu_hist',
  'n_gpus': 1
}

trained_model = dxgb_gpu.train(client, params, X_train, Y_train, num_boost_round=100)

Tracelog:

XGBoostError                              Traceback (most recent call last)
<timed exec> in <module>

/opt/conda/envs/rapids/lib/python3.6/site-packages/dask_xgboost/core.py in train(client, params, data, labels, dmatrix_kwargs, **kwargs)
    233     """
    234     return client.sync(_train, client, params, data,
--> 235                        labels, dmatrix_kwargs, **kwargs)
    236 
    237 

/opt/conda/envs/rapids/lib/python3.6/site-packages/distributed/client.py in sync(self, func, asynchronous, callback_timeout, *args, **kwargs)
    760         else:
    761             return sync(
--> 762                 self.loop, func, *args, callback_timeout=callback_timeout, **kwargs
    763             )
    764 

/opt/conda/envs/rapids/lib/python3.6/site-packages/distributed/utils.py in sync(loop, func, callback_timeout, *args, **kwargs)
    331     if error[0]:
    332         typ, exc, tb = error[0]
--> 333         raise exc.with_traceback(tb)
    334     else:
    335         return result[0]

/opt/conda/envs/rapids/lib/python3.6/site-packages/distributed/utils.py in f()
    315             if callback_timeout is not None:
    316                 future = gen.with_timeout(timedelta(seconds=callback_timeout), future)
--> 317             result[0] = yield future
    318         except Exception as exc:
    319             error[0] = sys.exc_info()

/opt/conda/envs/rapids/lib/python3.6/site-packages/tornado/gen.py in run(self)
    733 
    734                     try:
--> 735                         value = future.result()
    736                     except Exception:
    737                         exc_info = sys.exc_info()

/opt/conda/envs/rapids/lib/python3.6/site-packages/tornado/gen.py in run(self)
    740                     if exc_info is not None:
    741                         try:
--> 742                             yielded = self.gen.throw(*exc_info)  # type: ignore
    743                         finally:
    744                             # Break up a reference to itself

/opt/conda/envs/rapids/lib/python3.6/site-packages/dask_xgboost/core.py in _train(client, params, data, labels, dmatrix_kwargs, **kwargs)
    193 
    194     # Get the results, only one will be non-None
--> 195     results = yield client._gather(futures)
    196     result = [v for v in results if v]
    197     if not params.get('dask_all_models', False):

/opt/conda/envs/rapids/lib/python3.6/site-packages/tornado/gen.py in run(self)
    733 
    734                     try:
--> 735                         value = future.result()
    736                     except Exception:
    737                         exc_info = sys.exc_info()

/opt/conda/envs/rapids/lib/python3.6/site-packages/distributed/client.py in _gather(self, futures, errors, direct, local_worker)
   1699                             exc = CancelledError(key)
   1700                         else:
-> 1701                             raise exception.with_traceback(traceback)
   1702                         raise exc
   1703                     if errors == "skip":

/opt/conda/envs/rapids/lib/python3.6/site-packages/dask_xgboost/core.py in train_part()
     97         if dmatrix_kwargs is None:
     98             dmatrix_kwargs = {}
---> 99         dtrain = xgb.DMatrix(data, labels, **dmatrix_kwargs)
    100 
    101     elif labels[0] is None and isinstance(data[0], xgb.DMatrix):

/opt/conda/envs/rapids/lib/python3.6/site-packages/xgboost/core.py in __init__()
    512             self._init_from_dt(data, nthread)
    513         elif _use_columnar_initializer(data):
--> 514             self._init_from_columnar(data, missing)
    515         else:
    516             try:

/opt/conda/envs/rapids/lib/python3.6/site-packages/xgboost/core.py in _init_from_columnar()
    651             _LIB.XGDMatrixCreateFromArrayInterfaces(
    652                 interfaces, ctypes.c_int32(has_missing),
--> 653                 ctypes.c_float(missing), ctypes.byref(handle)))
    654         self.handle = handle
    655 

/opt/conda/envs/rapids/lib/python3.6/site-packages/xgboost/core.py in _check_call()
    199     """
    200     if ret != 0:
--> 201         raise XGBoostError(py_str(_LIB.XGBGetLastError()))
    202 
    203 

XGBoostError: [16:36:13] /conda/conda-bld/xgboost_1571337679414/work/src/data/simple_csr_source.cu:161: Boolean is not supported.
Stack trace:
  [bt] (0) /opt/conda/envs/rapids/lib/libxgboost.so(+0xc9594) [0x7f80d2a83594]
  [bt] (1) /opt/conda/envs/rapids/lib/libxgboost.so(xgboost::data::SimpleCSRSource::FromDeviceColumnar(std::vector<xgboost::Json, std::allocator<xgboost::Json> > const&, bool, float)+0x743) [0x7f80d2c66443]
  [bt] (2) /opt/conda/envs/rapids/lib/libxgboost.so(xgboost::data::SimpleCSRSource::CopyFrom(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, bool, float)+0xc74) [0x7f80d2ade9e4]
  [bt] (3) /opt/conda/envs/rapids/lib/libxgboost.so(XGDMatrixCreateFromArrayInterfaces+0x1c8) [0x7f80d2a91b08]
  [bt] (4) /opt/conda/envs/rapids/lib/python3.6/lib-dynload/../../libffi.so.6(ffi_call_unix64+0x4c) [0x7f82df0f3630]
  [bt] (5) /opt/conda/envs/rapids/lib/python3.6/lib-dynload/../../libffi.so.6(ffi_call+0x22d) [0x7f82df0f2fed]
  [bt] (6) /opt/conda/envs/rapids/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so(_ctypes_callproc+0x2ce) [0x7f82df10a00e]
  [bt] (7) /opt/conda/envs/rapids/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so(+0x13a45) [0x7f82df10aa45]
  [bt] (8) /opt/conda/envs/rapids/bin/python(_PyObject_FastCallDict+0x8b) [0x5603fddf67bb]`

dxgb.train throws ValueError: need more than 1 value to unpack

I am running a master and 5 node cluster on aws.
All my feature variables (X_train) are continuous and have been properly cleaned and null values filled. The target label (y_train) is 0 or 1 (float64);
I get the following error when trying to execute the following:
bst = dxgb.train(client, params, X_train, y_train) where X_train and y_train are data_train and labels_train;

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-88-ff678a0c4ab4> in <module>()
----> 1 bst = dxgb.train(client, params, X_train, y_train)

/usr/local/lib/python2.7/site-packages/dask_xgboost/core.pyc in train(client, params, data, labels, dmatrix_kwargs, **kwargs)
    167     """
    168     return sync(client.loop, _train, client, params, data,
--> 169                 labels, dmatrix_kwargs, **kwargs)
    170 
    171 

/usr/local/lib/python2.7/site-packages/distributed/utils.pyc in sync(loop, func, *args, **kwargs)
    275             e.wait(10)
    276     if error[0]:
--> 277         six.reraise(*error[0])
    278     else:
    279         return result[0]

/usr/local/lib/python2.7/site-packages/distributed/utils.pyc in f()
    260             if timeout is not None:
    261                 future = gen.with_timeout(timedelta(seconds=timeout), future)
--> 262             result[0] = yield future
    263         except Exception as exc:
    264             error[0] = sys.exc_info()

/usr/local/lib64/python2.7/site-packages/tornado/gen.pyc in run(self)
   1131 
   1132                     try:
-> 1133                         value = future.result()
   1134                     except Exception:
   1135                         self.had_exception = True

/usr/local/lib64/python2.7/site-packages/tornado/concurrent.pyc in result(self, timeout)
    259         if self._exc_info is not None:
    260             try:
--> 261                 raise_exc_info(self._exc_info)
    262             finally:
    263                 self = None

/usr/local/lib64/python2.7/site-packages/tornado/gen.pyc in run(self)
   1145                             exc_info = None
   1146                     else:
-> 1147                         yielded = self.gen.send(value)
   1148 
   1149                     if stack_context._state.contexts is not orig_stack_contexts:

/usr/local/lib/python2.7/site-packages/dask_xgboost/core.pyc in _train(client, params, data, labels, dmatrix_kwargs, **kwargs)
    119 
    120     # Start the XGBoost tracker on the Dask scheduler
--> 121     host, port = parse_host_port(client.scheduler.address)
    122     env = yield client._run_on_scheduler(start_tracker,
    123                                          host.strip('/:'),

/usr/local/lib/python2.7/site-packages/dask_xgboost/core.pyc in parse_host_port(address)
     22     if '://' in address:
     23         address = address.rsplit('://', 1)[1]
---> 24     host, port = address.split(':')
     25     port = int(port)
     26     return host, port

ValueError: need more than 1 value to unpack

[Feature Request] Support evals_result

Hello,

Currently the dask-xgboost package train result does not return evals_result.

I'm thinking it can be implemented in a similar way to https://github.com/dmlc/xgboost/blob/master/python-package/xgboost/dask.py#L348

I'd be happy to open a PR with this change myself, but I'd like to get feedback on what your thoughts are on this implementation because I imagine having the existing train method return a dictionary rather than the booster object will cause breaking changes for those who are using this library currently. If this package will be moving to dmlc/xgboost anyways then maybe this is acceptable, otherwise there's probably cleaner way to return evals_resultto the user

question about source code logic

Dear community,

I am quite new to dask, and trying to figure out how exactly does models like XGBoost distributed training via Dask. And got confused with below code from dask-xgboost/dask_xgboost/core.py, why here we only use [v for v in results if v][0] instead of the whole result lists? i.e., what does the comment only one will be non-None mean?

# Get the results, only one will be non-None
    results = yield client._gather(futures)
    result = [v for v in results if v][0]
    num_class = params.get("num_class")
    if num_class:
        result.set_attr(num_class=str(num_class))
    raise gen.Return(result)

Thanks!

Jackie

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.