Dask worker dies while during dask-xgboost classifier training ; It is being observed while running test_core.py::test_classifier
Dask Version: 2.9.2
Distributed Version: 2.9.3
XGBoost Version: 0.90
Dask-XGBoost Version: 0.1.9
OS-release : 4.14.0-115.16.1.el7a.ppc64le
> /mnt/pai/home/pradghos/dask-xgboost/dask_xgboost/tests/test_core.py(38)test_classifier()
-> with cluster() as (s, [a, b]):
(Pdb) n
distributed.scheduler - INFO - Clear task state
distributed.scheduler - INFO - Scheduler at: tcp://127.0.0.1:45767
distributed.worker - INFO - Start worker at: tcp://127.0.0.1:40743
distributed.worker - INFO - Listening to: tcp://127.0.0.1:40743
distributed.worker - INFO - Waiting to connect to: tcp://127.0.0.1:45767
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Threads: 1
distributed.worker - INFO - Memory: 612.37 GB
distributed.worker - INFO - Local Directory: /mnt/pai/home/pradghos/dask-xgboost/dask_xgboost/tests/_test_worker-c6ea91c7-746e-4c7a-9c13-f5afcd244966/worker-ebbqtfdu
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Start worker at: tcp://127.0.0.1:33373
distributed.worker - INFO - Listening to: tcp://127.0.0.1:33373
distributed.worker - INFO - Waiting to connect to: tcp://127.0.0.1:45767
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Threads: 1
distributed.worker - INFO - Memory: 612.37 GB
distributed.worker - INFO - Local Directory: /mnt/pai/home/pradghos/dask-xgboost/dask_xgboost/tests/_test_worker-050815d2-54f6-4edc-9a03-dd075213449d/worker-i1yr8xvc
distributed.worker - INFO - -------------------------------------------------
distributed.scheduler - INFO - Register worker <Worker 'tcp://127.0.0.1:40743', name: tcp://127.0.0.1:40743, memory: 0, processing: 0>
distributed.scheduler - INFO - Starting worker compute stream, tcp://127.0.0.1:40743
distributed.core - INFO - Starting established connection
distributed.worker - INFO - Registered to: tcp://127.0.0.1:45767
distributed.worker - INFO - -------------------------------------------------
distributed.scheduler - INFO - Register worker <Worker 'tcp://127.0.0.1:33373', name: tcp://127.0.0.1:33373, memory: 0, processing: 0>
distributed.core - INFO - Starting established connection
distributed.scheduler - INFO - Starting worker compute stream, tcp://127.0.0.1:33373
distributed.core - INFO - Starting established connection
distributed.worker - INFO - Registered to: tcp://127.0.0.1:45767
distributed.worker - INFO - -------------------------------------------------
distributed.core - INFO - Starting established connection
-> a.fit(X2, y2)
(Pdb) distributed.worker - DEBUG - Heartbeat: tcp://127.0.0.1:40743
distributed.worker - DEBUG - Heartbeat: tcp://127.0.0.1:33373
distributed.worker - DEBUG - Heartbeat: tcp://127.0.0.1:40743
distributed.worker - DEBUG - Heartbeat: tcp://127.0.0.1:33373
distributed.worker - DEBUG - Heartbeat: tcp://127.0.0.1:40743
distributed.worker - DEBUG - Heartbeat: tcp://127.0.0.1:33373
distributed.worker - DEBUG - Heartbeat: tcp://127.0.0.1:40743
distributed.worker - DEBUG - Heartbeat: tcp://127.0.0.1:33373
distributed.worker - DEBUG - Heartbeat: tcp://127.0.0.1:40743
distributed.worker - DEBUG - Heartbeat: tcp://127.0.0.1:33373
ndistributed.worker - DEBUG - Heartbeat: tcp://127.0.0.1:40743
distributed.worker - DEBUG - Heartbeat: tcp://127.0.0.1:33373
distributed.worker - DEBUG - Execute key: array-original-8d35e675b41aad38dc334c7f79ea1982 worker: tcp://127.0.0.1:33373
distributed.worker - DEBUG - Send compute response to scheduler: array-original-8d35e675b41aad38dc334c7f79ea1982, {'op': 'task-finished', 'status': 'OK', 'nbytes': 80, 'type': <class 'numpy.ndarray'>, 'start': 1580372953.2651937, 'stop': 1580372953.265216, 'thread': 140735736705456, 'key': 'array-original-8d35e675b41aad38dc334c7f79ea1982'}
distributed.worker - DEBUG - Execute key: ('array-8d35e675b41aad38dc334c7f79ea1982', 0) worker: tcp://127.0.0.1:33373
distributed.worker - DEBUG - Send compute response to scheduler: ('array-8d35e675b41aad38dc334c7f79ea1982', 0), {'op': 'task-finished', 'status': 'OK', 'nbytes': 40, 'type': <class 'numpy.ndarray'>, 'start': 1580372953.2696354, 'stop': 1580372953.2696435, 'thread': 140735736705456, 'key': "('array-8d35e675b41aad38dc334c7f79ea1982', 0)"}
distributed.worker - DEBUG - Execute key: ('array-8d35e675b41aad38dc334c7f79ea1982', 1) worker: tcp://127.0.0.1:33373
distributed.worker - DEBUG - Send compute response to scheduler: ('array-8d35e675b41aad38dc334c7f79ea1982', 1), {'op': 'task-finished', 'status': 'OK', 'nbytes': 40, 'type': <class 'numpy.ndarray'>, 'start': 1580372953.2705007, 'stop': 1580372953.2705073, 'thread': 140735736705456, 'key': "('array-8d35e675b41aad38dc334c7f79ea1982', 1)"}
distributed.worker - DEBUG - Deleted 1 keys
distributed.worker - DEBUG - Execute key: ('unique_internal-getitem-a6b7823aa95705e499984f972c2b58b3', 0) worker: tcp://127.0.0.1:33373
distributed.worker - DEBUG - Send compute response to scheduler: ('unique_internal-getitem-a6b7823aa95705e499984f972c2b58b3', 0), {'op': 'task-finished', 'status': 'OK', 'nbytes': 16, 'type': <class 'numpy.ndarray'>, 'start': 1580372953.2753158, 'stop': 1580372953.275466, 'thread': 140735736705456, 'key': "('unique_internal-getitem-a6b7823aa95705e499984f972c2b58b3', 0)"}
distributed.worker - DEBUG - Execute key: ('unique_internal-getitem-a6b7823aa95705e499984f972c2b58b3', 1) worker: tcp://127.0.0.1:33373
distributed.worker - DEBUG - Send compute response to scheduler: ('unique_internal-getitem-a6b7823aa95705e499984f972c2b58b3', 1), {'op': 'task-finished', 'status': 'OK', 'nbytes': 16, 'type': <class 'numpy.ndarray'>, 'start': 1580372953.2762377, 'stop': 1580372953.2763371, 'thread': 140735736705456, 'key': "('unique_internal-getitem-a6b7823aa95705e499984f972c2b58b3', 1)"}
distributed.worker - DEBUG - Deleted 1 keys
distributed.worker - DEBUG - Deleted 1 keys
distributed.worker - DEBUG - Execute key: ('getitem-a6b7823aa95705e499984f972c2b58b3', 0) worker: tcp://127.0.0.1:33373
distributed.worker - DEBUG - Send compute response to scheduler: ('getitem-a6b7823aa95705e499984f972c2b58b3', 0), {'op': 'task-finished', 'status': 'OK', 'nbytes': 16, 'type': <class 'numpy.ndarray'>, 'start': 1580372953.2805014, 'stop': 1580372953.2805073, 'thread': 140735736705456, 'key': "('getitem-a6b7823aa95705e499984f972c2b58b3', 0)"}
distributed.worker - DEBUG - Execute key: ('getitem-a6b7823aa95705e499984f972c2b58b3', 1) worker: tcp://127.0.0.1:33373
distributed.worker - DEBUG - Send compute response to scheduler: ('getitem-a6b7823aa95705e499984f972c2b58b3', 1), {'op': 'task-finished', 'status': 'OK', 'nbytes': 16, 'type': <class 'numpy.ndarray'>, 'start': 1580372953.2813187, 'stop': 1580372953.2813244, 'thread': 140735736705456, 'key': "('getitem-a6b7823aa95705e499984f972c2b58b3', 1)"}
distributed.worker - DEBUG - Deleted 1 keys
distributed.worker - DEBUG - Deleted 1 keys
distributed.worker - DEBUG - Deleted 1 keys
distributed.worker - DEBUG - Deleted 1 keys
distributed.worker - DEBUG - Deleted 1 keys
/mnt/pai/home/pradghos/anaconda3/envs/gdf37/lib/python3.7/site-packages/dask/dataframe/_compat.py:19: FutureWarning: pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead.
import pandas.util.testing as tm # noqa: F401
distributed.worker - DEBUG - Heartbeat: tcp://127.0.0.1:40743
distributed.worker - DEBUG - Heartbeat: tcp://127.0.0.1:33373
distributed.worker - DEBUG - Heartbeat skipped: channel busy
distributed.worker - DEBUG - Heartbeat skipped: channel busy
distributed.worker - INFO - Run out-of-band function 'start_tracker'
distributed.worker - DEBUG - Deleted 1 keys
distributed.worker - DEBUG - Deleted 1 keys
distributed.worker - DEBUG - Deleted 1 keys
/mnt/pai/home/pradghos/anaconda3/envs/gdf37/lib/python3.7/site-packages/dask/dataframe/_compat.py:19: FutureWarning: pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead.
import pandas.util.testing as tm # noqa: F401
/mnt/pai/home/pradghos/anaconda3/envs/gdf37/lib/python3.7/site-packages/dask/dataframe/_compat.py:19: FutureWarning: pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead.
import pandas.util.testing as tm # noqa: F401
distributed.scheduler - INFO - Remove worker <Worker 'tcp://127.0.0.1:40743', name: tcp://127.0.0.1:40743, memory: 1, processing: 1>
distributed.core - INFO - Removing comms to tcp://127.0.0.1:40743 ===========================>>> One worker dies
/mnt/pai/home/pradghos/anaconda3/envs/gdf37/lib/python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 1 leaked semaphores to clean up at shutdown
len(cache))
distributed.worker - DEBUG - Execute key: train_part-e17e49e3769aaa4870dc8cc01a1e015e worker: tcp://127.0.0.1:33373
distributed.worker - DEBUG - Heartbeat: tcp://127.0.0.1:33373
distributed.worker - DEBUG - Heartbeat: tcp://127.0.0.1:33373
distributed.worker - DEBUG - Heartbeat: tcp://127.0.0.1:33373
distributed.worker - DEBUG - future state: train_part-e17e49e3769aaa4870dc8cc01a1e015e - RUNNING === One worker is running infinitely
distributed.worker - DEBUG - Heartbeat: tcp://127.0.0.1:33373
distributed.worker - DEBUG - Heartbeat: tcp://127.0.0.1:33373
distributed.worker - DEBUG - future state: train_part-e17e49e3769aaa4870dc8cc01a1e015e - RUNNING
distributed.worker - DEBUG - Heartbeat: tcp://127.0.0.1:33373
distributed.worker - DEBUG - Heartbeat: tcp://127.0.0.1:33373
It is not clear why does dask worker die at that point .