Fellow XGBoost Users, I am facing a strange problem that I am hoping

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

XGBoost repeatedly copying data across machines - slowing down computation about wormhole HOT 5 CLOSED

dmlc commented on August 27, 2024

XGBoost repeatedly copying data across machines - slowing down computation

from wormhole.

Comments (5)

ankurd28 commented on August 27, 2024

So, after some digging we found out the reason it was slow.
Distributed XGBoost with MPI is copying the data back and forth across the two machines and that is making the whole computation slow-down.

Anybody has any ideas on how to fix this data copying issue?

Thanks,
Ankur

from wormhole.

tqchen commented on August 27, 2024

The data is indeed loaded from distributed data store, but only at startup time. So you can tell the difference from longer number of rounds.

The major goal of distributed xgboost is to scale up to the scale that could not be handled by single machine version. So it is totally possible that distributed version running slower than single node version, if the data fits into single node.

from wormhole.

ankurd28 commented on August 27, 2024

Hi Tianqi,

Thank you for your response!

So, if I understand you correctly, speed would be of secondary concern as long as distributed xgboost can scale up across machines. It is good to understand the design goal, since that makes clear the trade-offs that have been made in the development.

Having said that, do you have any ideas on how it might be possible to speed up the distributed implementation of xgboost? In your opinion, would moving to Hadoop framework be beneficial here for speedup as compared to the MPI framework, in other words, does the xgboost implementation on top of Hadoop also loads data from a distributed data store over the network?

Thanks,
Ankur

from wormhole.

tqchen commented on August 27, 2024

Hi @ankurd28 Speed is definitely important for us.

As the data scales up, the data loading cost over network is minor compared to the running cost of training in our experience (This is different from data processing problems like mapreduce, where little computation is done on each examples, and data locality is crucial).

Because more computation hits in as we get more data. It is likely not a problem for larger dataset. For small dataset, however, as the running cost already was low, and the data loading bottleneck surface up.

from wormhole.

ankurd28 commented on August 27, 2024

Hi Tianqi,

Thanks a lot for your response!
I completely understand your point!

Best,
Ankur

from wormhole.

Recommend Projects

XGBoost repeatedly copying data across machines - slowing down computation about wormhole HOT 5 CLOSED

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs