GithubHelp home page GithubHelp logo

auniquesun / crosspoint-ddp Goto Github PK

View Code? Open in Web Editor NEW
15.0 15.0 1.0 292 KB

PyTorch DistriubtedDataParallel (DDP) implementation of the CVPR 2022 Paper CrossPoint.

Shell 1.02% Python 46.43% Jupyter Notebook 52.54%

crosspoint-ddp's People

Contributors

auniquesun avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

Forkers

hhhtty

crosspoint-ddp's Issues

RuntimeError: [2] is setting up NCCL communicator and retreiving ncclUniqueId from [0] via c10d key-value store by key '1', but store->get('1') got error: Socket Timeout

Hello, Jerry Sun. Thank you for the sharing of your good implementation of DDP training for CrossPoint.

When I was conducting the training, I met the issue:
work = default_pg.allgather([tensor_list], [tensor])
RuntimeError: [3] is setting up NCCL communicator and retreiving ncclUniqueId from [0] via c10d key-value store by key '1', but store->get('1') got error: Socket Timeout

It seems that the processes failed to communicate with each other when 'allgather' was conducted.

Here are the Parser settings:
Namespace(backend='nccl', batch_size=1024, class_choice=None, dropout=0.5, emb_dims=1024, epochs=250, eval=False, exp_name='exp', ft_dataset='ModelNet40', gpu_id=0, img_model_path='', k=20, lr=0.001, master_addr='localhost', master_port='12355', model='dgcnn', model_path='', momentum=0.9, no_cuda=False, num_classes=40, num_ft_points=1024, num_pt_points=2048, num_workers=32, print_freq=200, rank=-1, resume=False, save_freq=50, scheduler='cos', seed=1, test_batch_size=16, use_sgd=False, wb_key='local-e6f***', wb_url='http://localhost:28282', world_size=4)

I was training the model on a server with 4 Nvidia 2080Ti.
The running environment: Ubuntu 18.04, Nvidia driver 525.89.02, CUDA 10.2.

The following is my trials to solve the problem:
To figure out the reason of communication failure among processes, I monitored the system status with htop and nvidia-smi.

It was shown that only GPU 0 was processing and the rest were idle. However, the program occupied memory of the four GPUs. I suppose the model was conveyed to 4 GPUs, but no data was transmitted to GPU 1, 2, 3. So the master process cannot gain response from the other processes.
image
微信图片编辑_20231215165040

Could you provide any ideas about how to fix the problem?

Thank you for your time! ;)

Training so slow~~~

Hi,Thank you very much for your efforts, but when I ran it with three graphics cards, it took a very long time to run, and the GPU utilization did not reach 100%, and was 0% most of the time.

image

3eca98c84a53bf8de22d4f907555d62

image

RuntimeError: Timed out initializing process group in store based barrier on rank: 1, for key: store_based_barrier_key:1 (world_size=2, worker_count=8, timeout=0:30:00)

Excuse me, when I was conducting distributed training, the log kept outputting "DEBUG SenderThread: 1236909 [sender. py: send(): 182] send: stats", and finally reported an RuntimeError: Timed out initializing process group in store based barrier on rank: 1, for key: store_based_barrier_key:1 (world_size=2, worker_count=8, timeout=0:30:00). The parser settings are as follows:
parser.add_argument('--master_addr', type=str, default='localhost', help='ip of master node')
parser.add_argument('--master_port', type=str, default='12355', help='port of master node')
Do I need to change these parameters?

Thanks for your reply.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.