auniquesun / crosspoint-ddp Goto Github PK

PyTorch DistriubtedDataParallel (DDP) implementation of the CVPR 2022 Paper CrossPoint.

Shell 1.02% Python 46.43% Jupyter Notebook 52.54%

crosspoint-ddp's Issues

RuntimeError: [2] is setting up NCCL communicator and retreiving ncclUniqueId from [0] via c10d key-value store by key '1', but store->get('1') got error: Socket Timeout

Hello, Jerry Sun. Thank you for the sharing of your good implementation of DDP training for CrossPoint.

When I was conducting the training, I met the issue:
work = default_pg.allgather([tensor_list], [tensor])
RuntimeError: [3] is setting up NCCL communicator and retreiving ncclUniqueId from [0] via c10d key-value store by key '1', but store->get('1') got error: Socket Timeout

It seems that the processes failed to communicate with each other when 'allgather' was conducted.

Here are the Parser settings:
Namespace(backend='nccl', batch_size=1024, class_choice=None, dropout=0.5, emb_dims=1024, epochs=250, eval=False, exp_name='exp', ft_dataset='ModelNet40', gpu_id=0, img_model_path='', k=20, lr=0.001, master_addr='localhost', master_port='12355', model='dgcnn', model_path='', momentum=0.9, no_cuda=False, num_classes=40, num_ft_points=1024, num_pt_points=2048, num_workers=32, print_freq=200, rank=-1, resume=False, save_freq=50, scheduler='cos', seed=1, test_batch_size=16, use_sgd=False, wb_key='local-e6f***', wb_url='http://localhost:28282', world_size=4)

I was training the model on a server with 4 Nvidia 2080Ti.
The running environment: Ubuntu 18.04, Nvidia driver 525.89.02, CUDA 10.2.

The following is my trials to solve the problem:
To figure out the reason of communication failure among processes, I monitored the system status with htop and nvidia-smi.

It was shown that only GPU 0 was processing and the rest were idle. However, the program occupied memory of the four GPUs. I suppose the model was conveyed to 4 GPUs, but no data was transmitted to GPU 1, 2, 3. So the master process cannot gain response from the other processes.

Could you provide any ideas about how to fix the problem?

Thank you for your time! ;)

Training so slow~~~

Hi，Thank you very much for your efforts, but when I ran it with three graphics cards, it took a very long time to run, and the GPU utilization did not reach 100%, and was 0% most of the time.

failed while executing pt-cls.sh with pueue

Could you please help me with this issue?

Have you tried using pointnet as a feature extractor for training?

Have you tried using pointnet as a feature extractor for training?
I have tried and failed, thank you!
My guess is that pointnet uses a different set of parameters, etc. Thank you very much!

RuntimeError: Timed out initializing process group in store based barrier on rank: 1, for key: store_based_barrier_key:1 (world_size=2, worker_count=8, timeout=0:30:00)

Excuse me, when I was conducting distributed training, the log kept outputting "DEBUG SenderThread: 1236909 [sender. py: send(): 182] send: stats", and finally reported an RuntimeError: Timed out initializing process group in store based barrier on rank: 1, for key: store_based_barrier_key:1 (world_size=2, worker_count=8, timeout=0:30:00). The parser settings are as follows:
parser.add_argument('--master_addr', type=str, default='localhost', help='ip of master node')
parser.add_argument('--master_port', type=str, default='12355', help='port of master node')
Do I need to change these parameters?

Thanks for your reply.

auniquesun / crosspoint-ddp Goto Github PK

crosspoint-ddp's People

Contributors

Stargazers

Watchers

Forkers

crosspoint-ddp's Issues

RuntimeError: [2] is setting up NCCL communicator and retreiving ncclUniqueId from [0] via c10d key-value store by key '1', but store->get('1') got error: Socket Timeout

Training so slow~~~

failed while executing pt-cls.sh with pueue

Have you tried using pointnet as a feature extractor for training?

RuntimeError: Timed out initializing process group in store based barrier on rank: 1, for key: store_based_barrier_key:1 (world_size=2, worker_count=8, timeout=0:30:00)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs