GithubHelp home page GithubHelp logo

salesforce / albef Goto Github PK

View Code? Open in Web Editor NEW
1.5K 12.0 195.0 71.57 MB

Code for ALBEF: a new vision-language pre-training method

License: BSD 3-Clause "New" or "Revised" License

Jupyter Notebook 3.30% Python 96.70%
vision-and-language representation-learning image-text weakly-supervised-learning contrastive-learning

albef's Introduction

Align before Fuse: Vision and Language Representation Learning with Momentum Distillation, NeurIPS 2021 Spotlight (Salesforce Research).

Announcement: ALBEF is now officially integrated into LAVIS - a one-stop library for language-and-vision research and applications!

This is the official PyTorch implementation of the ALBEF paper [Blog]. This repository supports pre-training on custom datasets, as well as finetuning on VQA, SNLI-VE, NLVR2, Image-Text Retrieval on MSCOCO and Flickr30k, and visual grounding on RefCOCO+. Pre-trained and finetuned checkpoints are released.

Requirements:

  • pytorch 1.8.0
  • transformers 4.8.1
  • timm 0.4.9

Download:

Visualization:

We provide code in visualize.ipynb to visualize the important areas in an image for each word in a text. Here is an example visualization using the visual grounding checkpoint.

Try the Replicate demo here Replicate.

Pre-training on custom datasets:

  1. Prepare training json files where each json file contains a list. Each item in the list is a dictonary with two key-value pairs: {'image': path_of_image, 'caption': text_of_image}.
  2. In configs/Pretrain.yaml, set the paths for the json files.
  3. Pre-train the model using 8 A100 GPUs:
python -m torch.distributed.launch --nproc_per_node=8 --use_env Pretrain.py --config ./configs/Pretrain.yaml --output_dir output/Pretrain 

Image-Text Retrieval:

  1. Download MSCOCO or Flickr30k datasets from the original websites.
  2. Download and extract the provided dataset json files.
  3. In configs/Retrieval_coco.yaml or configs/Retrieval_flickr.yaml, set the paths for the json files and the image path.
  4. Finetune the pre-trained checkpoint using 8 A100 GPUs:
python -m torch.distributed.launch --nproc_per_node=8 --use_env Retrieval.py \
--config ./configs/Retrieval_flickr.yaml \
--output_dir output/Retrieval_flickr \
--checkpoint [Pretrained checkpoint]

VQA:

  1. Download VQA v2 dataset and Visual Genome dataset from the original websites.
  2. Download and extract the provided dataset json files.
  3. In configs/VQA.yaml, set the paths for the json files and the image paths.
  4. Finetune the pre-trained checkpoint using 8 A100 GPUs:
python -m torch.distributed.launch --nproc_per_node=8 --use_env VQA.py \
--config ./configs/VQA.yaml \
--output_dir output/vqa \
--checkpoint [Pretrained checkpoint]
  1. Evaluate the result using the official evaluation server.

Visual Entailment:

  1. Download SNLI-VE dataset from the original website.
  2. Download and extract the provided dataset json files.
  3. In configs/VE.yaml, set the paths for the json files and the image path.
  4. Finetune the pre-trained checkpoint using 8 A100 GPUs:
python -m torch.distributed.launch --nproc_per_node=8 --use_env VE.py \
--config ./configs/VE.yaml \
--output_dir output/VE \
--checkpoint [Pretrained checkpoint]

Visual Grounding on RefCOCO+:

  1. Download MSCOCO dataset from the original website.
  2. Download and extract the provided dataset json files.
  3. In configs/Grounding.yaml, set the paths for the json files and the image path.
  4. Finetune the pre-trained checkpoint using 8 A100 GPUs:
python -m torch.distributed.launch --nproc_per_node=8 --use_env Grounding.py \
--config ./configs/Grounding.yaml \
--output_dir output/RefCOCO \
--gradcam_mode itm \ 
--block_num 8 \
--checkpoint [Pretrained checkpoint]

NLVR2:

NLVR2 requires an additional pre-training step with text-assignment (TA) to adapt the model for image-pair inputs. In order to perform TA, first set the paths for the json training files in configs/NLVR_pretrain.yaml, then run:

python -m torch.distributed.launch --nproc_per_node=8 --use_env Pretrain_nlvr.py \
--config ./configs/NLVR_pretrain.yaml \
--output_dir output/NLVR_pretrain \
--checkpoint [Pretrained checkpoint]

We provide the checkpoint after TA pre-training, which can be fine-tuned with the following steps.

  1. Download NLVR2 dataset from the original website.
  2. Download and extract the provided dataset json files.
  3. In configs/NLVR.yaml, set the paths for the json files and the image path.
  4. Finetune the pre-trained checkpoint using 8 A100 GPUs:
python -m torch.distributed.launch --nproc_per_node=8 --use_env NLVR.py \
--config ./configs/NLVR.yaml \
--output_dir output/NLVR \
--checkpoint [TA pretrained checkpoint]

Citation

If you find this code to be useful for your research, please consider citing.

@inproceedings{ALBEF,
      title={Align before Fuse: Vision and Language Representation Learning with Momentum Distillation}, 
      author={Junnan Li and Ramprasaath R. Selvaraju and Akhilesh Deepak Gotmare and Shafiq Joty and Caiming Xiong and Steven Hoi},
      year={2021},
      booktitle={NeurIPS},
}

albef's People

Contributors

chenxwh avatar lijunnan1992 avatar svc-scm avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

albef's Issues

Problem of NLVR_pretrain.yaml file

Problem 1:
Hello, I found "train_file" have "train_file", such as:
train_file: ['/export/home/project/VL/dataset/caption/coco_karpathy_train.json',
'/export/home/project/VL/dataset/caption/vg_caption.json',
'/export/home/project/VL/dataset/pretrain_caption/conceptual_caption_train.json',
'/export/home/project/VL/dataset/pretrain_caption/conceptual_caption_val.json',
'/export/home/project/VL/dataset/pretrain_caption/sbu_caption.json'
]
Could you please provide the above JSON file?

Problem 2:
When I fine-tune NLVR2 task, I need to first run NLVR_pretrain.yaml file, and then run NLVR.yaml?

Training with apex fp16

Thanks a lot for the interesting work.
BTW, I see in configs. you always make opt as adamw, did you select apex with fp16 for training or inference? I see in your code only when ope == 'fusedadamw' will use the apex optimizer.

The max length of tokenizer is 25?

hi, I notice that you set max length of Tokenizer is 25. However, Clip and Align, Filip set 76. Three times exist. You can't deal with long description, which may leads to bad performance?

Grad-CAM visualization code

Could you explain the code to extract gradient here?
grads=model.text_encoder.base_model.base_model.encoder.layer[block_num].crossattention.self.get_attn_gradients()
(block_num=8)
Why did you use base_model.base_model here, and what is the connection to "3rd layer of the multimodal heads"?

pretraining datasets json files

Hi @LiJunnan1992,

Congrats on your great work, and thanks for releasing the code!! To help reproduce the pretraining experiments, could you release the dataset json files for the pretraining datasets as well? Thanks!

Best,
Jie

The ability of Pretrained model for downstream tasks use directly

 Thanks for the great work for vision-language multi-modal pretraining.
 Have tried the finetuned model to test flickr30k retrieval task and the statistics is good. 
 My question is that the pretrained model has the aliblity to directly apply on downstream tasks or not ?  When used with the pretrained model ALBEF.pth for flickr30k retrieval task, the recall results are quite not satisfactory. It's a common status for all downstream tasks when use pretrained model or a special case in retrieval task?
  Hope get suggestions.

CUDA Out of Memory

While fine-tuning the network on larger dataset (on 8 A100 GPUs) for text to image retrieval, I always face the issue of CUDA OUT OF MEMORY, after running for 1000-1500 batches. If the batch size is reduced to 16 (from 32), this issue is faced after 2500-3000 batches.

I am currently using -

  • pytorch 1.8.0
  • transformers 4.8.1
  • timm 0.4.9
  • batch_size_train: 32
  • batch_size_test: 32
    E.g.:
    RuntimeError: CUDA out of memory. Tried to allocate 194.00 MiB (GPU 5; 39.59 GiB total capacity; 34.18 GiB already allocated;111.19 MiB free; 37.37 GiB reserved in total by PyTorch).

Did you face any of these issues during fine-tuning?

Training on a single GPU

Hi, kudos on your work, and thank you for releasing the code. I'm looking to use ALBEF to train on my custom data of about 50k-100k image-text pairs, and I had a few doubts.

  • Can you tell me how I should go about changing the pretraining command to support training on a single gpu?
  • What is the minimum batch size ALBEF must be trained on to get practical results (since contrastive loss is involved)
  • How many epochs is recommended when trained on a single gpu, in order to reach pretraining convergence?
  • I noticed in the sample pretrain config file that train_files is a list containing multiple paths to different datasets. For my single dataset json, should the value for this key be a list containing path(str) to json or just the path as a string itself?

I'm fairly new to this, and I appreciate the help. Thank you

About pretraining process

First of all, thanks for the great work.
I use 8 V100 to train the model with 4M images. it takes 5~6 hours for an epoch which is too slow. is it correct?
Could you please share the pretraining log, I want to check the loss of the pretraining process.
Thank you in advance.

how to test the model?

hi, I just want to test the effect of the model.
here is my test code:

python Retrieval.py --config ./configs/Retrieval_flickr.yaml --output_dir output/Retrieval_flickr --checkpoint model_file/ALBEF.pth --evaluate True

I have changed the relevant configuration files.

Am I right to test like this?

thx!

selecting pretraining checkpoints / monitoring pretraining performance

Hi @LiJunnan1992,

A few questions on pretraining: (1) how do you decide which checkpoint to use for each run, are you always using the last checkpoint? (2) and how to compare pretrained checkpoint across runs (3) do you monitor pretraining performance using by using one of the downstream tasks? If so, could you provide some more details?

Best,
Jie

Got key error when loading weights finetuning on Visual Grounding

Thank you for your sharing work. I got problems when finetuning Visual Grounding
python -m torch.distributed.launch --nproc_per_node=8 --use_env Grounding.py
--config ./configs/Grounding.yaml
--output_dir output/RefCOCO
--gradcam_mode itm
--block_num 8
--checkpoint ./pretrain/refcoco.pth
refcoco.pth was downloaded from your link.
Here is the error message
Traceback (most recent call last):
File "Grounding.py", line 295, in
main(args, config)
File "Grounding.py", line 187, in main
state_dict = checkpoint['model']
KeyError: 'model'

VQA VG answer weight

Hi Junnan,

I noticed the answers in the VQA dataset are assigned a weight value based on their frequency, and all answers to a question have a total weight of 1. For VG dataset, it looks like the answers are assigned a weight of 0.5 (see https://github.com/salesforce/ALBEF/blob/main/dataset/vqa_dataset.py#L66). Since there is always a single answer to VG questions, if using the same rules as in VQA dataset, the weight should be 1. Do you have any reason not to use a weight value of 1 for VG answers?

Best,
Jie

pretrain dataset raw images

Hi team, I'm trying to use the code to pretrain a model, and I looked at the JSON files contains pretrain dataset, but there is only the image location path. Could you please provide the raw image data, or where can I download that images? Because I found that the image storage names in the JSON files are different with the official version, and the captions are also a little different.

If it is hard to store all the raw images in your servers' hard-disk, I suggest that maybe you can provide a toy dataset for users to testing the codes. Thanks!

results when using resnet?

Hi, Thanks for releasing ALBEF, it is a clean and well organized repo.

We use pre-extracted resnet image features with texts for multimodal tasks. I have tried to use ALBEF in our project, but the results is not as expected. I wonder if this is because the resnet image features.

I find that you provide ALBEF with resnet model, could you share any results on downstream tasks when using resnet for images?

Visual Grounding, Whole sentence visualization in Fig.4

Hi, I tried to visualize the whole sentence attention map by [CLS], but the result is inaccurate.
For example: text tokens are [[CLS], "grey", "sweater"], the "grey" "sweater" tokens fits well in the image, but the [CLS] token heat map is incorrect. How did you get the result of Figure 4? Thank you!

I've removed text attention mask:

    if show_CLS:
        cams = cams[:, :, :, 1:].reshape(image.size(0), 12, -1, 24, 24)
        grads = grads[:, :, :, 1:].clamp(0).reshape(image.size(0), 12, -1, 24, 24)
    else:
        cams = cams[:, :, :, 1:].reshape(image.size(0), 12, -1, 24, 24) * mask
        grads = grads[:, :, :, 1:].clamp(0).reshape(image.size(0), 12, -1, 24, 24) * mask

NCCL problems of pretrain

I have a problem at pretrain phase, such as:
Traceback (most recent call last):
File "Pretrain.py", line 215, in
Traceback (most recent call last):
File "Pretrain.py", line 215, in
main(args, config)
File "Pretrain.py", line 93, in main
main(args, config)utils.init_distributed_mode(args)

File "Pretrain.py", line 93, in main
File "/root/albef/ALBEF/utils.py", line 257, in init_distributed_mode
utils.init_distributed_mode(args)
File "/root/albef/ALBEF/utils.py", line 257, in init_distributed_mode
torch.distributed.barrier()
File "/opt/conda/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 2420, in barrier
torch.distributed.barrier()
File "/opt/conda/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 2420, in barrier
work = default_pg.barrier(opts=opts)
RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:825, unhandled system error, NCCL version 2.7.8
ncclSystemError: System call (socket, malloc, munmap, etc) failed.

Have you ever had a similar problem or give me some advice? Thanks!

Some difference between the paper and code

In the page 3 of the paper, p_m^{i2t} = exp(...)/sum_{i=1}^M(...). I found it is different in code, which is sum_{i=1}^{M+batchsize}, so I think there is a mistake in this detail in the paper.
image

image

What is the difference between coco.json and coco_train.json

"coco.json" is in the "json_pretrain" dir and has 108MB.
"coco_train.json" is in the "data" dir and has 84.6MB.

I am confused about the difference.
By the way, "data/coco_karpathy_train.json" points to "coco_train.json" or "coco.json" ?
I want to retrain your model. Thanks.

The number of captions of VG

Hi, thanks for the great work!

I wonder why Table 8 says the number of captions of VG is 769K while in previous papers (e.g. ViLT, UNITER) it is ~5M?

vqa training

Hi, thanks for sharing. ALBEF is wonderful and your code is well orginized.
What bothers me now is the training process takes too much time. It takes about 10 hours to finetune VQA traning tast per epoch with 3 RTX 3090 GPUs. I'm wondering if there is a way to speed up the training process somehow?
Thanks again for your work.

RuntimeError: invalid multinomial distribution (sum of probabilities <= 0)

Hi, when I runned the Retrieval.py, I met this issue. How could I repair it?

Traceback (most recent call last):
File "Retrieval.py", line 382, in
main(args, config)
File "Retrieval.py", line 305, in main
train_stats = train(model, train_loader, optimizer, tokenizer, epoch, warmup_steps, device, lr_scheduler, config)
File "Retrieval.py", line 51, in train
loss_ita, loss_itm = model(image, text_input,alpha=alpha, idx=idx)
File "/home/ma-user/anaconda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/ma-user/anaconda/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 619, in forward
output = self.module(*inputs[0], **kwargs[0])
File "/home/ma-user/anaconda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/ma-user/modelarts/user-job-dir/ALBEF/models/model_retrieval.py", line 130, in forward
neg_idx = torch.multinomial(weights_t2i[b], 1).item()
RuntimeError: invalid multinomial distribution (sum of probabilities <= 0)

About VQA answer_list.json

Hi,
Thanks for your nice work! The released code is also very useful.

I am wondering how did you get answer_list.json. It contains 3,128 answers. However, your paper says that you use 3,192 candidate answers. Meanwhile, in fact existing methods usually use 3,129 answers as reported in their papers.

I have tried to get top-3,129 most frequent answers on VQA train set or train&validation set. It yields a list that only has ~2700 answers overlapped with your list.

Thanks!

Why step_size is set to be 100?

Hi thanks for your great work.
I have a minor question about your code.
I saw that the step_size is set to 100, and the warmup_iterations is step_size * warmup_epochs.
May I ask the justification?

Thanks!

About memory allocation

I want to do the VQA finetuning. I do not have A100 GPUs, and I could only run on 4 2080Ti GPUs with 11019MB memory. I tried to reduce the batch size to 1 but still got "CUDA out of memory". I want to inquire whether it is normal?

Problems about the test results

Hi, @LiJunnan1992

Thanks for your work and release!

However, I got a lower zero-shot performance on flickr30k, which seems to be the same as the performance reported in #23 ,

# val set
{'txt_r1': 90.23668639053254, 'txt_r5': 98.0276134122288, 'txt_r10': 99.11242603550296, 'txt_r_mean': 95.7922419460881, 'img_r1': 75.75936883629191, 'img_r5': 92.50493096646943, 'img_r10': 95.75936883629191, 'img_r_mean': 88.00788954635108, 'r_mean': 91.90006574621958}
# test set
{'txt_r1': 88.5, 'txt_r5': 98.5, 'txt_r10': 99.2, 'txt_r_mean': 95.39999999999999, 'img_r1': 75.92, 'img_r5': 93.34, 'img_r10': 96.66, 'img_r_mean': 88.63999999999999, 'r_mean': 92.01999999999998}

image

and used this command:

python -m torch.distributed.launch --nproc_per_node=8 --use_env Retrieval.py \
--config ./configs/Retrieval_flickr.yaml \
--output_dir output/Retrieval_flickr \
--checkpoint sources/ALBEF.pth --evaluate

Problem about the released checkpoint

Hi, Thanks for your excellent work!
I want to ask which datasets are the released checkpoints pretrained on, 4m dataset or 4 + 10m dataset? Since there are two pretrained dataset in your paper.

Thanks!

questions about Grounding evaluation

In your grounding task, you used 'dets.json' to evaluate your results, how do you get the 'dets.json' file? which object detector did you use?

Questions about Visual Grounding checkpoint and visualization

Thank you a lot for your outstanding work! I'm having problems with Visual Grounding task:

  1. How did you get refcoco.pth?
    During Fine-tuning according to this provided procedure, a 3.3G checkpoint_best.pth file, as large as the pretrained model, is generated; However the checkpoint: refcoco.pth you've given is only 800M, Could you explain how you managed to shrink the model size?
    I tried distill :True and distill :False in config file, making no difference to the final size.
    python -m torch.distributed.launch --nproc_per_node=8 --use_env Grounding.py \ --config ./configs/Grounding.yaml \ --output_dir output/RefCOCO \ --gradcam_mode itm \ --block_num 8 \ --checkpoint [Pretrained checkpoint, size 3.3G]

  2. How to evaluate refcoco.pth?
    Setting distill: False in config file does not work for me;
    python -m torch.distributed.launch --nproc_per_node=8 --use_env Grounding.py \ --config ./configs/Grounding.yaml \ --output_dir output/RefCOCO_albefpth \ --gradcam_mode itm \ --block_num 8 \ --evaluate \ --checkpoint refcoco.pth
    Drops following KeyError problem:
    Traceback (most recent call last): File "Grounding.py", line 295, in <module> main(args, config) File "Grounding.py", line 187, in main state_dict = checkpoint['model'] KeyError: 'model'

  3. How to visualize the 3.3G checkpoint_best.pth file generated by fine-tuning?
    During Fine-tuning, the [val, test_A, test_B] metrics data printed out seems fine. However, the visualization.ipynb only works for refcoco.pth, but not works for the 3.3G checkpoint_best.pth generated by fine-tuning, the heat map is totally mess, not as expected. There seems a gap between checkpoint_best.pth and refcoco.pth.

A quick question about visual grounding and visualizing Grad-CAM

Thanks for releasing the code!

I have a quick question about visualizing Grad-CAM.
Do you have any particular reason for using 3rd layer of multimodal encoder?
I've tried other layers using your demo code for visualization, but the results generated from the 4th & 5th layers are quite inaccurate.

Thanks in advance :)

test in VQA dataset

when I test in VQA task, I run :
python -m torch.distributed.launch --nproc_per_node=8 --use_env VQA.py
--config ./configs/VQA.yaml
--output_dir output/vqa
--checkpoint ./ALBEF.pth
But, when I run one epoch, I get the result flood is empty?
How to get the result? when all epoch training end?

Configs for pre-training with more than 8 GPUs?

First of all, thanks for your great work!I am trying to speed up the pretraining process by using more GPUs(such as 32 or 64 A100), but it turns out a performance drop in the original setting. i wonder the 8 GPUs version config(such as batch size or epoch num) is not suitable for more GPUs. have you tried pretraining in more GPUs and could you please share the config?

Some results issue

Hi Junnan, I have the following questions about the result. Hope that you can help to clarify them, thanks.

  1. VQA:
    I get the result folder after fine-tuning on VQA dataset. Which json file should I use to get test-dev and test-std?
    image

  2. SNLI-VE:
    This is the log file after fine-tuning on SNLI-VE dataset. You didn't update the best-epoch, so it is always 0. Should I pick the row which has the best val accuracy as the final result?
    image

  3. Grouding
    This is the log file after fine-tuning on Ref-COCO. Should I pick the row which has the best val_d as the final result?
    image

  4. NLVR2
    This is the log file after fine-tuning on NLVR2, but I did't find dev and test-P as shown in your paper, any idea?
    image

Key Error When reshaping position embedding.

Hi there, I am trying to evaluate the pretrained coco model you provided. Originally I got the same error as the original poster in #7 . However, setting distill to False does not fix the issue for me. It seems as though the checkpoint is already the model, therefore the line in #7 can be changed from state_dict = checkpoint['model'] to state_dict = checkpoint. However when I do this it gives me another key error here on line 190:

m_pos_embed_reshaped = interpolate_pos_embed(state_dict['visual_encoder_m.pos_embed'],model.visual_encoder_m) 

Inidicating that the key 'visual_encoder_m.pos_embed' does not exist. Is this an error with the code, checkpoint, or am I doing something wrong? For what its worth the key visual_encoder.pos_embed does exist. Any guidance is appreciated.

Also, great work.

Cannot load image from CC3M

Get the following error:
PIL.UnidentifiedImageError: cannot identify image file '/home/ubuntu/data/CC3M/DownloadConceptualCaptions/validation/10481_3355970027'

The error is generated by this code in caption_dataset.py:
image = Image.open(ann['image']).convert('RGB')

BTW, I can only download 2.4M images from CC3M/training, how did you download 2.95M images? Thanks.

the number of training images in pretrained checkpoints

Can you tell me for your provided "pretrained checkpoints", how many training images did you use, 4M or 14M?

I tested your "pretrained checkpoints" on flickr30k, I got this results with 8 V100 GPUs (32GB):
{'txt_r1': 88.5, 'txt_r5': 98.5, 'txt_r10': 99.2, 'txt_r_mean': 95.39999999999999, 'img_r1': 75.92, 'img_r5': 93.34, 'img_r10': 96.66, 'img_r_mean': 88.63999999999999, 'r_mean': 92.01999999999998}

which is slightly different of your provided results in Table 3 of arxiv paper, especially for TR @1.

Pretrain phase problem

I have a problem at pretrain phase, when the program is run by half, such as:
WARNING:torch.distributed.elastic.agent.server.api:Received 1 death signal, shutting down workers
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 6593 closing signal SIGHUP
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 6593 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 6594 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 6595 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 6596 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 6597 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 6598 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 6599 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 6600 closing signal SIGTERM
Traceback (most recent call last):
File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/distributed/elastic/agent/server/api.py", line 709, in run
result = self._invoke_run(role)
File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/distributed/elastic/agent/server/api.py", line 843, in _invoke_run
time.sleep(monitor_interval)
File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 60, in _terminate_process_handler
raise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval)
torch.distributed.elastic.multiprocessing.api.SignalException: Process 6523 got signal: 1

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/distributed/launch.py", line 193, in
main()
File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/distributed/launch.py", line 189, in main
launch(args)
File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/distributed/launch.py", line 174, in launch
run(args)
File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/distributed/run.py", line 713, in run
)(*cmd_args)
File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/distributed/launcher/api.py", line 131, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/distributed/launcher/api.py", line 252, in launch_agent
result = agent.run()
File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/distributed/elastic/metrics/api.py", line 125, in wrapper
result = f(*args, **kwargs)
File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/distributed/elastic/agent/server/api.py", line 716, in run
self._shutdown(e.sigval)
File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/distributed/elastic/agent/server/local_elastic_agent.py", line 190, in _shutdown
self._pcontext.close(death_sig)
File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 330, in close
self._close(death_sig=death_sig, timeout=timeout)
File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 709, in _close
if handler.proc.poll() is None:
File "/usr/lib/python3.6/subprocess.py", line 875, in poll
return self._internal_poll()
File "/usr/lib/python3.6/subprocess.py", line 1403, in _internal_poll
pid, sts = _waitpid(self.pid, _WNOHANG)
File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 60, in _terminate_process_handler
raise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval)
torch.distributed.elastic.multiprocessing.api.SignalException: Process 6523 got signal: 1

Have you ever had a similar problem?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.