GithubHelp home page GithubHelp logo

bl0 / moco Goto Github PK

View Code? Open in Web Editor NEW
146.0 8.0 18.0 60 KB

Unofficial implementation with pytorch DistributedDataParallel for "MoCo: Momentum Contrast for Unsupervised Visual Representation Learning"

Home Page: https://arxiv.org/abs/1911.05722

Python 98.69% Shell 1.31%
unsupervised-learning self-supervised-learning pytorch imagenet resnet-50 moco contrast-learning momentum-contrast

moco's Introduction

Unofficial implementation for MoCo: Momentum Contrast for Unsupervised Visual Representation Learning

Highlight

  1. Effective. Carefully implement important details such as ShuffleBN and distributed Queue mentioned in the paper to reproduce the reported results.
  2. Efficient. The implementation is based on pytorch DistributedDataParallel and Apex automatic mixed precision. It only takes about 40 hours to train MoCo on imagenet dataset with 8 V100 gpus. The time cost is smaller than 3 days reported in original MoCo paper.

Requirements

The following enverionments is tested:

  • Anaconda with python >= 3.6
  • pytorch>=1.3, torchvision, cuda=10.1/9.2
  • others: pip install termcolor opencv-python tensorboard
  • [Optional] apex: automatic mixed precision training.

Train and eval on imagenet

  • The pre-training stage:

    data_dir="./data/imagenet100"
    output_dir="./output/imagenet/K65536"
    python -m torch.distributed.launch --master_port 12347 --nproc_per_node=8 \
        train.py \
        --data-dir ${data_dir} \
        --dataset imagenet \
        --nce-k 65536 \
        --output-dir ${output_dir}

    The log, checkpoints and tensorboard events will be saved in ${output_dir}. Set --amp-opt-level to O1, O2, or O3 for mixed precision training. Run python train.py --help for more help.

  • The linear evaluation stage:

    python -m torch.distributed.launch --nproc_per_node=4 \
        eval.py \
        --dataset imagenet \
        --data-dir ${data_dir} \
        --pretrained-model ${output_dir}/current.pth \
        --output-dir ${output_dir}/eval

    The checkpoints and tensorboard log will be saved in ${output_dir}/eval. Set --amp-opt-level to O1, O2, or O3 for mixed precision training. Run python eval.py --help for more help.

Pre-trained weights

Pre-trained model checkpoint and tensorboard log for K = 16384 and 65536 on imagenet dataset can be downloaded from OneDrive.

BTW, the hyperparameters is also stored in model checkpoint, you can get full configs in the checkpoints like this:

import torch
ckpt = torch.load('model.pth')
ckpt['opt']

Performance comparison with original paper

K Acc@1 (ours) Acc@1 (MoCo paper)
16384 59.89 (model) 60.4
65536 60.79 (model) 60.6

Notes

The MultiStepLR of pytorch1.4 is broken (See pytorch/pytorch#33229 for more details). So if you are using pytorch1.4, you should not set --lr-scheduler to step. You can use cosine instead.

Acknowledgements

A lot of codes is borrowed from CMC and lemniscate.

moco's People

Contributors

bl0 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

moco's Issues

Initialization inconsistent

Hi,

Good job! When I ran this code, I found that the models on different devices are not initialized with the same random seed, which may destroy the subsequent gradient synchronizing. Is this a bug?

Best,
Zhijie

Loss is not updated

Loss do not update when running MoCo official code. Is it because the PyTorch version is wrong? '

Momentum Update Issue

There seems to be a bug on this line:

p2.data.mul_(m).add_(1 - m, p1.detach().data)

I would expect the update to be this instead,

def moment_update(model, model_ema, m):
    """ model_ema = m * model_ema + (1 - m) model """
    for p1, p2 in zip(model.parameters(), model_ema.parameters()):
        p2.data.mul_(m).add_((1 - m) * p1.detach().data)

Training time on ImageNet-1K

Hi Bin, thanks a lot for your great code base. Would you kindly tell me how long it will take to train a ResNet-50 on ImageNet-1k?

Got KeyError when resume apex model.

Sorry to bother.
Have you ever resumed model from checkpoint trained with Apex. I got strange error when I try to do such thing.

 -- Process 1 terminated with the following error:
Traceback (most recent call last):
File "/home/anaconda3/envs/dl/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
    fn(i, *args)
File "/home/repos/moco/train.py", line 284, in run
    func(opt,cfg,logger)
File "/home/repos/moco/train.py", line 163, in train
    load_checkpoint(args, model, model_ema, contrast, optimizer, scheduler,logger)
File "/home/repos/moco/train.py", line 101, in load_checkpoint
    optimizer.load_state_dict(checkpoint['optimizer'])
File "/home/anaconda3/envs/dl/lib/python3.7/site-packages/torch/optim/optimizer.py", line 108, in load_state_dict
    saved_groups = state_dict['param_groups']
KeyError: 'param_groups'

I printed keys of checkpoint['optimizer'], and I got
dict_keys(['multiplier', 'warmup_epoch', 'finished', 'base_lrs', 'last_epoch', '_step_count', '_get_lr_called_within_step', '_last_lr', 'after_scheduler'])
There is no param_groups indeed.

Is there any thing changed in optimizer when use amp.initialize ?
What's your opinion?
I'll appreciate your help.

Question about DDP training

Hi,
Thanks for your nice work.
I downloaded you code, and tried to run it on Tiny ImageNet 200 . But it seems like running on only one single GPU.
My pytorch version is 1.2.0, and the script is as bellow:

#!/bin/bash
CUDA_VISIBLE_DEVICES=0,1,2,3
data_dir='/dataset/tiny-imagenet-200'
output_dir='./output/imagenet200'
python -m torch.distributed.launch --master_port 12347 --nproc_per_node=8 \
	train.py \
	--data-dir ${data_dir} \
	--dataset imagenet100 \
	--base-learning-rate 0.4 \
	--alpha 0.99 \
	--crop 0.08 \
	--nce-k 65536 \
	--nce-t 0.1 \
	--local_rank 0 \
	--num-workers 0 \
	--batch-size 256 \
	--output-dir ${output_dir}

The nvidia-smi results as below
image

Performance on imagenet100 and imagenet1k

Have you tried your implementation on imagenet100 dataset? I'm getting accuracy at around 69.0 with default config (8 gpu, lr 0.03, bs 256), which is lower than the MoCo implementation in the CMC repo.

Why broadcast_buffers=False in train.py?

Hi,

I see this code in trian.py
model = DistributedDataParallel(model, device_ids=[args.local_rank], broadcast_buffers=False)

since queue is in buffer. Does it mean that each GPU will have its own buffer and update by itself? If so, should we sync the queue across GPUs?

Thanks!!

About queue size and learning rate

First of all, thank you for this amazing MoCo implementation! Learn a lot from this!
So, I'm using this code to try to train on cifar10/100, and the results are terribly bad.
I've changed the resnet50 into another network for small images (from SupCon, the same author of CMC
Now I'm thinking maybe the queue size and learning rate should be much different than the setting of ImageNet.
May I ask about your opinion? How do you set your lr, and do you think the queue size should be much smaller when the amount of training data is fewer than ImageNet? Looking for you reply!

ps. moco use a linear projection and SimCLR use a non-linear projection to a 128D space, do you think moco can use non-linear projection as well?

NaN loss in training ImageNet 1K

Thanks a lot for sharing the code.

But when I tried to train MoCo with ResNet-50 on ImageNet-1k, I got the NaN loss. The loss increases and the probability diminish to 0 quickly at the starting stage of training. You can check the following figure for more details.

image

The training configs are exactly the same as your model_k-65536.pth. Any suggestions to solve this problem?

My training configs are as follows:
alpha=0.999,
aug='CJ',
batch_size=64,
crop=0.2,
data_root='/home/datasets',
dataset='imagenet',
epochs=200,
exp_name='MoCo/ddp/4-gpu_bs-256_shuffle_bn',
learning_rate=0.03,
local_rank=0,
lr_decay_epochs=[120, 160, 200],
lr_decay_rate=0.1,
model='resnet50',
model_folder='./output/imagenet/MoCo/ddp/4-gpu_bs-256_shuffle_bn/models',
model_width=1,
momentum=0.9,
nce_k=65536,
nce_t=0.07,
num_workers=4,
output_root='./output',
print_freq=10,
resume='',
save_freq=10,
start_epoch=1,
tb_folder='./output/imagenet/MoCo/ddp/4-gpu_bs-256_shuffle_bn/tensorboard',
tb_freq=500,
weight_decay=0.0001

My environment configuration is as follows:

Python 3.7
Pytorch 1.4.0
torchvision 0.5.0
CUDA 9.2
4-GPU Titan Xp 12 GB

I also tried the following config as you suggested in the README. But the loss still increases, leading to NaN loss.

Python 3.6
Pytorch 1.3.0
torchvision 0.4.1
CUDA 9.2
4-GPU Titan Xp 12 GB

In addition, what's your version of torchvision ? Maybe I should exactly follow your pytorch configurations.

Question about ImageNet100 and nce_k

Hi,
Thanks for your nice work.
I've noticed that you use larger nce_k (126689) for ImageNet100 than ImageNet1k. And this number seems a little bit weird. It seems more natural to using smaller k in smaller dataset. Is there any motivation or explanation?
By the way, does ImageNet100 have accurate definition? Or just randomly select 100 classes?
Looking forward your replay.

Mixed precision training

Thanks for your excellent implementation. I am not sure whether using O1 or O2 will drop performance? And if it will, what's the results? Thank you!

Not use reduce_tensor in train.py?

I see u set
model = DistributedDataParallel(model, device_ids=[args.local_rank], broadcast_buffers=False)
and u only use reduce_tensor in eval.py.

Does it causes the loss backwards wrongly in train.py?

about data parallel

Hi, Thanks for providing this code. In your code, you are using distributedparallel, and implemented batch shuffle. Can I use this batch shuffle with dataparallel? if yes, should I use it? I didn't very understand why we should use batch shuffle. Thanks.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.