GithubHelp home page GithubHelp logo

lucidrains / byol-pytorch Goto Github PK

View Code? Open in Web Editor NEW
1.7K 25.0 237.0 73 KB

Usable Implementation of "Bootstrap Your Own Latent" self-supervised learning, from Deepmind, in Pytorch

License: MIT License

Python 100.00%
artificial-intelligence deep-learning self-supervised-learning

byol-pytorch's Introduction

Bootstrap Your Own Latent (BYOL), in Pytorch

PyPI version

Practical implementation of an astoundingly simple method for self-supervised learning that achieves a new state of the art (surpassing SimCLR) without contrastive learning and having to designate negative pairs.

This repository offers a module that one can easily wrap any image-based neural network (residual network, discriminator, policy network) to immediately start benefitting from unlabelled image data.

Update 1: There is now new evidence that batch normalization is key to making this technique work well

Update 2: A new paper has successfully replaced batch norm with group norm + weight standardization, refuting that batch statistics are needed for BYOL to work

Update 3: Finally, we have some analysis for why this works

Yannic Kilcher's excellent explanation

Now go save your organization from having to pay for labels :)

Install

$ pip install byol-pytorch

Usage

Simply plugin your neural network, specifying (1) the image dimensions as well as (2) the name (or index) of the hidden layer, whose output is used as the latent representation used for self-supervised training.

import torch
from byol_pytorch import BYOL
from torchvision import models

resnet = models.resnet50(pretrained=True)

learner = BYOL(
    resnet,
    image_size = 256,
    hidden_layer = 'avgpool'
)

opt = torch.optim.Adam(learner.parameters(), lr=3e-4)

def sample_unlabelled_images():
    return torch.randn(20, 3, 256, 256)

for _ in range(100):
    images = sample_unlabelled_images()
    loss = learner(images)
    opt.zero_grad()
    loss.backward()
    opt.step()
    learner.update_moving_average() # update moving average of target encoder

# save your improved network
torch.save(resnet.state_dict(), './improved-net.pt')

That's pretty much it. After much training, the residual network should now perform better on its downstream supervised tasks.

BYOL → SimSiam

A new paper from Kaiming He suggests that BYOL does not even need the target encoder to be an exponential moving average of the online encoder. I've decided to build in this option so that you can easily use that variant for training, simply by setting the use_momentum flag to False. You will no longer need to invoke update_moving_average if you go this route as shown in the example below.

import torch
from byol_pytorch import BYOL
from torchvision import models

resnet = models.resnet50(pretrained=True)

learner = BYOL(
    resnet,
    image_size = 256,
    hidden_layer = 'avgpool',
    use_momentum = False       # turn off momentum in the target encoder
)

opt = torch.optim.Adam(learner.parameters(), lr=3e-4)

def sample_unlabelled_images():
    return torch.randn(20, 3, 256, 256)

for _ in range(100):
    images = sample_unlabelled_images()
    loss = learner(images)
    opt.zero_grad()
    loss.backward()
    opt.step()

# save your improved network
torch.save(resnet.state_dict(), './improved-net.pt')

Advanced

While the hyperparameters have already been set to what the paper has found optimal, you can change them with extra keyword arguments to the base wrapper class.

learner = BYOL(
    resnet,
    image_size = 256,
    hidden_layer = 'avgpool',
    projection_size = 256,           # the projection size
    projection_hidden_size = 4096,   # the hidden dimension of the MLP for both the projection and prediction
    moving_average_decay = 0.99      # the moving average decay factor for the target encoder, already set at what paper recommends
)

By default, this library will use the augmentations from the SimCLR paper (which is also used in the BYOL paper). However, if you would like to specify your own augmentation pipeline, you can simply pass in your own custom augmentation function with the augment_fn keyword.

augment_fn = nn.Sequential(
    kornia.augmentation.RandomHorizontalFlip()
)

learner = BYOL(
    resnet,
    image_size = 256,
    hidden_layer = -2,
    augment_fn = augment_fn
)

In the paper, they seem to assure that one of the augmentations have a higher gaussian blur probability than the other. You can also adjust this to your heart's delight.

augment_fn = nn.Sequential(
    kornia.augmentation.RandomHorizontalFlip()
)

augment_fn2 = nn.Sequential(
    kornia.augmentation.RandomHorizontalFlip(),
    kornia.filters.GaussianBlur2d((3, 3), (1.5, 1.5))
)

learner = BYOL(
    resnet,
    image_size = 256,
    hidden_layer = -2,
    augment_fn = augment_fn,
    augment_fn2 = augment_fn2,
)

To fetch the embeddings or the projections, you simply have to pass in a return_embeddings = True flag to the BYOL learner instance

import torch
from byol_pytorch import BYOL
from torchvision import models

resnet = models.resnet50(pretrained=True)

learner = BYOL(
    resnet,
    image_size = 256,
    hidden_layer = 'avgpool'
)

imgs = torch.randn(2, 3, 256, 256)
projection, embedding = learner(imgs, return_embedding = True)

Distributed Training

The repository now offers distributed training with 🤗 Huggingface Accelerate. You just have to pass in your own Dataset into the imported BYOLTrainer

First setup the configuration for distributed training by invoking the accelerate CLI

$ accelerate config

Then craft your training script as shown below, say in ./train.py

from torchvision import models

from byol_pytorch import (
    BYOL,
    BYOLTrainer,
    MockDataset
)

resnet = models.resnet50(pretrained = True)

dataset = MockDataset(256, 10000)

trainer = BYOLTrainer(
    resnet,
    dataset = dataset,
    image_size = 256,
    hidden_layer = 'avgpool',
    learning_rate = 3e-4,
    num_train_steps = 100_000,
    batch_size = 16,
    checkpoint_every = 1000     # improved model will be saved periodically to ./checkpoints folder 
)

trainer()

Then use the accelerate CLI again to launch the script

$ accelerate launch ./train.py

Alternatives

If your downstream task involves segmentation, please look at the following repository, which extends BYOL to 'pixel'-level learning.

https://github.com/lucidrains/pixel-level-contrastive-learning

Citation

@misc{grill2020bootstrap,
    title = {Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning},
    author = {Jean-Bastien Grill and Florian Strub and Florent Altché and Corentin Tallec and Pierre H. Richemond and Elena Buchatskaya and Carl Doersch and Bernardo Avila Pires and Zhaohan Daniel Guo and Mohammad Gheshlaghi Azar and Bilal Piot and Koray Kavukcuoglu and Rémi Munos and Michal Valko},
    year = {2020},
    eprint = {2006.07733},
    archivePrefix = {arXiv},
    primaryClass = {cs.LG}
}
@misc{chen2020exploring,
    title={Exploring Simple Siamese Representation Learning}, 
    author={Xinlei Chen and Kaiming He},
    year={2020},
    eprint={2011.10566},
    archivePrefix={arXiv},
    primaryClass={cs.CV}
}

byol-pytorch's People

Contributors

gauenk avatar lucidrains avatar mark-selyaeff avatar naxalpha avatar umbertov avatar yieldthought avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

byol-pytorch's Issues

kornia 0.4.0 - torch 1.6.0 releases

I tried the sample code available in README, it looks like library depends on an earlier version of kornia 0.3.2. Since kornia just very recently released 0.4.0 it might be a bit of an early issue but I figured it would be helpful. Also new version of kornia 0.4.0 depends on torch 1.6.0.

kornia: 0.4.0
torch: 1.6.0

Code

import torch
from byol_pytorch import BYOL
from torchvision import models

resnet = models.resnet50(pretrained=True)

learner = BYOL(
    resnet,
    image_size = 256,
    hidden_layer = 'avgpool'
)

Error

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-5-f272192ae7d5> in <module>
      8     resnet,
      9     image_size = 256,
---> 10     hidden_layer = 'avgpool'
     11 )

/opt/conda/lib/python3.7/site-packages/byol_pytorch/byol_pytorch.py in __init__(self, net, image_size, hidden_layer, projection_size, projection_hidden_size, augment_fn, moving_average_decay)
    157             RandomApply(filters.GaussianBlur2d((3, 3), (1.5, 1.5)), p=0.1),
    158             augs.RandomResizedCrop((image_size, image_size)),
--> 159             color.Normalize(mean=torch.tensor([0.485, 0.456, 0.406]), std=torch.tensor([0.229, 0.224, 0.225]))
    160         )
    161 

AttributeError: module 'kornia.color' has no attribute 'Normalize'

Short-Term Solution

Is there a way to run the sample code without kornia dependency, e.g. by passing custom normalization and augmentations?

singleton question

Hi; thanks for the reference implementation.

I am trying to understand the logic behind the _get_target_encoder singleton. What is the purpose, especially considering the attribute is overwritten in the initializer already? It seems to serve no purpose that I can discern, but perhaps I am missing something?

Singleton Class Members

Forgive me for my unfamiliarity with software design, but I'm wondering why it is necessary to write a singleton wrapper for projector and target_encoder. Is there any disadvantage of initializing them in __init__?

example code error with multi-gpu?

GPU available: True, used: True
TPU available: False, using: 0 TPU cores
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
Traceback (most recent call last):
File "train.py", line 117, in
trainer.fit(model, train_loader)
File "/opt/conda/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 445, in fit
results = self.accelerator_backend.train()
File "/opt/conda/lib/python3.6/site-packages/pytorch_lightning/accelerators/ddp_spawn_accelerator.py", line 74, in train
mp.spawn(self.ddp_train, nprocs=self.nprocs, args=(self.mp_queue, model,))
File "/opt/conda/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 199, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/opt/conda/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 148, in start_processes
process.start()
File "/opt/conda/lib/python3.6/multiprocessing/process.py", line 105, in start
self._popen = self._Popen(self)
File "/opt/conda/lib/python3.6/multiprocessing/context.py", line 284, in _Popen
return Popen(process_obj)
File "/opt/conda/lib/python3.6/multiprocessing/popen_spawn_posix.py", line 32, in init
super().init(process_obj)
File "/opt/conda/lib/python3.6/multiprocessing/popen_fork.py", line 19, in init
self._launch(process_obj)
File "/opt/conda/lib/python3.6/multiprocessing/popen_spawn_posix.py", line 47, in _launch
reduction.dump(process_obj, fp)
File "/opt/conda/lib/python3.6/multiprocessing/reduction.py", line 60, in dump
ForkingPickler(file, protocol).dump(obj)
File "/opt/conda/lib/python3.6/enum.py", line 46, in _break_on_call_reduce
raise TypeError('%r cannot be pickled' % self)
TypeError: <Resample.BILINEAR: 1> cannot be pickled

Lightning example: double augmentation

Thanks for sharing this!

It seems to me that the Pytorch Lightning example augments the images twice; once in examples/lightning/train.py using Torchvision transforms and once in the BYOL class using Kornia augmentations.

/Søren

TPU/colab

thanks for a very clean implementation. Have you enabled the PL TPU and tried it?
I tried it on your train.py following the simple steps for enabling TPU (8 cores) and I got the dreaded "Input tensor is not an XLA tensor: torch.FloatTensor lightning" error. It might help it go faster if you could fix+enable this.

Validation

Hey guys!

Awesome repo, is there any way I can effortlessly add validation here? Besides after each epoch training the net with a fc layer.

How to train the better backbone with BYOL?

I use the example code (byol-pytorch/examples/lightning/train.py) with my own data(about 0.7 billion images) to get the improved resnet50.
But I replace the new resnet50 model with the open source pretrained resnet50, my target task is worse than the original one(My target task is metric learning).
I try to train BYOL for 50, 100, 200 epochs, but the target task result is worse.
Do I miss some settings? How could I eval the trained BYOL/Improved model ?
The Code is below and I train the model with 4 V100 GPUs.


train_mode = True
load_self_pretrain = True

if train_mode:
    model = SelfSupervisedLearner(
        resnet,
        image_size = IMAGE_SIZE,
        hidden_layer = 'avgpool',
        projection_size = 256,
        projection_hidden_size = 4096,
        moving_average_decay = 0.99
    )
    if load_self_pretrain:
        # state_dict = load_state_dict_from_url(model_urls[arch],
        #                                         progress=progress)
        state_dict = torch.load(args.pretrain_path, map_location=torch.device('cpu'))
        resnet.load_state_dict(state_dict)

    ds = ImagesDataset(args.image_folder, IMAGE_SIZE)
    train_loader = DataLoader(ds, batch_size=BATCH_SIZE, num_workers=NUM_WORKERS, shuffle=True)
    if NUM_GPUS >=2:
        distributed_backend = "ddp"
    else:
        distributed_backend = None
    trainer = pl.Trainer(
        gpus = NUM_GPUS,
        max_epochs = EPOCHS,
        accumulate_grad_batches = 1,
        default_root_dir = save_path,
        distributed_backend=distributed_backend
    )

    trainer.fit(model, train_loader)
    improved_resnet = model.learner.net
    torch.save(improved_resnet.state_dict(), './improved-resnet50.pth')  

Memory leak?

Thanks a lot for this repo! When I try running this code applied to a SimCLR implementation, I get out-of-GPU memory errors after a few training iterations. Any idea what might cause this?

Feat: Add image dataset normalization

Appendix B, page 13, bottom:

In both training and evaluation, we normalize color channelsby subtracting the average color and dividing by the standar deviation, computed on ImageNet, after applying the augmentations

BYOL collapses

Has anybody experienced collapse while training BYOL? After training 3 epochs representations with ResNet50, about 80% of scalars in the representation vector are zeros and loss is below 0.01.
Details: I'm using BYOL with momentum, batch size 256, while accumulating gradients for 4096/256=16 consecutive steps. Optimizer is Adam with LR=0.2, as mentioned in the paper.

CKPT load

How can we load the CKPT to get the online model

Representation Collapse

First of all, thank you so much for making this work open source!

My question isn't directly related to your implementation but rather a question about the paper -- I didn't know where else to ask, and I figured you'd have a pretty good understanding of the paper. Having said that, feel free to mark it closed if you think it is inappropriate.

I recently read the paper and I don't understand why the network doesn't cheat by simply learning to output 0s, or in their own words, by learning "collapsed representations." Any ideas?

The saved network is same as the initial one?

Firstly, thank you so much for this clean implementation!!

The self-supervised training process looks good, but the saved (i.e. improved) model is exactly the same as the initial one on my side. Have you observed the same problem?

The code I tested:

import torch
from net.byol import BYOL
from torchvision import models
 
       
resnet = models.resnet50(pretrained=True)
param_1 = resnet.parameters()

learner = BYOL(
    resnet,
    image_size = 256,
    hidden_layer = 'avgpool'
)

opt = torch.optim.Adam(learner.parameters(), lr=3e-4)

def sample_unlabelled_images():
    return torch.randn(20, 3, 256, 256)

for _ in range(2):
    images = sample_unlabelled_images()
    loss = learner(images)
    opt.zero_grad()
    loss.backward()
    opt.step()
    learner.update_moving_average() # update moving average of target encoder

# save your improved network
torch.save(resnet.state_dict(), './checkpoints/improved-net.pt')

# restore the model      
resnet2 = models.resnet50()
resnet2.load_state_dict(torch.load('./checkpoints/improved-net.pt'))
param_2 = resnet2.parameters()

# test whether two models are the same 
for p1, p2 in zip(param_1, param_2):
    if p1.data.ne(p2.data).sum() > 0:
        print('They are different.')
print('They are same.')

How to save the entire network including the projection?

Hi,

is is possible so save the entire network including the projection layers?

In your exmaple you just show how to save the improved representation network.

# save your improved network
torch.save(resnet.state_dict(), './improved-net.pt')

Sorry if that's a dumb question, but I only work with keras and never worked with pytorch before.

the optimizer

HI! In the original paper, the optimizer is LARS, but in your code , it is the Adam optimmizer

Differences between BYOL and SimSiam

Thanks for your implementation of BYOL and SimSiam.
However, after reading those two papers, especially the implementation part they wrote, I found there are also some other differences between those two structures, like MLP structure(same in BYOL in projection and prediction but not same in SimSiam), weight decay(applied in different part) and the uses of optimizer(LARS in BYOL and SGD in SimSiam)

Kornia transforms

Thanks for this clean and timely implementation. I have few question about Kornia transforms.

Why did you not use torchvision.transforms in data_loader instead? Is it for performance reason?

Transferring results on Cifar and other datasets

Thanks for your open sourcing!

I notice that the BYOL has a large gap on the transferring downstream datasets: e.g., SimCLR reaches 71.6% on Cifar 100, while BYOL can reach to 78.4%.

I understand that this might depends on the downstream training protocols. And could you provide us a sample code on that, especially for the LBFGS optimized logistic regressor?

Error when use 2 gpus

Hi! When I try to run the Pytorch lightning example code, I get the following error. Any idea how to fix this?

Epoch 0: 50%|#########################################################################5 | 1/2 [00:06<00:06, 6.06s/it, loss=3.94, v_num=10$
Traceback (most recent call last):
File "train.py", line 118, in
trainer.fit(model, train_loader)
File "/home/ssd1/wangjian_code/byol_pytorch/py3_env/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 513, in fit
self.dispatch()
File "/home/ssd1/wangjian_code/byol_pytorch/py3_env/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 553, in dispatch
self.accelerator.start_training(self)
File "/home/ssd1/wangjian_code/byol_pytorch/py3_env/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 74, in start_training
self.training_type_plugin.start_training(trainer)
File "/home/ssd1/wangjian_code/byol_pytorch/py3_env/lib/python3.7/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 111, in start_training
self._results = trainer.run_train()
File "/home/ssd1/wangjian_code/byol_pytorch/py3_env/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 644, in run_train
self.train_loop.run_training_epoch()
File "/home/ssd1/wangjian_code/byol_pytorch/py3_env/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 492, in run_training_epoch
batch_output = self.run_training_batch(batch, batch_idx, dataloader_idx)
File "/home/ssd1/wangjian_code/byol_pytorch/py3_env/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 650, in run_training_batch
self.optimizer_step(optimizer, opt_idx, batch_idx, train_step_and_backward_closure)
File "/home/ssd1/wangjian_code/byol_pytorch/py3_env/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 434, in optimizer_step
using_lbfgs=is_lbfgs,
File "/home/ssd1/wangjian_code/byol_pytorch/py3_env/lib/python3.7/site-packages/pytorch_lightning/core/lightning.py", line 1384, in optimizer_step
optimizer.step(closure=optimizer_closure)
File "/home/ssd1/wangjian_code/byol_pytorch/py3_env/lib/python3.7/site-packages/pytorch_lightning/core/optimizer.py", line 219, in step
self.__optimizer_step(*args, closure=closure, profiler_name=profiler_name, **kwargs)
File "/home/ssd1/wangjian_code/byol_pytorch/py3_env/lib/python3.7/site-packages/pytorch_lightning/core/optimizer.py", line 135, in __optimizer_step
trainer.accelerator.optimizer_step(optimizer, self._optimizer_idx, lambda_closure=closure, **kwargs)
File "/home/ssd1/wangjian_code/byol_pytorch/py3_env/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 278, in optimizer_step
self.run_optimizer_step(optimizer, opt_idx, lambda_closure, **kwargs)
File "/home/ssd1/wangjian_code/byol_pytorch/py3_env/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 283, in run_optimizer_step
self.training_type_plugin.optimizer_step(optimizer, lambda_closure=lambda_closure, **kwargs)
File "/home/ssd1/wangjian_code/byol_pytorch/py3_env/lib/python3.7/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 160, in optimizer_step
optimizer.step(closure=lambda_closure, **kwargs)
File "/home/ssd1/wangjian_code/byol_pytorch/py3_env/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 26, in decorate_context
return func(*args, **kwargs)
File "/home/ssd1/wangjian_code/byol_pytorch/py3_env/lib/python3.7/site-packages/torch/optim/adam.py", line 66, in step
loss = closure()
File "/home/ssd1/wangjian_code/byol_pytorch/py3_env/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 645, in train_step_and_backward_closure
split_batch, batch_idx, opt_idx, optimizer, self.trainer.hiddens
File "/home/ssd1/wangjian_code/byol_pytorch/py3_env/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 738, in training_step_and_backward
result = self.training_step(split_batch, batch_idx, opt_idx, hiddens)
File "/home/ssd1/wangjian_code/byol_pytorch/py3_env/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 293, in training_step
training_step_output = self.trainer.accelerator.training_step(args)
File "/home/ssd1/wangjian_code/byol_pytorch/py3_env/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 157, in training_step
return self.training_type_plugin.training_step(*args)
File "/home/ssd1/wangjian_code/byol_pytorch/py3_env/lib/python3.7/site-packages/pytorch_lightning/plugins/training_type/ddp.py", line 287, in training_step
return self.model(*args, **kwargs)
File "/home/ssd1/wangjian_code/byol_pytorch/py3_env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/ssd1/wangjian_code/byol_pytorch/py3_env/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 606, in forward
if self.reducer._rebuild_buckets():
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused
parameter detection by (1) passing the keyword argument find_unused_parameters=True to torch.nn.parallel.DistributedDataParallel; (2) making sure all forward function outputs participate in calculating los
s. If you already have done the above two steps, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's forward function. Please include the loss
function and the structure of the return value of forward of your module when reporting this issue (e.g. list, dict, iterable).

Feat: Performance tests

First of all: Many thanks for this rapid implementation of the paper's method. The implementation looks very clean.

I was wondering if you've done any performance testing to see if the implementation works as expected. E.g. by reproducing the Linear Evaluation on ImageNet experiment with BYOL.

Code error when using torch.nn.DataParallel for multi-gpu: AssertionError: hidden layer avgpool never emitted an output

First of all, thanks for your implementation.
And this code runs well when i use single gpu, but when i try to use multi-gpu to speed up my pretraining process, somthing goes run.

And the error looks like this(4 gpus used):
Traceback (most recent call last):
File "ssl_train.py", line 478, in
main()
File "ssl_train.py", line 474, in main
run(args)
File "ssl_train.py", line 320, in run
summary_pretrain = pretrain_epoch(summary_pretrain, model, optimizer, dataloader_pretrain)
File "ssl_train.py", line 102, in pretrain_epoch
loss = model(data_pretrain)
File "/home/jzj/anaconda3/envs/medical/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/jzj/anaconda3/envs/medical/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 161, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/home/jzj/anaconda3/envs/medical/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 171, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/home/jzj/anaconda3/envs/medical/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply
output.reraise()
File "/home/jzj/anaconda3/envs/medical/lib/python3.6/site-packages/torch/_utils.py", line 428, in reraise
raise self.exc_type(msg)
AssertionError: Caught AssertionError in replica 0 on device 0.
Original Traceback (most recent call last):
File "/home/jzj/anaconda3/envs/medical/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
output = module(*input, **kwargs)
File "/home/jzj/anaconda3/envs/medical/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/jzj/projects/medical/camelyon16/bin/data/byol_pytorch.py", line 240, in forward
online_proj_one, _ = self.online_encoder(image_one)
File "/home/jzj/anaconda3/envs/medical/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/jzj/projects/medical/camelyon16/bin/data/byol_pytorch.py", line 155, in forward
representation = self.get_representation(x)
File "/home/jzj/projects/medical/camelyon16/bin/data/byol_pytorch.py", line 151, in get_representation
assert hidden is not None, f'hidden layer {self.layer} never emitted an output'
AssertionError: hidden layer avgpool never emitted an output

so i try to print the hidden in get_representation(self, x):
def get_representation(self, x):
if self.layer == -1:
return self.net(x)

    if not self.hook_registered:
        self._register_hook()

    _ = self.net(x)
    hidden = self.hidden
    print('###################################')
    print(hidden)
    print('###################################')
    self.hidden = None
    assert hidden is not None, f'hidden layer {self.layer} never emitted an output'
    return hidden

and it turn out to like this:
###################################
tensor([[1.0303, 1.0782, 1.0756, ..., 1.1541, 0.8629, 1.0167],
[0.9641, 1.0843, 1.1032, ..., 0.9906, 1.0737, 1.0357]],
grad_fn=)
###################################
###################################
tensor([[1.0060, 1.0688, 1.0976, ..., 1.0569, 0.9572, 1.2543],
[1.0713, 1.1613, 1.0059, ..., 1.0245, 0.9633, 0.8983]],
grad_fn=)
###################################
###################################
tensor([[1.0303, 1.0782, 1.0756, ..., 1.1541, 0.8629, 1.0167],
[0.9641, 1.0843, 1.1032, ..., 0.9906, 1.0737, 1.0357]])
###################################
###################################
tensor([[1.0060, 1.0688, 1.0976, ..., 1.0569, 0.9572, 1.2543],
[1.0713, 1.1613, 1.0059, ..., 1.0245, 0.9633, 0.8983]])
###################################
###################################
None
###################################
###################################
None
###################################
###################################
None
###################################
###################################
None
###################################

Error when running the example code

Hi! When I try to run the Pytorch lightning example code, I get the following error. Any idea how to fix this?

Traceback (most recent call last):
File "train.py", line 94, in
trainer.fit(model, train_loader)
File "/home/jwl2182/.local/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line903, in fit
mp.spawn(self.ddp_train, nprocs=self.num_processes, args=(model, ))
File "/home/jwl2182/.local/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 200, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/home/jwl2182/.local/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 158, in start_processes
while not context.join():
File "/home/jwl2182/.local/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 108, in join
(error_index, name)
Exception: process 0 terminated with signal SIGSEGV

Training loss decreased and then increased

Hi, I used your example on my own data. The training loss decreased and then increased after 100 epochs, which is wired. Did you meet similar situations? Is it hard to train the model?
the batchsize is 128/256
lr is 0.1/0.2
weight_decay is 1e-6

Feat: Gradient accumulation

I've never worked with lightning, but the paper mentions that

To avoid re-tuning other hyperparameters, we average gradients over N consecutive steps before updating the online network when reducing the batch size by a factor N . The target network is updated once every N steps, after the update of the online network; we accumulate the N -steps in parallel in our runs.

Which can be done using something like

# some code
# Initialize dataset with batch size 10
opt.zero_grad()
for i, (input, target) in enumerate(dataset):
    pred = net(input)
    loss = crit(pred, target)
    # one graph is created here
    loss.backward()
    # graph is cleared here
    if (i+1)%10 == 0:
        # every 10 iterations of batches of size 10
        opt.step()
        opt.zero_grad()

from https://discuss.pytorch.org/t/why-do-we-need-to-set-the-gradients-manually-to-zero-in-pytorch/4903/20

Might be useful to get maximum perfromance with your implementation

Negative Loss, Transfer Learning/Fine-Tuning Question

Hi! Thanks for sharing this repo -- really clean and easy to use.

When training using the PyTorch Lightning script from the repo, my loss is negative (and gets more negative over time) when training. Is this expected?
Screenshot 2020-06-22 at 6 23 47 PM


I'm curious to know if you've fine-tuned a pretrained model using this BYOL as the README example suggested. If yes, how were the results? Any intuition regarding how many epochs to fine-tune for?

Thanks!

Can't load ckpt

I use byol-pytorch-master/examples/lightning/train.py to generate ckpt locally after training, but when I load ckpt, there will be the following errors. How should I load it? Thanks a lot!
截屏2020-11-18 上午12 51 48

0.5.6 failing Multiple GPU Error

Not able to run the code on multiple GPU using torch.nn.DataParallel.
@lucidrains this is the reason. self.hidden doesn't have corresponding key.

hidden = self.hidden[x.device]
KeyError: device(type='cuda', index=0)

This is the exact line .
Please help with the same.
Thanks!

Error when running the model with DDP

Hi,

Thanks for proving such a cool implementation! When I ran your code with multiple GPUs, I ran into the below error.

GPU available: True, used: True
TPU available: False, using: 0 TPU cores
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]
Traceback (most recent call last):
File "train.py", line 139, in
trainer.fit(model, train_loader)
File "/home/suhong/miniconda3/envs/byol/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 440, in fit
results = self.accelerator_backend.train()
File "/home/suhong/miniconda3/envs/byol/lib/python3.6/site-packages/pytorch_lightning/accelerators/ddp_spawn_accelerator.py", line 64, in train
mp.spawn(self.ddp_train, nprocs=self.nprocs, args=(self.mp_queue, model,))
File "/home/suhong/miniconda3/envs/byol/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 199, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/home/suhong/miniconda3/envs/byol/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 148, in start_processes
process.start()
File "/home/suhong/miniconda3/envs/byol/lib/python3.6/multiprocessing/process.py", line 105, in start
self._popen = self._Popen(self)
File "/home/suhong/miniconda3/envs/byol/lib/python3.6/multiprocessing/context.py", line 284, in _Popen
return Popen(process_obj)
File "/home/suhong/miniconda3/envs/byol/lib/python3.6/multiprocessing/popen_spawn_posix.py", line 32, in init
super().init(process_obj)
File "/home/suhong/miniconda3/envs/byol/lib/python3.6/multiprocessing/popen_fork.py", line 19, in init
self._launch(process_obj)
File "/home/suhong/miniconda3/envs/byol/lib/python3.6/multiprocessing/popen_spawn_posix.py", line 47, in _launch
reduction.dump(process_obj, fp)
File "/home/suhong/miniconda3/envs/byol/lib/python3.6/multiprocessing/reduction.py", line 60, in dump
ForkingPickler(file, protocol).dump(obj)
File "/home/suhong/miniconda3/envs/byol/lib/python3.6/enum.py", line 46, in _break_on_call_reduce
raise TypeError('%r cannot be pickled' % self)
TypeError: <Resample.BILINEAR: 1> cannot be pickled

Since there is no specification of accelerator type, pytorch-lightning assign ddp-spawn by default. Do you have any insight to resolve this error?

How to transfer the trained ckpt to pytorch.pth model?

I use the example script to train a model, I got a ckpt file. but how could I extra the trained resnet50.pth instead of the whole SelfSupervisedLearner? Sorry I am new for pytorch lightning lib.
What I want is the SelfSupervised resnet50.pth, because I want this to replace the original ImageNet-pretrained one.
Thank you a lot.

BYOL Pretrained Model

Hello there,
I wonder if you host your BYOL pretrained model (Resnet50 in particular) over Imagenet somewhere.
I mean so that we can compare it to other methods performance.

Error when training my own 1D dataset

Thanks for your great work! Now I encounter a tricky problem:
Size of input x is [1024, 12, 7500], but after running 'image_one, image_two = self.augment1(x), self.augment2(x)' in line 195 of byol_pytorch, size of image_one and image_two become [12, 7500].
I have checked my augment_fn and make sure them right.
This problem sometimes happens in the initial beginning and sometimes happens in training loop.
Can you give me some advices? Thanks a lot!

Did you try use apex for learner ?

Hi lucidrains! Thanks for your amazing packages!

Could you please give me some advice?

I'm trying to use apex with byol-pytorch (with default augmentation - kornia) in DDP mode and got such error:

Traceback (most recent call last):
File "byol.py", line 221, in
main(args)
File "byol.py", line 160, in main
loss = learner(X)
File "/root/PycharmProjects/torch16projects/venv/lib/python3.6/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/root/PycharmProjects/torch16projects/venv/lib/python3.6/site-packages/apex/parallel/distributed.py", line 560, in forward
result = self.module(*inputs, **kwargs)
File "/root/PycharmProjects/torch16projects/venv/lib/python3.6/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/root/PycharmProjects/torch16projects/venv/lib/python3.6/site-packages/apex/amp/_initialize.py", line 197, in new_fwd
**applier(kwargs, input_caster))
File "/root/PycharmProjects/torch16projects/venv/lib/python3.6/site-packages/byol_pytorch/byol_pytorch.py", line 187, in forward
image_one, image_two = self.augment(x), self.augment(x)
File "/root/PycharmProjects/torch16projects/venv/lib/python3.6/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/root/PycharmProjects/torch16projects/venv/lib/python3.6/site-packages/torch/nn/modules/container.py", line 117, in forward
input = module(input)
File "/root/PycharmProjects/torch16projects/venv/lib/python3.6/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/root/PycharmProjects/torch16projects/venv/lib/python3.6/site-packages/kornia/augmentation/augmentation.py", line 63, in forward
output = self.apply_transform(input, self._params)
File "/root/PycharmProjects/torch16projects/venv/lib/python3.6/site-packages/kornia/augmentation/augmentation.py", line 709, in apply_transform
return F.apply_crop(input, params)
File "/root/PycharmProjects/torch16projects/venv/lib/python3.6/site-packages/kornia/augmentation/functional.py", line 595, in apply_crop
input, params['src'], params['dst'], resample_mode, align_corners=align_corners)
File "/root/PycharmProjects/torch16projects/venv/lib/python3.6/site-packages/kornia/geometry/transform/crop.py", line 177, in crop_by_boxes
dst_trans_src: torch.Tensor = get_perspective_transform(src_box.to(tensor.dtype), dst_box.to(tensor.dtype))
File "/root/PycharmProjects/torch16projects/venv/lib/python3.6/site-packages/kornia/geometry/transform/imgwarp.py", line 249, in get_perspective_transform
X, LU = torch.solve(b, A)
RuntimeError: "solve_cpu" not implemented for 'Half'

Kornia is not support fp16 ? Or I'm doing something wrong? fp32 training is much expensive.
How can I fix it or maybe I can use some arbitrary aug function?

Thanks for your time.

RuntimeError: Only Tensors created explicitly by the user (graph leaves) support the deepcopy protocol at the moment

I tried the sample code available in README, but i get a error

byol-pytorch  0.5.4
torch              1.7.1
kornia             0.4.1

Code:

import torch
from byol_pytorch import BYOL
from torchvision import models

resnet = models.resnet50(pretrained=True)

learner = BYOL(
    resnet,
    image_size = 256,
    hidden_layer = 'avgpool'
)

opt = torch.optim.Adam(learner.parameters(), lr=3e-4)

def sample_unlabelled_images():
    return torch.randn(20, 3, 256, 256)

for _ in range(100):
    images = sample_unlabelled_images()
    loss = learner(images)
    opt.zero_grad()
    loss.backward()
    opt.step()
    learner.update_moving_average() # update moving average of target encoder

# save your improved network
torch.save(resnet.state_dict(), './improved-net.pt')

Error

Traceback (most recent call last):
  File "demo.py", line 12, in <module>
    hidden_layer = 'avgpool'
  File "/cloud/byol-pytorch/byol_pytorch/byol_pytorch.py", line 224, in __init__
    self.forward(torch.randn(2, 3, image_size, image_size, device=device))
  File "/cloud/byol-pytorch/byol_pytorch/byol_pytorch.py", line 256, in forward
    target_encoder = self._get_target_encoder() if self.use_momentum else self.online_encoder
  File "/cloud/byol-pytorch/byol_pytorch/byol_pytorch.py", line 28, in wrapper
    instance = fn(self, *args, **kwargs)
  File "/cloud/byol-pytorch/byol_pytorch/byol_pytorch.py", line 229, in _get_target_encoder
    target_encoder = copy.deepcopy(self.online_encoder)
  File "/environment/python/versions/miniconda3-4.7.12/lib/python3.7/copy.py", line 180, in deepcopy
    y = _reconstruct(x, memo, *rv)
  File "/environment/python/versions/miniconda3-4.7.12/lib/python3.7/copy.py", line 280, in _reconstruct
    state = deepcopy(state, memo)
  File "/environment/python/versions/miniconda3-4.7.12/lib/python3.7/copy.py", line 150, in deepcopy
    y = copier(x, memo)
  File "/environment/python/versions/miniconda3-4.7.12/lib/python3.7/copy.py", line 240, in _deepcopy_dict
    y[deepcopy(key, memo)] = deepcopy(value, memo)
  File "/environment/python/versions/miniconda3-4.7.12/lib/python3.7/copy.py", line 150, in deepcopy
    y = copier(x, memo)
  File "/environment/python/versions/miniconda3-4.7.12/lib/python3.7/copy.py", line 240, in _deepcopy_dict
    y[deepcopy(key, memo)] = deepcopy(value, memo)
  File "/environment/python/versions/miniconda3-4.7.12/lib/python3.7/copy.py", line 161, in deepcopy
    y = copier(memo)
  File "/environment/python/versions/miniconda3-4.7.12/lib/python3.7/site-packages/torch/tensor.py", line 47, in __deepcopy__
    raise RuntimeError("Only Tensors created explicitly by the user "
RuntimeError: Only Tensors created explicitly by the user (graph leaves) support the deepcopy protocol at the moment

Pretrained network

Hi, thanks for sharing the code and making it so easy to use.
I see in the example you set resnet = models.resnet50(pretrained=True).
Is this what is done in the paper? Shouldn't self-supervised-learned networks be trained from scratch?

Thanks again,
P.

Example Code Parameters

I believe the example code is wrong. The optimizer should only take the parameters from the online encoder and predictor and not all of the parameters like it currently does.

the error of the runing exampe code when i use 2 gpus

Hi! When I try to run the Pytorch lightning example code. the training code is:
trainer = pl.Trainer(gpus=2, max_epochs=1000, distributed_backend='dp'),trainer.fit(model, train_loader)
I get the following error:
“AssertionError: hidden layer avgpool never emitted an output”
but when i use 1 gpu there is no error。can you give me some advice?

Just wondering - DataLoader does not shuffle in the example

https://github.com/lucidrains/byol-pytorch/blob/master/examples/lightning/train.py#L95

Hi @lucidrains thanks for sharing great implementation.
Just wondering if we don't really need shuffle=True with train_loader.

train_loader = DataLoader(ds, batch_size=BATCH_SIZE, num_workers=NUM_WORKERS)

If we use a dataset big and random enough, this mightn't be a problem, but would makes trouble with smaller set.

Just close this if this is expected or you think this is not a problem.
Thanks anyway.

Cannot get embedding

When I test on env with python 3.6 and torch 1.6, there is an error when I try to get embedding:
Traceback (most recent call last):
File "test.py", line 14, in
projection, embedding = learner(imgs, return_embedding = True)
File "/home/dung/miniconda3/envs/tf_gpu_torch/lib/python3.6/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
TypeError: forward() got an unexpected keyword argument 'return_embedding'

And it is weird that when I test on other env it works normally. Is there anyone know the reason and how to fix this?

0.5.5 version no longer returns projections.

As mentioned in the readme

projection, embedding = learner(imgs, return_embedding = True)

Users should be able to get projections and embedding, with 0.5.5 users can no longer do so.
Note : It's happening because of this .

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.