open-mmlab / mmselfsup Goto Github PK

View Code? Open in Web Editor NEW

3.1K 44.0 427.0 3.47 MB

OpenMMLab Self-Supervised Learning Toolbox and Benchmark

Home Page: https://mmselfsup.readthedocs.io/en/latest/

License: Apache License 2.0

Python 97.67% Shell 2.20% Dockerfile 0.12%

self-supervised-learning unsupervised-learning pytorch moco simclr simsiam mae masked-image-modeling beit

mmselfsup's Introduction

OpenMMLab website ^HOT OpenMMLab platform ^{TRY IT OUT}

📘Documentation | 🛠️Installation | 👀Model Zoo | 🆕Update News | 🤔Reporting Issues

🌟 MMPreTrain is a newly upgraded open-source framework for visual pre-training. It has set out to provide multiple powerful pre-trained backbones and support different pre-training strategies.

👉 MMPreTrain 1.0 branch is in trial, welcome every to try it and discuss with us! 👈

English | 简体中文

Introduction

MMSelfSup is an open source self-supervised representation learning toolbox based on PyTorch. It is a part of the OpenMMLab project.

The master branch works with PyTorch 1.8 or higher.

Major features

Methods All in One

MMSelfsup provides state-of-the-art methods in self-supervised learning. For comprehensive comparison in all benchmarks, most of the pre-training methods are under the same setting.
Modular Design

MMSelfSup follows a similar code architecture of OpenMMLab projects with modular design, which is flexible and convenient for users to build their own algorithms.
Standardized Benchmarks

MMSelfSup standardizes the benchmarks including logistic regression, SVM / Low-shot SVM from linearly probed features, semi-supervised classification, object detection and semantic segmentation.
Compatibility

Since MMSelfSup adopts similar design of modulars and interfaces as those in other OpenMMLab projects, it supports smooth evaluation on downstream tasks with other OpenMMLab projects like object detection and segmentation.

What's New

MMSelfSup v1.0.0 was released based on main branch. Please refer to Migration Guide for more details.

MMSelfSup v1.0.0 was released in 06/04/2023.

Support PixMIM.
Support DINO in projects/dino/.
Refactor file io interface.
Refine documentations.

MMSelfSup v1.0.0rc6 was released in 10/02/2023.

Support MaskFeat with video dataset in projects/maskfeat_video/
Translate documentation to Chinese.

MMSelfSup v1.0.0rc5 was released in 30/12/2022.

Support BEiT v2, MixMIM, EVA.
Support ShapeBias for model analysis
Add Solution of FGIA ACCV 2022 (1st Place)
Refactor t-SNE

Please refer to Changelog for details and release history.

Differences between MMSelfSup 1.x and 0.x can be found in Migration.

Installation

MMSelfSup depends on PyTorch, MMCV, MMEngine and MMClassification.

Please refer to Installation for more detailed instruction.

Get Started

For tutorials, we provide User Guides for basic usage:

Pretrain

Downetream Tasks

Useful Tools

Advanced Guides and Colab Tutorials are also provided.

Please refer to FAQ for frequently asked questions.

Model Zoo

Please refer to Model Zoo.md for a comprehensive set of pre-trained models and benchmarks.

Supported algorithms:

More algorithms are in our plan.

Benchmark

Benchmarks	Setting
ImageNet Linear Classification (Multi-head)	Goyal2019
ImageNet Linear Classification (Last)
ImageNet Semi-Sup Classification
Places205 Linear Classification (Multi-head)	Goyal2019
iNaturalist2018 Linear Classification (Multi-head)	Goyal2019
PASCAL VOC07 SVM	Goyal2019
PASCAL VOC07 Low-shot SVM	Goyal2019
PASCAL VOC07+12 Object Detection	MoCo
COCO17 Object Detection	MoCo
Cityscapes Segmentation	MMSeg
PASCAL VOC12 Aug Segmentation	MMSeg

Contributing

We appreciate all contributions improving MMSelfSup. Please refer to Contribution Guides for more details about the contributing guideline.

Acknowledgement

MMSelfSup is an open source project that is contributed by researchers and engineers from various colleges and companies. We appreciate all the contributors who implement their methods or add new features, as well as users who give valuable feedbacks. We wish that the toolbox and benchmark could serve the growing research community by providing a flexible toolkit to reimplement existing methods and develop their own new algorithms.

MMSelfSup originates from OpenSelfSup, and we appreciate all early contributions made to OpenSelfSup. A few contributors are listed here: Xiaohang Zhan (@XiaohangZhan), Jiahao Xie (@Jiahao000), Enze Xie (@xieenze), Xiangxiang Chu (@cxxgtxy), Zijian He (@scnuhealthy).

Citation

If you use this toolbox or benchmark in your research, please cite this project.

@misc{mmselfsup2021,
    title={{MMSelfSup}: OpenMMLab Self-Supervised Learning Toolbox and Benchmark},
    author={MMSelfSup Contributors},
    howpublished={\url{https://github.com/open-mmlab/mmselfsup}},
    year={2021}
}

License

This project is released under the Apache 2.0 license.

Projects in OpenMMLab

MMEngine: OpenMMLab foundational library for training deep learning models.
MMCV: OpenMMLab foundational library for computer vision.
MMEval: A unified evaluation library for multiple machine learning libraries.
MIM: MIM installs OpenMMLab packages.
MMClassification: OpenMMLab image classification toolbox and benchmark.
MMDetection: OpenMMLab detection toolbox and benchmark.
MMDetection3D: OpenMMLab's next-generation platform for general 3D object detection.
MMRotate: OpenMMLab rotated object detection toolbox and benchmark.
MMYOLO: OpenMMLab YOLO series toolbox and benchmark.
MMSegmentation: OpenMMLab semantic segmentation toolbox and benchmark.
MMOCR: OpenMMLab text detection, recognition, and understanding toolbox.
MMPose: OpenMMLab pose estimation toolbox and benchmark.
MMHuman3D: OpenMMLab 3D human parametric model toolbox and benchmark.
MMSelfSup: OpenMMLab self-supervised learning toolbox and benchmark.
MMRazor: OpenMMLab model compression toolbox and benchmark.
MMFewShot: OpenMMLab fewshot learning toolbox and benchmark.
MMAction2: OpenMMLab's next-generation action understanding toolbox and benchmark.
MMTracking: OpenMMLab video perception toolbox and benchmark.
MMFlow: OpenMMLab optical flow toolbox and benchmark.
MMEditing: OpenMMLab image and video editing toolbox.
MMGeneration: OpenMMLab image and video generative models toolbox.
MMDeploy: OpenMMLab model deployment framework.

mmselfsup's People

Contributors

Stargazers

Watchers

Forkers

daisy1992 deppmeng robot-ai-machinelearning replr allensmile hanguniverse qyxqyx tcchriszhao abelard223 collinskoech11 davidjin1214 liu-yihong junhyeoplee youtang1993 sherlockls tchigher scapozs77 cclauss doanvanan seeker1943 trantorrepository detrident liaw05 gotobelieve abv-hub hadryan sccbhxc jeanttz xialuxi fdujay jiangyangbo mrtunguyen kaixinhuaihuai stc-cqupt forks-learning renhui19931001 herolin12 yipeng-sun funson shaohong352 ets-research-repositories whutwuwei damehou cs-wzl xrosliang samsgood0310 xindongol sshaoshuai jawaechan speedcell4 ayankumarbhunia wenmingmeng capricious-liu louisnust morizsj jingshui2014 neuroailab tragedyn ezhonghawke xieenze twistedmove zhangyifei01 hjian hengxyz ycxioooong llcchen elavin11 ashstuff nirvanalan cjrd zhichenghuang yangsenwxy snorkeldepth lancege andrewliao11 mc261670164 must-ai-lab wanggrun lwzbuaa 6gsn stronglily antoalli yuzhms keloli yf817 bigsoftcms zhiyuanchen klonggan nwpu-503 scnuhealthy danielchyeh leomessi17 cv-ip aireadinggroup webstorage119 tyzaizl fengqi-liu rhlee2k violetxi cxxgtxy

mmselfsup's Issues

About performance of byol

According to the above table, the results of byol are not fully reproduced. Do you have a duplicate result?

SimCLRv2?

Hello,
Do you have implemented SimCLRv2 as the baseline in this repo?
Best regards,
Jizong

InterCLR implemetation

Hi Xiaohang,
Thank you for sharing the amazing work.
I would like to check with you whether this repo contains the implementation&pretrianed models of InterCLR method. I read your recent paper "Delving into Inter-Image Invariance for Unsupervised Visual Representations", and would like to further study the method.

Thank you

Instruction of using detectron2 with slurm?

Hi:
Could you please share any instructions about how to run VOC07+12 / COCO17 Object Detection based on Detectron2 with slurm?

BN in linear classification stage of simclr

Based on your implementation, I found that bn(affine=false) is used when ft with simclr. But it seems that bn is not used in their official tf implementation.

I also re-implement simclr with the same mlp architecture (refer your LARS optimizer and use apex sync bn method). I found that the accuracy is much lower without bn in linear-classification stage. If bn is necessary based pytorch implementation?

About BYOL implementations

https://github.com/open-mmlab/OpenSelfSup/blob/3272f765c5b7d5bee9772233a5a4d7e3fb66e5bf/openselfsup/models/byol.py#L92

Dear Xiaohang,

Thanks for the excellent job!

I have noticed that there are four feed-forwards in the following lines in BYOL.

proj_online_v1 = self.online_net(img_v1)[0]
proj_online_v2 = self.online_net(img_v2)[0]
with torch.no_grad():
  proj_target_v1 = self.target_net(img_v1)[0].clone().detach()
  proj_target_v2 = self.target_net(img_v2)[0].clone().detach()

I also noticed that, in the earlier version, the pairs of images were concatenated such that only two feed-forwards were needed.

img_cat1 = torch.cat([img_v1, img_v2], dim=0)
img_cat2 = torch.cat([img_v2, img_v1], dim=0)
proj_online = self.online_net(img_cat1)[0]
with torch.no_grad():
  proj_target = self.target_net(img_cat2)[0].clone().detach()

My understanding is that these two implementations are similar except that the older one has a slight difference in calculating BN. However, the newer one is more time-consuming than the older one. Could you provide some explanations on the possible results of using the older implementation? As we know, you have remarked it as "fix BYOL" in the commit history.

I am looking forward to your reply.

Could you provide some examples of dataset setup?

Hi guys, thanks for the great work. I am trying to build upon your ODC network but I found it convoluted to understand the code structure. If I just want to retrain the odc network, how should I set up my environment to run the bash file?

For example, what is data_train_list and data_train_root?

Can I use CIFAR10 dataset to substitute ImageNet dataset?

Thanks for contributing this amazing repo! I found that all your configs use ImageNet dataset to train models, but my server has not enough space to store that huge dataset. So I attempted to use CIFAR10 dataset as substitution, and it works well. Then, I try to apply the config to other models, but they do not work. How should I modify my config? Is there any other solution for my problem?

About the Imagenet data preparation

Thanks for your repo! It's really helpful for me.
But when I prepare ImageNet and Places205(cause my research direction was medical imaging , I'm not familiar with these),

I found that your shared url was empty.
So can you provide the write url for preparing ImageNet? Thank you very much!

Relation between learning rate and batch size

I found that there are some differences in the setting of the learning rate between yours and the paper. Why do you want to set it like this?

In byol/r50_bs256_accumulate16_ep300.py
optimizer = dict(type='LARS', lr=4.8, weight_decay=0.000001, momentum=0.9,
paramwise_options={
'(bn|gn)(\d+)?.(weight|bias)': dict(weight_decay=0., lars_exclude=True),
'bias': dict(weight_decay=0., lars_exclude=True)})

However, in the paper:
LearningRate = 0.2 × BatchSize/256

How to save images to tf_logs?

I want to save some images in the tensorboard during training, how do I write it?

SimCRL v2

Hi, thanks for your excellent work!
Is there a plan to add the SimCRL v2 to this framework?

Training speed benchmark

Hi @XiaohangZhan
It would be nice to provide a training speed benchmark for these methods.
It doesn't have to be the actual total training time (sometimes training is interrupted I know). Maybe just a seconds/epoch comparison will be a good illustration on the speed of each method.
Many thx!

some problems about imagenet

IsADirectoryError: [Errno 21] Is a directory: 'data/imagenet/train/n04335435'

1.can you show example about ''train.txt ,train_labeled.txt"
2.train_labeled.txt (for linear evaluation, "filename[space]label\n" in each line), can you explain the "label" more clearly?

a table of accuracy comparison

Hi, all,

Thanks a lot for your great efforts. Just one suggestion, would you append a table of accuracy comparison for each method you provided?

THX!

Misspelled Variable

https://github.com/open-mmlab/OpenSelfSup/blob/b869bba153fadddf9e4841ab9ffbe17112398f20/openselfsup/datasets/data_sources/cifar.py#L45
Misspells split as spilt.

Any plans about speeding up?

Thanks for such impressive work.
However, the training settings mainly focus on a batch size of 256.
I have tested the cost on V100 and it would cost about 7 days to train MoCo/Simclr using Res50 (default settings).
I double batch size and learning rate, however, the training speed remains almost unchanged. Therefore, perhaps, AMP or DALI will be used to speed up the training process.

Moreover, have you tested the results with a larger batch size? If you have done, are the results comparable with the reported scores based on 256?
Thanks!

SimCLR distributed training

Hi! Thanks for this awesome repo. The official SimCLR implementation (https://github.com/google-research/simclr) does not provide a multi-gpu implementation "for reasons such as global BatchNorm and contrastive loss across cores." Does this implementation solve that issue, and if so, does it attain the same performance as the official code/model?

About the hyperparameters setting in BYOL

This is an outstanding project and sorry to bother you again.
I notice that the hyparameters setting reported in byol paper are different between training in 1000 epochs and training in 300 epochs (0.996 compared to 0.99 in exponential moving average parameter, 0.2 compared to 0.3 in learning rate and 1.5*10e-6 compared to 10e-6). However, this project reproduces byol in terms of 200 epochs while it uses the hyparameters setting in 1000 epochs. May this cause the gap between the reproduced result and that reported in the paper ?
Hoping for your reply, thanks!

how to save top k model

how to save top k model?
I only find this : checkpoint_config = dict(interval=10)
but I chenge to checkpoint_config = dict(save_top_k=5)，get error

Any update about BYOL？

The result remains empty. I wonder whether you have obtained some good results
Thanks.

Any update about the detection results?

I notice that the INSTALL.md has updated the part of detection while the result remains empty. I wonder whether you have reproduced the result on detection tasks as MoCo shows.
Thanks.

About the predictor of BYOL

It seems that the online and target network architecture is the same, where is the predictor network of online , have I miss something? https://github.com/open-mmlab/OpenSelfSup/blob/ed5000482b0d8b816cd8a6fbbb1f97da44916fed/openselfsup/models/byol.py#L35

Right way to reproduce results on ImageNet linear classification

Hi,

Thanks the authors contribution to this great repo.
I am reproducing the results on ImageNet linear classification with:

pretrained model: moco_r50_v2-e3b0c442.pth (downloaded from model zoo)
config file: configs/benchmarks/linear_classification/imagenet/r50_last.py

I use 4 GPUs and accordingly adjust imgs_per_gpu=64.
The training script is:
bash benchmarks/dist_train_linear.sh configs/benchmarks/linear_classification/imagenet/r50_last.py moco_r50_v2-e3b0c442.pth

I only obtain the following performance:
head0_top1: 37.9560, head0_top5: 67.7320
While the reported performance is 67.69 on Top-1.

Is there any mistakes for the reproduction?

AMP training on BYOL leads to nan loss in linear eval

With AMP, I trained BYOL on 2 machines with 8GPUs, 16GPUs in total. I did not change anything except the batchsize of a GPU from 32 to 128. Linear eval encountered nan loss.
Without AMP(same environment and same config file removing AMP), the linear eval produced a normal loss.

I wonder if someone else ran into this problem. Is it the incompatibility between BYOL and AMP, or just that I did something wrong.

MOCO imagenet linear classification

Hi,

In the moco repo, they use lr 30, while it seems you use 0.1 by default. Did you get the results in the model zoo with lr 0.1?

Standard of shared region

This is what I wanted!
And, I have a question.

Your figure about relations among Unsupervised Learning, Self-Supervised Learning and Representation Learning impressively.

but, What is the reason that de-occlusion and correspondence are not included in the shared region?

Process of de-occlusion predicts the occluded region and process using correspondence should understand sincere relations between corresponding pixel or patch.
Through this, I think the representation could be learned.

thank you.

reproduce results and pre-trained models

Hello OpenSelfSup team,
Thank you for open-sourcing the amazing tool for SSL.
In the Model_Zoo, you report the Top-1 acc for many methods. I wana know how to reproduce them? Are the using the exact same parameters you provide in the config file?
I find some parameters conflict with the setting you mentioned in the Readme file.
For example, what's the training epochs to be used in the linear classification experiment? 100 or 200? in the config, it says 100 but in your readme, you mention most of the experiments use 200.

Another question, will you kindly release the pretrained model of the second stage (training on imagenet with the frozen feature)?
For now, you provide with the pretrained models of the first stage (self-supervised training). Really appreciate that.

Thank you !!

training on multi-gpu

Hi,
I am really new to self supervised learning. I would like to know how do you adjust no.of epochs when you train on multi-gpu.

I am intending to train on ImageNet. Unfortunately, I cannot fit mini-batch of 256 in one gpu. As for as I understand, when I train on 4-gpu, even if I train for 100 epochs, the no.of epochs trained will be effectively 100/4=25 as number of gradient updates will also be divided by 4. Please correct me if I am wrong.

Please let me know how have you accounted for that in your implementation.

Thanks

questions about ODC' acc

Hi~ I'm trying your ODC(CVPR 2020) method on my own dataset. I'd want to know "what's the acc of the model after trained for 400 epochs". Many thanks!

Linear evaluation

Thanks for your excellent work! But i have trouble in linear evaluation.
I get the model from Model Zoo (BYOL (@XiaohangZhan) | selfsup/byol/r50_bs4096_ep200.py (@xieenze) | byol_r50-e3b0c442.pth), then I use it to do linear evaluation with the config-configs/benchmarks/linear_classification/imagenet/r50_last.py . The result is very bad, acc is very low and the loss does not converge.

INFO - head0_top1: 0.100
2021-01-06 10:17:54,353 - openselfsup - INFO - head0_top5: 0.530
2021-01-06 10:19:28,189 - openselfsup - INFO - Epoch [1][50/10010] lr: 3.000e+01, eta: 21 days, 16:32:18, time: 1.872, data_time: 1.130, memory: 3223, loss: 6.9107, acc: 0.0781
2021-01-06 10:20:23,091 - openselfsup - INFO - Epoch [1][100/10010] lr: 3.000e+01, eta: 17 days, 4:53:54, time: 1.098, data_time: 0.005, memory: 3223, loss: 6.9106, acc: 0.0938
2021-01-06 10:21:22,120 - openselfsup - INFO - Epoch [1][150/10010] lr: 3.000e+01, eta: 16 days, 0:39:29, time: 1.181, data_time: 0.212, memory: 3223, loss: 6.9093, acc: 0.1875
2021-01-06 10:22:06,845 - openselfsup - INFO - Epoch [1][200/10010] lr: 3.000e+01, eta: 14 days, 14:38:47, time: 0.894, data_time: 0.044, memory: 3223, loss: 6.9090, acc: 0.1406
2021-01-06 10:22:49,104 - openselfsup - INFO - Epoch [1][250/10010] lr: 3.000e+01, eta: 13 days, 15:29:32, time: 0.845, data_time: 0.012, memory: 3223, loss: 6.9092, acc: 0.0625
2021-01-06 10:23:24,521 - openselfsup - INFO - Epoch [1][300/10010] lr: 3.000e+01, eta: 12 days, 17:42:46, time: 0.708, data_time: 0.194, memory: 3223, loss: 6.9103, acc: 0.0781
2021-01-06 10:24:01,836 - openselfsup - INFO - Epoch [1][350/10010] lr: 3.000e+01, eta: 12 days, 3:39:37, time: 0.746, data_time: 0.080, memory: 3223, loss: 6.9107, acc: 0.0938
2021-01-06 10:24:39,800 - openselfsup - INFO - Epoch [1][400/10010] lr: 3.000e+01, eta: 11 days, 17:34:12, time: 0.759, data_time: 0.553, memory: 3223, loss: 6.9110, acc: 0.0781
2021-01-06 10:25:42,831 - openselfsup - INFO - Epoch [1][450/10010] lr: 3.000e+01, eta: 12 days, 1:12:04, time: 1.261, data_time: 0.118, memory: 3223, loss: 6.9087, acc: 0.1250
2021-01-06 10:26:34,674 - openselfsup - INFO - Epoch [1][500/10010] lr: 3.000e+01, eta: 12 days, 1:05:01, time: 1.037, data_time: 0.009, memory: 3223, loss: 6.9097, acc: 0.0312
2021-01-06 10:27:19,907 - openselfsup - INFO - Epoch [1][550/10010] lr: 3.000e+01, eta: 11 days, 21:38:43, time: 0.905, data_time: 0.071, memory: 3223, loss: 6.9083, acc: 0.1875
2021-01-06 10:27:57,860 - openselfsup - INFO - Epoch [1][600/10010] lr: 3.000e+01, eta: 11 days, 15:24:23, time: 0.759, data_time: 0.006, memory: 3223, loss: 6.9091, acc: 0.0469
2021-01-06 10:28:35,339 - openselfsup - INFO - Epoch [1][650/10010] lr: 3.000e+01, eta: 11 days, 9:55:22, time: 0.750, data_time: 0.240, memory: 3223, loss: 6.9095, acc: 0.0781
2021-01-06 10:29:15,188 - openselfsup - INFO - Epoch [1][700/10010] lr: 3.000e+01, eta: 11 days, 6:09:42, time: 0.797, data_time: 0.010, memory: 3223, loss: 6.9095, acc: 0.0000
2021-01-06 10:29:50,669 - openselfsup - INFO - Epoch [1][750/10010] lr: 3.000e+01, eta: 11 days, 1:16:58, time: 0.710, data_time: 0.490, memory: 3223, loss: 6.9088, acc: 0.1094
2021-01-06 10:30:55,669 - openselfsup - INFO - Epoch [1][800/10010] lr: 3.000e+01, eta: 11 days, 7:15:50, time: 1.300, data_time: 1.190, memory: 3223, loss: 6.9089, acc: 0.1094
2021-01-06 10:31:42,625 - openselfsup - INFO - Epoch [1][850/10010] lr: 3.000e+01, eta: 11 days, 6:38:31, time: 0.939, data_time: 0.574, memory: 3223, loss: 6.9096, acc: 0.0625
2021-01-06 10:32:24,059 - openselfsup - INFO - Epoch [1][900/10010] lr: 3.000e+01, eta: 11 days, 4:22:58, time: 0.829, data_time: 0.685, memory: 3223, loss: 6.9096, acc: 0.1094
2021-01-06 10:33:04,725 - openselfsup - INFO - Epoch [1][950/10010] lr: 3.000e+01, eta: 11 days, 2:08:10, time: 0.813, data_time: 0.138, memory: 3223, loss: 6.9096, acc: 0.1250
2021-01-06 10:33:43,538 - openselfsup - INFO - Epoch [1][1000/10010] lr: 3.000e+01, eta: 10 days, 23:35:52, time: 0.776, data_time: 0.016, memory: 3223, loss: 6.9105, acc: 0.0938
2021-01-06 10:34:18,924 - openselfsup - INFO - Epoch [1][1050/10010] lr: 3.000e+01, eta: 10 days, 20:23:37, time: 0.708, data_time: 0.107, memory: 3223, loss: 6.9107, acc: 0.1250
2021-01-06 10:34:53,551 - openselfsup - INFO - Epoch [1][1100/10010] lr: 3.000e+01, eta: 10 days, 17:17:18, time: 0.693, data_time: 0.028, memory: 3223, loss: 6.9074, acc: 0.1250
2021-01-06 10:36:03,863 - openselfsup - INFO - Epoch [1][1150/10010] lr: 3.000e+01, eta: 10 days, 23:04:14, time: 1.406, data_time: 0.225, memory: 3223, loss: 6.9089, acc: 0.0781
2021-01-06 10:36:50,300 - openselfsup - INFO - Epoch [1][1200/10010] lr: 3.000e+01, eta: 10 days, 22:50:38, time: 0.929, data_time: 0.495, memory: 3223, loss: 6.9094, acc: 0.0781
2021-01-06 10:37:32,179 - openselfsup - INFO - Epoch [1][1250/10010] lr: 3.000e+01, eta: 10 days, 21:37:18, time: 0.838, data_time: 0.497, memory: 3223, loss: 6.9096, acc: 0.0781
2021-01-06 10:38:13,449 - openselfsup - INFO - Epoch [1][1300/10010] lr: 3.000e+01, eta: 10 days, 20:21:44, time: 0.825, data_time: 0.171, memory: 3223, loss: 6.9097, acc: 0.1250
2021-01-06 10:38:49,444 - openselfsup - INFO - Epoch [1][1350/10010] lr: 3.000e+01, eta: 10 days, 18:06:38, time: 0.720, data_time: 0.458, memory: 3223, loss: 6.9100, acc: 0.1719
2021-01-06 10:39:23,363 - openselfsup - INFO - Epoch [1][1400/10010] lr: 3.000e+01, eta: 10 days, 15:36:25, time: 0.678, data_time: 0.192, memory: 3223, loss: 6.9101, acc: 0.0781
2021-01-06 10:40:13,955 - openselfsup - INFO - Epoch [1][1450/10010] lr: 3.000e+01, eta: 10 days, 16:28:05, time: 1.012, data_time: 0.242, memory: 3223, loss: 6.9092, acc: 0.1562
2021-01-06 10:41:16,828 - openselfsup - INFO - Epoch [1][1500/10010] lr: 3.000e+01, eta: 10 days, 19:32:39, time: 1.257, data_time: 0.360, memory: 3223, loss: 6.9111, acc: 0.0625
2021-01-06 10:42:02,019 - openselfsup - INFO - Epoch [1][1550/10010] lr: 3.000e+01, eta: 10 days, 19:15:12, time: 0.904, data_time: 0.324, memory: 3223, loss: 6.9088, acc: 0.0625
2021-01-06 10:42:43,456 - openselfsup - INFO - Epoch [1][1600/10010] lr: 3.000e+01, eta: 10 days, 18:19:43, time: 0.829, data_time: 0.141, memory: 3223, loss: 6.9093, acc: 0.0312
2021-01-06 10:43:22,875 - openselfsup - INFO - Epoch [1][1650/10010] lr: 3.000e+01, eta: 10 days, 17:07:12, time: 0.788, data_time: 0.326, memory: 3223, loss: 6.9098, acc: 0.0469
2021-01-06 10:43:58,435 - openselfsup - INFO - Epoch [1][1700/10010] lr: 3.000e+01, eta: 10 days, 15:21:05, time: 0.711, data_time: 0.283, memory: 3223, loss: 6.9104, acc: 0.1094
2021-01-06 10:44:35,390 - openselfsup - INFO - Epoch [1][1750/10010] lr: 3.000e+01, eta: 10 days, 13:54:17, time: 0.739, data_time: 0.651, memory: 3223, loss: 6.9094,

PyTorch Version Checking Issue

In openselfsup/models/necks.py, the version of PyTorch is checked to becomingly support syncbn.

        if StrictVersion(torch.__version__) < StrictVersion("1.4.0"):
            self.expand_for_syncbn = True
        else:
            self.expand_for_syncbn = False

However, it seems that the StrictVersion (from from distutils.version) cannot support PyTorch version ids like 1.6.0a0+b0b9e70. Basically, It will return

ValueError: invalid version number '1.6.0a0+b0b9e70

Some results on apex mixed precision(APM) training and large batch size training for moco-v2

Hi, Thanks for providing such an amazing repo on self-supervised learning.

For SSL, it's important to validate its value on the large-scale dataset, so it would be critical if these algorithms can easily adopt large batch training and mixed-precision training without a huge performance drop.

My experiment showed APM and large batch did not hurt the MOCO-V2 much, and Apex opt_level=O1 can double the throughput on some types of GPUs. Currently, I'm afraid I couldn't speed to much time integrating apex into this repo, I am only providing some basic results on APM & large batch training on MOCO V2.

ImageNet1k	AMP & Large batch	Baseline
Batch size	4096	256
Mixed Precision Training	Apex opt_level: O1	FP32
Init Lr	0.48	0.03
WarmUp	No	No
Pre-training Loss	6.624	6.572
ImageNet Linear protocol	66.60	67.28

about BYOL performance

Hi,
Where can I find the config file for the byol model, byol_r50-e3b0c442.pth, in the Model Zoo??
I couldn't find 'selfsup/byol/r50_bs4096_ep200.py ' from this git repository.
Also I fine-tuned on object detection task on mmdetection with faster_rcnn/faster_rcnn_r50_fpn_1x_coco.py and the bbox_mAP is only '27.00' which is too low compared with '40.3' in your benchmark page.
Even though the architecture and some hyper-parms are differ from yours, the difference is too big.
With imagenet-pretrained(pytorch) model, the mAP reporduced as 37.2 and with my personal byol model it shows 36.3.
Do you have any idea about this accuracy drop with your byol model??
I trained it on 4 gpus and 8 samples per gpu so that I increased the learning rate to 0.04(x2) as guided in the mmdetection document page.

Any Plans about AMP support?

Many projects of MMLAB use nvidia apex to support AMP.
I wonder whether there is a plan about supporting it in this project.
Thanks!

AttributeError: 'Clustering' object has no attribute 'obj'

Hello，thanks for your work.
When i run the code, i met the question:
getattr = lambda self, name: _swig_getattr(self, Clustering, name)
File "/home/anaconda3/envs/uda/lib/python3.6/site-packages/faiss/swigfaiss_avx2.py", line 80, in _swig_getattr
raise AttributeError("'%s' object has no attribute '%s'" % (class_type.name, name))
AttributeError: 'Clustering' object has no attribute 'obj'

How to solve this Error? Thanks.

Issue training SimCLR

When I use the SimCLR training code, my code always hangs after exactly 19 epochs and 3750 iterations. I can terminate it and restart from the previous checkpoint, but it would be nice not to have to do this. Do you have any idea why this might happen?

Error with GaussianBlur

The latest torchvision has implemented GaussianBlur (pytorch/vision@4106dbb). The current code will cause an error due to duplicate registration of GaussianBlur. Would it be better to use torchvision's GaussianBlur or change the name of GaussianBlur in this repo？

About the performance of SimCLR

This is a wonderful project!
I found a bug in the code. The word 'CosineAnealing' in the config file 'configs/selfup/moco/r50_v2.py' should be 'CosineAnnealing', and I think this bug results from the spelling error fixed in mmvc.
And here I am confused about the performance of SimClR, which is much lower than that printed in the paper (e.g. 64.5% vs 75.5, the Top-5 accuracy on semi-supervised learning on ImageNet 1%). Is it mainly because of the batch size (the batch size of your reproduction is only 256)?
Looking forward to your reply!

Error for deepcluster experiment

Hi OpenSelfSup team,
Thank you for your amazing work.
I wana report a bug I encountered in deepcluster exp here.

To repoduce:
bash tools/dist_train.sh configs/selfsup/deepcluster/r50.py 4

Meet error below:

[>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>] 2503/2503, 5.6 task/s, elapsed: 445s, ETA:     0s^[[B^[[B^[[B^[
PCA from dim 2048 to dim 256
Traceback (most recent call last):
  File "tools/train.py", line 145, in <module>
    main()
  File "tools/train.py", line 141, in main
    meta=meta)
  File "/home/qiang/codefiles/ssl/adv_ssl/openselfsup/apis/train.py", line 95, in train_model
    model, dataset, cfg, logger=logger, timestamp=timestamp, meta=meta)
  File "/home/qiang/codefiles/ssl/adv_ssl/openselfsup/apis/train.py", line 219, in _dist_train
    runner.run(data_loaders, cfg.workflow, cfg.total_epochs)
  File "/home/qiang/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 103, in run
    self.call_hook('before_run')
  File "/home/qiang/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/mmcv/runner/base_runner.py", line 282, in call_hook
    getattr(hook, fn_name)(self)
  File "/home/qiang/codefiles/ssl/adv_ssl/openselfsup/hooks/deepcluster_hook.py", line 43, in before_run
    self.deepcluster(runner)
  File "/home/qiang/codefiles/ssl/adv_ssl/openselfsup/hooks/deepcluster_hook.py", line 61, in deepcluster
    clustering_algo.cluster(features, verbose=True)
  File "/home/qiang/codefiles/ssl/adv_ssl/openselfsup/third_party/clustering.py", line 134, in cluster
    I, loss = run_kmeans(xb, self.k, verbose)
  File "/home/qiang/codefiles/ssl/adv_ssl/openselfsup/third_party/clustering.py", line 100, in run_kmeans
    losses = faiss.vector_to_array(clus.obj)
  File "/home/qiang/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/faiss/swigfaiss_avx2.py", line 1450, in <lambda>
    __getattr__ = lambda self, name: _swig_getattr(self, Clustering, name)
  File "/home/qiang/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/faiss/swigfaiss_avx2.py", line 80, in _swig_getattr
    raise AttributeError("'%s' object has no attribute '%s'" % (class_type.__name__, name))
AttributeError: 'Clustering' object has no attribute 'obj'
Traceback (most recent call last):
  File "/home/qiang/anaconda3/envs/open-mmlab/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/qiang/anaconda3/envs/open-mmlab/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/qiang/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/distributed/launch.py", line 261, in <module>
    main()
  File "/home/qiang/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/distributed/launch.py", line 257, in main
    cmd=cmd)

My environment:

conda create -n open-mmlab python=3.7 -y  

conda activate open-mmlab  

conda install pytorch=1.6 torchvision cudatoolkit=10.1 faiss-gpu -c pytorch  

pip install -v -e .  # or "python setup.py develop"

Any suggestions on how to fix this? Is it a bug or I made some mistakes?

Thank you!

With best regards,
Guocheng

question about training moco

I used the configs/selfsup/moco/r50_v2.py to train moco, but I found the loss increase to 9.60 at start and then decrease slowly, is it normal?

conver moco pretrain to mmdetection

Hi,

How to conver pretrained model to mmdetection rather than detectron2?

Can you show the logs of linear classification training (ImageNet) using BYOL? the loss is too large

I strictly follow the readme and train LC using BYOL (res50_last.py) without any modifications. The log show that the loss is large (60~100) while the accuracy is not too slow. Is it right?
Thanks!

2020-10-25 07:53:50,900 - openselfsup - INFO - Epoch [1][50/5005] lr: 3.000e+01, eta: 3 days, 11:02:10, time: 0.597, data_time: 0.514, memory: 3223, loss: 60.4163, acc: 7.1172
2020-10-25 07:53:54,992 - openselfsup - INFO - Epoch [1][100/5005] lr: 3.000e+01, eta: 1 day, 23:12:19, time: 0.082, data_time: 0.018, memory: 3223, loss: 133.8791, acc: 22.9766
2020-10-25 07:53:59,791 - openselfsup - INFO - Epoch [1][150/5005] lr: 3.000e+01, eta: 1 day, 11:54:50, time: 0.096, data_time: 0.014, memory: 3223, loss: 135.3538, acc: 28.3828
2020-10-25 07:54:04,156 - openselfsup - INFO - Epoch [1][200/5005] lr: 3.000e+01, eta: 1 day, 5:57:34, time: 0.087, data_time: 0.014, memory: 3223, loss: 126.6373, acc: 30.9141
2020-10-25 07:54:08,352 - openselfsup - INFO - Epoch [1][250/5005] lr: 3.000e+01, eta: 1 day, 2:18:07, time: 0.084, data_time: 0.016, memory: 3223, loss: 128.7639, acc: 34.4141
2020-10-25 07:54:12,944 - openselfsup - INFO - Epoch [1][300/5005] lr: 3.000e+01, eta: 1 day, 0:02:29, time: 0.092, data_time: 0.011, memory: 3223, loss: 96.6740, acc: 34.8281
2020-10-25 07:54:16,941 - openselfsup - INFO - Epoch [1][350/5005] lr: 3.000e+01, eta: 22:11:29, time: 0.080, data_time: 0.012, memory: 3223, loss: 87.8078, acc: 36.4219
2020-10-25 07:54:21,498 - openselfsup - INFO - Epoch [1][400/5005] lr: 3.000e+01, eta: 20:59:54, time: 0.091, data_time: 0.016, memory: 3223, loss: 101.1486, acc: 37.6719
2020-10-25 07:54:25,671 - openselfsup - INFO - Epoch [1][450/5005] lr: 3.000e+01, eta: 19:57:01, time: 0.083, data_time: 0.012, memory: 3223, loss: 105.0239, acc: 38.2734
2020-10-25 07:54:29,829 - openselfsup - INFO - Epoch [1][500/5005] lr: 3.000e+01, eta: 19:06:33, time: 0.083, data_time: 0.015, memory: 3223, loss: 72.1632, acc: 39.5703
2020-10-25 07:54:34,256 - openselfsup - INFO - Epoch [1][550/5005] lr: 3.000e+01, eta: 18:29:16, time: 0.089, data_time: 0.012, memory: 3223, loss: 73.9237, acc: 39.6719
2020-10-25 07:54:38,298 - openselfsup - INFO - Epoch [1][600/5005] lr: 3.000e+01, eta: 17:52:50, time: 0.081, data_time: 0.014, memory: 3223, loss: 65.0792, acc: 41.4375
2020-10-25 07:54:42,677 - openselfsup - INFO - Epoch [1][650/5005] lr: 3.000e+01, eta: 17:26:20, time: 0.088, data_time: 0.012, memory: 3223, loss: 53.3663, acc: 41.1719
2020-10-25 07:54:46,688 - openselfsup - INFO - Epoch [1][700/5005] lr: 3.000e+01, eta: 16:59:14, time: 0.080, data_time: 0.012, memory: 3223, loss: 59.9644, acc: 40.5469
2020-10-25 07:54:51,108 - openselfsup - INFO - Epoch [1][750/5005] lr: 3.000e+01, eta: 16:40:15, time: 0.088, data_time: 0.025, memory: 3223, loss: 84.6936, acc: 40.7188
2020-10-25 07:54:55,247 - openselfsup - INFO - Epoch [1][800/5005] lr: 3.000e+01, eta: 16:20:43, time: 0.083, data_time: 0.014, memory: 3223, loss: 76.7088, acc: 41.5156
2020-10-25 07:54:59,423 - openselfsup - INFO - Epoch [1][850/5005] lr: 3.000e+01, eta: 16:03:52, time: 0.084, data_time: 0.012, memory: 3223, loss: 73.0966, acc: 41.8984
2020-10-25 07:55:03,761 - openselfsup - INFO - Epoch [1][900/5005] lr: 3.000e+01, eta: 15:50:18, time: 0.087, data_time: 0.014, memory: 3223, loss: 57.4366, acc: 41.5781
2020-10-25 07:55:08,303 - openselfsup - INFO - Epoch [1][950/5005] lr: 3.000e+01, eta: 15:39:59, time: 0.091, data_time: 0.012, memory: 3223, loss: 71.7468, acc: 42.8359
2020-10-25 07:55:12,468 - openselfsup - INFO - Epoch [1][1000/5005] lr: 3.000e+01, eta: 15:27:35, time: 0.083, data_time: 0.013, memory: 3223, loss: 57.2820, acc: 42.9297
2020-10-25 07:55:16,611 - openselfsup - INFO - Epoch [1][1050/5005] lr: 3.000e+01, eta: 15:16:11, time: 0.083, data_time: 0.015, memory: 3223, loss: 68.9601, acc: 42.4688
2020-10-25 07:55:20,967 - openselfsup - INFO - Epoch [1][1100/5005] lr: 3.000e+01, eta: 15:07:23, time: 0.087, data_time: 0.014, memory: 3223, loss: 54.9672, acc: 42.8125
2020-10-25 07:55:25,295 - openselfsup - INFO - Epoch [1][1150/5005] lr: 3.000e+01, eta: 14:59:11, time: 0.087, data_time: 0.014, memory: 3223, loss: 69.3953, acc: 43.3672
2020-10-25 07:55:29,585 - openselfsup - INFO - Epoch [1][1200/5005] lr: 3.000e+01, eta: 14:51:23, time: 0.086, data_time: 0.013, memory: 3223, loss: 84.0508, acc: 43.1328
2020-10-25 07:55:33,901 - openselfsup - INFO - Epoch [1][1250/5005] lr: 3.000e+01, eta: 14:44:22, time: 0.086, data_time: 0.017, memory: 3223, loss: 72.7478, acc: 43.1797
2020-10-25 07:55:38,426 - openselfsup - INFO - Epoch [1][1300/5005] lr: 3.000e+01, eta: 14:39:13, time: 0.090, data_time: 0.017, memory: 3223, loss: 109.9391, acc: 44.0781
2020-10-25 07:55:42,844 - openselfsup - INFO - Epoch [1][1350/5005] lr: 3.000e+01, eta: 14:33:47, time: 0.088, data_time: 0.030, memory: 3223, loss: 50.3926, acc: 44.5156
2020-10-25 07:55:47,294 - openselfsup - INFO - Epoch [1][1400/5005] lr: 3.000e+01, eta: 14:28:57, time: 0.089, data_time: 0.013, memory: 3223, loss: 71.3163, acc: 44.1250
2020-10-25 07:55:51,594 - openselfsup - INFO - Epoch [1][1450/5005] lr: 3.000e+01, eta: 14:23:32, time: 0.086, data_time: 0.015, memory: 3223, loss: 96.4926, acc: 44.5156
2020-10-25 07:55:56,015 - openselfsup - INFO - Epoch [1][1500/5005] lr: 3.000e+01, eta: 14:19:11, time: 0.088, data_time: 0.022, memory: 3223, loss: 84.2186, acc: 44.6328

About the head of byol

It seems that concepts backbone, neck, and head correspond to the encoder, projector, and predictor in the original byol paper. In the code here, the heads of the online network and target network are shared. However, the predictors are not shared in the original paper.
Am I missing something?

No module named 'openselfsup'

After I run the command below:
bash tools/dist_train.sh configs/selfsup/odc/r50_v1.py 4

It get the error below:
Traceback (most recent call last): File "tools/train.py", line 13, in <module> from openselfsup import __version__ ModuleNotFoundError: No module named 'openselfsup' Traceback (most recent call last): File "tools/train.py", line 13, in <module> from openselfsup import __version__ ModuleNotFoundError: No module named 'openselfsup'

How to solve it?

Thanks.

Performance of linear classification

2021-01-22 14:14:28,948 - openselfsup - INFO - Epoch [90][4900/5005] lr: 1.000e-04, eta: 0:00:11, time: 0.073, data_time: 0.016, memory: 3990, loss.1: 5.5691, acc.1: 9.6953, loss.2: 4.0403, acc.2: 26.8594, loss.3: 3.0600, acc.3: 40.9844, loss.4: 1.8161, acc.4: 61.0156, loss.5: 1.0221, acc.5: 74.2734, loss: 15.50762021-01-22 14:14:32,976 - openselfsup - INFO - Epoch [90][4950/5005] lr: 1.000e-04, eta: 0:00:06, time: 0.081, data_time: 0.016, memory: 3990, loss.1: 5.5685, acc.1: 9.6875, loss.2: 4.0313, acc.2: 27.5000, loss.3: 3.0790, acc.3: 40.4375, loss.4: 1.8001, acc.4: 61.6094, loss.5: 1.0146, acc.5: 74.8906, loss: 15.49342021-01-22 14:14:36,655 - openselfsup - INFO - Epoch [90][5000/5005] lr: 1.000e-04, eta: 0:00:00, time: 0.074, data_time: 0.016, memory: 3990, loss.1: 5.5739, acc.1: 9.1875, loss.2: 4.0170, acc.2: 27.0781, loss.3: 3.0770, acc.3: 40.3750, loss.4: 1.7988, acc.4: 61.3906, loss.5: 1.0219, acc.5: 74.2344, loss: 15.48862021-01-22 14:14:38,422 - openselfsup - INFO - Saving checkpoint at 90 epochs
2021-01-22 14:17:22,464 - openselfsup - INFO - head0_top1: 0.7342021-01-22 14:17:22,748 - openselfsup - INFO - head1_top1: 0.934
2021-01-22 14:17:23,032 - openselfsup - INFO - head2_top1: 1.1042021-01-22 14:17:23,316 - openselfsup - INFO - head3_top1: 1.176
2021-01-22 14:17:23,601 - openselfsup - INFO - head4_top1: 1.1722021-01-22 14:17:23,684 - openselfsup - INFO - Epoch(val) [90][5005] head0_top1: 0.7340, head1_top1: 0.9340, head2_top1: 1.1040, head3_top1: 1.1760, head4_top1: 1.1720

Commond line is
bash benchmarks/dist_train_linear.sh configs/benchmarks/linear_classification/imagenet/r50_multihead.py pretrains/byol_r50_bs2048_accmulate2_ep200-e3b0c442.pth

However, top1 top5 is very strange. I have pulled the newest code.