bellymonster / weighted-soft-label-distillation Goto Github PK

Python 100.00%

weighted-soft-label-distillation's Introduction

Rethinking soft labels for knowledge distillation: a bias-variance tradeoff perspective

Accepted by ICLR 2021

This is the offical PyTorch implementation of paper Rethinking soft labels for knowledge distillation: a bias-variance tradeoff perspective.

Requirements

Python >= 3.6
PyTorch >= 1.0.1

ImageNet Training

The code is used for training Imagenet. Our pre-trained teacher models are Pytorch official models. By default, we pack the ImageNet data as the lmdb file for faster IO. The lmdb files can be made as follows.

Generate the list of the image data. python dataset/mk_img_list.py --image_path 'the path of your image data' --output_path 'the path to output the list file'
Use the image list obtained above to make the lmdb file. python dataset/img2lmdb.py --image_path 'the path of your image data' --list_path 'the path of your image list' --output_path 'the path to output the lmdb file' --split 'split folder (train/val)'

train_with_distillation.py: train the model with our distillation method
imagenet_train_cfg.py: all dataset and hyperparameter settings
knowledge_distiller.py: our weighted soft label distillation loss

Results

ImageNet

ResNet 18

Network	Method	mIOU
ResNet 34	Teacher	73.31
ResNet 18	Original	69.75
ResNet 18	Proposed	72.04

MobileNetV1

Network	Method	mIOU
ResNet 50	Teacher	76.16
MobileNetV1	Original	68.87
MobileNetV1	Proposed	71.52

Acknowledgments

In this code we refer to the following implementations: Overhaul and DenseNAS. Great thanks to them.

Reference

If you find this repo useful, please consider citing:

@inproceedings{zhou2021wsl,
  title={Rethinking soft labels for knowledge distillation: a bias-variance tradeoff perspective},
  author={Helong, Zhou and Liangchen, Song and Jiajie, Chen and Ye, Zhou and Guoli, Wang and Junsong, Yuan and Qian Zhang},
  booktitle = {International Conference on Learning Representations (ICLR)},
  year={2021}
}

weighted-soft-label-distillation's People

Contributors

Stargazers

Watchers

Forkers

lliai mldl france5289 deeplearninghb zkzssf marssgon yuyq96 ahmedhusskhalifa

weighted-soft-label-distillation's Issues

Hyper-parameters settings？

Hi, thank u for your nice job! Can you give me advice about how to set hyper-param of temperature coefficient and loss weight?

Hello, I have a question about training CIFAR-100

HI, I read your paper impressively thank you.

I implemented your method by utilizing your code in here, and tested on CIFAR-100
but, in my case, gradient exploding has been occurred.
It works well in first 15 epochs but, from 16 epoch, both accuracy and loss converge to 0..
Although I adjust learning rate more smaller (0.05 -> 0.01) I cannot solve gradient exploding problem..
How can I solve this?

thank you

Assumption 1: a gap between "KD helps calibrate" and "KD reduces variance".

Your work is exciting and inspiring.

But there is a huge gap between "KD helps calibrate" and "KD reduces variance", since it also could be due to bias reduction, the bias between the probability and accuracy like defined in the calibration error.

Actually as defined in Eq. 2 in Guo's calibration paper, the main reason to reduce ECE could be understood as the bias reduction of p, right?

Hi, I cannot reproduce your reported performance on CIFAR-100.

Hi there, I'm trying to use your method on CIFAR-100.
However, I cannot reproduce your performance even if I followed your script and hyper-parameter settings.
for instance, ResNet110-ResNet32 pair showed 74.12% on your paper but in my implementation they showed only 72.91.
I was able to reproduce your performance with respect to only resnet56-resnet20 (72.01 / 72.15)
I think it's quite high performance gap between yours and mine.
In addition, your repository only contains ImageNet training script.
If you don't mind uploading CIFAR-100 training script, I may train your method ..

Thanks!

Minor questions on Eq.(2)

Hi, I have read your excellent work several times. The bias-variance idea is very interesting!

However, due to my poor knowledge of "bias/variance" theorem, I found the variance term in Equation (2) hard to understand. E.g.,
How to prove that
$\mathbb{E}_D[\mathbb{E}_\mathrm{x}[\mathrm{y}\log\frac{\overline{\mathrm{y}}_{ce}}{\hat{y}_{ce}}]]]=\mathbb{E}_D[D_{KL}(\overline{\mathrm{y}}_{ce},\hat{\mathrm{y}}_{ce})]?$

I have referred to Heskes's paper and it seems that the derivation relies on the normalization constant Z in Equation (1). It is easy to prove that
$\frac{\log \overline{\mathrm y}_{ce}}{\mathbb{E}_D[\log \hat{y}_{ce}]}=constant$
if Z in Equation (1) is a constant value. But I still did not find the relationship between this term and Equation (2).

Could you please kindly provide the detailed derivations of the variance term in Equation (2)? Thanks in advance.

缺少dataset文件

由于没有dataset，我尝试复现了一下，但是存在一些细节上的不一致，请问能提供一下dataset文件吗？

KD Loss keeps raising during training

hello, I am using WSLD on my own dataset, however, loss KD loss keeps increasing, is it normal? could you please provide a training log on cifar-100 or imagenet? so we can see a normal training performance of WSLD

　The pretrained teacher and hyper-parameters on CIFAR-100

Hi, thanks for the interesting work. I am trying to reproduce the results on CIFAR-100, but failed. I have some questions about the implementation on CIFAR-100. I will appreciate it if you can provide some suggestions. Specifically, is the training loss implementation on CIFAR-100 the same as that on ImageNet, except $\alpha$ is set to 2.25 and T is set to 4? are the pretrained and fixed teachers that are used in the experiments the same as those in CRD? Thank you in advance!