Thanks for the nice work! I have a question regarding model training reported in the p

Question regarding Imagenet pretraining about uniformer HOT 4 CLOSED

MLDeS commented on July 18, 2024

Question regarding Imagenet pretraining

from uniformer.

Comments (4)

Andy1621 commented on July 18, 2024

All the parts have ImageNet-pretraining. For convolution, if the temporal dimension is larger than 1, we will copy and average the convolution weights. For self-attention, we copy the same weights. Please check the code

UniFormer/video_classification/slowfast/models/uniformer.py

Lines 387 to 421 in f92e423

 def inflate_weight(self, weight_2d, time_dim, center=False): 

 if center: 

 weight_3d = torch.zeros(*weight_2d.shape) 

 weight_3d = weight_3d.unsqueeze(2).repeat(1, 1, time_dim, 1, 1) 

 middle_idx = time_dim // 2 

 weight_3d[:, :, middle_idx, :, :] = weight_2d 

 else: 

 weight_3d = weight_2d.unsqueeze(2).repeat(1, 1, time_dim, 1, 1) 

 weight_3d = weight_3d / time_dim 

 return weight_3d 

 def get_pretrained_model(self, cfg): 

 if cfg.UNIFORMER.PRETRAIN_NAME: 

 checkpoint = torch.load(model_path[cfg.UNIFORMER.PRETRAIN_NAME], map_location='cpu') 

 if 'model' in checkpoint: 

 checkpoint = checkpoint['model'] 

 elif 'model_state' in checkpoint: 

 checkpoint = checkpoint['model_state'] 

 state_dict_3d = self.state_dict() 

 for k in checkpoint.keys(): 

 if checkpoint[k].shape != state_dict_3d[k].shape: 

 if len(state_dict_3d[k].shape) <= 2: 

 logger.info(f'Ignore: {k}') 

 continue 

 logger.info(f'Inflate: {k}, {checkpoint[k].shape} => {state_dict_3d[k].shape}') 

 time_dim = state_dict_3d[k].shape[2] 

 checkpoint[k] = self.inflate_weight(checkpoint[k], time_dim) 

 if self.num_classes != checkpoint['head.weight'].shape[0]: 

 del checkpoint['head.weight'] 

 del checkpoint['head.bias'] 

 return checkpoint 

 else: 

 return None

from uniformer.

MLDeS commented on July 18, 2024

Thanks a lot for the quick response, the pointer to the code helps a lot! Just two follow-up questions.

I understand the imagenet pertaining is done on the image-based Uniformer architectures and transferred to video uniformer architectures by inflating weights as above, right?
a) Is there a table showing a comparison between imagenet pertaining vs not? b) I see that Table 17 in the paper presents some results showing inflating the weights to 3D performs better than 2D. What is the basis of this comparison? Because if it is a video model, the 3D inflation was always done right ? Whether centered around the middle slice or equally averaged across the time dimension. So what is the 2D comparison here?

Thanks a lot again for your time to answer the questions!

from uniformer.

Andy1621 commented on July 18, 2024

For convolution inflation, I suggest you read paper I3D.

As for your other questions:

Yes.
a) Without ImageNet pretraining, the convergence will be much slower, which is a common strategy in video training. b) 2D means we do not inflate the convolution, and merge the temporal dimension with the batch dimension. But for attention, we use spatiotemporal attention.

from uniformer.

MLDeS commented on July 18, 2024

Thanks a lot for the answers!

from uniformer.

Question regarding Imagenet pretraining about uniformer HOT 4 CLOSED

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs

	def inflate_weight(self, weight_2d, time_dim, center=False):
	if center:
	weight_3d = torch.zeros(*weight_2d.shape)
	weight_3d = weight_3d.unsqueeze(2).repeat(1, 1, time_dim, 1, 1)
	middle_idx = time_dim // 2
	weight_3d[:, :, middle_idx, :, :] = weight_2d
	else:
	weight_3d = weight_2d.unsqueeze(2).repeat(1, 1, time_dim, 1, 1)
	weight_3d = weight_3d / time_dim
	return weight_3d

	def get_pretrained_model(self, cfg):
	if cfg.UNIFORMER.PRETRAIN_NAME:
	checkpoint = torch.load(model_path[cfg.UNIFORMER.PRETRAIN_NAME], map_location='cpu')
	if 'model' in checkpoint:
	checkpoint = checkpoint['model']
	elif 'model_state' in checkpoint:
	checkpoint = checkpoint['model_state']

	state_dict_3d = self.state_dict()
	for k in checkpoint.keys():
	if checkpoint[k].shape != state_dict_3d[k].shape:
	if len(state_dict_3d[k].shape) <= 2:
	logger.info(f'Ignore: {k}')
	continue
	logger.info(f'Inflate: {k}, {checkpoint[k].shape} => {state_dict_3d[k].shape}')
	time_dim = state_dict_3d[k].shape[2]
	checkpoint[k] = self.inflate_weight(checkpoint[k], time_dim)

	if self.num_classes != checkpoint['head.weight'].shape[0]:
	del checkpoint['head.weight']
	del checkpoint['head.bias']
	return checkpoint
	else:
	return None