tengdahan / memdpc Goto Github PK

[ECCV'20 Spotlight] Memory-augmented Dense Predictive Coding for Video Representation Learning. Tengda Han, Weidi Xie, Andrew Zisserman.

License: Apache License 2.0

Python 100.00%

memdpc's Introduction

Memory-augmented Dense Predictive Coding for Video Representation Learning

This repository contains the implementation of Memory-augmented Dense Predictive Coding (MemDPC).

Links: [arXiv] [PDF] [Video] [Project page]

News

2020/09/08: upload evaluation code for action classification and upload pretrained weights on Kinetics400.
2020/08/26: correct the DynamoNet statistics in the figure. DynamoNet uses 500K videos from Youtube8M but only use 10-second clip from each, totally the video length is about 58 days.

Preparation

This repository is implemented in PyTorch 1.2, but newer version should also work. Additionally, it needs cv2, joblib, tqdm, tensorboardX.

For the dataset, please follow the instructions here.

Self-supervised training (MemDPC)

Change directory cd memdpc/
Train MemDPC on UCF101 rgb stream

python main.py --gpu 0,1 --net resnet18 --dataset ucf101 --batch_size 16 --img_dim 128 --epochs 500

Train MemDPC on Kinetics400 rgb stream

python main.py --gpu 0,1,2,3 --net resnet34 --dataset k400 --batch_size 16 --img_dim 224 --epochs 200

Evaluation

Finetune entire network for action classification on UCF101:

Change directory cd eval/
Train action classifier by finetuning the pretrained weights

python test.py --gpu 0,1 --net resnet34 --dataset ucf101 --batch_size 16 \
--img_dim 224 --epochs 500 --train_what ft --schedule 300 400

Train action classifier by freezing the pretrained weights and only a linear layer

python test.py --gpu 0,1 --net resnet34 --dataset ucf101 --batch_size 16 \
--img_dim 224 --epochs 100 --train_what last --schedule 60 80 --dropout 0.5

MemDPC pretrained weights

Citation

If you find the repo useful for your research, please consider citing our paper:

@InProceedings{Han20,
  author       = "Tengda Han and Weidi Xie and Andrew Zisserman",
  title        = "Memory-augmented Dense Predictive Coding for Video Representation Learning",
  booktitle    = "European Conference on Computer Vision",
  year         = "2020",
}

For any questions, welcome to create an issue or contact Tengda Han ([email protected]).

memdpc's People

Contributors

Stargazers

Watchers

memdpc's Issues

question about the memory bank

Hi, thank you for sharing the great work and well-structured code - I have a quick question that, how you get the value of the memory bank? According to the code, is that just a bank of random initialized vectors? If it is just random initialized vectors, have you done any ablation study on how sensitive the model is, in term of the memory bank initialization?
Thanks so much!

cannot run correctly for UCF101

Hi Thanks so much for sharing the wonderful code.
I have a problem, in dataset.py file,
self.video_info = video_info.drop(drop_idx, axis=0),

this step will filter out all the videos, self.video_info will become empty.
Is there anything wrong?

Pretrained weights not available

Hi, the pretrained weights host on oxford domain are not available. Could you update them?

Thank you very much for your time!

Hidden state is needed for prediction?

As i read the paper, GRU takes several previous clips (not predicted) and propagates a hidden state through them for 'pred step' predictions.
But, in your implementation, as i know, GRU takes only one clip and predicts sequential clips.
I wonder it can yield following problems

Zero-initialized hidden states will be used for all inputs of any case.

For the first prediction, the zero-initialized state acts as just noise. Then, why do we need to use GRU, not conv blocks?

Big quality difference between predictions by their distances from the input.

Does it need sequential predictions from one clip? How about one prediction per one clip?

I hope it does not bother you. And I would really appreciate your answer.
Thanks.

Question about pre-processing of Kinetics-400

Hi Tengda,

Thanks for sharing your work! I have some questions regarding the Kinetics-400.

Understanding that one must download the Kinetics videos himself, may I ask how many videos have you crawled? Is there any possibilities that you could share your downloaded videos for better replication?
Which ffmpeg command do you use for resizing Kinetics videos to short side of 256?

Looking forward to hearing from you!

Best Regards,
Hualin

Question about split in the paper.

Hi, Tengda
I read your paper and think it is a really good work. Could you tell me which split in UCF101/HMDB51 you choose to train the ssl model and finetune in all your paper tables?
A. ssl split1, finetune split 1, test split1; ssl split2, finetune split 2, test split2; ssl split2, finetune split 2, test split2; then get avg result
B. ssl split1, finetune split 1, test split1; finetune split 2, test split2; finetune split 2, test split2; then get avg result
C. ssl split1, finetune split 1, test split1;
D. etc.

Thank you!

Replicate fine-tuning results

Hi,

I wonder how I should set the training config so that I can reproduce the full network fine-tuning results on HMDB51 in Table1 (visual stream only, 41.2). Also, I wonder how I should adjust the training schedule to get the results in Figure 5 (visual stream only)?

Thank you and look forward to your response!

eval/test.py fails to run: np.ndarray.flatten in numpy 1.19

Thanks for sharing your code, excellent work!

I run your project in ubuntu20.04, installed python3.7 from backports and with following pip environment:

torch==1.4.0
tensorboardX==2.2
opencv-python==4.5.1.48
joblib==1.0.1
tqdm==4.59.0
matplotlib==3.3.4
torchvision==0.5.0
pandas==1.1.5
numpy==1.19.5
opencv_contrib_python==4.5.1.48

I did training with RGB stream UCF101 dataset, worked well so far:

python3.7 main.py --gpu 0 --net resnet18 --workers 6 --dataset ucf101 --batch_size 32 --img_dim 128 --epochs 500

When I execute eval/test.py

cd eval
MODEL="../memdpc/log_tmp/memdpc_ucf101-128_resnet18_mem1024_bs16_lr0.001_seq8_pred3_len5_ds3/model/model_best_epoch445.pth.tar"
python3.7 test.py --gpu 0 --net resnet18 --dataset ucf101 --center_crop --img_dim 128 --test "${MODEL}"

I get this error in memdpc/dataset.py:

[ . . . ]
  File "../memdpc/dataset.py", line 162, in idx_sampler
    seq_idx = seq_idx.flatten(0)
TypeError: order must be str, not int

This is may related to the error:

https://stackoverflow.com/questions/64027206/what-did-numpys-flatten0-do

I guess you used an older version of numpy so flatten(0) worked well for you. In newer versions of numpy you have too choose either 'C', 'F', 'A', or 'K'
To fix I choose:

seq_idx.flatten()

Which should be default to c-style flattening (.flatten('C')), see https://numpy.org/doc/1.19/reference/generated/numpy.ndarray.flatten.html

Is c-style flattening correct here? Or what did you indent to do there?

The pretrained model

Hi, a great work！I have run it with my own dataset without pretrain. But the accuracy is low.
Could you provide the pretrained ResNet2d3d on imagenet, or some pretrained models on other datasets? Thx !

Unable to run K400 training due to gpu memory constraint

Hi Tengda,

I try to run this self-supervised training command python main.py --gpu 0,1,2,3 --net resnet34 --dataset k400 --batch_size 16 --img_dim 224 --epochs 200 on a 4-gpu machine with 16 GB memory on each gpu. It seems that there is an out-of-memory issue. I wonder if you have any alternative command or solution for this. Thank you!

Best Regards,
Hualin

About trained model weight

Hi,
I notice that you release two pretrained weight. However, it`s still need a long time to finetune. Could you release the final trained model weight so we can easy test on it? Thanks!

Hidden

i think this hidden.unsqueeze(0) only support self.param['num_layers'] = 1, should we change to previous hidden without hidden = hidden[:,-1] instead?

Training time

Hi, Tengda. How much time did it take to train the initial self supervised model using UCF101 and fine tuning again with UCF101?
I am having a plan to train it but availability of GPU for the required time is a constraint.

transfer learning dataset

Hi, UCF101 contains train splits from 1 to 3 and test splits from 1 to 3, which train split and test split did you use to train and evaluate transfer model?

Exactly what Acc is normal during the training?

I noticed that the acc computed for validation during the training is actually assessing the retrieval performance of the feature blocks using dot-product similairty. I was wondering exactly how high the val-acc should be during the training ? (e.g. for the UCF101)

configure for frozen fine-tuning

hi, thanks for the great work. I just wonder if you can provide the configure of frozen training you mentioned in table 2. How many epochs and what lr did you use? Thank you!

Finetuning UCF101 head on pretrained kinetics400 rgb model

I used the pretrained RGB stream kinetics400 base model you kindly provided and did a fine-tune training and think my results are not as expected. The project code runs without issues. The only change I did to your code is the numpy flattening issue from #11.

Would be great if you can provide your opinion on what I did and the results I got.

Used python3.7:

torch==1.4.0
tensorboardX==2.2
opencv-python==4.5.1.48
joblib==1.0.1
tqdm==4.59.0
matplotlib==3.3.4
torchvision==0.5.0
pandas==1.1.5
numpy==1.19.5
opencv_contrib_python==4.5.1.48

Downloaded and placed UCF101
executed process_data/src/extract_ff.py to extract frames (commented-out flow features code, just RGB frames)
executed process_data/src/write_csv.py to generate csv files (clip path and frames count)
Put file process_data/data/ucf101/ClassInd.txt with sorted list of the 101 labels
fine tune (with batch_size = 8 because of only 11GB GPU memory available)

python3.7 test.py --gpu 0 --net resnet34 --dataset ucf101 --batch_size 8 --img_dim 224 --epochs 500 --train_what ft --pretrain ../pretrained/k400-rgb-224_resnet34_memdpc.pth.tar

[ . . . ]

Epoch: [499][855/856]   Loss 0.7534 (0.6072)    Acc: 0.7500 (0.8000)    T-data:0.00 T-batch:1.13
Epoch: [499]    T-epoch:973.63
100%|██████████████████████████████████████████████████████████████████████████████████| 99/99 [00:42<00:00,  2.32it/s]
Loss 1.2451     Acc: 0.6982
Training from ep 0 to ep 500 finished

run test set evaluation on best model of fine tune training

MODEL="log_tmp/ucf101-224-sp1_resnet34_lc_bs8_lr0.001_wd0.001_ds3_seq8_len5_dp0.9_train-ft_pt=..-pretrained-k400-rgb-224_resnet34_memdpc.pth.tar/model/model_best_epoch494.pth.tar"
python3.7 test.py --gpu 0 --net resnet34 --dataset ucf101 --center_crop --img_dim 224 --test "${MODEL}"

[ . . . ]

=> loaded testing checkpoint 'log_tmp/ucf101-224-sp1_resnet34_lc_bs8_lr0.001_wd0.001_ds3_seq8_len5_dp0.9_train-ft_pt=..-pretrained-k400-rgb-224_resnet34_memdpc.pth.tar/model/model_best_epoch494.pth.tar' (epoch 494)

100%|████████████████████████████████████████████████████████████████████████████| 2658/2658 [00:00<00:00, 3867.80it/s]
Mean: Acc@1: 0.4105 Acc@5: 0.7129

From the results in your paper I expected something like 0.70 Acc@1, but I got 0.41. I'm unsure if the Top1 accuracy mentioned in Table 2 in your paper (MemDPC, K400 (28d), Res. 224 , Arch. R-2D3D, depth 33, Modality V --> UCF 78.1%) has been trained with RGB or flow features. But results of my training should be at least as good as the C2 variant in Table 1 (full training with UCF101 on 128 img dims using RGB input and memory size 1024 --> Top1 of 68.2).

What is the expected accuracy for the training I did?