GithubHelp home page GithubHelp logo

alpha-vl / convmae Goto Github PK

View Code? Open in Web Editor NEW
468.0 11.0 39.0 8.74 MB

ConvMAE: Masked Convolution Meets Masked Autoencoders

License: MIT License

Python 99.80% Shell 0.20%
backbone computer-vision masked-image-modeling object-detection semantic-segmentation mae

convmae's People

Contributors

alpha-vl avatar linziyi96 avatar teleema avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

convmae's Issues

output of FastconvMAE

I used your fastconvmae train imgnet data.

In your code, you said the output should be:

image
However, when i used the pretrained model to predict, it gave me prediction size= torch.Size([4, 196, 768]).
I also tested MAE mode, it can give prediction in torch.Size([1, 196, 768]).

Can you explain it why?

Hi

Hi,
Great work ! Congratulation!
How to draw the pictures in the "Visualization" section of README.md?

image unpatchify related problems

There is a little problem in the open source code, self.patch_embed is not defined in the unpatchify function of the model. The original dimension of the image cannot be restored. I hope it can be modified slightly for our convenience. Thank you for your answer.

Train on

Could you provide a tutorial on how to train and finetune with custom dataset? And how to modify the input image size during the detection, the current code seems not to support custom image size.

Doubts about masking strategy

Hi! Thanks for the opensource code. I have the doubts about masking strategy.
In the paper: Uniformly masking stage-1 input tokens from the H/4 × W/4 featuremaps would cause all tokens of stage-3 to have partially visible information and requires keeping all stage-3 tokens. Why the visible information will pass to the stage-3, if the images was masked in the first.
Thanks very much!

How long will the the pretraining stage takes in V100?

Hi,

Thank you for your excellent work!
We would like to know how long would the pretraining of ImNet-1k take when running on the machine with 8 V100.
Also, will you release your manuscript about your work on Faster ConvMAE soon? We can't wait to know more details about the Faster ConvMAE.

about the training loss

Hello, dear, master! I observe that the training loss decrease from 0.42 to 0.39 spending too much epoch. So, I have a question that it really have a big different for the test result when the training loss decrease from 0.42 to 0.39?

Hello, how to finetune own datasets

What should I do if I want to fine tune the current pre training model to my own dataset instead of Imagenet's Val dataset? Can you answer it? Thank you very much

Pretraining implementation

I have implemented pretraining codes based on MAE repo but I wonder one thing: in the decoder phase, (1) do you sum all features of 3 stages and then normalize it or (2) you normalize the feature of last stage and then sum it with 2 previous ones? Because I got nan loss after 270 epochs with (1) approach. Btw, Have you ever met Nan loss during training?

ImageNet Evaluation

Thanks for sharing the great work.
I encountered difficulties in reproducing the evaluation results on FINETUNE.md. My evaluation results are:
* Acc@1 1.090 Acc@5 2.188 loss 8.955
Accuracy of the network on the 50000 test images: 1.1%
That's obviously too big a gap.

I download the ImageNet-1K following your guidance and prepared the ImageNet-1K following Jasonlee1995. Are there any details I haven't noticed, or any specific requirements for preparing the dataset?

hi,i need help

Hello, I would like to ask you how to display the accuracy of each class output by finetune, and how to use the model of downstream tasks for visualization of detection

Questions about convmae-v2

convmae-v2 is great work, I'm very interested in some of the details of the paper, Where can I find the code for convmae-v2

Total memory consumption for training with 32 batch size.

I have tried training the convmae detector (as provided in this repository) with 2 GPUs with each 32GB (V-100). It looks like I can carry out training with only batch size = 2. Going beyond batch-size 2 raises CUDA out of memory. Also with such small batch size training does not seem to produce any well-trained model. Could you tell me the recommended memory size for training the model with batch size = 32?

Thank you so much.

Question about ConvMAE-v2

Thank you for your excellent work!

When I load ConvMAE-v2-Base pretrained checkpoints [https://drive.google.com/file/d/1gykVKNDlRn8eiuXk5bUj1PbSnHXFzLnI/view?usp=sharing], it has cls_token parameter, which not in models_convmae.py.

Does ConvMAE-v2 model different from models_convmae.py in some details, thanks!

Running pretrained convvit on larger image sizes

Hi,
I am looking to see how well the pretrained base model runs on my own dataset, but the current model is configured for an image size of 224
In the original MAE code, the 'interpolate_pos_embed' function would allow the user to increase the positional embedding to allow for larger image patches
In your linear probing code, that same script is commented out, and (obviously) doesn't function the same way, as there are multiple positional embeddings to take care of
Do you have a function that can allow the pretrained model to run on different image sizes?
Thanks

How can i train 200 epoches for DET ?

Hi ,
I want to train the pretrained model in detectron2 framework for object detection.
But the code only train 1 epoch and then ended.
Is this a bug ?

Visualization VIT feature

Hi, author.

To visualize your results attention map, how can you visualize this?

  1. Use Encoder (ViT)?
  2. Use Decoder (VIT)?

given input x -> y = encoder(x) -> decoder(y). then use final vit of decoder(y)?

mask convolution

Hi! Thanks for the opensource code.
I noticed that the mask convolution in the code only masks the residual block, but the skip connection does not have a mask, as shown in line 119 of "ConvMAE/vision_transformer.py". The corresponding code is as follows:
"x = x + self.drop_path(self.conv2(self.attn(mask * self.conv1(self.norm1(x.permute(0, 2, 3, 1)).permute(0, 3, 1, 2))))) "
Will this lead to information leakage in convolution stage?

about pretrain model convmae.pth

I downloaded your pretrained model. And when i tried to load it.
It gave me the following errors.

_IncompatibleKeys(missing_keys=['mask_token', 'decoder_pos_embed', 'stage1_output_decode.weight', 'stage1_output_decode.bias', 'stage2_output_decode.weight', 'stage2_output_decode.bias', 'decoder_embed.weight', 'decoder_embed.bias', 'decoder_blocks.0.norm1.weight', 'decoder_blocks.0.norm1.bias', 'decoder_blocks.0.attn.qkv.weight', 'decoder_blocks.0.attn.qkv.bias', 'decoder_blocks.0.attn.proj.weight', 'decoder_blocks.0.attn.proj.bias', 'decoder_blocks.0.norm2.weight', 'decoder_blocks.0.norm2.bias', 'decoder_blocks.0.mlp.fc1.weight', 'decoder_blocks.0.mlp.fc1.bias', 'decoder_blocks.0.mlp.fc2.weight', 'decoder_blocks.0.mlp.fc2.bias', 'decoder_blocks.1.norm1.weight', 'decoder_blocks.1.norm1.bias', 'decoder_blocks.1.attn.qkv.weight', 'decoder_blocks.1.attn.qkv.bias', 'decoder_blocks.1.attn.proj.weight', 'decoder_blocks.1.attn.proj.bias', 'decoder_blocks.1.norm2.weight', 'decoder_blocks.1.norm2.bias', 'decoder_blocks.1.mlp.fc1.weight', 'decoder_blocks.1.mlp.fc1.bias', 'decoder_blocks.1.mlp.fc2.weight', 'decoder_blocks.1.mlp.fc2.bias', 'decoder_blocks.2.norm1.weight', 'decoder_blocks.2.norm1.bias', 'decoder_blocks.2.attn.qkv.weight', 'decoder_blocks.2.attn.qkv.bias', 'decoder_blocks.2.attn.proj.weight', 'decoder_blocks.2.attn.proj.bias', 'decoder_blocks.2.norm2.weight', 'decoder_blocks.2.norm2.bias', 'decoder_blocks.2.mlp.fc1.weight', 'decoder_blocks.2.mlp.fc1.bias', 'decoder_blocks.2.mlp.fc2.weight', 'decoder_blocks.2.mlp.fc2.bias', 'decoder_blocks.3.norm1.weight', 'decoder_blocks.3.norm1.bias', 'decoder_blocks.3.attn.qkv.weight', 'decoder_blocks.3.attn.qkv.bias', 'decoder_blocks.3.attn.proj.weight', 'decoder_blocks.3.attn.proj.bias', 'decoder_blocks.3.norm2.weight', 'decoder_blocks.3.norm2.bias', 'decoder_blocks.3.mlp.fc1.weight', 'decoder_blocks.3.mlp.fc1.bias', 'decoder_blocks.3.mlp.fc2.weight', 'decoder_blocks.3.mlp.fc2.bias', 'decoder_blocks.4.norm1.weight', 'decoder_blocks.4.norm1.bias', 'decoder_blocks.4.attn.qkv.weight', 'decoder_blocks.4.attn.qkv.bias', 'decoder_blocks.4.attn.proj.weight', 'decoder_blocks.4.attn.proj.bias', 'decoder_blocks.4.norm2.weight', 'decoder_blocks.4.norm2.bias', 'decoder_blocks.4.mlp.fc1.weight', 'decoder_blocks.4.mlp.fc1.bias', 'decoder_blocks.4.mlp.fc2.weight', 'decoder_blocks.4.mlp.fc2.bias', 'decoder_blocks.5.norm1.weight', 'decoder_blocks.5.norm1.bias', 'decoder_blocks.5.attn.qkv.weight', 'decoder_blocks.5.attn.qkv.bias', 'decoder_blocks.5.attn.proj.weight', 'decoder_blocks.5.attn.proj.bias', 'decoder_blocks.5.norm2.weight', 'decoder_blocks.5.norm2.bias', 'decoder_blocks.5.mlp.fc1.weight', 'decoder_blocks.5.mlp.fc1.bias', 'decoder_blocks.5.mlp.fc2.weight', 'decoder_blocks.5.mlp.fc2.bias', 'decoder_blocks.6.norm1.weight', 'decoder_blocks.6.norm1.bias', 'decoder_blocks.6.attn.qkv.weight', 'decoder_blocks.6.attn.qkv.bias', 'decoder_blocks.6.attn.proj.weight', 'decoder_blocks.6.attn.proj.bias', 'decoder_blocks.6.norm2.weight', 'decoder_blocks.6.norm2.bias', 'decoder_blocks.6.mlp.fc1.weight', 'decoder_blocks.6.mlp.fc1.bias', 'decoder_blocks.6.mlp.fc2.weight', 'decoder_blocks.6.mlp.fc2.bias', 'decoder_blocks.7.norm1.weight', 'decoder_blocks.7.norm1.bias', 'decoder_blocks.7.attn.qkv.weight', 'decoder_blocks.7.attn.qkv.bias', 'decoder_blocks.7.attn.proj.weight', 'decoder_blocks.7.attn.proj.bias', 'decoder_blocks.7.norm2.weight', 'decoder_blocks.7.norm2.bias', 'decoder_blocks.7.mlp.fc1.weight', 'decoder_blocks.7.mlp.fc1.bias', 'decoder_blocks.7.mlp.fc2.weight', 'decoder_blocks.7.mlp.fc2.bias', 'decoder_norm.weight', 'decoder_norm.bias', 'decoder_pred.weight', 'decoder_pred.bias'], unexpected_keys=[])

Time required to train one epoch.

Dear author:
Thank you for sharing the excellent work! May I ask how the time overhead of ConvMAE pre-training compares to MAE? Can you provide the time required to train an epoch for these two methods on the same type of GPU?

Model Settings and checkpoint not match

Thanks for your great work!

But I have a problem about the model setting with your provided checkpoint.
I load your checkpoints, but the model setting that can be loaded correctly does not match what is written in the paper.

  1. the mlp_ratio of Large and Huge
  2. the patch_size of huge

I want to find out what's going on, thanks a lot!

image image

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.