GithubHelp home page GithubHelp logo

sami's Introduction

SAMI: Masked AutoEncoders leveraging Segment-Anything

Unofficial pytorch implementation of Masked AutoEncoder image

Based on my understanding of EfficientSAM's SAMI framework and the technical details given in the paper, I tried to implement the SAMI pre-training framework, using SAM's ViT to improve the performance of small-scale ViT models, including ViT-Tiny and ViT-Small.

Unfortunately, I currently do not have sufficient computing resources to verify whether my implementation can reproduce the SAMI experimental results in the EfficientSAM paper.

First of all, you need to follow the requirements of this README file to prepare the SAM's ViT checkpoint, which will be used as the teacher model to supervise the small-scale ViT in the SAMI pretraining stage.

1. Pretrain

We have kindly provided the bash script train_pretrain.sh file for pretraining. You can modify some hyperparameters in the script file according to your own needs.

  • Single GPU
# bash train_pretrain.sh <model> <teacher model> <batch size> <data> <data path> <world size> <resume>
bash train_pretrain.sh vit_t vit_h 256 imagenet_1k /path/to/imagenet_1k/ 1 None
  • Multi GPUs
# bash train_pretrain.sh <model> <teacher model> <batch size> <data> <data path> <world size> <resume>
bash train_pretrain.sh vit_t vit_h 256 imagenet_1k /path/to/imagenet_1k/ 8 None

2. Finetune

We have kindly provided the bash script train_finetune.sh file for finetuning. You can modify some hyperparameters in the script file according to your own needs.

  • Single GPU
# bash train_pretrain.sh <model> <batch size> <data> <data path> <world size> <resume>
bash train_finetune.sh vit_t 256 imagenet_1k /path/to/imagenet_1k/ 1 None
  • Multi GPUs
# bash train_pretrain.sh <model> <batch size> <data> <data path> <world size> <resume>
bash train_finetune.sh vit_t 256 imagenet_1k /path/to/imagenet_1k/ 8 None

3. Evaluate

  • Evaluate the top1 & top5 accuracy of ViT-Tiny on CIFAR10 dataset:
python train_finetune.py --dataset cifar10 -m vit_t --batch_size 256 --img_size 32 --patch_size 2 --eval --resume path/to/checkpoint
  • Evaluate the top1 & top5 accuracy of ViT-Tiny on ImageNet-1K dataset:
python train_finetune.py --dataset imagenet_1k --root /path/to/imagenet_1k -m vit_t --batch_size 256 --img_size 224 --patch_size 16 --eval --resume path/to/checkpoint

4. Experiments

Classification: ImageNet-1K

  • We use the SAM's ViT-H as the teacher to supervise the small-scale ViT.
  • We use the AttentionPoolingClassifier as the classifier.
  • We finetune the models with 100 epoch on ImageNet-1K.
Method Model Teacher Epoch Top 1 Weight MAE weight
MAE ViT-T - 100
MAE ViT-S - 100
SAMI ViT-T SAM ViT-H 100
SAMI ViT-S SAM ViT-H 100

Object detection: COCO

  • We use the small ViT pretrained by the SAMI as the backbone of ViTDet.
Method Model Backbone Epoch Top 1 Weight MAE weight
SAMI ViTDet Vit-T 100
SAMI ViTDet Vit-S 100

6. Acknowledgment

Thanks for Kaiming He's inspiring work on MAE and the official source code of MAE.

sami's People

Contributors

yjh0410 avatar

Stargazers

yyy1998 avatar  avatar  avatar Star avatar  avatar  avatar Markson-Young avatar  avatar pp avatar yeep avatar jie zhang avatar

Watchers

Kostas Georgiou avatar  avatar 403F avatar

Forkers

yeyeyeping hqzqaq

sami's Issues

Shape unmatch: image_encoder.patch_embed.proj.weight

Hello, may I ask, the resolution of the pre-trained SAM-ViT-H model is 1024x1024, patch_size=16, but the resolution used by your code on the cifar10 dataset is 32x32, patch_size=2, which makes this issue arise when loading the pre-trained weights of the SAM: Shape unmatch:image_encoder.patch_ embed.proj.weight, then the network of teachers will change

Confused

Hello author, I used different teacher models in pre-training, namely medsam_checkpoint,sam_vit_b_checkpoint, and the other one did not use the teacher model. As follows, I found that different teacher models differ greatly in pre-training. medsam_checkpoint loss is 18.9 at the beginning and 4.2 after 300 epoches training, which is not as good as not using the teacher model at the beginning. Because my data is medical images, the structure of sam_vit_b_checkpoint model and medsam_checkpoint model are exactly the same, which makes me feel confused. I hope you can give me some tips, thank you for your reply
fc4a00a5-8789-4885-a133-cbf519262ed2
992fc17c843f1218ce3cd3044fd670e4
386bf12575cc518fbd8cb5ad7b42c7e7

About the Cross-Attention Decoder

Thank you for your implementation of SAMI's training code. It has been incredibly helpful for me!

However,I have a question regarding the pretraining process. In forward preprocess of the MaeDecoder, I noticed that the masked embeddings are reconstructed through self-attention, which is clever but seems to be inconsistent with the cross-attention described in the original paper.

I am confused about the Cross-Attention Decoder module. Could you please explain your understanding of the query, key, and value as mentioned in Section 3.2 on page 3? And why you implenment query as learnable mask_token?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.