GithubHelp home page GithubHelp logo

microsoft / focalnet Goto Github PK

View Code? Open in Web Editor NEW
650.0 14.0 58.0 74.92 MB

[NeurIPS 2022] Official code for "Focal Modulation Networks"

License: MIT License

Python 97.68% Jupyter Notebook 2.25% Shell 0.07%

focalnet's Introduction

This is the official Pytorch implementation of FocalNets:

"Focal Modulation Networks" by Jianwei Yang, Chunyuan Li, Xiyang Dai, Lu Yuan and Jianfeng Gao.

PWC PWC PWC PWC

News

  • [11/07/2023] Researchers showed that Focal-UNet beats previous methods on several earth system analysis benchmarks. Check out their code, paper, and project!
  • [06/30/2023] πŸ’₯ Please find FocalNet-DINO checkpoints from huggingface. The old links are deprecated.
  • [04/26/2023] By combining with FocalNet-Huge backbone, Focal-Stable-DINO achieves 64.8 AP on COCO test-dev without any test time augmentation! Check our Technical Report for more details!
  • [02/13/2023] FocalNet has been integrated to Keras, check out the tutorial!
  • [01/18/2023] Checkout a curated paper list which introduce networks beyond attention based on modern convolution and modulation!
  • [01/01/2023] Researchers showed that Focal-UNet beats Swin-UNet on several medical image segmentation benchmarks. Check out their code and paper, and happy new year!
  • [12/16/2022] πŸ’₯ We are pleased to release our FocalNet-Large-DINO checkpoint pretrained on Object365 and finetuned on COCO, which help to get 63.5 mAP without tta on COCO minival! Check it out!
  • [11/14/2022] We created a new repo FocalNet-DINO to hold the code to reproduce the object detection performance with DINO. We will be releasing the object detection code and checkpoints there. Stay tunned!
  • [11/13/2022] πŸ’₯ We release our large, xlarge and huge models pretrained on ImageNet-22K, including the one we used to achieve the SoTA on COCO object detection leaderboard!
  • [11/02/2022] We wrote a blog post to introduce the insights and techniques behind our FocalNets in a plain way, check it out!
  • [10/31/2022] πŸ’₯ We achieved new SoTA with 64.2 box mAP on COCO minival and 64.3 64.4 box mAP on COCO test-dev based on the powerful OD method DINO! We used huge model size (700M) beating much larger attention-based models like SwinV2-G and BEIT-3. Checkoout our new version and stay tuned!
  • [09/20/2022] Our FocalNet has been accepted by NeurIPS 2022!
  • [04/02/2022] Create a gradio demo in huggingface space to visualize the modulation mechanism. Check it out!

Introduction

We propose FocalNets: Focal Modulation Networks, an attention-free architecture that achieves superior performance than SoTA self-attention (SA) methods across various vision benchmarks. SA is an first interaction, last aggregation (FILA) process as shown above. Our Focal Modulation inverts the process by first aggregating, last interaction (FALI). This inversion brings several merits:

  • Translation-Invariance: It is performed for each target token with the context centered around it.
  • Explicit input-dependency: The modulator is computed by aggregating the short- and long-rage context from the input and then applied to the target token.
  • Spatial- and channel-specific: It first aggregates the context spatial-wise and then channel-wise, followed by an element-wise modulation.
  • Decoupled feature granularity: Query token preserves the invidual information at finest level, while coarser context is extracted surrounding it. They two are decoupled but connected through the modulation operation.
  • Easy to implement: We can implement both context aggregation and interaction in a very simple and light-weight way. It does not need softmax, multiple attention heads, feature map rolling or unfolding, etc.

Before getting started, see what our FocalNets have learned to perceive images and where to modulate!

Finally, FocalNets are built with convolutional and linear layers, but goes beyond by proposing a new modulation mechanism that is simple, generic, effective and efficient. We hereby recommend:

Focal-Modulation May be What We Need for Visual Modeling!

Getting Started

Benchmarking

Image Classification on ImageNet-1K

  • Strict comparison with multi-scale Swin and Focal Transformers:
Model Depth Dim Kernels #Params. (M) FLOPs (G) Throughput (imgs/s) Top-1 Download
FocalNet-T [2,2,6,2] 96 [3,5] 28.4 4.4 743 82.1 ckpt/config/log
FocalNet-T [2,2,6,2] 96 [3,5,7] 28.6 4.5 696 82.3 ckpt/config/log
FocalNet-S [2,2,18,2] 96 [3,5] 49.9 8.6 434 83.4 ckpt/config/log
FocalNet-S [2,2,18,2] 96 [3,5,7] 50.3 8.7 406 83.5 ckpt/config/log
FocalNet-B [2,2,18,2] 128 [3,5] 88.1 15.3 280 83.7 ckpt/config/log
FocalNet-B [2,2,18,2] 128 [3,5,7] 88.7 15.4 269 83.9 ckpt/config/log
  • Strict comparison with isotropic ViT models:
Model Depth Dim Kernels #Params. (M) FLOPs (G) Throughput (imgs/s) Top-1 Download
FocalNet-T 12 192 [3,5,7] 5.9 1.1 2334 74.1 ckpt/config/log
FocalNet-S 12 384 [3,5,7] 22.4 4.3 920 80.9 ckpt/config/log
FocalNet-B 12 768 [3,5,7] 87.2 16.9 300 82.4 ckpt/config/log

ImageNet-22K Pretraining

Model Depth Dim Kernels #Params. (M) Download
FocalNet-L [2,2,18,2] 192 [5,7,9] 207 ckpt/config
FocalNet-L [2,2,18,2] 192 [3,5,7,9] 207 ckpt/config
FocalNet-XL [2,2,18,2] 256 [5,7,9] 366 ckpt/config
FocalNet-XL [2,2,18,2] 256 [3,5,7,9] 366 ckpt/config
FocalNet-H [2,2,18,2] 352 [3,5,7] 687 ckpt/config
FocalNet-H [2,2,18,2] 352 [3,5,7,9] 689 ckpt/config

NOTE: We reorder the class names in imagenet-22k so that we can directly use the first 1k logits for evaluating on imagenet-1k. We remind that the 851th class (label=850) in imagenet-1k is missed in imagenet-22k. Please refer to this labelmap. More discussion found in this issue.

Object Detection on COCO

Backbone Kernels Lr Schd #Params. (M) FLOPs (G) box mAP mask mAP Download
FocalNet-T [9,11] 1x 48.6 267 45.9 41.3 ckpt/config/log
FocalNet-T [9,11] 3x 48.6 267 47.6 42.6 ckpt/config/log
FocalNet-T [9,11,13] 1x 48.8 268 46.1 41.5 ckpt/config/log
FocalNet-T [9,11,13] 3x 48.8 268 48.0 42.9 ckpt/config/log
FocalNet-S [9,11] 1x 70.8 356 48.0 42.7 ckpt/config/log
FocalNet-S [9,11] 3x 70.8 356 48.9 43.6 ckpt/config/log
FocalNet-S [9,11,13] 1x 72.3 365 48.3 43.1 ckpt/config/log
FocalNet-S [9,11,13] 3x 72.3 365 49.3 43.8 ckpt/config/log
FocalNet-B [9,11] 1x 109.4 496 48.8 43.3 ckpt/config/log
FocalNet-B [9,11] 3x 109.4 496 49.6 44.1 ckpt/config/log
FocalNet-B [9,11,13] 1x 111.4 507 49.0 43.5 ckpt/config/log
FocalNet-B [9,11,13] 3x 111.4 507 49.8 44.1 ckpt/config/log
  • Other detection methods
Backbone Kernels Method Lr Schd #Params. (M) FLOPs (G) box mAP Download
FocalNet-T [11,9,9,7] Cascade Mask R-CNN 3x 87.1 751 51.5 ckpt/config/log
FocalNet-T [11,9,9,7] ATSS 3x 37.2 220 49.6 ckpt/config/log
FocalNet-T [11,9,9,7] Sparse R-CNN 3x 111.2 178 49.9 ckpt/config/log

Semantic Segmentation on ADE20K

  • Resolution 512x512 and Iters 160k
Backbone Kernels Method #Params. (M) FLOPs (G) mIoU mIoU (MS) Download
FocalNet-T [9,11] UPerNet 61 944 46.5 47.2 ckpt/config/log
FocalNet-T [9,11,13] UPerNet 61 949 46.8 47.8 ckpt/config/log
FocalNet-S [9,11] UPerNet 83 1035 49.3 50.1 ckpt/config/log
FocalNet-S [9,11,13] UPerNet 84 1044 49.1 50.1 ckpt/config/log
FocalNet-B [9,11] UPerNet 124 1180 50.2 51.1 ckpt/config/log
FocalNet-B [9,11,13] UPerNet 126 1192 50.5 51.4 ckpt/config/log

Visualizations

There are three steps in our FocalNets:

  1. Contexualization with depth-wise conv;
  2. Multi-scale aggregation with gating mechanism;
  3. Modulator derived from context aggregation and projection.

We visualize them one by one.

  • Depth-wise convolution kernels learned in FocalNets:

Yellow colors represent higher values. Apparently, FocalNets learn to gather more local context at earlier stages while more global context at later stages.

  • Gating maps at last layer of FocalNets for different input images:

From left to right, the images are input image, gating map for focal level 1,2,3 and the global context. Clearly, our model has learned where to gather the context depending on the visual contents at different locations.

  • Modulator learned in FocalNets for different input images:

The modulator derived from our model automatically learns to focus on the foreground regions.

For visualization by your own, please refer to visualization notebook.

Citation

If you find this repo useful to your project, please consider to cite it with following bib:

@misc{yang2022focal,
      title={Focal Modulation Networks}, 
      author={Jianwei Yang and Chunyuan Li and Xiyang Dai and Jianfeng Gao},
      journal={Advances in Neural Information Processing Systems (NeurIPS)},
      year={2022}
}

Acknowledgement

Our codebase is built based on Swin Transformer and Focal Transformer. To achieve the SoTA object detection performance, we heavily rely on the most advanced method DINO and the advices from the authors. We thank the authors for the nicely organized code!

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.

focalnet's People

Contributors

cannacak avatar jwyang avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

focalnet's Issues

Gradient Overflow Issue for focalnet_small_lrf

Hello @jwyang

I am trying to reproduce your result for focalnet_small_srf. with exactly same hardware setup and configs you mentioned you used. Basically using 4 nodes (32GPUs) with batch size of 32 for each ( total batch size of 1024). I am directly using same configuration as in the repository.

I am using amp O1 as mentioned in the repository. However, I get gradient overflow and as both grad_norm and loss go nan after training 24 epochs.

Have you ever encountered this issue ? what's the solution ?

Please lee log and config used for this training attached.
log.txt

Appreciate insights.

Increasing batch size negatively impacts mAP, is it because of padding ?

Hello,
I have noticed that running evaluations with batch size > 1 leads to much lower mAP, so I was wondering if the reason is because the model (large fl4 with 5scale DINO) was trained with only 1 image per GPU ? It is not specified in focal-dino's README and I would like to make sure this is indeed the reason.

And as an additional question, does someone know why increasing the batch size does not improve the inference speed / image ? I just know that it is not because of focalnet backbone, because I have observed the same effect with resnet50 and swin backbones.

focalnet_large_fl4_o365_finetuned_on_coco.pth size mismatch

Hello,

Thank you for sharing your experiment.

I am trying to train an object detection based on focalnet large from this checkpoint :
https://github.com/FocalNet/FocalNet-DINO#training

However some mismatch size is happening. It happened with the checkpoint pretrained on object 365 and then finetuned on coco dataset. I am using this config file : "DINO_4scale_focalnet_large_fl4.py" instead of "DINO_4scale_focalnet_fl4.py" as I did not find it in the repo. I was wondering if the config file uploaded in the repo was the correct one ?

Here the message :

RuntimeError: Error(s) in loading state_dict for DINO:
        size mismatch for transformer.level_embed: copying a param with shape torch.Size([5, 256]) from checkpoint, the shape in current model is torch.Size([4, 256]).
        size mismatch for transformer.encoder.layers.0.self_attn.sampling_offsets.weight: copying a param with shape torch.Size([320, 256]) from checkpoint, the shape in current model is torch.Size([256, 256]).
        size mismatch for transformer.encoder.layers.0.self_attn.sampling_offsets.bias: copying a param with shape torch.Size([320]) from checkpoint, the shape in current model is torch.Size([256]).
        size mismatch for transformer.encoder.layers.0.self_attn.attention_weights.weight: copying a param with shape torch.Size([160, 256]) from checkpoint, the shape in current model is torch.Size([128, 256]).
        size mismatch for transformer.encoder.layers.0.self_attn.attention_weights.bias: copying a param with shape torch.Size([160]) from checkpoint, the shape in current model is torch.Size([128]).
        size mismatch for transformer.encoder.layers.1.self_attn.sampling_offsets.weight: copying a param with shape torch.Size([320, 256]) from checkpoint, the shape in current model is torch.Size([256, 256]).
        size mismatch for transformer.encoder.layers.1.self_attn.sampling_offsets.bias: copying a param with shape torch.Size([320]) from checkpoint, the shape in current model is torch.Size([256]).
        size mismatch for transformer.encoder.layers.1.self_attn.attention_weights.weight: copying a param with shape torch.Size([160, 256]) from checkpoint, the shape in current model is torch.Size([128, 256]).
        size mismatch for transformer.encoder.layers.1.self_attn.attention_weights.bias: copying a param with shape torch.Size([160]) from checkpoint, the shape in current model is torch.Size([128]).
        size mismatch for transformer.encoder.layers.2.self_attn.sampling_offsets.weight: copying a param with shape torch.Size([320, 256]) from checkpoint, the shape in current model is torch.Size([256, 256]).
        size mismatch for transformer.encoder.layers.2.self_attn.sampling_offsets.bias: copying a param with shape torch.Size([320]) from checkpoint, the shape in current model is torch.Size([256]).
        size mismatch for transformer.encoder.layers.2.self_attn.attention_weights.weight: copying a param with shape torch.Size([160, 256]) from checkpoint, the shape in current model is torch.Size([128, 256]).
        size mismatch for transformer.encoder.layers.2.self_attn.attention_weights.bias: copying a param with shape torch.Size([160]) from checkpoint, the shape in current model is torch.Size([128]).
        size mismatch for transformer.encoder.layers.3.self_attn.sampling_offsets.weight: copying a param with shape torch.Size([320, 256]) from checkpoint, the shape in current model is torch.Size([256, 256]).
        size mismatch for transformer.encoder.layers.3.self_attn.sampling_offsets.bias: copying a param with shape torch.Size([320]) from checkpoint, the shape in current model is torch.Size([256]).
        size mismatch for transformer.encoder.layers.3.self_attn.attention_weights.weight: copying a param with shape torch.Size([160, 256]) from checkpoint, the shape in current model is torch.Size([128, 256]).
        size mismatch for transformer.encoder.layers.3.self_attn.attention_weights.bias: copying a param with shape torch.Size([160]) from checkpoint, the shape in current model is torch.Size([128]).
        size mismatch for transformer.encoder.layers.4.self_attn.sampling_offsets.weight: copying a param with shape torch.Size([320, 256]) from checkpoint, the shape in current model is torch.Size([256, 256]).
        size mismatch for transformer.encoder.layers.4.self_attn.sampling_offsets.bias: copying a param with shape torch.Size([320]) from checkpoint, the shape in current model is torch.Size([256]).
        size mismatch for transformer.encoder.layers.4.self_attn.attention_weights.weight: copying a param with shape torch.Size([160, 256]) from checkpoint, the shape in current model is torch.Size([128, 256]).
        size mismatch for transformer.encoder.layers.4.self_attn.attention_weights.bias: copying a param with shape torch.Size([160]) from checkpoint, the shape in current model is torch.Size([128]).
        size mismatch for transformer.encoder.layers.5.self_attn.sampling_offsets.weight: copying a param with shape torch.Size([320, 256]) from checkpoint, the shape in current model is torch.Size([256, 256]).
        size mismatch for transformer.encoder.layers.5.self_attn.sampling_offsets.bias: copying a param with shape torch.Size([320]) from checkpoint, the shape in current model is torch.Size([256]).
        size mismatch for transformer.encoder.layers.5.self_attn.attention_weights.weight: copying a param with shape torch.Size([160, 256]) from checkpoint, the shape in current model is torch.Size([128, 256]).
        size mismatch for transformer.encoder.layers.5.self_attn.attention_weights.bias: copying a param with shape torch.Size([160]) from checkpoint, the shape in current model is torch.Size([128]).
        size mismatch for transformer.decoder.layers.0.cross_attn.sampling_offsets.weight: copying a param with shape torch.Size([320, 256]) from checkpoint, the shape in current model is torch.Size([256, 256]).
        size mismatch for transformer.decoder.layers.0.cross_attn.sampling_offsets.bias: copying a param with shape torch.Size([320]) from checkpoint, the shape in current model is torch.Size([256]).
        size mismatch for transformer.decoder.layers.0.cross_attn.attention_weights.weight: copying a param with shape torch.Size([160, 256]) from checkpoint, the shape in current model is torch.Size([128, 256]).
        size mismatch for transformer.decoder.layers.0.cross_attn.attention_weights.bias: copying a param with shape torch.Size([160]) from checkpoint, the shape in current model is torch.Size([128]).
        size mismatch for transformer.decoder.layers.1.cross_attn.sampling_offsets.weight: copying a param with shape torch.Size([320, 256]) from checkpoint, the shape in current model is torch.Size([256, 256]).
        size mismatch for transformer.decoder.layers.1.cross_attn.sampling_offsets.bias: copying a param with shape torch.Size([320]) from checkpoint, the shape in current model is torch.Size([256]).
        size mismatch for transformer.decoder.layers.1.cross_attn.attention_weights.weight: copying a param with shape torch.Size([160, 256]) from checkpoint, the shape in current model is torch.Size([128, 256]).
        size mismatch for transformer.decoder.layers.1.cross_attn.attention_weights.bias: copying a param with shape torch.Size([160]) from checkpoint, the shape in current model is torch.Size([128]).
        size mismatch for transformer.decoder.layers.2.cross_attn.sampling_offsets.weight: copying a param with shape torch.Size([320, 256]) from checkpoint, the shape in current model is torch.Size([256, 256]).
        size mismatch for transformer.decoder.layers.2.cross_attn.sampling_offsets.bias: copying a param with shape torch.Size([320]) from checkpoint, the shape in current model is torch.Size([256]).
        size mismatch for transformer.decoder.layers.2.cross_attn.attention_weights.weight: copying a param with shape torch.Size([160, 256]) from checkpoint, the shape in current model is torch.Size([128, 256]).
        size mismatch for transformer.decoder.layers.2.cross_attn.attention_weights.bias: copying a param with shape torch.Size([160]) from checkpoint, the shape in current model is torch.Size([128]).
        size mismatch for transformer.decoder.layers.3.cross_attn.sampling_offsets.weight: copying a param with shape torch.Size([320, 256]) from checkpoint, the shape in current model is torch.Size([256, 256]).
        size mismatch for transformer.decoder.layers.3.cross_attn.sampling_offsets.bias: copying a param with shape torch.Size([320]) from checkpoint, the shape in current model is torch.Size([256]).
        size mismatch for transformer.decoder.layers.3.cross_attn.attention_weights.weight: copying a param with shape torch.Size([160, 256]) from checkpoint, the shape in current model is torch.Size([128, 256]).
        size mismatch for transformer.decoder.layers.3.cross_attn.attention_weights.bias: copying a param with shape torch.Size([160]) from checkpoint, the shape in current model is torch.Size([128]).
        size mismatch for transformer.decoder.layers.4.cross_attn.sampling_offsets.weight: copying a param with shape torch.Size([320, 256]) from checkpoint, the shape in current model is torch.Size([256, 256]).
        size mismatch for transformer.decoder.layers.4.cross_attn.sampling_offsets.bias: copying a param with shape torch.Size([320]) from checkpoint, the shape in current model is torch.Size([256]).
        size mismatch for transformer.decoder.layers.4.cross_attn.attention_weights.weight: copying a param with shape torch.Size([160, 256]) from checkpoint, the shape in current model is torch.Size([128, 256]).
        size mismatch for transformer.decoder.layers.4.cross_attn.attention_weights.bias: copying a param with shape torch.Size([160]) from checkpoint, the shape in current model is torch.Size([128]).
        size mismatch for transformer.decoder.layers.5.cross_attn.sampling_offsets.weight: copying a param with shape torch.Size([320, 256]) from checkpoint, the shape in current model is torch.Size([256, 256]).
        size mismatch for transformer.decoder.layers.5.cross_attn.sampling_offsets.bias: copying a param with shape torch.Size([320]) from checkpoint, the shape in current model is torch.Size([256]).
        size mismatch for transformer.decoder.layers.5.cross_attn.attention_weights.weight: copying a param with shape torch.Size([160, 256]) from checkpoint, the shape in current model is torch.Size([128, 256]).
        size mismatch for transformer.decoder.layers.5.cross_attn.attention_weights.bias: copying a param with shape torch.Size([160]) from checkpoint, the shape in current model is torch.Size([128]).
        size mismatch for input_proj.0.0.weight: copying a param with shape torch.Size([256, 192, 1, 1]) from checkpoint, the shape in current model is torch.Size([256, 384, 1, 1]).
        size mismatch for input_proj.1.0.weight: copying a param with shape torch.Size([256, 384, 1, 1]) from checkpoint, the shape in current model is torch.Size([256, 768, 1, 1]).
        size mismatch for input_proj.2.0.weight: copying a param with shape torch.Size([256, 768, 1, 1]) from checkpoint, the shape in current model is torch.Size([256, 1536, 1, 1]).
        size mismatch for input_proj.3.0.weight: copying a param with shape torch.Size([256, 1536, 1, 1]) from checkpoint, the shape in current model is torch.Size([256, 1536, 3, 3]).

Visualization

Hi, I try to plot the visualization of modulator values as Fig.4 in your paper using the released checkpoint, however, the generated visualization shows chessboard-like square boxes as following:
0_attnmap
gradcam_orig
The code is following:

activation = F.interpolate(activation, size=224, mode='bilinear')
ax.imshow(activation)

How do you unsample the map to show more a natural heatmap? Thank you for your time.

Setting up ImageNet-22K

Hi @jwyang,

Thank you for the very interesting work.

We are planning to train Focal-B on ImageNet-22K. Can you please provide the instructions for preparing ImageNet-22K dataset.

Currently on the official website, there are two variants available as follows:

Winter 2021 release

  • ImageNet21K
  • Processed version of ImageNet21K

Kindly let us know which folder version was used in the main paper for pretraining on IN22K dataset.

Thank you and kind regards,
Muhammad Uzair

All checkpoint downloads unavailable

All checkpoint links lead to the same page:

This XML file does not appear to have any style information associated with it. The document tree is shown below.
<Error>
<Code>PublicAccessNotPermitted</Code>
<Message>Public access is not permitted on this storage account. RequestId:677b1eb8-501e-0054-079c-93e2cc000000 Time:2023-05-31T08:48:21.6368816Z</Message>
</Error>

Correct batch size ?

Hello @jwyang ! thanks for the great work ! I was looking at the log for FocalNet-S and the BATCH_SIZE is set to 32.

However, you mention that BATCH_SIZE should be 128 in the classification guide.

Which one should be ?

Also, how many nodes/GPUs are used to train this model ?

How to modify in optimizer_config

Hi Team
Great work and thanks for your contribution!

I'm working on training the mask_rcnn_focalnet_small_patch4_mstrain_480-800_adamw_3x_coco_lrf.py with custom dataset in coco format.

  • The DistOptimizerHook has been depreciated from mmcv.runner.hooks.optimizer and has been recommended to use OptimizerHook replacing the same.
    # do not use mmdet version fp16
    fp16 = None
    optimizer_config = dict(
       type="DistOptimizerHook",
       update_interval=1,
       grad_clip=None,
       coalesce=True, 
       bucket_size_mb=-1,
       use_fp16=True,
    )
  • I've replaced with below code which goes with default OptimizerHook
    # do not use mmdet version fp16
    fp16 = None
    optimizer_config = dict(
        grad_clip=None,
    )
  • Installed mmcv using source code:
    git clone https://github.com/open-mmlab/mmcv.git -b 1.x /openmmlab/mmcv \
    && cd /openmmlab/mmcv \
    && MMCV_WITH_OPS=1 pip install --no-cache-dir -e . -v

Would like to know recommended configuration with fp16 and optimizer_config in place DistOptimizerHook for distributed training.

where is ckpt_path = "focalnet_base_lrf.pth" and attention or focus scores

Hello, Jianwei:

I am trying to use focalnet to get some attention/focus scores for my images, I see it perform well here in the modulation map of the below images, and this is exactly what I need, I want to calculate each pixel's attention/focus score, I see in your image, the yellow is with highest attention scores (not sure attention score is the right name, basically I want to see which pixels get more attention from people's eyes).
image
And I have two questions:

  1. I try to run the code in visualizaiton.ipynb but cannot find "focalnet_base_lrf.pth", where to download it please?
  2. I wonder can you give a hint, from which code part, I can get attention scores which shows in the modulation map.

Thank you very much for your help in advance!

Best wishes,
Wen

Cannot download pre-trained weight files from Git repository.

Hello,

I am trying to download pre-trained weights from the repository (Sparse R-CNN with FocalNet Tiny Backbone pre-trained on COCO object detection). Clicking on the "ckpt" button gives me an error called "PublicAccessNotPermitted". Why are the files not open to the public? Can I download them from somewhere else?

Thanks in advance!

Issue in reproducing evaluation results with sparse rcnn checkpoint

First of all thank you for releasing the code and ckpts.
I am not able to reproduce the results using config file "sparse_rcnn_focalnet_tiny_fpn_300_proposals_crop_mstrain_480-800_3x_coco_lrf.py" and it's corresponding checkpoint given in readme. The code itself is showing that there is mismatches in state dictionary.
For this to run I also had to update the line 1 in config from _base_ = '../_base_/sparse_rcnn_focalnet_fpn.py' to _base_ = '../_base_/models/sparse_rcnn_focalnet_fpn.py'
image
The metrics obtained are very low and close to zero!
Following is the command I used to run the evaluation:

python tools/test.py configs/focalnet/sparse_rcnn_focalnet_tiny_fpn_300_proposals_crop_mstrain_480-800_3x_coco_lrf.py ckpts/focalnet_tiny_lrf_sparsercnn_3x.pth --eval bbox

Following is mAP values obtained:
image
Could you please look into this and let me know if I am doing something wrong here?

Suggested ways to fine-tuning for Focal-Dino (Dino based) , thanks.

Hi Sir, thanks for your great work~

For the Focal-Dino (Dino based) releasing , is it possible to fine-tune Focal-Dino(with custom-dataset) with these latest checkpoints as backbone/pretrained as well ?

EX : o365_ckpt(pretrained on o365) as Backbone, and coco_ckpt(Fine-tune on coco) as pretrain

[12/16/2022] πŸ’₯ We are pleased to release our [FocalNet-Large-DINO checkpoint](https://github.com/FocalNet/FocalNet-DINO#model-zoos) pretrained on Object365 and finetuned on COCO, which help to get 63.5 mAP without tta on COCO minival! Check it out

Seems that some RuntimeError from 'loading state_dict for DINO' for our side so far (ignore as Dino ?), and any kindly suggestion/advice will be great help for us, thanks in advance.

Traceback (most recent call last):
  File "main.py", line 398, in <module>
    main(args)
  File "main.py", line 251, in main
    _load_output = model_without_ddp.load_state_dict(_tmp_st, strict=False)
  File "/home/jovyan/.conda/envs/detrex_env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1604, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for DINO:
        size mismatch for transformer.level_embed: copying a param with shape torch.Size([5, 256]) from checkpoint, the shape in current model is torch.Size([4, 256])...

Merge Prelinear and Post-Linear layer?

Hi, thanks for releasing the code.
Looking at the diagram and the code implementation, I believe we can merge the Post-Linear Projection layer from a previous Focal-Block into the Pre-Linear layer of the next FocalBlock, since they are both Matrix multiplication without the activation in between. This will save parameters and inference time.
However, I am not sure the effect if we drop the Post-Linear layer during training.
Looking for your opinion, Thanks.

torch.utils.checkpoint + DDP

I found the useful trick Gradient Checkpointing in your implemention(i.e. use_checkpoint flag).

I failed to use it with DDP training(when using single GPU, it works fine), error occurs like below:

RuntimeError: Expected to mark a variable ready only once. This error is caused by one of the following reasons: 1) Use of a module parameter outside the `forward` function. Please make sure model parameters are not shared across multiple concurrent forward-backward passes. or try to use _set_static_graph() as a workaround if this module graph does not change during training loop.2) Reused parameters in multiple reentrant backward passes. For example, if you use multiple `checkpoint` functions to wrap the same part of your model, it would result in the same set of parameters been used by different reentrant backward passes multiple times, and hence marking a variable ready multiple times. DDP does not support such use cases in default. You can try to use _set_static_graph() as a workaround if your module graph does not change over iterations.
Parameter at index 440 with name backbone.backbone.layers.3.blocks.2.mlp.fc2.bias has been marked as ready twice. This means that multiple autograd engine  hooks have fired for this particular parameter during this iteration.

My torch version is 1.10.0

Could you please provide some advice ? Thanks

How is global average pooling done, in detail?

In the FocalNet paper, it states:

"To capture global context of the whole input, which could be high-resolution, we apply a global average pooling on the L-th level feature map Z_(L+1) = Avg-Pool(Z_(L)). Thus, we obtain in total (L+1) feature maps"

I'm not finding any other details in the paper. Can @jwyang or others give more details? I don't understand how one starts with L feature maps and ends up with L+1 feature maps. If I understand correctly, the pooling is spatially, over the H x W dimensions. Once pooled, the dimension sizes will be smaller, e.g., a pooling size of 2 would give H/2 x W/2. So how are feature maps of smaller size added to maps of larger size, to get Zout?

It seems that the shape of pre-trained focalnet_base_lrf is incompatible with Dino-base.

Thanks for your wonderful works! I was attracted here from the depository of Focalnet-Dino.

When I tried to repeat your experiments, it seems that the shape of pre-trained focalnet_base_lrf model is incompatible with Dino-base however.

The error log when loading pre-trained focalnet-base in Dino is here:
RuntimeError: Error(s) in loading state_dict for FocalNet:
size mismatch for patch_embed.proj.weight: copying a param with shape torch.Size([128, 3, 4, 4]) from checkpoint, the shape in current model is torch.Size([128, 3, 7, 7]).
size mismatch for layers.0.downsample.proj.weight: copying a param with shape torch.Size([256, 128, 2, 2]) from checkpoint, the shape in current model is torch.Size([256, 128, 3, 3]).
size mismatch for layers.1.downsample.proj.weight: copying a param with shape torch.Size([512, 256, 2, 2]) from checkpoint, the shape in current model is torch.Size([512, 256, 3, 3]).
size mismatch for layers.2.downsample.proj.weight: copying a param with shape torch.Size([1024, 512, 2, 2]) from checkpoint, the shape in current model is torch.Size([1024, 512, 3, 3]).

For comparison, I have also tried focalnet_large_lrf_384_fl4 and the log is normal.
The log when loading pretrained focalnet-large in Dino is here:
_IncompatibleKeys(missing_keys=['norm0.weight', 'norm0.bias', 'norm1.weight', 'norm1.bias', 'norm2.weight', 'norm2.bias', 'norm3.weight', 'norm3.bias'], unexpected_keys=['norm.weight', 'norm.bias', 'head.weight', 'head.bias'])

So I guess some details of focalnet_base_lrf in pre-training have been modified somewhat.
It's so appreciated for your help!

Can Train on 1 GPU?

Can FocalNet be trained/fine tuned on a single GPU instead of distributed 8 GPU settings?

Incomplete pre-trained weights

Hi,

All links in the repository to download pre-trained weights are still broken as stated in issue #49. In HuggingFace, the only weights available are for classification and DINO-based models, while on the repository's releases only classification models and Sparse R-CNN with FocalNet Tiny Backbone are available.

For example, neither Mask R-CNN-based nor semantic segmentation weights can be found.

Thanks in advance!

Use of nn.LayerNorm by FocalNet for a segmentation task and its alternatives

Excellent work!

By default, layer normalization is used as FocalNet(norm_layer=nn.LayerNorm). I'm wondering if it's a better choice for a semantic segmentation task. I would love to hear some thoughts on this.

Strangely enough, setting norm_layer=nn.BatchNorm2d caused several errors since x in nn.BatchNorm2d(embed_dim)(x) was found to be a 3D Tensor with embed_dim as its last dimension.

  1. If nn.LayerNorm is supposed to be the default normalization for FocalNet, then why do we even have it as an input parameter?
  2. If one wishes to use a different normalization, is there a quick fix?

Looking forward to getting awesome replies. Thanks!

Pre-weight of FocalNet backbone for custom segmentation network

Hello
How are you?
Thanks for contributing to this project.
I am going to use FocalNet network as backbone (encoder) of my custom network for semantic segmentation.
I am using my own custom decoder.
Where can I get the pre-trained weight of FocalNet backbone for segmentation?
Of course, you shared the pre-trained models for segmentation but they were used with ONLY MaskRCNN or UperNet method.

visualize.ipynb not found

I recently read your FocalNat paper, and I feel very inspired. In the readme, you mentioned the visualization notebook, but the URL link is wrong. In addition, I saw Figure 1 in the paper. Using CAM and GradCAM tools can display interpretable images. I am very curious how you can display interpretable images through code. Thank you for reading my question, thank you very much for your answer。

image

[FocalNet-DINO] Training Command is not correct

"Train on COCO with 5scale DINO and FocalNet-L with 4 focal levels" command is not corrent, which would get an error as:

...
FileNotFoundError: file "/home/xxx/FocalNet-DINO/config/DINO/DINO_5scale_focalnet_fl4.py" does not exist
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 1100) of binary: /home/xxx/.conda/envs/dino/bin/python
...

Could please help resolve this error?

Your answer and guide will be appreciated!

Training on custom dataset

Hi Jianwei,

Thank You for providing the official implementation of Focalnet. I faced issues while trying to use the model for my research. I am working on a binary classification model with an input size of 360x880 and I have made the related adjustments for the same. I am using the Focalnet tiny lrf model specifically. It was observed that the training BCEwithlogits loss was not decreasing at all and stayed around 0.69 the whole time of 30 epochs while the accuracy stayed around 50%. For better performance, I also tried finetuning the Imagenet model while ignoring the head weight which resulted in the same outcome. Just wanted to understand, if you might have any idea about what might cause this issue.

FocalNet onnx conversion

Hi,

I am trying to convert FocalNet model to onnx using mmdeploy. However, I am running into an error due to size mismatch.

Config being used: cascade_mask_rcnn_focalnet_tiny_patch4_mstrain_480-800_adamw_3x_coco_lrf.py
Weights: focalnet_tiny_lrf_cascade_maskrcnn_3x.pth

The error I am encountering:

File "/DATA/scratch/FocalNet-MMDetection/tools/mmdeploy/mmdeploy/codebase/mmdet/models/roi_heads/cascade_roi_head.py", line 66, in cascade_roi_head__simple_test
bbox_pred = bbox_pred.reshape(batch_size, num_proposals_per_img, 4)
RuntimeError: shape '[1, 1000, 4]' is invalid for input of size 8000

Can you please point out what could be the reason for this error?

Issue reproducing evaluation metric for FocalNet+DINO+O365pretrain

Thank you for your great work!
However, I am having difficulty reproducing the evaluation metric for the model open-sourced in link. Specifically, my evaluation results is 0.3 AP lower than that you reported in the README.
图片
My command used to run the evaluation is:

python -m torch.distributed.launch --nproc_per_node=4 main.py \
  --output_dir output/path \
	-c config/DINO/DINO_5scale_focalnet_large_fl4.py --coco_path coco/path  \
	--eval --resume checkpoint/path

Could you please help me with this issue? I would be grateful if you could provide some guidance on what I might be doing wrong, or if you could share any additional details about the exact process that you used to compute the evaluation metric.
Thank you very much!

Training Time for "FocalNet for Object Detection with DINO"

First and foremost, I would like to express my sincere appreciation for your outstanding work.

As I go through the repository, I cannot find the total training time(including Object365 pretraining and COCO finetuning for FocalNet-L+DINO) or training log file, which is the information I need. I wonder if it's possible to tell me what's the total training time and what and how many GPUs you used during the training of FocalNet-L+DINO? If it is too much trouble for you to calculate the total training time, could you please send me the training log file so I can calculate myself?

Thank you very much!

Pretrained weight ImageNet22K

Hello,

I was wondering if the weights mentionned in the paper will be released soon ?

  • focalnet finetuned on imagenet1K after being pretrained on 22K
  • focalnet B pretrained on 22K ?

Visualization of gated outputs

RuntimeError Traceback (most recent call last)
/tmp/ipykernel_63586/1379561863.py in
23 fig.add_subplot(1, 5, i+2)
24 gates_i = (upsampler(gates[:, i:i+1])).cpu().detach()
---> 25 plt.imshow(gates_i.permute(1,2,0).numpy())
26 plt.axis('off')
27 x.axes.get_xaxis().set_visible(False)

RuntimeError: number of dims don't match in permute

The speed is relatively slow despite its low MACs.

We have tried FocalNet in our recognition task, but found it slower than ViT obviously, though it cost much less card memory.
Have you found the same problem? I guess it is because the inner detailed operations are not paralleled well.

How to use FocalNet-DINO pretrained with Object365

Hi,

Thank you for sharing the work!

I am using your model pretrained with Objecet365 dataset which you indicated here: https://github.com/FocalNet/FocalNet-DINO. While when I run the code in https://github.com/FocalNet/FocalNet-DINO/blob/main/inference_and_visualization.ipynb in order to predict for some images, it reports bugs on /FocalNet-DINO/models/dino/backbone.py:
NotImplementedError: Unknown backbone focalnet_large_fl4_pretrained_on_o365
I added focalnet_large_fl4_pretrained_on_o365 in the dict at https://github.com/FocalNet/FocalNet-DINO/blob/main/models/dino/backbone.py#L229 and https://github.com/FocalNet/FocalNet-DINO/blob/main/models/dino/backbone.py#L205, but it still came out that NotImplementedError: Unknown backbone focalnet_large_fl4_pretrained_on_o365 in the function https://github.com/FocalNet/FocalNet-DINO/blob/main/models/dino/focal.py#L515. I found I cannot fix it due to the unknown parameter in https://github.com/FocalNet/FocalNet-DINO/blob/main/models/dino/focal.py#L531.

Could you help me to use the model?
Thank you for your time.

Model load_state_dict issue with 'focalnet_base_iso_16.pth'

Hello @jwyang:
I have another problem when I try the isotropic focalnets model of 'focalnet_base_iso_16.pth':

I initialize the model by

# isotropic FocalNets
model = FocalNet(depths=[12], patch_size=16, embed_dim=768, focal_levels=[3], focal_windows=[3], use_layerscale=True, use_postln=True).cuda()

and load 'focalnet_base_iso_16.pth' by

ckpt_path = "focalnet_base_iso_16.pth"
ckpt = torch.load(ckpt_path)
model.load_state_dict(ckpt['model'])
model.eval()

but I have the error as below:

_Error(s) in loading state_dict for FocalNet:
Unexpected key(s) in state_dict: "layers.0.blocks.0.modulation.ln.weight", "layers.0.blocks.0.modulation.ln.bias", "layers.0.blocks.1.modulation.ln.weight", "layers.0.blocks.1.modulation.ln.bias", "layers.0.blocks.2.modulation.ln.weight", "layers.0.blocks.2.modulation.ln.bias", "layers.0.blocks.3.modulation.ln.weight", "layers.0.blocks.3.modulation.ln.bias", "layers.0.blocks.4.modulation.ln.weight", "layers.0.blocks.4.modulation.ln.bias", "layers.0.blocks.5.modulation.ln.weight", "layers.0.blocks.5.modulation.ln.bias", "layers.0.blocks.6.modulation.ln.weight", "layers.0.blocks.6.modulation.ln.bias", "layers.0.blocks.7.modulation.ln.weight", "layers.0.blocks.7.modulation.ln.bias", "layers.0.blocks.8.modulation.ln.weight", "layers.0.blocks.8.modulation.ln.bias", "layers.0.blocks.9.modulation.ln.weight", "layers.0.blocks.9.modulation.ln.bias", "layers.0.blocks.10.modulation.ln.weight", "layers.0.blocks.10.modulation.ln.bias", "layers.0.blocks.11.modulation.ln.weight", "layers.0.blocks.11.modulation.ln.bias".

It seems the architectures are different, is any code part out of date here please, I am still in the visualization.ipynb to run.
Thank you!

And also about the model choices:

I see the visualization in huggingface you shared last time is quite good of the attentions visualisation, the model behind should be 'focalnet_base_iso_16.pth' right? as I checked the files there and if I understand correctly.

If my focus is to generate attention scores for all pixels in images, what models do you recommend, if pre-trained model: focalnet_base_iso_16.pth is best/good? or I'd better training one on my own data (mostly online advertisments images) based on some pre-trained-model? (It seems there is no way to evaluate the performance of attentions scores of the models except by eyes/intuitions.)

Sorry a bit long questions... Thanks a lot!

Best wishes,
Wen

ImageNet-22k classifier layout

Can you share the classifier layout of your ImageNet-22k heads? It does not match any canonical layouts I'm aware of and it'd be useful to have as these models are useful as is for more than just pretrain (but not without the layout).

You have 21842 classes, that does not match either the fall11 (21841) or google's internal (21843). It does not match winter21 which is < 20k classes. I cannot add a background '0' class to offset 21481 -> 21842 (assuming lexicographical sorting of the sysnets which is standard practice for imagenet classifier layouts).

Thanks

Where are the paths of training set images and annotations to be specified?

In the example config file configs/focalnet/cascade_mask_rcnn_focalnet_tiny_patch4_mstrain_480-800_adamw_3x_coco_srf.py
we see:

data_root = 'data/coco/'
data = dict(
    train=dict(pipeline=train_pipeline),
    test=dict(
        ann_file=data_root + 'annotations/instances_val2017.json',
        img_prefix=data_root + 'val2017/'
    ),
    samples_per_gpu=1,
)

Why are ann_file and img_prefix defined in the test dict but not in the train dict? Where should the paths be specified for the training set?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.