w1oves / rein Goto Github PK

[CVPR 2024] Official implement of <Stronger, Fewer, & Superior: Harnessing Vision Foundation Models for Domain Generalized Semantic Segmentation>

Home Page: https://zxwei.site/rein

License: GNU General Public License v3.0

Jupyter Notebook 48.47% Python 51.46% Shell 0.07%

rein's Introduction

[CVPR 2024] Stronger, Fewer, & Superior: Harnessing Vision Foundation Models for Domain Generalized Semantic Segmentation

zhixiang wei¹, lin chen², et al.
¹ University of Science of Techonology of China ² Shanghai AI Laboratory

Project page: https://zxwei.site/rein

Paper: https://arxiv.org/pdf/2312.04265.pdf

Rein is a efficient and robust fine-tuning method, specifically developed to effectively utilize Vision Foundation Models (VFMs) for Domain Generalized Semantic Segmentation (DGSS). It achieves SOTA on Cityscapes to ACDC, and GTAV to Cityscapes+Mapillary+BDD100K. Using only synthetic data, Rein achieved an mIoU of 78.4% on Cityscapes validation set! Using only the data from the Cityscapes training set, we achieved an average mIoU of 77.6% on ACDC test set!

Visualization

Trained on Cityscapes, Rein generalizes to unseen driving scenes and cities: Nighttime Shanghai, Foggy Countryside, and Rainy Hollywood.

night_shanghai.mp4

rain_chicago.mp4

fog_beijing.mp4

Performance Under Various Settings (DINOv2).

Setting	mIoU	Config	Log & Checkpoint
GTAV $\rightarrow$ Cityscapes	66.7	config	log & checkpoint
+Synthia $\rightarrow$ Cityscapes	72.2	config	log & checkpoint
+UrbanSyn $\rightarrow$ Cityscapes	78.4	config	log & checkpoint
+1/16 of Cityscapes training $\rightarrow$ Cityscapes	82.5	config	log & checkpoint
GTAV $\rightarrow$ BDD100K	60.0	config	log & checkpoint
Cityscapes $\rightarrow$ ACDC	77.6	config	log & checkpoint
Cityscapes $\rightarrow$ Cityscapes-C	60.0	config	log & checkpoint

Performance For Various Backbones (Trained on GTAV).

Setting	Pretraining	Citys. mIoU	Config	Log & Checkpoint
ResNet50	ImageNet1k	49.1	config	log & checkpoint
ResNet101	ImageNet1k	45.9	config	log & checkpoint
ConvNeXt-Large	ImageNet21k	57.9	config	log & checkpoint
ViT-Small	DINOv2	55.3	config	log & checkpoint
ViT-Base	DINOv2	64.3	config	log & checkpoint
CLIP-Large	OPENAI	58.1	config	log & checkpoint
SAM-Huge	SAM	59.2	config	log & checkpoint

Citation

If you find our code or data helpful, please cite our paper:

@InProceedings{Wei_2024_CVPR,
    author    = {Wei, Zhixiang and Chen, Lin and Jin, Yi and Ma, Xiaoxiao and Liu, Tianle and Ling, Pengyang and Wang, Ben and Chen, Huaian and Zheng, Jinjin},
    title     = {Stronger Fewer \& Superior: Harnessing Vision Foundation Models for Domain Generalized Semantic Segmentation},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    month     = {June},
    year      = {2024},
    pages     = {28619-28630}
}

🔥 News!

We have uploaded the config for ResNet and ConvNeXt.
🔥 We have uploaded the checkpoint and config for +1/16 of Cityscapes training set, and it get 82.5% on the Cityscapes validation set!
Rein is accepted in CVPR2024!
🔥 Using only the data from the Cityscapes training set, we achieved an average mIoU of 77.56% on the ACDC test set! This result ranks first in the DGSS methods on the ACDC benchmark! Checkpoint is avaliable at release.
🔥 Using only synthetic data (UrbanSyn, GTAV, and Synthia), Rein achieved an mIoU of 78.4% on Cityscapes! Checkpoint is avaliable at release.

Try and Test

Experience the demo: Users can open demo.ipynb in any Jupyter-supported editor to explore our demonstration.

For testing on the cityscapes dataset, refer to the 'Install' and 'Setup' sections below.

Environment Setup

To set up your environment, execute the following commands:

conda create -n rein -y
conda activate rein
conda install pytorch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2 pytorch-cuda=11.7 -c pytorch -c nvidia -y
pip install -U openmim
mim install mmengine
mim install "mmcv>=2.0.0"
pip install "mmsegmentation>=1.0.0"
pip install "mmdet>=3.0.0"
pip install xformers=='0.0.20' # optional for DINOv2
pip install -r requirements.txt
pip install future tensorboard

Dataset Preparation

The Preparation is similar as DDB.

Cityscapes: Download leftImg8bit_trainvaltest.zip and gt_trainvaltest.zip from Cityscapes Dataset and extract them to data/cityscapes.

Mapillary: Download MAPILLARY v1.2 from Mapillary Research and extract it to data/mapillary.

GTA: Download all image and label packages from TU Darmstadt and extract them to data/gta.

Prepare datasets with these commands:

cd Rein
mkdir data
# Convert data for validation if preparing for the first time
python tools/convert_datasets/gta.py data/gta # Source domain
python tools/convert_datasets/cityscapes.py data/cityscapes
# Convert Mapillary to Cityscapes format and resize for validation
python tools/convert_datasets/mapillary2cityscape.py data/mapillary data/mapillary/cityscapes_trainIdLabel --train_id
python tools/convert_datasets/mapillary_resize.py data/mapillary/validation/images data/mapillary/cityscapes_trainIdLabel/val/label data/mapillary/half/val_img data/mapillary/half/val_label

(Optional) ACDC: Download all image and label packages from ACDC and extract them to data/acdc.

(Optional) UrbanSyn: Download all image and label packages from UrbanSyn and extract them to data/urbansyn.

The final folder structure should look like this:

Rein
├── ...
├── checkpoints
│   ├── dinov2_vitl14_pretrain.pth
│   ├── dinov2_rein_and_head.pth
├── data
│   ├── cityscapes
│   │   ├── leftImg8bit
│   │   │   ├── train
│   │   │   ├── val
│   │   ├── gtFine
│   │   │   ├── train
│   │   │   ├── val
│   ├── bdd100k
│   │   ├── images
│   │   |   ├── 10k
│   │   │   |    ├── train
│   │   │   |    ├── val
│   │   ├── labels
│   │   |   ├── sem_seg
│   │   |   |    ├── masks
│   │   │   |    |    ├── train
│   │   │   |    |    ├── val
│   ├── mapillary
│   │   ├── training
│   │   ├── cityscapes_trainIdLabel
│   │   ├── half
│   │   │   ├── val_img
│   │   │   ├── val_label
│   ├── gta
│   │   ├── images
│   │   ├── labels
├── ...

Pretraining Weights

Download: Download pre-trained weights from facebookresearch for testing. Place them in the project directory without changing the file name.

Convert: Convert pre-trained weights for training or evaluation.

python tools/convert_models/convert_dinov2.py checkpoints/dinov2_vitl14_pretrain.pth checkpoints/dinov2_converted.pth

(optional for 1024x1024 resolution)

python tools/convert_models/convert_dinov2.py checkpoints/dinov2_vitl14_pretrain.pth checkpoints/dinov2_converted_1024x1024.pth --height 1024 --width 1024

Evaluation

Run the evaluation:

python tools/test.py configs/dinov2/rein_dinov2_mask2former_512x512_bs1x4.py checkpoints/dinov2_rein_and_head.pth --backbone dinov2_converted.pth

For most of provided release checkpoints, you can run this command to evluate

python tools/test.py /path/to/cfg /path/to/checkpoint --backbone /path/to/dinov2_converted.pth #(or dinov2_converted_1024x1024.pth)

Training

Start training in single GPU:

python tools/train.py configs/dinov2/rein_dinov2_mask2former_512x512_bs1x4.py

Start training in multiple GPU:

PORT=12345 CUDA_VISIBLE_DEVICES=1,2,3,4 bash tools/dist_train.sh configs/dinov2/rein_dinov2_mask2former_1024x1024_bs4x2.py NUM_GPUS

Generate full weights

Because we only fine-tune and save the REIN and head weights, if you need a complete set of segmentor weights, you need to use this script:

python generate_full_weights.py --segmentor_save_path SEGMENTOR_SAVE_PATH --backbone CONVERTED_BACKBONE --rein_head REIN_HEAD

FAQs

Detailed instruction for mmsegmentation.
How to train on new dataset?
How to visualize and save the segmentation results?
How to use new checkpoint?
What is the difference between the ReinMask2FormerHead and original Mask2FormerHead?
Multi-gpu training problem
How to Integrate Rein into Your Existing Backbone?(without mmsegmentation)

Q: How to Visualize?

A: Use tools/visualize.py, such as :

python tools/visualize.py /path/to/cfg /path/to/checkpoint /path/to/images --backbone /path/to/converted_backbone

here /path/to/images can be a filename or image folder.

Q: Why do we need to use multiple weight files during testing?**
- A: The weight files used during testing are:
  - Backbone: Pre-trained backbone weight files. Since Rein is a parameter-efficient fine-tuning method, there is no need to fine-tune the backbone. This means that for the same backbone, we only need to store one set of parameters, which can significantly reduce storage space.
  - Rein_head: Fine-tuned Rein weights and decode head weights.

Acknowledgment

Our implementation is mainly based on following repositories. Thanks for their authors.

Star History

rein's People

Contributors

Stargazers

Watchers

Forkers

ichinose0code lindadamama qjwyl evdcush wuzhongdehua autogyro zhangzw12319 kingfener alias-z brookocola black-cat13 cv-seg jxhoh yuanczx d710055071 gniyy yixfeng prithwijit-shl

rein's Issues

some difficulties training my own dataset

您好，

非常感谢您能够将您的代码开源！

在下载了您的代码后，我做了以下操作：
1.下载了预训练模型"dinov2_vitl14_pretrain.pth"并使用”python tools/convert_models/convert_dinov2.py checkpoints/dinov2_vitl14_pretrain.pth checkpoints/dinov2_converted.pth“语句生成了"dinov2_converted.pth"文件
2.在Rein-train文件夹下建立了"data"。data下有两个文件：images和labels，这两个文件夹下又分别存在两个文件：train和val，train和val下都是我自己的'.png'文件，分别表示训练集和验证集。

3.复制"configs/base/datasets"文件夹下的"cityscapes_512x512.py"文件并重命名为"Cfg2.py",修改了文件中对应的路径为自己的数据集。
4.复制"configs/dinov2"文件夹下的"rein_dinov2_mask2former_512x512_bs1x4.py"，并重命名为"Cfg1.py"，修改了对应的"base"路径。
最后我运行"python tools/train.py configs/dinov2/Cfg1.py"后出现了"ValueError: val_dataloader, val_cfg, and val_evaluator should be either all None or not None, but got val_dataloader=None, val_cfg={'type': 'ValLoop'}, val_evaluator=None"这样的报错，在"Cfg1.py"将val_cfg和test_cfg注释之后又出现了"KeyError: 'cfg or default_args must contain the key "type", but got {'pipeline': [{'type': 'LoadImageFromFile'}, {'type': 'LoadAnnotations'}, {'type': 'RandomChoiceResize', 'scales': [256, 307, 358, 409, 460], 'resize_type': 'ResizeShortestEdge', 'max_size': 2048}, {'type': 'RandomCrop', 'crop_size': (512, 512), 'cat_max_ratio': 0.75}, {'type': 'RandomFlip', 'prob': 0.5}, {'type': 'PhotoMetricDistortion'}, {'type': 'PackSegInputs'}]}\nNone'"的错误。
这是我数据集设置的问题还是代码修改的问题呢？可以麻烦您针对这种情况给我一些建议吗？
期待您的回复！

The confusion about the details of the paper

Dear author, I have some confusion about the details of the paper as follows:

Rein can apply a softmax function to align each patch with a unique instance, why can a softmax function achieve this goal?
This strategic selection allows models to sidestep unnecessary adjustments by assigning a high value to the first token and subsequently discarding it. I am confused about why the first token is assigned a high value and how to achieve the goal that assigns a high value to the first token rather than another token.

dataloader error

This is a great piece of work! However, I encountered some issues while trying to train the model using the command “python tools/train.py configs/resnet/rein_resnet50_mask2former_512x512_bs1x4.py” ：

/home/xiaoxu/Rein/rein/models/backbones/dino_layers/swiglu_ffn.py:43: UserWarning: xFormers is available (SwiGLU)
warnings.warn("xFormers is available (SwiGLU)")
/home/xiaoxu/Rein/rein/models/backbones/dino_layers/attention.py:27: UserWarning: xFormers is available (Attention)
warnings.warn("xFormers is available (Attention)")
/home/xiaoxu/Rein/rein/models/backbones/dino_layers/block.py:33: UserWarning: xFormers is available (Block)
warnings.warn("xFormers is available (Block)")
Fail to import ReinsConvNeXt, if you need to use it, please install mmpretrain
Traceback (most recent call last):
File "/home/xiaoxu/Rein/tools/train.py", line 116, in
main()
File "/home/xiaoxu/Rein/tools/train.py", line 105, in main
runner = Runner.from_cfg(cfg)
^^^^^^^^^^^^^^^^^^^^
File "/home/xiaoxu/.conda/envs/rein/lib/python3.11/site-packages/mmengine/runner/runner.py", line 462, in from_cfg
runner = cls(
^^^^
File "/home/xiaoxu/.conda/envs/rein/lib/python3.11/site-packages/mmengine/runner/runner.py", line 342, in init
raise ValueError(
ValueError: val_dataloader, val_cfg, and val_evaluator should be either all None or not None, but got val_dataloader=None, val_cfg={'type': 'ValLoop'}, val_evaluator=None

I have confirmed that my environment configuration is as follows:

mmcv: 2.1.0
mmdet: 3.3.0
mmengine: 0.10.3
mmsegmentation: 1.2.2
Python: 3.11
PyTorch: 2.0.1
CUDA: 11.7
I am using the Cityscapes dataset.I have tried several methods to resolve this issue but have not found a solution yet. Do you have any suggestions or guidance to help solve this problem?

About load EVA_02 weights

Congratulations! This paper inspired me a lot, but I still have some questions. For instance, the selection of the EVA_02 model appears to have multiple versions, which doesn’t seem to be addressed in the paper. Additionally, could you release the corresponding Python file about loading and selecting the eva_02 model weights? I was unable to locate this information in the paper.
Thank you! By the way, your paper is well-written, and the figures are elegant.

The version of xFormers

I encounter this issue when using dinov2 for training:

WARNING[XFORMERS]: xFormers can't load C++/CUDA extensions. xFormers was built for:
PyTorch 2.0.1+cu118 with CUDA 1108 (you have 2.0.1)
Python 3.8.16 (you have 3.8.13)
Please reinstall xformers (see https://github.com/facebookresearch/xformers#installing-xformers)

Could you tell me how to resolve it? Thanks a lot.

自定义数据集

如果是想在自己的数据集里面做任务的话，
cityscapes_type = "CityscapesDataset"
那这个City有点奇怪，我如果不想用这个dataset呢

Can't understand the design of Δfi

Sorry for my poor comprehension ability, I cant understand the design of this , even though I've read the paper many times. I'm curious if removing the the first column of Si and first row of Ti achieves the same result as removing any column of Si and any row of Ti. I can't understand why it has to be the first one.

I'd appreciate it if you could explain it to me.

have you ever try synthia->C, B, M?

Hello, thank you for your great work!
Have you ever trained your Rein on "synthia to others" setting? (single-source)
I wonder the result of synthia setting for same setting with GTAV (i.e, crop size 512 x 512).
Is there official report of this setting?
Thanks a lot.

Error

作者你好: 我在复现代码时出现以下错误:

就是在4400代时报错，无法读取标签，但是前面的训练迭代都没有问题，并且我的数据集设置跟代码一致，请问这是为什么呢？非常期待您的回复！

Rein performing poorly with EVA-B

I am only getting about 23 mIoU for BDD and 48 for cityscapes, when training on cityscapes using EVA-02-B model + mask2former + rein.
Any ideas what could lead to this huge performance drop compared to EVA-L in the paper? Training config is below.

backbone_norm_cfg = dict(eps=1e-06, requires_grad=True, type='LN')
bdd_crop_size = (
    512,
    512,
)
bdd_root = '../data/bdd100k/'
bdd_test_pipeline = [
    dict(type='LoadImageFromFile'),
    dict(keep_ratio=True, scale=(
        1280,
        720,
    ), type='Resize'),
    dict(type='LoadAnnotations'),
    dict(type='PackSegInputs'),
]
bdd_train_pipeline = [
    dict(type='LoadImageFromFile'),
    dict(type='LoadAnnotations'),
    dict(scale=(
        1280,
        720,
    ), type='Resize'),
    dict(cat_max_ratio=0.75, crop_size=(
        512,
        512,
    ), type='RandomCrop'),
    dict(prob=0.5, type='RandomFlip'),
    dict(type='PhotoMetricDistortion'),
    dict(type='PackSegInputs'),
]
bdd_type = 'CityscapesDataset'
cityscapes_crop_size = (
    512,
    512,
)
cityscapes_root = '../data/cityscapes/'
cityscapes_test_pipeline = [
    dict(type='LoadImageFromFile'),
    dict(keep_ratio=True, scale=(
        2048,
        512,
    ), type='Resize'),
    dict(type='LoadAnnotations'),
    dict(type='PackSegInputs'),
]
cityscapes_train_pipeline = [
    dict(type='LoadImageFromFile'),
    dict(type='LoadAnnotations'),
    dict(scale=(
        2048,
        512,
    ), type='Resize'),
    dict(cat_max_ratio=0.75, crop_size=(
        512,
        512,
    ), type='RandomCrop'),
    dict(prob=0.5, type='RandomFlip'),
    dict(type='PhotoMetricDistortion'),
    dict(type='PackSegInputs'),
]
cityscapes_type = 'CityscapesDataset'
crop_size = (
    512,
    512,
)
default_hooks = dict(
    checkpoint=dict(
        by_epoch=False,
        interval=1000,
        max_keep_ckpts=3,
        save_best='bdd_mIoU',
        type='CheckpointHook'),
    logger=dict(interval=50, log_metric_by_epoch=False, type='LoggerHook'),
    param_scheduler=dict(type='ParamSchedulerHook'),
    sampler_seed=dict(type='DistSamplerSeedHook'),
    timer=dict(type='IterTimerHook'),
    visualization=dict(type='SegVisualizationHook'))
default_scope = 'mmseg'
embed_multi = dict(decay_mult=0.0, lr_mult=1.0)
env_cfg = dict(
    cudnn_benchmark=True,
    dist_cfg=dict(backend='nccl'),
    mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0))
find_unused_parameters = True
launcher = 'pytorch'
load_from = None
log_level = 'INFO'
log_processor = dict(by_epoch=False)
model = dict(
    backbone=dict(
        depth=12,
        drop_path_rate=0.2,
        embed_dim=768,
        img_size=512,
        in_chans=3,
        init_values=None,
        intp_freq=True,
        mlp_ratio=2.6666666666666665,
        naiveswiglu=True,
        norm_layer=dict(eps=1e-06, requires_grad=True, type='LN'),
        num_heads=12,
        out_indices=[
            3,
            5,
            7,
            11,
        ],
        patch_size=16,
        pretrained=
        '../data/eva02_B_pt_in21k_p14to16.pt',
        pt_hw_seq_len=16,
        qkv_bias=True,
        reins_config=dict(
            embed_dims=768,
            link_token_to_query=True,
            lora_dim=16,
            num_layers=12,
            patch_size=16,
            token_length=100,
            type='LoRAReins'),
        rope=True,
        subln=True,
        type='ReinsEVA2',
        use_abs_pos_emb=True,
        use_checkpoint=False,
        use_rel_pos_bias=False,
        use_shared_rel_pos_bias=False,
        xattn=True),
    data_preprocessor=dict(
        bgr_to_rgb=True,
        mean=[
            123.675,
            116.28,
            103.53,
        ],
        pad_val=0,
        seg_pad_val=255,
        size=(
            512,
            512,
        ),
        std=[
            58.395,
            57.12,
            57.375,
        ],
        type='SegDataPreProcessor'),
    decode_head=dict(
        align_corners=False,
        enforce_decoder_input_project=False,
        feat_channels=256,
        in_channels=[
            768,
            768,
            768,
            768,
        ],
        loss_cls=dict(
            class_weight=[
                1.0,
                1.0,
                1.0,
                1.0,
                1.0,
                1.0,
                1.0,
                1.0,
                1.0,
                1.0,
                1.0,
                1.0,
                1.0,
                1.0,
                1.0,
                1.0,
                1.0,
                1.0,
                1.0,
                0.1,
            ],
            loss_weight=2.0,
            reduction='mean',
            type='mmdet.CrossEntropyLoss',
            use_sigmoid=False),
        loss_dice=dict(
            activate=True,
            eps=1.0,
            loss_weight=5.0,
            naive_dice=True,
            reduction='mean',
            type='mmdet.DiceLoss',
            use_sigmoid=True),
        loss_mask=dict(
            loss_weight=5.0,
            reduction='mean',
            type='mmdet.CrossEntropyLoss',
            use_sigmoid=True),
        num_classes=19,
        num_queries=100,
        num_transformer_feat_level=3,
        out_channels=256,
        pixel_decoder=dict(
            act_cfg=dict(type='ReLU'),
            encoder=dict(
                init_cfg=None,
                layer_cfg=dict(
                    ffn_cfg=dict(
                        act_cfg=dict(inplace=True, type='ReLU'),
                        embed_dims=256,
                        feedforward_channels=1024,
                        ffn_drop=0.0,
                        num_fcs=2),
                    self_attn_cfg=dict(
                        batch_first=True,
                        dropout=0.0,
                        embed_dims=256,
                        im2col_step=64,
                        init_cfg=None,
                        norm_cfg=None,
                        num_heads=8,
                        num_levels=3,
                        num_points=4)),
                num_layers=6),
            init_cfg=None,
            norm_cfg=dict(num_groups=32, type='GN'),
            num_outs=3,
            positional_encoding=dict(normalize=True, num_feats=128),
            type='mmdet.MSDeformAttnPixelDecoder'),
        positional_encoding=dict(normalize=True, num_feats=128),
        replace_query_feat=True,
        strides=[
            4,
            8,
            16,
            32,
        ],
        train_cfg=dict(
            assigner=dict(
                match_costs=[
                    dict(type='mmdet.ClassificationCost', weight=2.0),
                    dict(
                        type='mmdet.CrossEntropyLossCost',
                        use_sigmoid=True,
                        weight=5.0),
                    dict(
                        eps=1.0,
                        pred_act=True,
                        type='mmdet.DiceCost',
                        weight=5.0),
                ],
                type='mmdet.HungarianAssigner'),
            importance_sample_ratio=0.75,
            num_points=12544,
            oversample_ratio=3.0,
            sampler=dict(type='mmdet.MaskPseudoSampler')),
        transformer_decoder=dict(
            init_cfg=None,
            layer_cfg=dict(
                cross_attn_cfg=dict(
                    attn_drop=0.0,
                    batch_first=True,
                    dropout_layer=None,
                    embed_dims=256,
                    num_heads=8,
                    proj_drop=0.0),
                ffn_cfg=dict(
                    act_cfg=dict(inplace=True, type='ReLU'),
                    add_identity=True,
                    dropout_layer=None,
                    embed_dims=256,
                    feedforward_channels=2048,
                    ffn_drop=0.0,
                    num_fcs=2),
                self_attn_cfg=dict(
                    attn_drop=0.0,
                    batch_first=True,
                    dropout_layer=None,
                    embed_dims=256,
                    num_heads=8,
                    proj_drop=0.0)),
            num_layers=9,
            return_intermediate=True),
        type='ReinMask2FormerHead'),
    test_cfg=dict(crop_size=(
        512,
        512,
    ), mode='slide', stride=(
        341,
        341,
    )),
    train_cfg=dict(),
    type='FrozenBackboneEncoderDecoder')
norm_cfg = dict(requires_grad=True, type='SyncBN')
num_classes = 19
optim_wrapper = dict(
    constructor='PEFTOptimWrapperConstructor',
    optimizer=dict(
        betas=(
            0.9,
            0.999,
        ),
        eps=1e-08,
        lr=0.0001,
        type='AdamW',
        weight_decay=0.05),
    paramwise_cfg=dict(
        custom_keys=dict({
            'learnable_tokens': dict(decay_mult=0.0, lr_mult=1.0),
            'level_embed': dict(decay_mult=0.0, lr_mult=1.0),
            'norm': dict(decay_mult=0.0),
            'query_embed': dict(decay_mult=0.0, lr_mult=1.0),
            'reins.scale': dict(decay_mult=0.0, lr_mult=1.0)
        }),
        norm_decay_mult=0.0))
param_scheduler = [
    dict(
        begin=0,
        by_epoch=False,
        end=40000,
        eta_min=0,
        power=0.9,
        type='PolyLR'),
]
randomness = dict(seed=42)
resume = False
test_cfg = dict(type='TestLoop')
test_dataloader = dict(
    batch_size=1,
    dataset=dict(
        datasets=[
            dict(
                data_prefix=dict(
                    img_path='images/10k/val',
                    seg_map_path='labels/sem_seg/masks/val'),
                data_root=
                '../data/bdd100k/',
                img_suffix='.jpg',
                pipeline=[
                    dict(type='LoadImageFromFile'),
                    dict(keep_ratio=True, scale=(
                        1280,
                        720,
                    ), type='Resize'),
                    dict(type='LoadAnnotations'),
                    dict(type='PackSegInputs'),
                ],
                seg_map_suffix='.png',
                type='CityscapesDataset'),
            dict(
                data_prefix=dict(
                    img_path='leftImg8bit/val', seg_map_path='gtFine/val'),
                data_root=
                '../data/cityscapes/',
                pipeline=[
                    dict(type='LoadImageFromFile'),
                    dict(keep_ratio=True, scale=(
                        2048,
                        512,
                    ), type='Resize'),
                    dict(type='LoadAnnotations'),
                    dict(type='PackSegInputs'),
                ],
                type='CityscapesDataset'),
        ],
        type='ConcatDataset'),
    num_workers=4,
    persistent_workers=True,
    sampler=dict(shuffle=False, type='DefaultSampler'))
test_evaluator = dict(
    dataset_keys=[
        'citys',
        'bdd',
    ],
    iou_metrics=[
        'mIoU',
    ],
    type='DGIoUMetric')
train_bdd = dict(
    data_prefix=dict(
        img_path='images/10k/train',
        seg_map_path='labels/sem_seg/masks/train'),
    data_root=
    '../data/bdd100k/',
    img_suffix='.jpg',
    pipeline=[
        dict(type='LoadImageFromFile'),
        dict(type='LoadAnnotations'),
        dict(scale=(
            1280,
            720,
        ), type='Resize'),
        dict(cat_max_ratio=0.75, crop_size=(
            512,
            512,
        ), type='RandomCrop'),
        dict(prob=0.5, type='RandomFlip'),
        dict(type='PhotoMetricDistortion'),
        dict(type='PackSegInputs'),
    ],
    seg_map_suffix='.png',
    type='CityscapesDataset')
train_cfg = dict(max_iters=40000, type='IterBasedTrainLoop', val_interval=1000)
train_cityscapes = dict(
    data_prefix=dict(
        img_path='leftImg8bit/train', seg_map_path='gtFine/train'),
    data_root=
    '../data/cityscapes/',
    pipeline=[
        dict(type='LoadImageFromFile'),
        dict(type='LoadAnnotations'),
        dict(scale=(
            2048,
            512,
        ), type='Resize'),
        dict(cat_max_ratio=0.75, crop_size=(
            512,
            512,
        ), type='RandomCrop'),
        dict(prob=0.5, type='RandomFlip'),
        dict(type='PhotoMetricDistortion'),
        dict(type='PackSegInputs'),
    ],
    type='CityscapesDataset')
train_dataloader = dict(
    batch_size=8,
    dataset=dict(
        data_prefix=dict(
            img_path='leftImg8bit/train', seg_map_path='gtFine/train'),
        data_root=
        '../data/cityscapes/',
        pipeline=[
            dict(type='LoadImageFromFile'),
            dict(type='LoadAnnotations'),
            dict(scale=(
                2048,
                512,
            ), type='Resize'),
            dict(
                cat_max_ratio=0.75, crop_size=(
                    512,
                    512,
                ), type='RandomCrop'),
            dict(prob=0.5, type='RandomFlip'),
            dict(type='PhotoMetricDistortion'),
            dict(type='PackSegInputs'),
        ],
        type='CityscapesDataset'),
    num_workers=4,
    persistent_workers=True,
    pin_memory=True,
    sampler=dict(shuffle=True, type='InfiniteSampler'))
train_pipeline = [
    dict(type='LoadImageFromFile'),
    dict(type='LoadAnnotations'),
    dict(
        max_size=2048,
        resize_type='ResizeShortestEdge',
        scales=[
            256,
            307,
            358,
            409,
            460,
            512,
            563,
            614,
            665,
            716,
            768,
            819,
            870,
            921,
            972,
            1024,
        ],
        type='RandomChoiceResize'),
    dict(cat_max_ratio=0.75, crop_size=(
        512,
        512,
    ), type='RandomCrop'),
    dict(prob=0.5, type='RandomFlip'),
    dict(type='PhotoMetricDistortion'),
    dict(type='PackSegInputs'),
]
tta_model = dict(type='SegTTAModel')
val_bdd = dict(
    data_prefix=dict(
        img_path='images/10k/val', seg_map_path='labels/sem_seg/masks/val'),
    data_root=
    '../data/bdd100k/',
    img_suffix='.jpg',
    pipeline=[
        dict(type='LoadImageFromFile'),
        dict(keep_ratio=True, scale=(
            1280,
            720,
        ), type='Resize'),
        dict(type='LoadAnnotations'),
        dict(type='PackSegInputs'),
    ],
    seg_map_suffix='.png',
    type='CityscapesDataset')
val_cfg = dict(type='ValLoop')
val_cityscapes = dict(
    data_prefix=dict(img_path='leftImg8bit/val', seg_map_path='gtFine/val'),
    data_root=
    '../data/cityscapes/',
    pipeline=[
        dict(type='LoadImageFromFile'),
        dict(keep_ratio=True, scale=(
            2048,
            512,
        ), type='Resize'),
        dict(type='LoadAnnotations'),
        dict(type='PackSegInputs'),
    ],
    type='CityscapesDataset')
val_dataloader = dict(
    batch_size=1,
    dataset=dict(
        datasets=[
            dict(
                data_prefix=dict(
                    img_path='images/10k/val',
                    seg_map_path='labels/sem_seg/masks/val'),
                data_root=
                '../data/bdd100k/',
                img_suffix='.jpg',
                pipeline=[
                    dict(type='LoadImageFromFile'),
                    dict(keep_ratio=True, scale=(
                        1280,
                        720,
                    ), type='Resize'),
                    dict(type='LoadAnnotations'),
                    dict(type='PackSegInputs'),
                ],
                seg_map_suffix='.png',
                type='CityscapesDataset'),
            dict(
                data_prefix=dict(
                    img_path='leftImg8bit/val', seg_map_path='gtFine/val'),
                data_root=
                '../data/cityscapes/',
                pipeline=[
                    dict(type='LoadImageFromFile'),
                    dict(keep_ratio=True, scale=(
                        2048,
                        512,
                    ), type='Resize'),
                    dict(type='LoadAnnotations'),
                    dict(type='PackSegInputs'),
                ],
                type='CityscapesDataset'),
        ],
        type='ConcatDataset'),
    num_workers=4,
    persistent_workers=True,
    sampler=dict(shuffle=False, type='DefaultSampler'))
val_evaluator = dict(
    dataset_keys=[
        'citys',
        'bdd',
    ],
    iou_metrics=[
        'mIoU',
    ],
    type='DGIoUMetric')
vis_backends = [
    dict(type='LocalVisBackend'),
    dict(type='TensorboardVisBackend'),
]
visualizer = dict(
    name='visualizer',
    type='SegLocalVisualizer',
    vis_backends=[
        dict(type='LocalVisBackend'),
        dict(type='TensorboardVisBackend'),
    ])
work_dir = './work_dirs/rein_eva02_mask2former_512x512_cityscapes'

Create model with input size of 128x128

Hello!

Can I convert the model to be able to pass through images with 128 x 128 dim ? After converting with your script, run inference with this converted model. I guess I would have to change the config file as well, but it is possible ?

Thanks

About clip's configuration file

Thank you so much for your wonderful work!
When we were training rein+clip we realized that there was no configuration file for it, so we wrote our own based on your code and other samples, but we ran into a problem: clip uses the neck module, so in the neck module do we just pass in the backbone(rein) output about the feature, and the learnable token directly into head?
Here's the config file we wrote:

base = [
"../base/datasets/dg_gta_512x512.py",
"../base/default_runtime.py",
"../base/models/clip-L_mask2former.py",
]
model = dict(
backbone=dict(
type="ReinsCLIPVisionTransformer",
reins_config=dict(
type="LoRAReins",
token_length=100,
embed_dims=1024,
num_layers=24,
patch_size=16,
link_token_to_query=True,
lora_dim=16,
),
),
decode_head=dict(
type="ReinMask2FormerHead",
),
)
train_pipeline = [
dict(type="LoadImageFromFile"),
dict(type="LoadAnnotations"),
dict(
type="RandomChoiceResize",
scales=[int(512 * x * 0.1) for x in range(5, 21)],
resize_type="ResizeShortestEdge",
max_size=2048,
),
dict(type="RandomCrop", crop_size={{base.crop_size}}, cat_max_ratio=0.75),
dict(type="RandomFlip", prob=0.5),
dict(type="PhotoMetricDistortion"),
dict(type="PackSegInputs"),
]
train_dataloader = dict(batch_size=4, dataset=dict(pipeline=train_pipeline))

embed_multi = dict(lr_mult=1.0, decay_mult=0.0)
optim_wrapper = dict(
constructor="PEFTOptimWrapperConstructor",
optimizer=dict(
type="AdamW", lr=0.0001, weight_decay=0.05, eps=1e-8, betas=(0.9, 0.999)
),
paramwise_cfg=dict(
custom_keys={
"norm": dict(decay_mult=0.0),
"query_embed": embed_multi,
"level_embed": embed_multi,
"learnable_tokens": embed_multi,
"reins.scale": embed_multi,
},
norm_decay_mult=0.0,
),
)
param_scheduler = [
dict(type="PolyLR", eta_min=0, power=0.9, begin=0, end=40000, by_epoch=False)
]

train_cfg = dict(type="IterBasedTrainLoop", max_iters=40000, val_interval=10000)
val_cfg = dict(type="ValLoop")
test_cfg = dict(type="TestLoop")
default_hooks = dict(
timer=dict(type="IterTimerHook"),
logger=dict(type="LoggerHook", interval=50, log_metric_by_epoch=False),
param_scheduler=dict(type="ParamSchedulerHook"),
checkpoint=dict(
type="CheckpointHook", by_epoch=False, interval=4000, max_keep_ckpts=3
),
sampler_seed=dict(type="DistSamplerSeedHook"),
visualization=dict(type="SegVisualizationHook"),
)
find_unused_parameters = True
auto_scale_lr = dict(enable=False, base_batch_size=4) # v2

Train freeze dinov2

Hello, I use the command "python tools/train.py configs/dinov2/dinov2_mask2former_512x512_bs1x4.py" to train the freeze dinov2.

Since the limitation of the GPUs, the batch size is set to 1, and I use 3 1080Ti to train the model.

Then, I use these commands,
including "python tools/test.py configs/dinov2/dinov2_mask2former_512x512_bs1x4.py work_dirs/dinov2_mask2former_512x512_bs1x4/iter_40000.pth --backbone checkpoints/dinov2_converted.pth" or
"python tools/test.py configs/dinov2/dinov2_mask2former_512x512_bs1x4.py work_dirs/dinov2_mask2former_512x512_bs1x4/iter_40000.pth --backbone checkpoints/dinov2_vitl14_pretrain.pth" or
"python tools/test.py configs/dinov2/dinov2_mask2former_512x512_bs1x4.py work_dirs/dinov2_mask2former_512x512_bs1x4/iter_40000.pth" to test the trained model.

However, the performances are poor. The two former commands are not work to test the model, since the mIoU is 0.0x in each dataset.

The mIoU of using the last command to test the model is 30.3%, 22.8%, and 35.1% in the Cityscapes, BDD, and Mapillary.

So, could you please tell me if I used the wrong test command or if I need to change the number of training iterations because the batch size has been changed to a smaller number? If the number of training iterations needs to be changed, what other relevant parameters need to be changed? such as LR?

tools/test.py did't work

How to combine rein and resnet or convnext?

Hello，

I notice that there are reins_resnet.py and reins_convnext.py in the code, but threre is no config file for them. If I want to use rein in a resnet structure, how should I set the patch_size and zero_mlp_delta_f parameters, and do they matter much to the result？

Looking forward to your reply!

Ask for the Synthia dataset config file.

Hi,
Could you please provide the synthia dataset config file and related data processing detailes for us?
It would be helpful for reproduce your results.
thanks alot

wating online

checkpoint release for "dinov2_rein_and_head.pth

Hi, Thank you for your great work!

I'd like to experiment with your demo, but it looks like the checkpoint is less released.
There is checkpoint for dinov2 backbone, however not for rein and its head. (i.e.,dinov2_rein_and_head.pth)
I would like to request that you finish releasing the checkpoint.

Thanks a lot !

Some questions about cls_weight in rein_dinov2_mask2former.py

Dear author,
Thanks for ur share with ur code. But when I trained the head with the dataset REFUGE2, I find some problem here:
class_weight=[1.0] * num_classes + [0.1]
I noticed u set num_classes = 19, but I dont know why need to set a list at the length of 19+1 for class_weight

and I set reduce_zero_label = True for REFUGE2 dataset, and the classes=('background', ' Optic Cup', 'Optic Disc'), I set num_class = 2 for model, but I only get 'background' and 'Optic Cup' in my segmentation mask.
when I set num_class = 3 and class_weight = [0.1 ,1, 2], the index is out of boundary , I check my label:
class_weight: tensor([0.1000, 1.0000, 2.0000], device='cuda:0')
label:tensor([3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
1, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 0, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 0, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 1,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 2, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3], device='cuda:0')
I dont know why my label got a pixel at 3, I have transfered all the pixes to 0~2 before training

Question about source domain performance.

Hi,

Thank you for your insightful work, which has provided me with valuable perspectives.

I have a question regarding the performance on the source domain, specifically the mIoU on GTAV as presented in Table 4. On the source domain, does Rein outperform other methods such as VPT and Adaptformer?

Best regards.

What is difference between FrozenBackboneEncoderDecoder and EncoderDecoder

Hello, the Rein is trained on the frozen dinov2 backbones, but the Rein likely uses EncoderDecoder. so, what is the difference when I add the code model = dict(type="FrozenBackboneEncoderDecoder") to the file rein_dinov2_mask2former_512x512_bs1x4.py

How Did You Choose Checkpoints?

Hi,

Amazing work!

I am interested in understanding how you chose checkpoints during training. For example, when training on GTA5 -> City, did you evaluate performance on City every 8000 iterations? What was the rationale for selecting 40000 iterations as your training duration?

Thank you!

why the total similarity is kept as 1 in the corresponding line of Si, which may lead to wrong modifications?

Hi author, nice to read such an interesting paper, I would like to ask why the total similarity is kept as 1 in the corresponding line of Si, which may lead to wrong modifications?

There are some issues while running the code

Hello, I'm having some problems running Rein's code. Because this is my first time coming into contact with this direction, there may be errors in my operation. Please try to solve the following problems:

After the configuration is completed according to the readme.md, the error shows that checkpoints/dinov2_converted.pth does not exist after running the training command python tools/train.py configs/dinov2/rein_dinov2_mask2former_512x512_bs1x4.py. There are only dinov2_rein_and_head.pth and dinov2_vitl14_pretrain.pth in the checkpoint file. How to download dinov2_converted.pth?
I found that renaming dinov2_rein_and_head.pth to dinov2_converted.pth worked, but the results were not ideal.
Where do I download the backbone checkpoints for releases? I didn't find the corresponding file in dinov2's github. There are many models in it.
Is the 10k semantic segmentation dataset used in bdd100k the same data in bdd100k_seg.zip? The download from the official website is not available. If not, can you provide a download link to the dataset?

how to convert eva model

There is a script to convert dino backbone. How do I have to adapt the following function to use EVA02-B?

def load_backbone(path: str):
if not osp.isfile(path):
raise FileNotFoundError(
f"{path} dont exist(absolute path: {osp.abspath(path)})"
)
weight = torch.load(path, map_location="cpu")
weight["pos_embed"] = torch.cat(
(
weight["pos_embed"][:, :1, :],
F.interpolate(
weight["pos_embed"][:, 1:, :]
.reshape(1, 37, 37, 1024)
.permute(0, 3, 1, 2),
size=(32, 32),
mode="bicubic",
align_corners=False,
)
.permute(0, 2, 3, 1)
.reshape(1, 1024, 1024),
),
dim=1,
)
weight["patch_embed.proj.weight"] = F.interpolate(
weight["patch_embed.proj.weight"].float(),
size=(16, 16),
mode="bicubic",
align_corners=False,
)
return weight

How to visualize and save the segmentation results

Hello, how to visualize and save the segmentation results?

Cityscapes best model weights

Hi,

Thanks for releasing this amazing work. I noticed that the best-performing model on cityscapes ("+1/16 of Cityscapes training Cityscapes") isn't available. Could you please release the weights of that model too?

Best,
Akshay

Reproducing Table 2

Hi @w1oves,

Congratulations on the excellent work.

I am keen to reproduce Table 2 from your paper but have been unable to locate all the necessary configuration files. Could you please provide guidance on which scripts to run and the specific hyperparameters required? I am particularly interested in the "Freeze" and "Rein" rows.

Thank you in advance for your assistance.

Question about full fine tuning exp. results

Hi, I'm interested in your impressive work!
And I have a question about full fine tuning experiment results.

How many iterations did you train full fine tuning method?
In my experience, [EVA-02+ Mask2Former] is sufficiently saturated in 5k iterations.

Thank you :)

How to use SAM as Backbone?

Thank you for your excellent work!
Could you provide relative code for Backbone of SAM?
Thanks a lot!

About the problems encountered when using GTA5+SYNTHIA configuration file

Hello, I got an error when using the GTA5+SYNTHIA configuration file in Releases:

Traceback (most recent call last):
  File "/workspace/Rein/tools/train.py", line 116, in <module>
    main()
  File "/workspace/Rein/tools/train.py", line 112, in main
    runner.train()
  File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/runner.py", line 1777, in train
    model = self.train_loop.run()  # type: ignore
  File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/loops.py", line 286, in run
    self.run_iter(data_batch)
  File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/loops.py", line 309, in run_iter
    outputs = self.runner.model.train_step(
  File "/opt/conda/lib/python3.10/site-packages/mmengine/model/base_model/base_model.py", line 114, in train_step
    losses = self._run_forward(data, mode='loss')  # type: ignore
  File "/opt/conda/lib/python3.10/site-packages/mmengine/model/base_model/base_model.py", line 361, in _run_forward
    results = self(**data, mode=mode)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/mmseg/models/segmentors/base.py", line 94, in forward
    return self.loss(inputs, data_samples)
  File "/opt/conda/lib/python3.10/site-packages/mmseg/models/segmentors/encoder_decoder.py", line 178, in loss
    loss_decode = self._decode_head_forward_train(x, data_samples)
  File "/opt/conda/lib/python3.10/site-packages/mmseg/models/segmentors/encoder_decoder.py", line 139, in _decode_head_forward_train
    loss_decode = self.decode_head.loss(inputs, data_samples,
  File "/opt/conda/lib/python3.10/site-packages/mmseg/models/decode_heads/mask2former_head.py", line 126, in loss
    losses = self.loss_by_feat(all_cls_scores, all_mask_preds,
  File "/opt/conda/lib/python3.10/site-packages/mmdet/models/dense_heads/maskformer_head.py", line 348, in loss_by_feat
    losses_cls, losses_mask, losses_dice = multi_apply(
  File "/opt/conda/lib/python3.10/site-packages/mmdet/models/utils/misc.py", line 219, in multi_apply
    return tuple(map(list, zip(*map_results)))
  File "/opt/conda/lib/python3.10/site-packages/mmdet/models/dense_heads/mask2former_head.py", line 273, in _loss_by_feat_single
    avg_factor) = self.get_targets(cls_scores_list, mask_preds_list,
  File "/opt/conda/lib/python3.10/site-packages/mmdet/models/dense_heads/maskformer_head.py", line 237, in get_targets
    results = multi_apply(self._get_targets_single, cls_scores_list,
  File "/opt/conda/lib/python3.10/site-packages/mmdet/models/utils/misc.py", line 219, in multi_apply
    return tuple(map(list, zip(*map_results)))
  File "/opt/conda/lib/python3.10/site-packages/mmdet/models/dense_heads/mask2former_head.py", line 213, in _get_targets_single
    gt_points_masks = point_sample(
  File "/opt/conda/lib/python3.10/site-packages/mmcv/ops/point_sample.py", line 270, in point_sample
    output = F.grid_sample(
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/functional.py", line 4244, in grid_sample
    return torch.grid_sampler(input, grid, mode_enum, padding_mode_enum, align_corners)
RuntimeError: grid_sampler(): expected grid to have size 3 in last dimension, but got grid with sizes [5, 12544, 1, 2]

mmcv version conflict

As depicted in the readme, mmcv >=2.0.0 is recommended However, when I try to prepare my cityscape dataset with the provided script (cityscapes.py), it comes out with an error says that mmcv.scandir is not found.
I has checked the version of mmcv and found that mmcv stop supporting this function since 2.0.0 and so there is a conflict.

About feature-to-token similarity map and token.

Hello author! I am very interested in your work. Your work is excellent.

I have a few questions.

In the paper, why is the feature-to-token similarity map designed to choose columns 2 to m in Eq.(6)?
And why does the token denote the selection of rows 2 to m in Eq.(6)?
How does this process give high value to the first token?
What would be the impact of making the sum of each row in the feature-to-token similarity map equal to 1?

Thank you again for your work.

Error about demo.ipynb

/home/XXX/moe_seg/moe_exp_seg/pytorch_seg/Rein/rein/models/backbones/dino_layers/swiglu_ffn.py:43: UserWarning: xFormers is available (SwiGLU)
warnings.warn("xFormers is available (SwiGLU)")
/home/XXX/moe_seg/moe_exp_seg/pytorch_seg/Rein/rein/models/backbones/dino_layers/attention.py:27: UserWarning: xFormers is available (Attention)
warnings.warn("xFormers is available (Attention)")
/home/XXX/moe_seg/moe_exp_seg/pytorch_seg/Rein/rein/models/backbones/dino_layers/block.py:33: UserWarning: xFormers is available (Block)
warnings.warn("xFormers is available (Block)")
Loads checkpoint by local backend from path: checkpoints/dinov2_segmentor.pth
/home/XXX/anaconda3/envs/rein/lib/python3.11/site-packages/torch/functional.py:504: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at /opt/conda/conda-bld/pytorch_1682343970094/work/aten/src/ATen/native/TensorShape.cpp:3483.)
return _VF.meshgrid(tensors, **kwargs) # type: ignore[attr-defined]
/home/XXX/anaconda3/envs/rein/lib/python3.11/site-packages/mmengine/visualization/visualizer.py:196: UserWarning: Failed to add <class 'mmengine.visualization.vis_backend.LocalVisBackend'>, please provide the save_dir argument.
warnings.warn(f'Failed to add {vis_backend.class}, '

The code is no longer running at this stage and the model is successfully loaded onto the gpu

pretrained checkpoints for CLIP with Rein not available?

I cannot find any pretrained checkpoints in the checkpoints folder for CLIP with Rein. Also there is no link for this in the demo or Readme (only for dino). Could you also provide a link for a pretrained CLIP with Rein variant? Or am I missing this?

How to train on new dataset.

@w1oves
Thanks for sharing your code, I was wondering which part of the code I should modify if I want to train my own data set? Because your code is complicated, I am not sure if it is ok to modify only train_dataloader? And I don't know how to modify train_dataloader properly
Now, can you choose one of the following questions to answer me?
First question: How do you train the model without using mmengine
Second question: If you have to use mmengine, how many parameter Settings in your config file are redundant? In other words, how many of the parameters in configs/dinov2/rein_dinov2_mask2former_512x512_bs1x4.py can be removed during training?

If you could answer my question, I would be very grateful for it.

Issue when training on my own binary segmentation dataset

Hello,

Thank you very much for your paper and the open-source code. I've encountered an issue when attempting to use REIN on my own dataset and was hoping you could assist me in resolving it. I aim to apply REIN to my binary segmentation dataset, so I modified the num_classes to 2 in configs\_base_\models\rein_dinov2_mask2former.py. Following your instructions, I trained the model using the command:
python tools/train.py configs/dinov2/rein_dinov2_mask2former_512x512_bs1x4.py
However, I encountered a size mismatch issue:

size mismatch for patch_embed.proj.weight: copying a param with shape torch.Size([1024, 3, 16, 16]) from checkpoint, the shape in current model is torch.Size([1024, 1, 16, 16]). missing keys in source state_dict: reins.scale, reins.learnable_tokens_a, reins.learnable_tokens_b, reins.mlp_token2feat.weight, reins.mlp_token2feat.bias, reins.mlp_delta_f.weight, reins.mlp_delta_f.bias, reins.transform.weight, reins.transform.bias, reins.merge.weight, reins.merge.bias

Although the training process continued and the trained model could be used for testing with the command:
python tools/test.py configs/dinov2/rein_dinov2_mask2former_512x512_bs1x4.py path/to/my/trained/pth --backbone dinov2_converted.pth
As the evaIuation result appears pool, I suspect something is not correct. Could you advise on the correct approach for this situation? Or, is REIN suitable for binary classification segmentation tasks?

I would be greatful for your reply.

which pretrained file should be used for eva backbone?

I mean what is the equivalent to dinov2_vitl14_pretrain just for eva?
I tried multiple ones from https://huggingface.co/Yuxin-CV/EVA-02/tree/main/eva02 but it does not seem to work.

Question about freezing backbone and Linking tokens to instances

Hi, I'm interested in your impressive work!
And I have a question about freezing backbone and Linking tokens to instances.

Where is the code for freezing the backbone?

Is it correct that the token is linked to the decoder's query during training?

In the released code, token appears to be linked only once for the first time.

  if use_rein:

          ...

          self.head.load_state_dict(head_state, strict=False)

          self.head.link(self.backbone.rein.link_to_querys())

Thank you :)

about Table 1.

Awesome work but a little issue. Is it possible to release the code of "Frozen backbone of VFMs"? I'm very interested in how you can achieve such high segmentation performance with these functionally different backbones (SAM, MAE, and CLIP) by only training the decoder.

Best.

Multi-gpu training problem

This is a good paper and very interested idea! There is a training cmd using a single gpu in readme. For multi-gpus training, could you provide the corresponding cmd ?

What is the difference between the ReinMask2FormerHead and original Mask2FormerHead?

Nice work. I noticed that you reweite the Mask2FormerHead as ReinMask2FormerHead, I wonder that what is the difference between them?
By when you use the VPT to comprare your Rein, What are the detailed settings of VPT? For example, how many tokens are used, and are there any differences in hyperparameter settings?

looking for your reply

Checkpoints of various "head.pth" trained on frozen backbone of VFMs (Table 1)

Hi~ Thanks for your great work! I am also curious about the performance on "Frozen backbone of VFMs" in Table 1. Could you please release these checkpoints? Thanks!

Resizing During Training and Eval

Hi! I noticed that the train pipeline for dg on gta-->cv train ons 512x512 crops on a downsampled gta image (1280, 720). However, during evaluation on cityscapes, you are evaluating on 512x512 crops on a downsampled cityscapes image (1024, 512).

Was this intended, as evaluation should occur on the original image size for cityscapes (2048, 1024)>

how to understand has_cls_token？

nice work，and i wanna know how to understand “has_cls_token”？

About GTA5+SYNTHIA config file

Hello, how to config the GTA5+SYNTHIA? In Release/source code/Rein-GTAV-Synthia/config/base/dataset/ your provided, I have not found the dg_gta_syn_xx.py config file. I try to set the config:

train_dataloader = dict(
batch_size=2,
num_workers=2,
persistent_workers=False,
pin_memory=False,
sampler=dict(type="InfiniteSampler", shuffle=True),
dataset=dict(
type="ConcatDataset",
datasets=[
{{base.train_gta}},
{{base.train_syn}}
],
),
)

but it work failed.

reproducing dinov2 backbone results

Hello! This is Amazing work!.

I am currently attempting to reproduce the results of the paper, but I am getting significantly different outcomes and would like to ask for your opinion on whether there might be an issue.

According to the paper, the frozen setting of the DINOv2 backbone on the GTA2CBM benchmarks yields a 61.1 mIoU. However, my implementation produces results around 63 (with a maximum of 63.44 mIoU). However, when training rein, the results were similar to those reported in the paper (64.3). For full fine-tuning, contrary to the paper, the performance was lower than the frozen setting, with a result of 61.5.

Could you help me identify what might be wrong, or if there is no issue at all?

Add MP4 demostration.

fog_beijing.mp4

night_shanghai.mp4

rain_chicago.mp4

how to use new checkpoint

new released pth
Trained on Cityscapes (crop size:1024x1024, iterations: 40k, batch size;8)
how to use？
i cant find 1024*1024 config file.

about checkpoints trained on Ctyscapes with crop size of (512,512)

Hi, Thank you for your great work!

I am now following your work and need to compare ours with yours on Cityscapes to BDD100k+Mapillary. However, I found that the checkpoint trained on Cityscapes uses a crop size of 1024x1024. Could you please provide the checkpoint with a crop size of 512x512 for a fair comparison?

Thanks a lot!

On the full training performance of Table 2

We are very grateful for your outstanding contributions to domain generalization. However, we are experiencing some problems with the execution of the code in question.

When we run the 'configs/dinov2/dinov2_mask2former_512x512_bs1x4.py' configuration file, we end up with citys_mIoU: 32.8100, bdd_mIoU: 26.7600, map_mIoU: 36.3800, mean_mIoU : 31.9833, showing confusing performance, with a large gap from 61.7 in Table 2. We executed multiple experiments, with the mean_mIoU results holding basically around 30, with the command "python tools/train.py configs/dinov2/dinov2_mask2former_512x512_bs1x4.py ".

I hope to have a response from you. Thanks again for your work!

w1oves / rein Goto Github PK

rein's Introduction

[CVPR 2024] Stronger, Fewer, & Superior: Harnessing Vision Foundation Models for Domain Generalized Semantic Segmentation

Visualization

Performance Under Various Settings (DINOv2).

Performance For Various Backbones (Trained on GTAV).

Citation

🔥 News!

Try and Test

Environment Setup

Dataset Preparation

Pretraining Weights

Evaluation

Training

Generate full weights

FAQs

Acknowledgment

Star History

rein's People

Contributors

Stargazers

Watchers

Forkers

rein's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs