mendelxu / san Goto Github PK

View Code? Open in Web Editor NEW

286.0 6.0 27.0 3.17 MB

Open-vocabulary Semantic Segmentation

Home Page: https://mendelxu.github.io/SAN/

License: MIT License

Python 99.42% Dockerfile 0.46% Shell 0.11%

cvpr2023 open-vocabulary-semantic-segmentation prompt-tuning

san's Introduction

[CVPR2023-Highlight] Side Adapter Network for Open-Vocabulary Semantic Segmentation

[PAMI] SAN: Side Adapter Network for Open-Vocabulary Semantic Segmentation

This is the official implementation of our conference paper : "Side Adapter Network for Open-Vocabulary Semantic Segmentation" (main branch) and journal paper: "SAN: Side Adapter Network for Open-Vocabulary Semantic Segmentation " (video branch).

Introduction

This paper presents a new framework for open-vocabulary semantic segmentation with the pre-trained vision-language model, named Side Adapter Network (SAN). Our approach models the semantic segmentation task as a region recognition problem. A side network is attached to a frozen CLIP model with two branches: one for predicting mask proposals, and the other for predicting attention bias which is applied in the CLIP model to recognize the class of masks. This decoupled design has the benefit CLIP in recognizing the class of mask proposals. Since the attached side network can reuse CLIP features, it can be very light. In addition, the entire network can be trained end-to-end, allowing the side network to be adapted to the frozen CLIP model, which makes the predicted mask proposals CLIP-aware. Our approach is fast, accurate, and only adds a few additional trainable parameters. We evaluate our approach on multiple semantic segmentation benchmarks. Our method significantly outperforms other counterparts, with up to 18 times fewer trainable parameters and 19 times faster inference speed.

Demo

Run the demo app on 🤗HuggingFace. (It is running on a low-spec machine and could be slow)

Run the demo app with docker.

docker build docker/app.Docker -t san_app
docker run -it --shm-size 4G -p 7860:7860  san_app

Installation

Clone the repository

git clone https://github.com/MendelXu/SAN.git

Navigate to the project directory
```
cd SAN
```
Install the dependencies
```
bash install.sh
```
Hint: You can run the job in the docker instead of installing dependencies locally. Run with pre-built docker:
```
docker run -it --gpus all --shm-size 8G mendelxu/pytorch:d2_nvcr_2008 /bin/bash
```
or build your docker with provided dockerfile docker/Dcokerfile.

Data Preparation

See SimSeg for reference. The data should be organized like:

datasets/
    coco/
        ...
        train2017/
        val2017/
        stuffthingmaps_detectron2/
    VOC2012/
        ...
        images_detectron2/
        annotations_detectron2/
    pcontext/
        ...
        val/
    pcontext_full/
        ...
        val/
    ADEChallengeData2016/
        ...
        images/
        annotations_detectron2/
    ADE20K_2021_17_01/
        ...
        images/
        annotations_detectron2/

Hint In the code, those datasets are registered with their related dataset names. The relationship is:

coco_2017_*_stuff_sem_seg : COCO Stuff-171
voc_sem_seg_*: Pascal VOC-20
pcontext_sem_seg_*: Pascal Context-59
ade20k_sem_seg_*: ADE-150
pcontext_full_sem_seg_*： Pascal Context-459
ade20k_full_sem_seg_*: ADE-847

Usage

Pretrained Weights

Model Config Weights Logs

SAN-ViT-B/16 configs/san_clip_vit_res4_coco.yaml Huggingface Log

SAN-ViT-L/14 configs/san_clip_vit_large_res4_coco.yaml Huggingface Log

Model	Config	Weights	Logs
SAN-ViT-B/16	configs/san_clip_vit_res4_coco.yaml	Huggingface	Log
SAN-ViT-L/14	configs/san_clip_vit_large_res4_coco.yaml	Huggingface	Log

Evaluation

evaluate trained model on validation sets of all datasets.

python train_net.py --eval-only --config-file <CONFIG_FILE> --num-gpus <NUM_GPU> OUTPUT_DIR <OUTPUT_PATH> MODEL.WEIGHTS <TRAINED_MODEL_PATH>

For example, evaluate our pre-trained model:

# 1. Download SAN (ViT-B/16 CLIP) from https://huggingface.co/Mendel192/san/blob/main/san_vit_b_16.pth.
# 2. put it at `output/model.pth`.
# 3. evaluation
  python train_net.py --eval-only --config-file configs/san_clip_vit_res4_coco.yaml --num-gpus 8 OUTPUT_DIR ./output/trained_vit_b16 MODEL.WEIGHTS output/model.pth

evaluate trained model on validation sets of one dataset.

python train_net.py --eval-only --config-file <CONFIG_FILE> --num-gpus <NUM_GPU> OUTPUT_DIR <OUTPUT_PATH> MODEL.WEIGHTS <TRAINED_MODEL_PATH> DATASETS.TEST "('<FILL_DATASET_NAME_HERE>',)"

Visualization

python visualize_json_results.py --input <JSON_RESULT> --output <WHERE TO  SAVE VISUALIZATION RESULT> --dataset <DATASET>
# example:
# Generate the results.
# python train_net.py --eval-only --config-file configs/san_clip_vit_res4_coco.yaml --num-gpus 1 OUTPUT_DIR ./output/trained_vit_b16 MODEL.WEIGHTS output/san/san_vit_b_16.pth DATASETS.TEST '("pcontext_sem_seg_val",)'
# Visualizing
# python visualize_json_results.py --input output/trained_vit_b16/inference/sem_seg_predictions.json --output output/viz --dataset pcontext_sem_seg_val

Training

wandb off
# [Optional] If you want to log the training logs to wandb.
# wandb login
# wandb on
python train_net.py --config-file <CONFIG_FILE> --num-gpus <NUM_GPU> OUTPUT_DIR <OUTPUT_PATH> WANDB.NAME <WANDB_LOG_NAME>

Hint: We use <> to denote the variables you should replace according to your own setting.

FAQ

If you found it is too late to get a response from the author on the github, please e-mail me directly at [shea.mendel] [AT] [gmail.com].

License

Distributed under the MIT License. See LICENSE for more information.

Cite

If you find it helpful, you can cite our paper in your work.

@proceedings{xu2023side,
  title={Side Adapter Network for Open-Vocabulary Semantic Segmentation},
  author={Mengde Xu, Zheng Zhang, Fangyun Wei, Han Hu, Xiang Bai},
  journal={CVPR},
  year={2023}
}

san's People

Contributors

Stargazers

Watchers

san's Issues

About training cost.

Dear authors, thanks for your amazing work. Could you tell me how long the training (CLIP VIT-B/16 and ViT-L/14) takes on COCO Stuff dataset using 8 NVIDIA V100 GPUs?
Thanks!

关于使用Prompt提升分割效果

感谢您出色的工作！！
注意到您论文附录中涉及使用Prompt提升分割效果，请问具体实施中，是训练、测试都使用加入Prompt的类别标签，还是使用单个词汇的标签训练再在测试时加入Prompt呢？
如您能解惑，不胜感激！！

Torchscript implementation

Is there a faster implementation of this to deploy. E.g. a version with a compiled model in torchscript?
Thanks!

Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.

how to get pcontext-full dataset

Thanks for you great work! I met a problem when i prepare the dataset

What value should I set for anno dir?
i guess the anno dir should be set "dataset/pcontext-full/VOCdevkit/VOC2010/Annotations" but "xml" is not compatible with 'mat' which assigned in code?
what should i do?

Error on training: double free or corruption (!prev)

Thanks for your inspiring work!

When running the provided code, I met the following error. However, when I set num_workers=0, the code can run smoothly without and errors.
Have you met this before? or do you have any idea to debug this?

[04/05 09:08:31 d2.engine.train_loop]: [Starting training from iteration 0
*** Error in `/home/anaconda3/envs/guangrui_cuda11/bin/python': double free or corruption (!prev): 0x000055bbaca66830 ***
======= Backtrace: =========
/lib64/libc.so.6(+0x733cf)[0x7f52abbfc3cf]
/lib64/libc.so.6(+0x78c3e)[0x7f52abc01c3e]
/lib64/libc.so.6(+0x79917)[0x7f52abc02917]
/lib64/ld-linux-x86-64.so.2(_dl_deallocate_tls+0x58)[0x7f52ac367588]
/lib64/libpthread.so.0(+0x71f7)[0x7f52abf3c1f7]
/lib64/libpthread.so.0(+0x730f)[0x7f52abf3c30f]
/lib64/libpthread.so.0(pthread_join+0xdb)[0x7f52abf3e65b]
/home/anaconda3/envs/guangrui_cuda11/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so(pthreadpool_destroy+0x82)[0x7f5247df80b2]
/home/anaconda3/envs/guangrui_cuda11/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so(+0x344abe7)[0x7f5245641be7]
/home/anaconda3/envs/guangrui_cuda11/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so(_ZN2at15set_num_threadsEi+0x38)[0x7f524397c3c8]
/home/anaconda3/envs/guangrui_cuda11/lib/python3.8/site-packages/torch/lib/libtorch_python.so(+0x5e9d3a)[0x7f52975a8d3a]
/home/anaconda3/envs/guangrui_cuda11/bin/python(+0x1401c5)[0x55bba920b1c5]
/home/anaconda3/envs/guangrui_cuda11/bin/python(+0xff58e)[0x55bba91ca58e]
/home/anaconda3/envs/guangrui_cuda11/bin/python(_PyFunction_Vectorcall+0x10b)[0x55bba925616b]
/home/anaconda3/envs/guangrui_cuda11/bin/python(PyVectorcall_Call+0x71)[0x55bba9207a41]
/home/anaconda3/envs/guangrui_cuda11/bin/python(_PyEval_EvalFrameDefault+0x207c)[0x55bba928e1dc]
/home/anaconda3/envs/guangrui_cuda11/bin/python(_PyFunction_Vectorcall+0x10b)[0x55bba925616b]
/home/anaconda3/envs/guangrui_cuda11/bin/python(+0xff56d)[0x55bba91ca56d]
/home/anaconda3/envs/guangrui_cuda11/bin/python(_PyEval_EvalCodeWithName+0x2d2)[0x55bba92552a2]
/home/anaconda3/envs/guangrui_cuda11/bin/python(_PyFunction_Vectorcall+0x1e3)[0x55bba9256243]
/home/anaconda3/envs/guangrui_cuda11/bin/python(+0xff56d)[0x55bba91ca56d]
/home/anaconda3/envs/guangrui_cuda11/bin/python(_PyFunction_Vectorcall+0x10b)[0x55bba925616b]
/home/anaconda3/envs/guangrui_cuda11/bin/python(+0xff819)[0x55bba91ca819]
/home/anaconda3/envs/guangrui_cuda11/bin/python(_PyEval_EvalCodeWithName+0x2d2)[0x55bba92552a2]
/home/anaconda3/envs/guangrui_cuda11/bin/python(_PyFunction_Vectorcall+0x1e3)[0x55bba925
6243]
/home/anaconda3/envs/guangrui_cuda11/bin/python(+0x10050b)[0x55bba91cb50b]
/home/anaconda3/envs/guangrui_cuda11/bin/python(_PyEval_EvalCodeWithName+0x2d2)[0x55bba92552a2]
/home/anaconda3/envs/guangrui_cuda11/bin/python(PyEval_EvalCodeEx+0x44)[0x55bba9256054]
/home/anaconda3/envs/guangrui_cuda11/bin/python(PyEval_EvalCode+0x1c)[0x55bba92e45bc]
/home/anaconda3/envs/guangrui_cuda11/bin/python(+0x219664)[0x55bba92e4664]
/home/anaconda3/envs/guangrui_cuda11/bin/python(+0x24b874)[0x55bba9316874]
/home/anaconda3/envs/guangrui_cuda11/bin/python(PyRun_StringFlags+0x7d)[0x55bba93190cd]
/home/anaconda3/envs/guangrui_cuda11/bin/python(PyRun_SimpleStringFlags+0x3f)[0x55bba91e0488]
/home/anaconda3/envs/guangrui_cuda11/bin/python(+0x11598b)[0x55bba91e098b]
/home/anaconda3/envs/guangrui_cuda11/bin/python(Py_BytesMain+0x39)[0x55bba9319389]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f52abbaaa85]
/home/anaconda3/envs/guangrui_cuda11/bin/python(+0x1de553)[0x55bba92a9553]
======= Memory map: ========

Skipping the memroy maps....

Traceback (most recent call last):
File "train_net.py", line 281, in
launch(
File "/home/anaconda3/envs/guangrui_cuda11/lib/python3.8/site-packages/detectron2/engine/launch.py", line 69, in launch
mp.start_processes(
File "/home/anaconda3/envs/guangrui_cuda11/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
while not context.join():
File "/home/anaconda3/envs/guangrui_cuda11/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 160, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 1 terminated with the following error:
Traceback (most recent call last):
File "/home/anaconda3/envs/guangrui_cuda11/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1163, in _try_get_data
data = self._data_queue.get(timeout=timeout)
File "/home/anaconda3/envs/guangrui_cuda11/lib/python3.8/multiprocessing/queues.py", line 107, in get
if not self._poll(timeout):
File "/home/anaconda3/envs/guangrui_cuda11/lib/python3.8/multiprocessing/connection.py", line 257, in poll
return self._poll(timeout)
File "/home/anaconda3/envs/guangrui_cuda11/lib/python3.8/multiprocessing/connection.py", line 424, in _poll
r = wait([self], timeout)
File "/home/anaconda3/envs/guangrui_cuda11/lib/python3.8/multiprocessing/connection.py", line 931, in wait
ready = selector.select(timeout)
File "/home/anaconda3/envs/guangrui_cuda11/lib/python3.8/selectors.py", line 415, in select
fd_event_list = self._selector.poll(timeout)
File "/home/anaconda3/envs/guangrui_cuda11/lib/python3.8/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
_error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 225138) is killed by signal: Aborted.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/home/anaconda3/envs/guangrui_cuda11/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
fn(i, *args)
File "/home/anaconda3/envs/guangrui_cuda11/lib/python3.8/site-packages/detectron2/engine/launch.py", line 123, in _distributed_worker
main_func(*args)
File "/home/guangrui/san_ovSeg/train_net.py", line 275, in main
return trainer.train()
File "/home/anaconda3/envs/guangrui_cuda11/lib/python3.8/site-packages/detectron2/engine/defaults.py", line 484, in train
super().train(self.start_iter, self.max_iter)
File "/home/anaconda3/envs/guangrui_cuda11/lib/python3.8/site-packages/detectron2/engine/train_loop.py", line 155, in train
self.run_step()
File "/home/anaconda3/envs/guangrui_cuda11/lib/python3.8/site-packages/detectron2/engine/defaults.py", line 494, in run_step
self._trainer.run_step()
File "/home/anaconda3/envs/guangrui_cuda11/lib/python3.8/site-packages/detectron2/engine/train_loop.py", line 281, in run_step
data = next(self._data_loader_iter)
File "/home/anaconda3/envs/guangrui_cuda11/lib/python3.8/site-packages/detectron2/data/common.py", line 291, in iter
for d in self.dataset:
File "/home/anaconda3/envs/guangrui_cuda11/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 681, in next
data = self._next_data()
File "/home/anaconda3/envs/guangrui_cuda11/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1359, in _next_data
idx, data = self._get_data()
File "/home/anaconda3/envs/guangrui_cuda11/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1325, in _get_data
success, data = self._try_get_data()
File "/home/anaconda3/envs/guangrui_cuda11/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1176, in _try_get_data
raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str)) from e
RuntimeError: DataLoader worker (pid(s) 225138, 225290) exited unexpectedly

question about clip-aware and clip-unaware

Hi, great work! I have some questions about clip aware and clip unaware in the paper. Since the clip model is locked, why the gradients can pass through clip in an e2e training manner and block in a two-stage manner training in paper fig.5?

about caption

Where is the code that makes each image category name correspond to the labeling mask during training

What is the difference between Query Tokens and [SLS] tokens

or are they the same thing?

about the pretrained weights

Hi! Thank you for your excellent work!

Do you use pretrained weights? where should I set in the code? Thank you!

multi-fusion 如何实现？

@MendelXu Hello，论文中说的multi-fusion 代码里面如何实现呢？

要运用CNN的backbone进行san方法的集成需要如何修改FeatureExtractor这个类

@MendelXu 你好，要运用CNN的backbone进行san方法的集成需要如何修改FeatureExtractor这个类？

自己数据集上的训练和预测问题

非常感谢您的工作！我的数据集有3类分别用1,2,3表示再加上背景类0，我在自己数据集上遇到一些问题：
1、基于您的方法我进行了训练，在0,1,2三类上还是很有效果的，但是3的话就是0，后来发现在semantic_inference函数中，您通过mask_cls = F.softmax(mask_cls, dim=-1)[..., :-1]这个操作去除了最后一个通道导致预测的时候就少了一类。请问这是为什么？您这个是把最后一类当做背景去掉了吗？
2、我把mask_cls = F.softmax(mask_cls, dim=-1)[..., :-1]用mask_cls = F.softmax(mask_cls, dim=-1)替换，保留了所有通道，直接再进行预测，结果0,1,2的效果比之前差了很多，3这类也没啥效果，请问我改了这句话之后是否需要重新训练？但是这句话好像是在inference中才用到。对此挺疑惑的，希望您能给我一些帮助，谢谢！

Huggingface demo output error, please check

Hi, Thanks for your excellent work.
When I using demo in huggingface, it just has error info.
Could you please check that?

On the issue of self created Coco format datasets

hello,
thanks for your Excellent work，but I have some questions that I need to consult with：
The data I annotate with labelme now has four classes, and then I have converted the data from these four classes to the format of stuffingmaps. However, where do I need to make corresponding modifications when training with my own dataset? Especially for registers_ Coco_ Stuff_ In the 164k.py file, my dataset only has 4 classes and does not have the co stuff label of 91 classes.

position embedding 跟图像大小有关系吗？

@stupidZZ position embedding 跟图像大小有关系吗？有什么方法做到position embedding 跟图像大小无关吗？

SAN ensemble

Thanks for your great work! Is there any code implementation about "SAN ensemble"(mentioned in Table 2 of the paper) ?

关于COCO-Stuff171数据集准备的问题

这是一个很棒的工作！
在此向您汇报一个数据准备的小问题，使得该代码更容易为大家使用：
在 Data Preparation 中，您提到使用 SimSeg 中的脚本去准备 COCO-Stuff 数据集，也就是提前将不连续的类别标签转化为连续的类别标签；
但是在本项目注册 COCO-Stuff171 数据集时，代码表明您仍然使用原始的不连续标签；
因此在准备COCO-Stuff171时，不需要再生成一个连续的标签放在 $DETECTRON2_DATASETS/coco/stuffthingmaps_detectron2 下，直接使用原始标注即可；
实践表明，修正这一处错误后，使用您提供的检查点才能在COCO-Stuff171上获得正确的性能，并在也可以正常训练。

我发现代码中并没有使用 stuff_dataset_id_to_contiguous_id，且作者提供数据处理流程是正确的。
（我在一个已经处理好的标注上又处理了一遍造成了之前的误解）

how to prepare pcontext data?

pcontext/
        ...
        val/
    pcontext_full/
        ...
        val/

how to generate these data?

There is also a question about "emsemble"

There is also a question about the emsemble, what should the weight be set to emsemble the origin san model and the model fine-tuned on coco stuff? or could you tell me the range the weights should be set?

w = float(os.environ["EWEIGHT"])
sem_seg = torch.pow(sem_segs[0][:min_cls],w)*torch.pow(sem_segs[1][:min_cls],1-w)

TypeError: '>' not supported between instances of 'NoneType' and 'int'

Hi,
I am trying to train the model and got following error
Do you have any solution or comments?

关于评估mIoU时背景类别的考虑

尊敬的作者：
您好！感谢您做出的优秀工作并开源了代码。
这边有一个小问题想请教您一下，在推演代码中，我观察到每次迭代输出的预测logits的维度是<class_num x H x W>，以VOC2012为例，输出的logit是20 x H x W。这在mIoU的评估中背景类别（在gt中标注为255）似乎没有参与mIoU计算。请问这种操作是否会导致性能虚高？
因为在通过argmax可视化logits的过程中我观察到，对应类别的预测mask较好地覆盖了但远远超过了对应object的区域。但是最终评测mIoU仍然达到了90+。

期待您的解答!
祝好！

Spliting model into Encoder and Decoder

Hello! I really like this project.
Do you plan to support splitting this model into Encoder and Decoder, and allow to input one image into the model?
Thank you very much!

Run issue

Thanks for your wonderful work. When I run your code, I face the bug
'''
Traceback (most recent call last):
File "/home/gyang/data/SAN/detectron2/detectron2/engine/train_loop.py", line 156, in train
self.after_step()
File "/home/gyang/data/SAN/detectron2/detectron2/engine/train_loop.py", line 190, in after_step
h.after_step()
File "/home/gyang/data/SAN/detectron2/detectron2/engine/hooks.py", line 556, in after_step
self._do_eval()
File "/home/gyang/data/SAN/detectron2/detectron2/engine/hooks.py", line 529, in _do_eval
results = self._func()
File "/home/gyang/data/SAN/detectron2/detectron2/engine/defaults.py", line 453, in test_and_save_results
self._last_eval_results = self.test(self.cfg, self.model)
File "/home/gyang/data/SAN/detectron2/detectron2/engine/defaults.py", line 602, in test
data_loader = cls.build_test_loader(cfg, dataset_name)
File "/home/gyang/data/SAN/train_net.py", line 102, in build_test_loader
return build_detection_test_loader(cfg, dataset_name)
File "/home/gyang/data/SAN/detectron2/detectron2/config/config.py", line 207, in wrapped
explicit_args = _get_args_from_config(from_config, *args, **kwargs)
File "/home/gyang/data/SAN/detectron2/detectron2/config/config.py", line 245, in _get_args_from_config
ret = from_config_func(*args, **kwargs)
File "/home/gyang/data/SAN/san/data/build.py", line 268, in _test_loader_from_config
dataset = get_detection_dataset_dicts(
File "/home/gyang/data/SAN/san/data/build.py", line 129, in get_detection_dataset_dicts
assert len(dicts), "Dataset '{}' is empty!".format(dataset_name)
AssertionError: Dataset 'pcontext_sem_seg_val' is empty!
'''
Would you mind helping me to fix this bug?

How to train on PASCAL VOC 2012 datasets

Hi! thank you for your good work.

I want to train and test on the PASCAL VOC 2012 datasets. where should I change? /config/base-coco-stuff-164k-171.yaml? To change the original 'coco_2017_train_stuff_sem_seg' to 'voc_sem_seg_train' on the DATASETS:TRAIN: position?
and after I changed the yaml file like above, I still met the problems as follows:

  File "/data/users/cliu/work12/SAN/san/model/criterion.py", line 222, in get_loss
    return loss_map[loss](outputs, targets, indices, num_masks)
  File "/data/users/cliu/work12/SAN/san/model/criterion.py", line 142, in loss_labels
    loss_ce = F.cross_entropy(
  File "/home/cliu/miniconda3/envs/san/lib/python3.9/site-packages/torch/nn/functional.py", line 3014, in cross_entropy
    return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index, label_smoothing)
RuntimeError: weight tensor should be defined either for all or no classes

It looks like the code problem not the environment problem. Do you have any advice?

3.How do you get the 'datasets/VOC2012/annotations_detectron2/train' datasets? ti's 10582 images for PASCAL VOC 2012 train set here?

Thank you!

Zero-shot setting

Hello. Thanks for your fantastic work. I notice that your work SimSeg in eccv2022 considered both zero-shot setting and cross-dataset setting. How does the SAN perform on zero-shot setting?

Prompt templates used in paper how to achieve？

@MendelXu Hello，Thank you for your great work! ! Where is the Prompt templates used in paper in the code？How to assign value

Why using a masked self-attention layer when calculating the X[SLS]

Thank you for your excellent work. In the figure 3 of the paper, the black squares means query is not updated by key. Could you kindly explain why these queries are intentionally not updated by key? Why not set them to be updated normally by key? I am curious if this has any potential impact on the results.

pointsample

When compare the predition_mask with gt_mask, i find you used the pointsample() to align their size, do you think it harms the learning performance? Have you ever tried other methods or the pointsample() is always used for detr's-like model?

About Visualization of Results

I would like to ask where the visualization part of the results in the code is located.Thanks

关于一些论文细节

尊敬的作者您好：

非常感谢您对开放词汇图像分割做出的巨大贡献。
我拜读了您的论文，在这里想请教您两个问题。
  1. 请问您对于query tokens和[SLS] tokens的个数设置是相同的吗？
  2.

  请问这里的类别名称指的是当前数据集的所有类别名称吗？（并用C表示其个数）

期待您的回复！感谢！

what's the difference between prepare_voc_sem_seg.py processed and before?

Hi! Sorry for bothering you again.
Does prepare_voc_sem_seg.py process the data any differently than before? It looks like they changed the value of the tag. Is this due to not calculating the background class during testing?
Thank you!

Upsample masks during inference?

Hi, was curious do you upsample the predicted mask from 1/16 of image resolution to the original image resolution using interpolation during inference?

Model structure

Hi Mengde,
I'm trying to replicate the model demo, in the paper, the demo on hugging face is able to accepts text based prompt query on the image, we just want to know which part of the code correspond to this functionality, I went through the code and was unable to locate the this part, can you give me some general guide lines?
Cheers,
Yupeng

TypeError: PatchEmbed.init() got an unexpected keyword argument 'dynamic_img_pad'

I got the following error:

Traceback (most recent call last):
File "/home/ubuntu/anaconda3/envs/SAN/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
fn(i, *args)
File "/data/NZH/detectron2/detectron2/engine/launch.py", line 123, in _distributed_worker
main_func(*args)
File "/data/NZH/SAN/train_net.py", line 260, in main
model = Trainer.build_model(cfg)
File "/data/NZH/detectron2/detectron2/engine/defaults.py", line 514, in build_model
model = build_model(cfg)
File "/data/NZH/detectron2/detectron2/modeling/meta_arch/build.py", line 22, in build_model
model = META_ARCH_REGISTRY.get(meta_arch)(cfg)
File "/data/NZH/detectron2/detectron2/config/config.py", line 189, in wrapped
explicit_args = _get_args_from_config(from_config_func, *args, **kwargs)
File "/data/NZH/detectron2/detectron2/config/config.py", line 245, in _get_args_from_config
ret = from_config_func(*args, **kwargs)
File "/data/NZH/SAN/san/model/san.py", line 133, in from_config
"side_adapter_network": build_side_adapter_network(
File "/data/NZH/SAN/san/model/side_adapter/side_adapter.py", line 26, in build_side_adapter_network
return SIDE_ADAPTER_REGISTRY.get(name)(cfg, input_shape)
File "/data/NZH/detectron2/detectron2/config/config.py", line 189, in wrapped
explicit_args = _get_args_from_config(from_config_func, *args, **kwargs)
File "/data/NZH/detectron2/detectron2/config/config.py", line 245, in _get_args_from_config
ret = from_config_func(*args, **kwargs)
File "/data/NZH/SAN/san/model/side_adapter/side_adapter.py", line 129, in from_config
vit = create_model(
File "/home/ubuntu/anaconda3/envs/SAN/lib/python3.10/site-packages/timm/models/_factory.py", line 114, in create_model
model = create_fn(
File "/data/NZH/SAN/san/model/side_adapter/timm_wrapper.py", line 68, in vit_w240n6d8_patch16
model = _create_vision_transformer(
File "/home/ubuntu/anaconda3/envs/SAN/lib/python3.10/site-packages/timm/models/vision_transformer.py", line 1510, in _create_vision_transformer
return build_model_with_cfg(
File "/home/ubuntu/anaconda3/envs/SAN/lib/python3.10/site-packages/timm/models/_builder.py", line 381, in build_model_with_cfg
model = model_cls(**kwargs)
File "/home/ubuntu/anaconda3/envs/SAN/lib/python3.10/site-packages/timm/models/vision_transformer.py", line 465, in init
self.patch_embed = embed_layer(
TypeError: PatchEmbed.init() got an unexpected keyword argument 'dynamic_img_pad'

However, it's weird that the vision_transformer in timm does not pass the param "dynamic_img_pad" to PatchEmbed.

A naive solution is adding the param "dynamic_img_pad=None" to the PatchEmbed in timm_wrapper and no negative impact has been found for it so far.

cv2 errors with docker instructions

TLDR
I got this error and fixed it by modifying line 2 and 3 of app.Dockerfile.

Error: AttributeError: partially initialized module 'cv2' has no attribute 'gapi_wip_gst_GStreamerPipeline'

To replicate
docker build -f docker/app.Dockerfile -t san_app .
docker run -it --shm-size 4G -p 7860:7860 san_app

Fix
Change docker/app.Dockerfile to this:

RUN pip install 'git+https://github.com/facebookresearch/detectron2.git'
RUN pip install cython scipy shapely timm h5py submitit scikit-image wandb setuptools numpy Pillow pycocotools~=2.0.4 fvcore tabulate tqdm ftfy regex open_clip_torch cityscapesscripts tensorboard gradio

RUN useradd -m -u 1000 user
# Switch to the "user" user
USER user
# Set home to the user's home directory
ENV HOME=/home/user \
    PATH=/home/user/.local/bin:$PATH

# Set the working directory to the user's home directory
WORKDIR $HOME
RUN git clone https://github.com/MendelXu/SAN app

WORKDIR $HOME/app
ENV GRADIO_SERVER_NAME=0.0.0.0
EXPOSE 7860
RUN echo "gradio app.py">>run.sh
CMD ["script","-c","sh run.sh","/dev/null"]

Discussion
I also tried messing around with opencv-python versions, but only ran into other cv2 or libGL errors. Eventually, I found this issue in opencv/opencv-python#867, and noticed that detectron2 was also installing cv2. So I changed the dockerfile to install dectron2 first, and not install opencv-python after that. And it worked.

Training on multiple datasets

Thank you for your excellent work. I have some questions about training on multiple datasets. For example, I want to train the SAN model on both coco dataset and ade_20k dataset. There are errors when I change TRAIN in Base-coco-stuff-164K-171.yaml due to the assertion assert (len(list(set(dataset_names))) == 1), "All images in a batch must be from the same dataset." How to solve this problem? Thanks a lot!

About testing

Hi, great work, I have a small detailed question, the MIN_SIZE_TEST is set to 640, so in fact the shortest edge of test images are larger or equal than 640, but then RandomCrop will randomly crop a 640 x 640 patch, while there is still a certain portion of test images are discarded, so as the annotations, does this influence the test mAP? Cause a portion of annotations are threw away and the model does not evaluate on these annotations.

Visualize

How to visualize your quantitative results. Do you have any scripts for generating the semantic segmentation results. I'm not quite into the detectron2 framework.
Thank you!

CLIP image encoder

I have two questions.
(1)At present,I have a ViT-B/32 weight that I trained by myself,but SAN's pre-training weight is ViT-B/16
How can I use my pre-training weight ViT-B/32 for SAN ? AS we all konw, official clip is trained of ViT-B/32 ViT-B/16 and ResNet
(2)why doesn't the CLIP image encoder use ResNet
Thanks!!

How to change the dataset folder position?

Sorry for bothering again.
How to change the dataset folder position?
Thanks!

Performance difference between the provided log and my re-implementation.

Thanks for your appealing work!
After preparing the required datasets, I download the pretrained weights of "SAN-ViT-B/16" model and evaluate it using the provided script on all datasets. However, there are small differences between the provided log file and mine.

I provided my evaluation log.txt.

What might be the reason for this difference?

代码中的冻结参数设置

我好像只找到了冻结positional_embedding的代码。。其他的是怎么设置的呢

train and test on my own dataset

Hi! I had some problems when I changed the datasets.
My datasets have 4 foreground classes and 1 background class. I followed other issues and registered in ./san/data/datasets/register.py and __init__.py. I want to compute the mIoU in both foreground and background.

I set CLASS_NAMES=(background, ...) in register.py (including background). And I set MODEL.SAN.NUM_CLASSES 5. I didn't change mask_cls=F.softmax(mask_cls, dim=-1)[..., :-1]. Do you think the mIoU I calculated this way is reasonable?
I don't understand why the output of mask_cls=F.softmax(mask_cls, dim=-1) is relevant to ..X..X..6 shape (6 classes). And you delete the last dimension [..., :-1]. Is it something to do with 255?
While I set my dataset, I saw the RGB image root and semantic segmentation ground truth root. But where is your class label root? Does the label use image-level labels or semantic segmentation ground truths? Thanks!

如何计算模型的flops，论文中的flops计算有开放代码吗？

@MendelXu 你好如何计算模型的flops，论文中的flops计算有开放代码吗？

Detail about 'background' class

Hi, thanks for your great work.

I have a question about the classifier in this paper 3.3.1 Pixel-wise Side Adapter Network:

where M denotes the category number of training set.

I want to know whether the 'background' class contained in the training set? If you used a background class, is it learnable or fixed?

Train on private dataset (only one category)

Thank you for your work,
When I used a private dataset (only one category) for training, I first faked VOC to register the dataset,

CLASS_NAMES = (
    "polyp",
)


def _get_voc_meta(cat_list):
    ret = {
        "stuff_classes": cat_list,
    }
    return ret


def register_all_voc_11k(root):
    root = os.path.join(root, "pranet")
    meta = _get_voc_meta(CLASS_NAMES)

    for name, image_dirname, sem_seg_dirname in [
        ("train", "JPEGImages", "annotations_detectron2/train"),
        ("val", "JPEGImages", "annotations_detectron2/val"),
    ]:
        image_dir = os.path.join(root, image_dirname)
        gt_dir = os.path.join(root, sem_seg_dirname)
        all_name = f"pranet_sem_seg_{name}"
        DatasetCatalog.register(
            all_name,
            lambda x=image_dir, y=gt_dir: load_sem_seg(
                y, x, gt_ext="png", image_ext="jpg"
            ),
        )
        MetadataCatalog.get(all_name).set(
            image_root=image_dir,
            sem_seg_root=gt_dir,
            evaluator_type="sem_seg",
            ignore_label=255,
            **meta,
        )


_root = os.getenv("DETECTRON2_DATASETS", "datasets")
register_all_voc_11k(_root)

and then I used the code to train. The training loss seemed normal, but when I predicted, I Found that all the results are pure white （All pixels are foreground）,
python train_net.py --config-file ./configs/san_clip_vit_res4_pranet.yaml --num-gpus 1 OUTPUT_DIR ./OUTPUT/vit_14 MODEL.SAN.NUM_CLASSES 1
can you provide me with some help? Or tell me where the problem might be. I would be very grateful!

训练自己的数据集

作者你好！抱歉打扰
想请教一下如果我想在自己的数据集上进行训练和测试，我需要在哪几个文件上进行修改呢？

question about the difference between the paper and code implementation

Thanks for your great work! I have some trouble when trying to understand the code. In visual.py, the sos tokens (I also wonder why they are called sos tokens instead of sls tokens) are computed as follows:

 sos_token = cross_attn_layer(
                        resblock,
                        sos_token,
                        x[1:,],
                        attn_biases[i],
                    )

and cross_attn_layer is:

def cross_attn_layer(self: ResidualAttentionBlock, x, mem, attn_bias):
    # x: [K,N,C]
    # mem: [L,N,C]
    # attn_bias: [N*num_head,K,L]
    # return: [K,N,C]
    q_x = self.ln_1(x)
    k_x = v_x = self.ln_1(mem)
    x = x + self.ls_1(
        cross_attn_with_self_bias(self.attn, q_x, k_x, v_x, attn_mask=attn_bias)[0]
    )
    x = x + self.ls_2(self.mlp(self.ln_2(x)))
    return x

It uses the sos_token to obtain q and the visual tokens to obtain k and v, but in section 3 of the paper, the formula (3) seems to use the sls tokens to obtain v. I want to know why there is a discrepancy between the paper and the code, or if there is a problem with my understanding. Thanks!

How to understand attention bias?

Hi,

How to understand the attention bias generated from the adapter? I cannot get its purpose.