GithubHelp home page GithubHelp logo

henghuiding / mevis Goto Github PK

View Code? Open in Web Editor NEW
473.0 8.0 19.0 53.5 MB

[ICCV 2023] MeViS: A Large-scale Benchmark for Video Segmentation with Motion Expressions

Home Page: https://henghuiding.github.io/MeViS/

License: MIT License

Shell 0.11% Python 89.67% C++ 1.02% Cuda 9.20%
multimodal-learning referring-expression-comprehension referring-expression-segmentation referring-video-object-segmentation video-understanding mevis-dataset mose-dataset

mevis's Introduction

MeViS: A Large-scale Benchmark for Video Segmentation with Motion Expressions

PyTorch Python PWC

🏠[Project page]   πŸ“„[arXiv]   πŸ“„[PDF]   πŸ”₯[Dataset Download]   πŸ”₯[Evaluation Server]

This repository contains code for ICCV2023 paper:

MeViS: A Large-scale Benchmark for Video Segmentation with Motion Expressions
Henghui Ding, Chang Liu, Shuting He, Xudong Jiang, Chen Change Loy
ICCV 2023

Abstract

This work strives for motion expressions guided video segmentation, which focuses on segmenting objects in video content based on a sentence describing the motion of the objects. Existing referring video object segmentation datasets downplay the importance of motion in video content for language-guided video object segmentation. To investigate the feasibility of using motion expressions to ground and segment objects in videos, we propose a large-scale dataset called MeViS, which contains numerous motion expressions to indicate target objects in complex environments. The goal of MeViS benchmark is to provide a platform that enables the development of effective language-guided video segmentation algorithms that leverage motion expressions as a primary cue for object segmentation in complex video scenes.

Figure 1. Examples of video clips from Motion expressions Video Segmentation (MeViS) are provided to illustrate the dataset's nature and complexity. The expressions in MeViS primarily focus on motion attributes and the referred target objects that cannot be identified by examining a single frame solely. For instance, the first example features three parrots with similar appearances, and the target object is identified as "The bird flying away". This object can only be recognized by capturing its motion throughout the video.

TABLE 1. Scale comparison between MeViS and existing language-guided video segmentation datasets.
Dataset Pub.&Year Videos Object Expression Mask Obj/Video Obj/Expn Target
A2DΒ Sentence CVPRΒ 2018 3,782 4,825 6,656 58k 1.28 1 Actor
DAVIS17-RVOS ACCVΒ 2018 90 205 205 13.5k 2.27 1 Object
ReferYoutubeVOS ECCVΒ 2020 3,978 7,451 15,009 131k 1.86 1 Object
MeViS (ours) ICCVΒ 2023 2,006 8,171 28,570 443k 4.28 1.59 Object(s)

MeViS Dataset Download

⬇️ Download the dataset from ️here☁️.

Dataset Split

  • 2,006 videos & 28,570 sentences in total;
  • Train set: 1662 videos & 23,051 sentences, used for training;
  • Valu set: 50 videos & 793 sentences, used for offline evaluation (e.g., ablation study) by users during training;
  • Val set: 140 videos & 2,236 sentences, used for CodaLab online evaluation;
  • Test set: 154 videos & 2,490 sentences (not released yet), used for evaluation during the competition periods; It is suggested to report the results on Valu set and Val set.

Online Evaluation

Please submit your results of Val set on

It is strongly suggested to first evaluate your model locally using the Valu set before submitting your results of the Val to the online evaluation system.

File Structure

The dataset follows a similar structure as Refer-YouTube-VOS. Each split of the dataset consists of three parts: JPEGImages, which holds the frame images, meta_expressions.json, which provides referring expressions and metadata of videos, and mask_dict.json, which contains the ground-truth masks of objects. Ground-truth segmentation masks are saved in the format of COCO RLE, and expressions are organized similarly like Refer-Youtube-VOS.

Please note that while annotations for all frames in the Train set and the Valu set are provided, the Val set only provide frame images and referring expressions for inference.

mevis
β”œβ”€β”€ train                       // Split Train
β”‚Β Β  β”œβ”€β”€ JPEGImages
β”‚   β”‚   β”œβ”€β”€ <video #1  >
β”‚   β”‚   β”œβ”€β”€ <video #2  >
β”‚   β”‚   └── <video #...>
β”‚   β”‚
β”‚Β Β  β”œβ”€β”€ mask_dict.json
β”‚Β Β  └── meta_expressions.json
β”‚
β”œβ”€β”€ valid_u                     // Split Val^u
β”‚Β Β  β”œβ”€β”€ JPEGImages
β”‚   β”‚   └── <video ...>
β”‚   β”‚
β”‚   β”œβ”€β”€ mask_dict.json
β”‚   └── meta_expressions.json
β”‚
└── valid                       // Split Val
 Β Β  β”œβ”€β”€ JPEGImages
    β”‚   └── <video ...>
    β”‚
 Β Β  └── meta_expressions.json

Method Code Installation:

Please see INSTALL.md

Inference

1. Valu set

Obtain the output masks of Valu set:

python train_net_lmpm.py \
    --config-file configs/lmpm_SWIN_bs8.yaml \
    --num-gpus 8 --dist-url auto --eval-only \
    MODEL.WEIGHTS [path_to_weights] \
    OUTPUT_DIR [output_dir]

Obtain the J&F results on Valu set:

python tools/eval_mevis.py

2. Val set

Obtain the output masks of Val set for CodaLab online evaluation:

python train_net_lmpm.py \
    --config-file configs/lmpm_SWIN_bs8.yaml \
    --num-gpus 8 --dist-url auto --eval-only \
    MODEL.WEIGHTS [path_to_weights] \
    OUTPUT_DIR [output_dir] DATASETS.TEST '("mevis_test",)'

CodaLab Evaluation Submission Guideline

The submission format should be a .zip file containing the predicted .PNG results of the Val set (for current competition stage).

You can use following command to prepare .zip submission file

cd [output_dir]
zip -r ../xxx.zip *

A submission example named sample_submission_valid.zip can be found from the CodaLab.

sample_submission_valid.zip       // .zip file, which directly packages 140 val video folders
β”œβ”€β”€ 0ab4afe7fb46                  // video folder name
β”‚Β Β  β”œβ”€β”€ 0                         // expression_id folder name
β”‚   β”‚   β”œβ”€β”€ 00000.png             // .png files
β”‚   β”‚   β”œβ”€β”€ 00001.png
β”‚   β”‚   └── ....
β”‚   β”‚
β”‚Β Β  β”œβ”€β”€ 1
β”‚   β”‚   └── 00000.png
β”‚   β”‚   └── ....
β”‚   β”‚
β”‚   └── ....
β”‚ 
β”œβ”€β”€ 0fea0cb75a25
β”‚Β Β  β”œβ”€β”€ 0                              
β”‚   β”‚   β”œβ”€β”€ 00000.png
β”‚   β”‚   └── ....
β”‚   β”‚
β”‚   └── ....
β”‚
└── ....                      

Training

Firstly, download the backbone weights (model_final_86143f.pkl) and convert it using the script:

wget https://dl.fbaipublicfiles.com/maskformer/mask2former/coco/instance/maskformer2_swin_tiny_bs16_50ep/model_final_86143f.pkl
python tools/process_ckpt.py

Then start training:

python train_net_lmpm.py \
    --config-file configs/lmpm_SWIN_bs8.yaml \
    --num-gpus 8 --dist-url auto \
    MODEL.WEIGHTS [path_to_weights] \
    OUTPUT_DIR [path_to_weights]

Note: We also support training ReferFormer by providing ReferFormer_dataset.py

Models

Our results on Valu set and Val set of MeViS dataset.

  • Valu set is used for offline evaluation by userself, like doing ablation study
  • Val set is used for CodaLab online evaluation by MeViS dataset organizers
Backbone Valu Val
J&F J F J&F J F
Swin-Tiny & RoBERTa 40.23 36.51 43.90 37.21 34.25 40.17

☁️ Google Drive

Acknowledgement

This project is based on VITA, GRES, Mask2Former, and VLT. Many thanks to the authors for their great works!

BibTeX

Please consider to cite MeViS if it helps your research.

@inproceedings{MeViS,
  title={{MeViS}: A Large-scale Benchmark for Video Segmentation with Motion Expressions},
  author={Ding, Henghui and Liu, Chang and He, Shuting and Jiang, Xudong and Loy, Chen Change},
  booktitle={ICCV},
  year={2023}
}
@inproceedings{GRES,
  title={{GRES}: Generalized Referring Expression Segmentation},
  author={Liu, Chang and Ding, Henghui and Jiang, Xudong},
  booktitle={CVPR},
  year={2023}
}
@article{VLT,
  title={{VLT}: Vision-language transformer and query generation for referring segmentation},
  author={Ding, Henghui and Liu, Chang and Wang, Suchen and Jiang, Xudong},
  journal={IEEE Transactions on Pattern Analysis and Machine Intelligence},
  year={2023},
  publisher={IEEE}
}

A majority of videos in MeViS are from MOSE: Complex Video Object Segmentation Dataset.

@inproceedings{MOSE,
  title={{MOSE}: A New Dataset for Video Object Segmentation in Complex Scenes},
  author={Ding, Henghui and Liu, Chang and He, Shuting and Jiang, Xudong and Torr, Philip HS and Bai, Song},
  booktitle={ICCV},
  year={2023}
}

MeViS is licensed under a CC BY-NC-SA 4.0 License. The data of MeViS is released for non-commercial research purpose only.

mevis's People

Contributors

henghuiding avatar heshuting555 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

mevis's Issues

MultiScaleDeformableAttention import error

I'm using Python 3.8 with PyTorch 1.9 and Cuda 11.1. I have already set cuda_home as such:

export CUDA_HOME=/mnt/slurm_home/remelias/anaconda3/envs/vita30/
cd /mnt/slurm_home/remelias/MeViS-main/mask2former/modeling/pixel_decoder/ops/
sh make.sh

I'm still getting MSDA import error.

Traceback (most recent call last):
line 22, in
import MultiScaleDeformableAttention as MSDA
ImportError: /mnt/slurm_home/remelias/anaconda3/envs/vita30/lib/python3.8/site-packages/MultiScaleDeformableAttention-1.0-py3.8-linux-x86_64.egg/MultiScaleDeformableAttention.cpython-38-x86_64-linux-gnu.so: undefined symbol: _ZNK2at10TensorBase8data_ptrIdEEPT_v

How can I resolve this?

Issue with Detectron2 config file

I have installed detectron2 and all other libraries in a python virtualenv and am facing AssertionError: Config file '' does not exist! when running train_net_lmpm.py. I've attached the error below.
Screenshot 2023-11-04 at 12 30 53 PM

there are no 'expressions' in meta_valid.json

what you are talking about, the meta_expressions.json, is rename from meta_valid.json? i rename meta_valid.json to meta_expressions.json, however, the code throw this exception:

File "/ai/home/project/MeViS-main/lmpm/data/datasets/mevis.py", line 66, in load_mevis_json
for exp_id, exp_dict in vid_data['expressions'].items():
KeyError: 'expressions'

what should i do next?

Mask GroundTruth

I download the mask_dict.json from the google cloud, however, there seems to be some Encoding error, could you please provide a Annotation folder such as Youtube-vos and Davis?

Shape cannot match the size during training

During the training, in the part of backbone, I got this error:

File "/root/MeViS/mask2former/modeling/backbone/swin.py", line 694, in forward
value = value.reshape(B, self.num_heads, self.value_channels//self.num_heads, n_l)
RuntimeError: shape '[24, 1, 96, 40]' is invalid for input of size 368640

this happened in the part of SpatialImageLanguageAttention, I found num_heads is 1, so this is not a MultiheadAttention right?
but I don't know whether the shape or the size is wrong, so what is the expected shape or size?

and the full error message is below:
Traceback (most recent call last):
File "train_net_lmpm.py", line 318, in
launch(
File "/root/anaconda3/envs/torch1/lib/python3.8/site-packages/detectron2/engine/launch.py", line 69, in launch
mp.start_processes(
File "/root/anaconda3/envs/torch1/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
while not context.join():
File "/root/anaconda3/envs/torch1/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 150, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 1 terminated with the following error:
Traceback (most recent call last):
File "/root/anaconda3/envs/torch1/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
fn(i, *args)
File "/root/anaconda3/envs/torch1/lib/python3.8/site-packages/detectron2/engine/launch.py", line 123, in _distributed_worker
main_func(*args)
File "/root/MeViS/train_net_lmpm.py", line 312, in main
return trainer.train()
File "/root/anaconda3/envs/torch1/lib/python3.8/site-packages/detectron2/engine/defaults.py", line 484, in train
super().train(self.start_iter, self.max_iter)
File "/root/anaconda3/envs/torch1/lib/python3.8/site-packages/detectron2/engine/train_loop.py", line 155, in train
self.run_step()
File "/root/anaconda3/envs/torch1/lib/python3.8/site-packages/detectron2/engine/defaults.py", line 494, in run_step
self._trainer.run_step()
File "/root/anaconda3/envs/torch1/lib/python3.8/site-packages/detectron2/engine/train_loop.py", line 494, in run_step
loss_dict = self.model(data)
File "/root/anaconda3/envs/torch1/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/root/anaconda3/envs/torch1/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 886, in forward
output = self.module(*inputs[0], **kwargs[0])
File "/root/anaconda3/envs/torch1/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/root/MeViS/lmpm/lmpm_model.py", line 281, in forward
return self.train_model(batched_inputs)
File "/root/MeViS/lmpm/lmpm_model.py", line 312, in train_model
features = self.backbone(images.tensor, lang_feat_sentence, lang_mask)
File "/root/anaconda3/envs/torch1/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/root/MeViS/mask2former/modeling/backbone/swin.py", line 785, in forward
y = super().forward(x, l, l_mask)
File "/root/MeViS/mask2former/modeling/backbone/swin.py", line 470, in forward
x_out, H, W, x, Wh, Ww = layer(x, Wh, Ww, l, l_mask)
File "/root/anaconda3/envs/torch1/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/root/MeViS/mask2former/modeling/backbone/swin.py", line 590, in forward
x_residual = self.fusion(x, l, l_mask)
File "/root/anaconda3/envs/torch1/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(input, **kwargs)
File "/root/MeViS/mask2former/modeling/backbone/swin.py", line 627, in forward
lang = self.image_lang_att(x, l, l_mask) # (B, H
W, dim)
File "/root/anaconda3/envs/torch1/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/root/MeViS/mask2former/modeling/backbone/swin.py", line 694, in forward
value = value.reshape(B, self.num_heads, self.value_channels//self.num_heads, n_l)
RuntimeError: shape '[24, 1, 96, 40]' is invalid for input of size 368640

Hardware Information

Hello, could you share information about the hardware you are using?I couldn't find it in the paper.

results on referformer

Thanks for the excellent work. In table 5 in the paper, the ReferFormer reaches 31.0 J&F on your dataset and how are the results obtained? Is it directly evaluated on your validation set without training (i.e. directly using pretrained referformer) or evaluated after training with the training set?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.