henghuiding / mose-api Goto Github PK

View Code? Open in Web Editor NEW

295.0 6.0 2.0 26 KB

[ICCV 2023] MOSE: A New Dataset for Video Object Segmentation in Complex Scenes

Home Page: https://henghuiding.github.io/MOSE/

Python 100.00%

benchmark complex-environment dataset iccv2023 video-object-segmentation video-segmentation

mose-api's Introduction

MOSE: A New Dataset for Video Object Segmentation in Complex Scenes

🏠[Homepage] 📄[Arxiv]

This repository contains information and tools for the MOSE dataset.

Download

[🔥02.09.2023: Dataset has been released!]

⬇️ Get the dataset from:

☁️ OneDrive (Recommended)
☁️ Google Drive
☁️ Baidu Pan (Access Code: MOSE)

📦 Or use gdown:

# train.tar.gz
gdown 'https://drive.google.com/uc?id=ID_removed_to_avoid_overaccesses_get_it_by_yourself'

# valid.tar.gz
gdown 'https://drive.google.com/uc?id=ID_removed_to_avoid_overaccesses_get_it_by_yourself'

# test set will be released when competition starts.

Please also check the SHA256 sum of the files to ensure the data intergrity:

3f805e66ecb576fdd37a1ab2b06b08a428edd71994920443f70d09537918270b train.tar.gz
884baecf7d7e85cd35486e45d6c474dc34352a227ac75c49f6d5e4afb61b331c valid.tar.gz

Evaluation

[🔥02.16.2023: Our CodaLab competition is on live now!]

Please submit your results on

💯 CodaLab.

File Structure

The dataset follows a similar structure as DAVIS and Youtube-VOS. The dataset consists of two parts: JPEGImages which holds the frame images, and Annotations which contains the corresponding segmentation masks. The frame images are numbered using five-digit numbers. Annotations are saved in color-pattlate mode PNGs like DAVIS.

Please note that while annotations for all frames in the training set are provided, annotations for the validation set will only include the first frame.

<train/valid.tar>
│
├── Annotations
│ │ 
│ ├── <video_name_1>
│ │ ├── 00000.png
│ │ ├── 00001.png
│ │ └── ...
│ │ 
│ ├── <video_name_2>
│ │ ├── 00000.png
│ │ ├── 00001.png
│ │ └── ...
│ │ 
│ ├── <video_name_...>
│ 
└── JPEGImages
  │ 
  ├── <video_name_1>
  │ ├── 00000.jpg
  │ ├── 00001.jpg
  │ └── ...
  │ 
  ├── <video_name_2>
  │ ├── 00000.jpg
  │ ├── 00001.jpg
  │ └── ...
  │ 
  └── <video_name_...>

BibTeX

Please consider to cite MOSE if it helps your research.

@inproceedings{MOSE,
  title={{MOSE}: A New Dataset for Video Object Segmentation in Complex Scenes},
  author={Ding, Henghui and Liu, Chang and He, Shuting and Jiang, Xudong and Torr, Philip HS and Bai, Song},
  booktitle={ICCV},
  year={2023}
}

License

MOSE is licensed under a CC BY-NC-SA 4.0 License. The data of MOSE is released for non-commercial research purpose only.

mose-api's People

Contributors

Stargazers

Watchers

Forkers

tommywhy longmalongma

mose-api's Issues

Annotation tool

Thank you for your wonderful work! Could you share the annotation tool used to build the dataset? I appreciate if you can release codes for annotation tool.

Gdown is no longer working and Google drive download is very unstable

Hello!

We appreciate the release of this dataset! It's fantastic.
I would like just to point out that gdown is showing this error:

`gdown https://drive.google.com/uc\?id\=10HYO-CJTaITalhzl_Zbz_Qpesh8F3gZR
Access denied with the following error:

Cannot retrieve the public link of the file. You may need to change
the permission to 'Anyone with the link', or have had many accesses.

You may still be able to access the file from the browser:

 https://drive.google.com/uc?id=10HYO-CJTaITalhzl_Zbz_Qpesh8F3gZR

`
And when you download from Google Drive it breaks every 2 min and eventually is a failed download. Besides, in Baidu, it shows errors to download. The most stable so far is One Drive.

Thanks.

Unsupervised VOS Setting

Thanks for your answer before!

I have another question about the unsupervised VOS setting, since I want to test some VIS methods on MOSE and need to follow the same way you used in the MOSE paper. From my understanding, unsupervised VOS with multiple objects = VIS (VIdeo Instance Segmentation), both segment and track objects of predefined categories without reference on the first frame, is that right?

Therefore, when you run those methods listed in Table 5, you train these methods on the training set of MOSE (only on those videos where the first frame is exhausitively labeled), and then test them on the validation set (only those videos with exhasutive labeled first frame) without reference on the first frame, which is the same way of trainging and testing VIS method, is this the case?

Could you please clarify this? Thanks a lot!

Some folders only have one image

Hi,

I have recently realized that some folders only have one image, which is weird for a video dataset. For instance :

MOSE/train/JPEGImages/9eb92f21

Is it true or the problem is somewhere from my side?

DeAOT training & inference.

Since you uploaded the code of XMem, could you please also provide the train_datasets.py and eval_datasets.py of aot-benchmark for MOSE?
And did you change the training config, or same as YTB, like
self.DATA_MOSE_REPEAT = 1, self.DATA_RANDOM_GAP_MOSE = 3

Thanks a lot!

Boundaries threshold

Are you using morphological ops like in the official Davis evaluation toolkit?

Cause that impl could depend on the frame size and so It could be very generous with not too big objects:
https://github.com/davisvideochallenge/davis2017-evaluation/blob/master/davis2017/metrics.py#L77

If you are using the same approach for MOSE eval you could at least report F at different threshold levels.

Training setting

Thanks for your work!

I want to ask about the training setting. Your paper said "We replace the training dataset of previous methods from YouTubeVOS with our MOSE and strictly follow their training settings on YouTube-VOS [3]."

Most previous works training on YouTube-VOS and DAVIS in the main-training after the image pre-training.
Do you remove the DAVIS dataset? If that's the case, is it because you tested that removing DAVIS worked better?

Thanks a lot for your answer!

the frame interval I tested was 1

Thanks for your great work, the frame interval I tested was 1, do you know what went wrong?

About the use of data sets

Thanks for the data, I am using a dataset like DAVIS structure for the first time, I want to turn the dataset into a human segmented dataset, is there any way I can extract all the humans from the segmented annotations?

Question about the experimental result of STCN on table 3

I trained STCN on MOSE, but your paper had a different result.

What I've done so far:

downloaded a pre-trained model of STCN - static image pre-trained version
trained on only MOSE, with the same setting as stage 3 of STCN.
inference on MOSE valid using eval_genenric.py of STCN
uploaded to MOSE codalab

and the score I got on MOSE codalab was 0.2601784555.

Do you have any idea why this discrepancy showed?

Why videos with only one frame?

Hi, why are there "videos" in the dataset with only one frame?
For example 330ac20d, 9eb92f21, a4287634, ce1ea47c.

I'm just curious if there's a reason, otherwise, thanks for this dataset.

Possible Out-Of-Date SHA256sum for train.tar.gz

First of all, thank you for this amazing work! I would kindly ask you a confirm, since I am facing an issue in checking data integrity with sha256sum.
I've downloaded the .tar.gz training file from OneDrive, as it is the suggested source.
The sums reported in the corresponding file in the OneDrive folder match the ones reported on GitHub, but by downloading the train multiple times, I get a different value (but always the same) for that file, which does not match the one reported in the file in OneDrive and on GitHub.
I've also tried downloading it from multiple PCs (Desktop + Windows + Chrome, Laptop + Arch Linux + Firefox), and I always obtain the same sha256sum, which is different from the one reported.
Am I doing something wrong, or is the sum value for train.tar.gz actually out of date?

Unsupervised VOS Evaluation

Hi, thanks for the great dataset!

I am interested in the unsupervised VOS part. Although the meatafiles for training set and validation set have included the first_frame_exhaustive_anno part to denote whether the first frame is exhaustively annotated, the evaluation server on CodaLab seems to not include the specific results for the unsupervised VOS setting.

If that's the case, do we have any other way to evaluate the unsupervised VOS setting and so we can compare it with the Table 5 results in the MOSE paper? Thank you!

Unsupervised VOS

Hello, thank you very much for your work. Is there any dataset for Unsupervised VOS？

Download issues

Please feel free to raise any download issues here.