iriscxy / vmsmo Goto Github PK

View Code? Open in Web Editor NEW

32.0 3.0 4.0 1.23 MB

Official code and dataset link for ''VMSMO: Learning to Generate Multimodal Summary for Video-based News Articles''

Python 86.63% Starlark 0.72% Shell 1.55% Jupyter Notebook 11.09%

vmsmo's Introduction

VMSMO

Official code and dataset link for ''VMSMO: Learning to Generate Multimodal Summary for Video-based News Articles''

About the corpus

VMSMO corpus consists of 184,920 document-summary pairs, with 180,000 training pairs, 2,460 validation and test pairs.

We first publish the link (https://drive.google.com/drive/folders/1MpVv9naDaLINIo4ZKjGoZZHqp7v3_b-A?usp=sharing) to download each case in the dataset. The dataset consists of train.json, valid.json, and test.json. In each item in the json file, there are:

- ID: the ID number of the news
- content: the content of news
- original_pictures: whether the original microblog has pictures
- video_url: video URL
- image_url: video cover image URL
- publish_place: the place of publication
- publish_time: the release time of microblog
- publish_tool: microblog publishing method
- Up_num: number of likes
- retweet_num: number of forwarding
- comment_num: number of comments
- title: title of the weibo

Only the entries 'content', 'title', 'video_url' and 'image_url' are needed in our experiment. However, we keep all information in the json files for possible future uses.

About the code

Requirements

python = 3.6
tensorflow = 1.9
numpy = 4.2
opencv python = 1.16

### Commands

In the preprocess folder, we have videoprocess.pyto split the videos into frames, and dataprocess.py to read images, and find the image label for the video. Finally, by resnet152_img.py in sim folder, we use resnet to extract image features.

Train:

python run_summarization.py --mode=train --data_path=* --test_path=* --vocab_path=* --log_root=logs --exp_name=vmsmo --max_enc_steps=100 --max_dec_stpes=30 --vocab_size=50000 --lr=0.001

Test:

python run_summarization.py --mode=decode --data_path=* --test_path=* --vocab_path=* --log_root=logs --exp_name=vmsmo --max_enc_steps=100 --max_dec_stpes=30 --vocab_size=50000 --lr=0.001

We also give the crawler code used to crawl videos and text from weibo website, as shown in crawler-weibo folder.

Citation

We appreciate your citation if you find our dataset and code beneficial.

@inproceedings{Li2020VMSMO,
  title={VMSMO: Learning to Generate Multimodal Summary for Video-based News Articles},
  author={Mingzhe Li, Xiuying Chen, Shen Gao, Zhangming Chan, Dongyan Zhao, and Rui Yan},
  booktitle = {EMNLP},
  year = {2020}
}

vmsmo's People

Contributors

Stargazers

Watchers

Forkers

egodby anshiquanshu66 rafaelpezzuto midhunadharshini

vmsmo's Issues

Hard to reproduce the code

Hello,

Thanks for your recent updates! And sorry to bother you again.
Could you please also share with us the specific files for data/test/vocab_path required in 'python run_summarization.py --mode=train --data_path=* --test_path=* --vocab_path=* --log_root=logs --exp_name=vmsmo --max_enc_steps=100 --max_dec_stpes=30 --vocab_size=50000 --lr=0.001'? It's still hard to reproduce the result in your paper according to the readme...And in your paper, it states ' 10 cover candidates are selected from every 120 frames', could you also help point out where this implementation is in your code?
Thank you!

An Keyerror when run the code

Hello, i have run the code and I wonder if the train.json have some missing part like 'start' ,'end' ,'max_sim'?Thank you very much!

when to release the dataset?

Hello, when will the dataset be released? Thanks.

pretrained model

Hi,

Could you please share your trained model for this dataset, then we can do some finetunning work with other datasets? Thanks!

About the full dataset

Hello, when will the full dataset be released?

Baseline codes

Hi,

Thanks for sharing your work!
Could you please also share the baseline codes (e.g., How2, MSMO, and MOF in Table.2), thanks!

how to get the datasets(most of the urls you offered has been invalid

hello, most of the URLs you provide in the datasets will report an error 403 whether it is opened directly by the browser or requests.gets.

It seems that these URLs have been invalidated, and most of the data URLs will face this problem.so the download method you provided seems to be invalid.

May I ask you directly provide the source data set to download?

<head><title>403 Forbidden</title></head>
<body bgcolor="white">
<h1>403 Forbidden</h1>
<p>You don't have permission to access the URL on this server.<hr/>Powered by Tengine</body>
</html>

I attach a list of links that worked and list of links which didn't from the dev set.

Any ideas on what could be the reason? Is it possible that some kind of authorization is required?