GithubHelp home page GithubHelp logo

wentong-dst / self-critical Goto Github PK

View Code? Open in Web Editor NEW
22.0 3.0 2.0 151 KB

PyTorch implementation of paper: "Self-critical Sequence Training for Image Captioning"

License: MIT License

Python 91.75% Shell 0.08% HTML 1.31% Jupyter Notebook 6.87%
image-captioning deep-learning mscoco-dataset reinforcement-learning pytorch

self-critical's Introduction

Self-critical Sequence Training for Image Captioning

This is an unofficial implementation for Self-critical Sequence Training for Image Captioning.

The latest topdown and att2in2 model can achieve 1.12 Cider score on Karpathy's test split after self-critical training.

This is based on Ruotian's self-critical.pytorch repository.

Requirements

Python 2.7 and PyTorch 0.2 (along with torchvision)

Pretrained models

You need to download pretrained resnet model resnet101.pth or resnet50.pth for both training and evaluation. The models can be downloaded from here, and should be placed in data/imagenet_weights folder.

Pretrained image caption models (on coco dataset) are provided here.

Train your own network on COCO

Download COCO dataset and preprocessing

First, download the coco images from link. We need 2014 training images and 2014 val. images. You should put them in the image/train2014/ and image/val2014/ directories, denoted as $IMAGE_ROOT.

Download preprocessed coco captions from link from Karpathy's homepage. Extract dataset_coco.json from the zip file and copy it in to data/. This file provides preprocessed captions and also standard train-val-test splits.

Once we have these, we can now invoke the prepro_*.py script, which will read all of this in and create a dataset (two feature folders, a hdf5 label file and a json file).

$ python scripts/prepro_labels.py --input_json data/dataset_coco.json --output_json data/cocotalk.json 
--output_h5 data/cocotalk
$ python scripts/prepro_feats.py --input_json data/dataset_coco.json --output_dir data/cocotalk 
--images_root image

prepro_labels.py will map all words that occur <= 5 times to a special UNK token, and create a vocabulary for all the remaining words.

The image information and vocabulary are dumped into cocotalk.json and discretized caption data are dumped into cocotalk_label.h5. Both are stored in /data folder.

prepro_feats.py extract the resnet101 features (both fc feature and last conv feature) of each image. The features are saved in .npy in data/cocotalk_fc and .npz in data/cocotalk_att, and resulting files are about 200GB.

(Check the prepro scripts for more options, like other resnet models or other attention sizes.)

Start training

$ python train.py --id fc --input_json data/cocotalk.json --input_fc_dir data/cocotalk_fc 
--input_att_dir data/cocotalk_att --input_label_h5 data/cocotalk_label.h5 --batch_size 48 
--learning_rate 1e-3 --learning_rate_decay_start 0 --scheduled_sampling_start -1 
--checkpoint_path log_fc --save_checkpoint_every 800 --val_images_use 1000 --max_epochs 300 
--caption_model topdown --seq_per_img 1

The train script will dump checkpoints into the folder specified by --checkpoint_path (default = save/). We only save the best-performing checkpoint on validation and the latest checkpoint to save disk space.

To resume training, you can specify --start_from option to be the path saving infos.pkl and model.pth (usually you could just set --start_from and --checkpoint_path to be the same).

The current command use scheduled sampling, you can also set scheduled_sampling_start to -1 to turn off scheduled sampling.

If you'd like to evaluate BLEU/METEOR/CIDEr scores during training in addition to validation cross entropy loss, use --language_eval 1 option, but don't forget to download the coco-caption code into coco-caption directory.

For more options, see opts.py.

A few notes on training. To give you an idea, with the default settings one epoch of MS COCO images is about 11000 iterations. After 1 epoch of training results in validation loss ~2.5 and CIDEr score of ~0.68. By iteration 60,000 CIDEr climbs up to about ~0.84 (validation loss at about 2.4 (under scheduled sampling)).

Train using self critical

First you should preprocess the dataset and get the cache for calculating cider score:

$ python scripts/prepro_ngrams.py --input_json data/dataset_coco.json --dict_json data/cocotalk.json 
--output_pkl data/coco-train --split train

And also you need to clone my forked cider repository.

Then, copy the model from the pretrained model using cross entropy. (It's not mandatory to copy the model, just for back-up)

$ bash scripts/copy_model.sh fc fc_rl

Then

$ python train.py --id fc_rl --caption_model fc --input_json data/cocotalk.json 
--input_fc_dir data/cocotalk_fc --input_att_dir data/cocotalk_att --input_label_h5 data/cocotalk_label.h5 
--batch_size 10 --learning_rate 1e-3 --start_from log_fc_rl --checkpoint_path log_fc_rl 
--save_checkpoint_every 6000 --language_eval 1 --val_images_use 5000 --self_critical_after 30

Starting self-critical training after 30 epochs, the CIDEr score goes up to 1.05 after 600k iterations (including the 30 epochs pertraining).

Generate image captions

Evaluate on raw images

Now place all your images of interest into a folder, e.g. blah, and run the eval script:

$ python eval.py --model log_fc_rl/model-best.pth --infos_path log_fc_rl/infos_fc_rl-best.pkl 
--image_folder image/test2014 --num_images 10
$ python eval.py --model log_fc/model-best.pth --infos_path log_fc/infos_fc-best.pkl 
--image_folder image/test2014 --num_images 10

This tells the eval script to run up to 10 images from the given folder. If you have a big GPU you can speed up the evaluation by increasing batch_size. Use --num_images -1 to process all images. The eval script will create an vis.json file inside the vis folder, which can then be visualized with the provided HTML interface:

$ cd vis
$ python -m SimpleHTTPServer

Now visit localhost:8000 in your browser and you should see your predicted captions.

Evaluate on Karpathy's test split

$ python eval.py --dump_images 0 --num_images 5000 --model log_fc_rl/model-best.pth 
--infos_path log_fc_rl/infos_fc_rl-best.pkl --language_eval 1 
$ python eval.py --dump_images 0 --num_images 500 --model log_fc/model-best.pth 
--infos_path log_fc/infos_fc-best.pkl --language_eval 1 

The defualt split to evaluate is test. The default inference method is greedy decoding (--sample_max 1), to sample from the posterior, set --sample_max 0.

Beam Search. Beam search can increase the performance of the search for greedy decoding sequence by ~5%. However, this is a little more expensive. To turn on the beam search, use --beam_size N, N should be greater than 1.

Miscellanea

Train on other dataset. It should be trivial to port if you can create a file like dataset_coco.json for your own dataset.

self-critical's People

Contributors

clu8 avatar gujiuxiang avatar raoyongming avatar ruotianluo avatar wentong-dst avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

self-critical's Issues

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.