The clipaudiocaption from dhecloud

Work based on official implementation for the paper "ClipCap: CLIP Prefix for Image Captioning"

Description

Image captioning is a complicated task, where usually a pretrained detection network is used, requires additional supervision in the form of object annotation. We present a new approach that does not requires additional information (i.e. requires only images and captions), thus can be applied to any data. In addition, our model's training time is much faster than similar methods while achieving comparable to state-of-the-art results, even for the Conceptual Captions dataset contains over 3M images.

In our work, we use the CLIP model, which was already trained over an extremely large number of images, thus is capable of generating semantic encodings for arbitrary images without additional supervision. To produce meaningful sentences we fine-tune a pretrained language model, which has been proven to be successful for other natural language tasks. The key idea is to use the CLIP encoding as a prefix to the textual captions by employing a simple mapping network over the raw encoding, and then fine-tune our language model to generate a valid caption. In addition, we present another variant, where we utilize a transformer architecture for the mapping network and avoid the fine-tuning of GPT-2. Still, our light model achieve comaparable to state-of-the-art over nocaps dataset.

Training prerequisites

Clone, create environment and install dependencies:

git clone https://github.com/anushkajj/ClipAudioCaption.git && cd ClipAudioCaption
conda env create -f environment.yml
conda activate clip_prefix_caption

Clotho training

Download train_audio and captions to data.

Download model weights to root directory

gdown --id 14pXWwB4Zm82rsDdvbGguLfx9F8aM7ovT -O model_wieghts.pt

Extract CLIP features using (output is data/clotho/oscar_split_ViT-B_32_train.pkl):

python parse_clotho.py --clip_model_type ViT-B/32 --caption_path ./data/clotho_captions_development.csv --audio_path ./data/development

Train with fine-tuning of GPT2

python train.py --data ./data/clotho/oscar_split_ViT-B_32_train.pkl --out_dir ./clotho_train/

dhecloud / clipaudiocaption Goto Github PK

clipaudiocaption's Introduction

Work based on official implementation for the paper "ClipCap: CLIP Prefix for Image Captioning"

Description

Training prerequisites

Clotho training

clipaudiocaption's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs