GithubHelp home page GithubHelp logo

isabella232 / visitron Goto Github PK

View Code? Open in Web Editor NEW

This project forked from alexa/visitron

0.0 0.0 0.0 4.1 MB

VISITRON: A multi-modal Transformer-based model for Cooperative Vision-and-Dialog Navigation (CVDN)

License: MIT No Attribution

Dockerfile 0.56% Python 91.23% Shell 8.21%

visitron's Introduction

VISITRON: Visual Semantics-aligned Interactively Trained Object-Navigator

VISITRON: Visual Semantics-aligned Interactively Trained Object-Navigator

Ayush Shrivastava, Karthik Gopalakrishnan, Yang Liu, Robinson Piramuthu, Gokhan Tür, Devi Parikh, Dilek Hakkani-Tür

Accepted to NAACL 2021, Visually Grounded Interaction and Language (ViGIL) Workshop

VISITRON

Setup

Clone the repo using:

git clone --recursive https://github.com/alexa/visitron.git

Matterport3D Dataset and Simulator

This codebase uses the Matterport3D Simulator. Detailed instructions on how to setup the simulator and how to preprocess the Matterport3D data for faster simulator performance are present here: Matterport3DSimulator_README. We provide the Docker setup for ease of setup for the simulator.

We assume that the Matterport3D is present at $MATTERPORT_DATA_DIR which can be set using:

export MATTERPORT_DATA_DIR=<PATH_TO_MATTERPORT_DATASET>

Docker Setup

Build the Docker image:

docker build -t mattersim:visitron .

To run the Docker container and mount the codebase and the Matterport3D dataset, use:

nvidia-docker run -it --ipc=host --cpuset-cpus="$(taskset -c -p $$ | cut -f2 -d ':' | awk '{$1=$1};1')" --volume `pwd`:/root/mount/Matterport3DSimulator --mount type=bind,source=$MATTERPORT_DATA_DIR,target=/root/mount/Matterport3DSimulator/data/v1/scans,readonly mattersim:visitron

Task Data Setup

NDH and R2R

mkdir -p srv/task_data
bash scripts/download_ndh_r2r_data.sh

RxR

Refer to RxR repo for its setup and copy the data to srv/task_data/RxR/data folder.

Pre-training

Inside the docker container, run these commands to generate pre-training data.

For NDH, run

python scripts/generate_pretraining_data.py --dataset_to_use NDH --split train

For R2R, run

python scripts/generate_pretraining_data.py --dataset_to_use NDH --split train

The data gets saved to srv/task_data/pretrain_data. By default, this script starts 8 multiprocessing threads to speed up its execution. --start_job_index, --end_job_index and --global_total_jobs can be changed to change the number of threads.

Image Features

Our pre-training approach requires object-level features from Faster R-CNN concatenated with orientation features.

First, follow the setup from the bottom-up attention repo inside a docker container to install Caffe. Note that the code from bottom-up attention repo requires python2.

Then, extract object-level features using

python2 scripts/precompute_bottom-up_features.py

You can use --gpu_id to parallelize the feature extraction process ovewr multiple GPUs.

Then, to concatenate orientation features to object-level features, use

python scripts/add_orientation_to_features.py

During fine-tuning, we use scene-level ResNet features. Download ResNet features from this link. You can also extract using

python scripts/precompute_resnet_img_features.py

VISITRON Initialization

Before performing navigation-specific pre-training and fine-tuning, we initialize VISITRON with disembodied weights from the Oscar model. Download the Oscar pre-trained weights using

wget https://biglmdiag.blob.core.windows.net/oscar/pretrained_models/$MODEL_NAME.zip
unzip $MODEL_NAME.zip -d srv/oscar_weights/

where $MODEL_NAME is base-vg-labels and base-no-labels.

Training

We provide pre-training, training and evaluation scripts in run_scripts/.

As an example, use the following command to run a script.

bash run_scripts/viewpoint_train/pretrain_ndh_r2r.sh $MODE

where $MODE can be from [cpu, single-gpu 0, multi-gpu-dp, multi-gpu-ddp].

  • Use cpu to train on CPU.
  • Use single-gpu 0 to train on a single GPU. To use other GPU, change 0 to another value.
  • Use multi-gpu-dp to train on all available GPUs using DataParallel.
  • Use multi-gpu-ddp to train on 4 GPUs using DistributedDataParallel. Change --nproc_per_node in the script to specify no. of GPUs in DistributedDataParallel mode.

Pretraining scripts are in pretrain, training scripts which use viewpoint selection as action space are in viewpoint_train, turn based action space scripts are in turn_based_train and scripts for training and evaluating question-asking classifier are in classifier. ablations are the training scripts for Table 1 from the paper.

To pretrain our model on NDH and R2R, and finetune on NDH and RxR for viewpoint selection action space, run

  1. Pretrain on NDH+R2R for all objectives:
bash run_scripts/pretrain/pretrain_ndh_r2r.sh multi-gpu-ddp
  1. Pick the best pretrained checkpoint by evaluating using
bash run_scripts/pretrain/pretrain_ndh_r2r_val.sh multi-gpu-ddp

Change --model_name_or_path in run_scripts/viewpoint_train/pretrain_ndh_r2r.sh to load the best pretrained checkpoint.

  1. Finetune on NDH + RxR using
bash run_scripts/viewpoint_train/pretrain_ndh_r2r.sh multi-gpu-ddp
  1. Evaluate trained models using
bash run_scripts/viewpoint_train/pretrain_ndh_r2r_val.sh multi-gpu-ddp

You can then, run the scripts in run_scripts/classifier to train the question-asking classifier.

For any run script, make sure these arguments refer to correct paths, img_feat_dir, img_feature_file, data_dir, model_name_or_path, output_dir.

License

This library is licensed under the MIT-0 License. See the LICENSE file.

Citation:

@inproceedings{visitron,
  title={VISITRON: Visual Semantics-aligned Interactively Trained Object-Navigator},
  author={Ayush Shrivastava, Karthik Gopalakrishnan, Yang Liu, Robinson Piramuthu, Gokhan T\"{u}r, Devi Parikh, Dilek Hakkani-T\"{u}r},
  booktitle={NAACL 2021, Visually Grounded Interaction and Language (ViGIL) Workshop},
  year={2021}
}

visitron's People

Contributors

ayshrv avatar g-karthik avatar amazon-auto avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.