siddk / voltron-robotics Goto Github PK

Voltron: Language-Driven Representation Learning for Robotics

License: MIT License

Makefile 0.14% Python 99.86%

voltron-robotics's Introduction

Language-Driven Representation Learning for Robotics

Package repository for Voltron: Language-Driven Representation Learning for Robotics. Provides code for loading pretrained Voltron, R3M, and MVP representations for adaptation to downstream tasks, as well as code for pretraining such representations on arbitrary datasets.

Quickstart

This repository is built with PyTorch; while specified as a dependency for the package, we highly recommend that you install the desired version (e.g., with accelerator support) for your given hardware and environment manager (e.g., conda).

PyTorch installation instructions can be found here. This repository should work with PyTorch >= 1.12. Releases before 1.1.0 have been thoroughly tested with PyTorch 1.12.0, Torchvision 0.13.0, and Torchaudio 0.12.0. Note: Releases 1.1.0 and after assume PyTorch 2.0!

Once PyTorch has been properly installed, you can install this package via PyPI, and you're off!

pip install voltron-robotics

You can also install this package locally via an editable installation in case you want to run examples/extend the current functionality:

git clone https://github.com/siddk/voltron-robotics
cd voltron-robotics
pip install -e .

Usage

Voltron Robotics (package: voltron) is structured to provide easy access to pretrained Voltron models (and reproductions), to facilitate use for various downstream tasks. Using a pretrained Voltron model is easy:

from torchvision.io import read_image
from voltron import instantiate_extractor, load

# Load a frozen Voltron (V-Cond) model & configure a vector extractor
vcond, preprocess = load("v-cond", device="cuda", freeze=True)
vector_extractor = instantiate_extractor(vcond)()

# Obtain & Preprocess an image =>> can be from a dataset, or camera on a robot, etc.
#   => Feel free to add any language if you have it (Voltron models work either way!)
img = preprocess(read_image("examples/img/peel-carrot-initial.png"))[None, ...].to("cuda")
lang = ["peeling a carrot"]

# Extract both multimodal AND vision-only embeddings!
multimodal_embeddings = vcond(img, lang, mode="multimodal")
visual_embeddings = vcond(img, mode="visual")

# Use the `vector_extractor` to output dense vector representations for downstream applications!
#   => Pass this representation to model of your choice (object detector, control policy, etc.)
representation = vector_extractor(multimodal_embeddings)

Voltron representations can be used for a variety of different applications; in the voltron-evaluation repository, you can find code for adapting Voltron representations to various downstream tasks (segmentation, object detection, control, etc.); all the applications from our paper.

API

The package voltron provides the following functionality for using and adapting existing representations:

`voltron.available_models()`

Returns the name of available Voltron models; right now, the following models (all models trained in the paper) are available:

v-cond – V-Cond (ViT-Small) trained on Sth-Sth; single-frame w/ language-conditioning.
v-dual – V-Dual (ViT-Small) trained on Sth-Sth; dual-frame w/ language-conditioning.
v-gen – V-Gen (ViT-Small) trained on Sth-Sth; dual-frame w/ language conditioning AND generation.
r-mvp – R-MVP (ViT-Small); reproduction of MVP trained on Sth-Sth.
r-r3m-vit – R-R3M (ViT-Small); reproduction of R3M trained on Sth-Sth.
r-r3m-rn50 – R-R3M (ResNet-50); reproduction of R3M trained on Sth-Sth.
v-cond-base – V-Cond (ViT-Base) trained on Sth-Sth; larger (86M parameter) variant of V-Cond.

`voltron.load(name: str, device: str, freeze: bool, cache: str = cache/)`

Returns the model and the Torchvision Transform needed by the model, where name is one of the strings returned by voltron.available_models(); this in general follows the same API as OpenAI's CLIP.

Voltron models (v-{cond, dual, gen, ...}) returned by voltron.load() support the following:

`model(img: Tensor, lang: Optional[List[str]], mode: str = "multimodal")`

Returns a sequence of embeddings corresponding to the output of the multimodal encoder; note that lang can be None, which is totally fine for Voltron models! However, if you have any language (even a coarse task description), it'll probably be helpful!

The parameter mode in ["multimodal", "visual"] controls whether the output will contain the fused image patch and language embeddings, or only the image patch embeddings.

Note: For the API for the non-Voltron models (e.g., R-MVP, R-R3M), take a look at examples/verify.py; this file shows how representations from every model can be extracted.

Adaptation

See examples/usage.py and the voltron-evaluation repository for more examples on the various ways to adapt/use Voltron representations.

Contributing

Before committing to the repository, make sure to set up your dev environment! Here are the basic development environment setup guidelines:

Fork/clone the repository, performing an editable installation. Make sure to install with the development dependencies (e.g., pip install -e ".[dev]"); this will install black, ruff, and pre-commit.
Install pre-commit hooks (pre-commit install).
Branch for the specific feature/issue, issuing PR against the upstream repository for review.

Additional Contribution Notes:

This project has migrated to the recommended pyproject.toml based configuration for setuptools. However, as some tools haven't yet adopted PEP 660, we provide a setup.py file.
This package follows the flat-layout structure described in setuptools.
Make sure to add any new dependencies to the project.toml file!

Repository Structure

High-level overview of repository/project file-tree:

docs/ - Package documentation & assets - including project roadmap.
voltron - Package source code; has all core utilities for model specification, loading, feature extraction, preprocessing, etc.
examples/ - Standalone examples scripts for demonstrating various functionality (e.g., extracting different types of representations, adapting representations in various contexts, pretraining, amongst others).
.pre-commit-config.yaml - Pre-commit configuration file (sane defaults + black + ruff).
LICENSE - Code is made available under the MIT License.
Makefile - Top-level Makefile (by default, supports linting - checking & auto-fix); extend as needed.
pyproject.toml - Following PEP 621, this file has all project configuration details (including dependencies), as well as tool configurations (for black and ruff).
README.md - You are here!

Citation

Please cite our paper if using any of the Voltron models, evaluation suite, or other parts of our framework in your work.

@inproceedings{karamcheti2023voltron,
  title={Language-Driven Representation Learning for Robotics},
  author={Siddharth Karamcheti and Suraj Nair and Annie S. Chen and Thomas Kollar and Chelsea Finn and Dorsa Sadigh and Percy Liang},
  booktitle={Robotics: Science and Systems (RSS)},
  year={2023}
}

voltron-robotics's People

Contributors

Stargazers

Watchers

voltron-robotics's Issues

Upload ALL Checkpoints for ALL Pretrained Models

We have 400 checkpoints for each of the pretrained models in this work ([v-cond, v-dual, v-gen, r-mvp, r-r3m-vit, r-r3m-rn50]); for future work, would be nice to store these/make these accessible to the community.

Making these available via the HF Hub (similar to the Mistral checkpoints) might be a decent idea?

Code Release on evaluate_imitate.py

Hi, thanks a lot for your incredible framework and open-sourced codes!

I was wondering if you have any plans to release the evaluation framework for the "Language-Conditioned Imitation" task, namely, evaluate_imitate.py (currently a blank file)

It would be quite interesting to port the language-condition imitation code to the simulation frameworks (e.g. coppeliasim), and I hope further research could be facilitated from your work by doing so.

Hope this issue could reach you, and thanks again for your outstanding work!

[Roadmap] Add `xpretrain.py` Reference Script

This project has been in the works for a while; the version of the Something-Something-v2 dataset used for pretraining was actually the original 20BN-Sth-Sth version, prior to the transfer to Qualcomm.

We also trained on TPU compute generously provided by the TPU Research Cloud program on a heavily patched version of PyTorch XLA.

I'll clean up and post the original xpretrain.py (X for PyTorch XLA) script + preprocessing code as high-priority item, but will refactor the pretraining pipeline to reflect:

New format (w/ new permissions) of the Something-Something-v2 dataset.
Cleaner Preprocessing Pipeline

Update Documentation & Add Examples

Add better documentation around the MAP extraction pipelines, and other options:

Document API for using instantiate_extractor
Add mean-pooling as a first-class extraction scheme
Add examples for using Voltron representations as a "drop-in" replacement; perhaps a Colab notebook demo?

VGen Pretraining NaN values

Hi,

I'm trying to replicate pretraining of the VGen model with Sth-Sth-V2. Even in the first batch of training, I encounter NaN values in the tensor values. After some debugging. I found the NaN values occur in the first forward pass through the decoder transformer. It seems in the code there are some checks for NaN values. Have you experienced the same issue? Any suggestions on how to fix this? Thanks!

Storage constraints for Something-v2 for inference

Hey @siddk,
Thanks for open sourcing the framework!
I had a question about the data loading: I wanted to evaluate/infer the pre-trained models on a small subset of the Sth-Sth v2 data, and have <80-100 GB of storage space for it.
The Readme file in the pretrain folder says the data extraction might need 100s of GBs and that the streaming_dataset might be a solution, could you elaborate? Or am I interpreting it wrong since the dataset website says the data might be 56 GB after extraction, so maybe the >100GB storage is needed only if we want to process it a certain way for Voltron pretraining?
Alternatively do you know if I might be able to reduce storage needs by extracting the dataset at a lower fps (I assume that should be fine since the Voltron models encode single images?) or only preprocessing a certain subset of the videos?

PS: also minor correction to the Readme: the command to untar should be
cat 20bn-something-something-v2-?? | tar -xvzf -
instead of
cat 20bn-something-something-?? | tar -xvzf -

Extract vision-only embedding (no language)

I need a function that, given only a preprocessed image (224x224 normalized with ImageNet values), returns its embedding. Is this the correct way to do this? I ask because I'm getting weird behavior with my robot (other embeddings work fine).

vcond, _ = load("v-cond-base", device="cuda", freeze=True)
vector_extractor = instantiate_extractor(vcond)().cuda()

def extractor(image):
    visual_embeddings = vcond(image, mode="visual")
    representation = vector_extractor(visual_embeddings)
    return representation

return extractor

The embedding is used as input to a small MLP trained with behavior cloning. The MLP seems to underfit (loss is relatively high) with Voltron.

robot visual generation

Hey @siddk thx so much for sharing this repo, which is really amazing!

I have a question related to the downstream usage of the encoder:
given language description i.e. "peel the carrot" in the demo you show in the paper , and observation at initial state $s_{t = 0}$,
is there any method that I can predict the observation image of the result state $s_{t = k}$?

Upload Sth-Sth-v2 Dataset Index Files

We have all the index files (data / frame IDs seen per epoch, per batch) for all models we pretrained in a GCP bucket; would be nice to figure out a longer-term hosting solution for future work (and maybe interpretability work)?

Files are semi-big; see if HF Hub or AcademicTorrents are options?

How to pretrain using Sth-Sth-v2 Dataset

Hi, Siddharth.
This is a great job and How can I use training code for pretraining?
Thanks!

How much traning time is needed?

Hello @siddk How many TPU hours are needed to train this model (including image reconstruction)?

[Roadmap] Add Support for Pretraining on Other Datasets

Something-Something-v2 was always meant to be a starting point; add hooks and a more unified API for pretraining on various video (and language) datasets:

Add API for general preprocessing -> index files.
Add tests for ensuring locked data across baselines (should anyone want that)

Download Bug

When I try to run the example code I get the following error (see below). I believe this is an issue with the google drive download, because I can successfully load the model after downloading and copying it myself. Maybe you can fix this by changing the permissions on the file?

>>> from voltron import instantiate_extractor, load
>>> vcond, pre = load('v-cond', device='cuda')
Downloading...
From: https://drive.google.com/uc?id=1O4oqRIblfS6PdFlZzUcYIX-Rqe6LbvnD
To: /home/sudeep/cache/v-cond/v-cond-config.json
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 640/640 [00:00<00:00, 2.18MB/s]
Access denied with the following error:

 	Cannot retrieve the public link of the file. You may need to change
	the permission to 'Anyone with the link', or have had many accesses.

You may still be able to access the file from the browser:

	 https://drive.google.com/uc?id=12g5QckQSMKqrfr4lFY3UPdy7oLw4APpG

Downloading (…)okenizer_config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 28.0/28.0 [00:00<00:00, 5.70kB/s]
Downloading (…)lve/main/config.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 483/483 [00:00<00:00, 119kB/s]
Downloading (…)solve/main/vocab.txt: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 232k/232k [00:00<00:00, 4.42MB/s]
Downloading (…)/main/tokenizer.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 466k/466k [00:00<00:00, 5.07MB/s]
Downloading (…)"pytorch_model.bin";: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 268M/268M [00:04<00:00, 61.9MB/s]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/sudeep/anaconda3/envs/torch/lib/python3.8/site-packages/voltron/models/materialize.py", line 102, in load
    state_dict, _ = torch.load(checkpoint_path, map_location=device)
  File "/home/sudeep/anaconda3/envs/torch/lib/python3.8/site-packages/torch/serialization.py", line 699, in load
    with _open_file_like(f, 'rb') as opened_file:
  File "/home/sudeep/anaconda3/envs/torch/lib/python3.8/site-packages/torch/serialization.py", line 230, in _open_file_like
    return _open_file(name_or_buffer, mode)
  File "/home/sudeep/anaconda3/envs/torch/lib/python3.8/site-packages/torch/serialization.py", line 211, in __init__
    super(_open_file, self).__init__(open(name, mode))
FileNotFoundError: [Errno 2] No such file or directory: 'cache/v-cond/v-cond.pt'

siddk / voltron-robotics Goto Github PK

voltron-robotics's Introduction

Language-Driven Representation Learning for Robotics

Quickstart

Usage

API

voltron.available_models()

voltron.load(name: str, device: str, freeze: bool, cache: str = cache/)

model(img: Tensor, lang: Optional[List[str]], mode: str = "multimodal")