GithubHelp home page GithubHelp logo

whtqh / mimictest Goto Github PK

View Code? Open in Web Editor NEW

This project forked from edirobotics/mimictest

0.0 0.0 0.0 806 KB

A simple testbed for robotics manipulation policies based on robomimic

License: Apache License 2.0

Python 100.00%

mimictest's Introduction

Mimictest

A simple testbed for robotics manipulation policies based on robomimic. All policies are rewritten in a simple way. We may further expand it to the libero benchmark, which is also based on robosuite simulator.

We also have policies trained and tested on the CALVIN benchmark, e.g., GR1-Training which is the current SOTA on the hardest ABC->D task of CALVIN.

We also recommend other good frameworks / comunities for robotics policy learning.
  • HuggingFace's LeRobot, which currently have ACT, Diffusion Policy (only simple pusht task), TDMPC, and VQ-BeT. LeRobot has a nice robotics learning community on this discord server.

  • CleanDiffuser which implements multiple diffusion algorithms for imitation learning and reinforcement learning. Our implementation of diffusion algorithms is different from CleanDiffuser, but we thank the help of their team members.

  • Dr. Mu Yao organizes a nice robitics learning community for Chinese researchers, see DeepTimber website and 知乎.

Please remember we build systems for you ヾ(^▽^*)). Feel free to ask me if you have any question!

News

[2024.7.30] Add Florence policy with MLP action head & diffusion action head. Add RT-1 policy.

[2024.7.16] Add transformer version of Diffusion Policy.

[2024.7.15] Initial release which only contains UNet version of Diffusion Policy.

Features

Unified State and Action Space.
  • All policies share the same data pre-processing pipeline and predict actions in 3D Cartesian translation + 6D rotation + gripper open/close. The 3D translation can be relative to current gripper position (abs_mode=False) or the world coordinate (abs_mode=True).

  • They perceive obs_horizon historical observations, generate chunk_size future actions, and execute test_chunk_size predicted actions. An example with obs_horizon=3, chunk_size=4, test_chunk_size=2:

Policy sees: 		|o|o|o|
Policy predicts: 	| | |a|a|a|a|
Policy executes:	| | |a|a|
  • They use image input from both static and wrist cameras.
Multi-GPU training and simulation.
  • We achieve multi-GPU / multi-machine training with HuggingFace accelerate.

  • We achieve parallel simulation with asynchronized environment provided by stable-baseline3. In practice, we train and evaluate the model on multiple GPUs. For each GPU training process, there are several parallel environments running on different CPU.

Optimizing data loading pipeline and profiling.
  • We implement a simple GPU data prefetching mechanism.

  • Image preprocessing are performed on GPU, instead of CPU.

  • You can perform detailed profiling of the training pipeline by setting do_profile=True and check the trace log with torch_tb_profiler. Introduction to the pytorch profiler.

Sorry...but you should tune the learning rate manually.
  • We try new algorithms here so we are not sure when the algorithm will converge before we run it. Thus, we use a simple constant learning rate schduler with warmup. To get the best performance, you should set the learning rate manually: a high learning rate at the beginning and a lower learning rate at the end.

  • Sometimes you need to freeze the visual encoder at the first training stage, and unfreeze the encoder when the loss converges in the first stage. It's can be done by setting freeze_vision_tower=<True/False> in the script.

Supported Policies

We implement the following algorithms:

Google's RT1.
  • Original implementation.

  • Our implementation supports EfficientNet v1/v2 and you can directly load pretrained weights by torchvision API. Google's implementation only supports EfficientNet v1.

  • You should choose a text encoder in Sentence Transformers to generate text embeddings and sent them to RT1.

  • Our implementation predicts multiple continuous actions (see above) instead of a single discrete action. We find our setting has better performance.

  • To get better performance, you should freeze the EfficientNet visual encoder in the 1st training stage, and unfreeze it in the 2nd stage.

Chi Cheng's Diffusion Policy (UNet / Transformer).
  • Original implementation.

  • Our architecture is a copy of Chi Cheng's network. We test it in our pipeline and it has the same performance. Note that diffusion policy trains 2 resnet visual encoders for 2 camera views from scratch, so we never freeze the visual encoders.

  • We also support predict actions in episilon / sample / v-space and other diffusion schedulers. The DiffusionPolicy wrapper can easily adapt to different network designs.

Florence Policy developed on Microsoft's Florence2 VLM, which is trained with VQA, OCR, detection and segmentation tasks on 900M images.
  • We develop the policy on the pretrained model.

  • Unlike OpenVLA and RT2, Florence2 is much smaller with 0.23B (Florence-2-base) or 0.7B (Florence-2-large) parameters.

  • Unlike OpenVLA and RT2 which generate discrete actions, our Florence policy generates continuous actions with a linear action head or a diffusion transformer action head.

  • The following figure illustrates the architecture of the Florence policy. We always freeze the DaViT visual encoder of Florence2, which is so good that unfreezing it does not improve the success rate.

Original Florence2 Network
Florence policy with a linear action head
Florence policy with a diffusion transformer action head

Performance on Example Task

Square task with professional demos:

Policy Success Rate Checkpoint Model Size Failure Cases
RT-1 62% HuggingFace 23.8M HuggingFace
Diffusion Policy (UNet) 88.5% HuggingFace 329M HuggingFace
Diffusion Policy (Transformer) 90.5% HuggingFace 31.5M HuggingFace
Florence (linear head) 88.5% HuggingFace 270.8M HuggingFace
Florence (diffusion head) 92.7% HuggingFace 279.9M HuggingFace

*The success rate is measured with an average of 3 latest checkpoints. Each checkpoint is evaluated with 96 rollouts. *For diffusion models, we save both the trained model and the exponential moving average (EMA) of the trained model in a checkpoint

Failure analysis
  • RT-1:
    • Failure to grasp an object after picking it up and the object falls: 1
    • Pause before picking the object: 6
    • Pause before inserting object into the target: 2
    • It thought the gripper picked up the object, but actually not: 3
    • When inserting the object into target, the object gets stuck halfway through, and the policy doesn't know how to fix it: 1
  • Diffusion Policy (UNet):
    • Failure to grasp an object after picking it up and the object falls: 2
    • Pause before picking the object: 2
    • It thought the gripper picked up the object, but actually not: 1
    • When inserting the object into target, the object gets stuck halfway through, and the policy doesn't know how to fix it: 3
    • It successfylly inserts the object into target but suddenly lifts and throws the object away: 1
  • Diffusion Policy (Transformer):
    • Pause before picking the object: 1 (In the third-person view, objects are obscured by the gripper)
    • It thought the gripper picked up the object, but actually not: 1
    • Pause before inserting object into the target: 2
  • Florence (linear head):
    • Failure to grasp an object after picking it up and the object falls: 1
    • Pause before picking the object: 6
    • Pause before inserting object into the target: 5
    • It thought the gripper picked up the object, but actually not: 1
    • It successfylly inserts the object into target but suddenly lifts and throws the object away: 1
  • Florence (diffusion transformer head):
    • Failure to grasp an object after picking it up and the object falls: 6
    • Pause before picking the object: 4
    • Pause before inserting object into the target: 2
    • When inserting the object into target, the object gets stuck halfway through, and the policy doesn't know how to fix it: 2
    • When inserting the object into target, the object falls halfway through, and the policy doesn't know how to fix it: 1

Installation

To use robosuite1.2, we need a conda environment with python 3.8 or 3.9. You can use mirror sites of Github to avoid the connection problem in some regions.

conda create -n mimic python=3.9
conda activate mimic
git clone https://github.com/EDiRobotics/mimictest
cd mimictest
apt install curl git libgl1-mesa-dev libgl1-mesa-glx libglew-dev libosmesa6-dev software-properties-common net-tools unzip vim virtualenv wget xpra xserver-xorg-dev libglfw3-dev patchelf cmake
pip install -e .
pip install robosuite@https://github.com/cheng-chi/robosuite/archive/277ab9588ad7a4f4b55cf75508b44aa67ec171f0.tar.gz

You should also download dataset that contains robomimic_image.zip or robomimic_lowdim.zip from the official link or HuggingFace. In this example, I use the tool of HF-Mirror. You can set the environment variable export HF_ENDPOINT=https://hf-mirror.com to avoid the connection problem in some regions.

apt install git-lfs aria2
wget https://hf-mirror.com/hfd/hfd.sh
chmod a+x hfd.sh
./hfd.sh EDiRobotics/mimictest_data --dataset --tool aria2c -x 9

If you only want to download a subset of the data, e.g., the square task with image input:

./hfd.sh EDiRobotics/mimictest_data --dataset --tool aria2c -x 9 --include robomimic_image/square.zip
You should also do these to use florence-based models.

To use florence-based models, you should download one of it from HuggingFace, for example:

./hfd.sh microsoft/Florence-2-base --model --tool aria2c -x 9

And then set model_path in the script, for example:

# in Script/FlorenceImage.py
model_path = "/path/to/downloaded/florence/folder"

You need to install flash-attention. We recommend to download a pre-build wheel from official release, instead of building a wheel by yourself. For example (you should choose a wheel depending on your system):

wget https://github.com/Dao-AILab/flash-attention/releases/download/v2.6.3/flash_attn-2.6.3+cu118torch2.4cxx11abiTRUE-cp39-cp39-linux_x86_64.whl
pip install flash_attn-2.6.3+cu118torch2.4cxx11abiTRUE-cp39-cp39-linux_x86_64.whl

Multi-GPU Train & Evaluation

  1. You shall first run accelerate config to set environment parameters (number of GPUs, precision, etc). We recommend to use bf16.
  2. Download and unzip the dataset mentioned above.
  3. Please check and modify the settings (e.g, train or eval, and the corresponding settings) in the scripts you want to run, under the Script directory. Each script represents a configuration of an algorithm.
  4. Please then run
accelerate launch Script/<the script you choose>.py

Possible Installation Problems

ImportError: /opt/conda/envs/test/bin/../lib/libstdc++.so.6: version `GLIBCXX_3.4.30' not found (required by /lib/x86_64-linux-gnu/libLLVM-15.so.1)

Please check this link.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.