A simple testbed for robotics manipulation policies based on robomimic. All policies are rewritten in a simple way. We may further expand it to the libero benchmark, which is also based on robosuite simulator.
We also have policies trained and tested on the CALVIN benchmark, e.g., GR1-Training which is the current SOTA on the hardest ABC->D task of CALVIN.
We also recommend other good frameworks / comunities for robotics policy learning.
-
HuggingFace's LeRobot, which currently have ACT, Diffusion Policy (only simple pusht task), TDMPC, and VQ-BeT. LeRobot has a nice robotics learning community on this discord server.
-
CleanDiffuser which implements multiple diffusion algorithms for imitation learning and reinforcement learning. Our implementation of diffusion algorithms is different from CleanDiffuser, but we thank the help of their team members.
-
Dr. Mu Yao organizes a nice robitics learning community for Chinese researchers, see DeepTimber website and 知乎.
Please remember we build systems for you ヾ(^▽^*)). Feel free to ask me if you have any question!
[2024.7.30] Add Florence policy with MLP action head & diffusion action head. Add RT-1 policy.
[2024.7.16] Add transformer version of Diffusion Policy.
[2024.7.15] Initial release which only contains UNet version of Diffusion Policy.
Unified State and Action Space.
-
All policies share the same data pre-processing pipeline and predict actions in 3D Cartesian translation + 6D rotation + gripper open/close. The 3D translation can be relative to current gripper position (
abs_mode=False
) or the world coordinate (abs_mode=True
). -
They perceive
obs_horizon
historical observations, generatechunk_size
future actions, and executetest_chunk_size
predicted actions. An example withobs_horizon=3, chunk_size=4, test_chunk_size=2
:
Policy sees: |o|o|o|
Policy predicts: | | |a|a|a|a|
Policy executes: | | |a|a|
- They use image input from both static and wrist cameras.
Multi-GPU training and simulation.
-
We achieve multi-GPU / multi-machine training with HuggingFace accelerate.
-
We achieve parallel simulation with asynchronized environment provided by stable-baseline3. In practice, we train and evaluate the model on multiple GPUs. For each GPU training process, there are several parallel environments running on different CPU.
Optimizing data loading pipeline and profiling.
-
We implement a simple GPU data prefetching mechanism.
-
Image preprocessing are performed on GPU, instead of CPU.
-
You can perform detailed profiling of the training pipeline by setting
do_profile=True
and check the trace log withtorch_tb_profiler
. Introduction to the pytorch profiler.
Sorry...but you should tune the learning rate manually.
-
We try new algorithms here so we are not sure when the algorithm will converge before we run it. Thus, we use a simple constant learning rate schduler with warmup. To get the best performance, you should set the learning rate manually: a high learning rate at the beginning and a lower learning rate at the end.
-
Sometimes you need to freeze the visual encoder at the first training stage, and unfreeze the encoder when the loss converges in the first stage. It's can be done by setting
freeze_vision_tower=<True/False>
in the script.
We implement the following algorithms:
Google's RT1.
-
Our implementation supports EfficientNet v1/v2 and you can directly load pretrained weights by torchvision API. Google's implementation only supports EfficientNet v1.
-
You should choose a text encoder in Sentence Transformers to generate text embeddings and sent them to RT1.
-
Our implementation predicts multiple continuous actions (see above) instead of a single discrete action. We find our setting has better performance.
-
To get better performance, you should freeze the EfficientNet visual encoder in the 1st training stage, and unfreeze it in the 2nd stage.
Chi Cheng's Diffusion Policy (UNet / Transformer).
-
Original implementation.
-
Our architecture is a copy of Chi Cheng's network. We test it in our pipeline and it has the same performance. Note that diffusion policy trains 2 resnet visual encoders for 2 camera views from scratch, so we never freeze the visual encoders.
-
We also support predict actions in episilon / sample / v-space and other diffusion schedulers. The
DiffusionPolicy
wrapper can easily adapt to different network designs.
Florence Policy developed on Microsoft's Florence2 VLM, which is trained with VQA, OCR, detection and segmentation tasks on 900M images.
-
We develop the policy on the pretrained model.
-
Unlike OpenVLA and RT2, Florence2 is much smaller with 0.23B (Florence-2-base) or 0.7B (Florence-2-large) parameters.
-
Unlike OpenVLA and RT2 which generate discrete actions, our Florence policy generates continuous actions with a linear action head or a diffusion transformer action head.
-
The following figure illustrates the architecture of the Florence policy. We always freeze the DaViT visual encoder of Florence2, which is so good that unfreezing it does not improve the success rate.
Square task with professional demos:
Policy | Success Rate | Checkpoint | Model Size | Failure Cases |
---|---|---|---|---|
RT-1 | 62% | HuggingFace | 23.8M | HuggingFace |
Diffusion Policy (UNet) | 88.5% | HuggingFace | 329M | HuggingFace |
Diffusion Policy (Transformer) | 90.5% | HuggingFace | 31.5M | HuggingFace |
Florence (linear head) | 88.5% | HuggingFace | 270.8M | HuggingFace |
Florence (diffusion head) | 92.7% | HuggingFace | 279.9M | HuggingFace |
*The success rate is measured with an average of 3 latest checkpoints. Each checkpoint is evaluated with 96 rollouts. *For diffusion models, we save both the trained model and the exponential moving average (EMA) of the trained model in a checkpoint
Failure analysis
- RT-1:
- Failure to grasp an object after picking it up and the object falls: 1
- Pause before picking the object: 6
- Pause before inserting object into the target: 2
- It thought the gripper picked up the object, but actually not: 3
- When inserting the object into target, the object gets stuck halfway through, and the policy doesn't know how to fix it: 1
- Diffusion Policy (UNet):
- Failure to grasp an object after picking it up and the object falls: 2
- Pause before picking the object: 2
- It thought the gripper picked up the object, but actually not: 1
- When inserting the object into target, the object gets stuck halfway through, and the policy doesn't know how to fix it: 3
- It successfylly inserts the object into target but suddenly lifts and throws the object away: 1
- Diffusion Policy (Transformer):
- Pause before picking the object: 1 (In the third-person view, objects are obscured by the gripper)
- It thought the gripper picked up the object, but actually not: 1
- Pause before inserting object into the target: 2
- Florence (linear head):
- Failure to grasp an object after picking it up and the object falls: 1
- Pause before picking the object: 6
- Pause before inserting object into the target: 5
- It thought the gripper picked up the object, but actually not: 1
- It successfylly inserts the object into target but suddenly lifts and throws the object away: 1
- Florence (diffusion transformer head):
- Failure to grasp an object after picking it up and the object falls: 6
- Pause before picking the object: 4
- Pause before inserting object into the target: 2
- When inserting the object into target, the object gets stuck halfway through, and the policy doesn't know how to fix it: 2
- When inserting the object into target, the object falls halfway through, and the policy doesn't know how to fix it: 1
To use robosuite1.2
, we need a conda environment with python 3.8 or 3.9. You can use mirror sites of Github to avoid the connection problem in some regions.
conda create -n mimic python=3.9
conda activate mimic
git clone https://github.com/EDiRobotics/mimictest
cd mimictest
apt install curl git libgl1-mesa-dev libgl1-mesa-glx libglew-dev libosmesa6-dev software-properties-common net-tools unzip vim virtualenv wget xpra xserver-xorg-dev libglfw3-dev patchelf cmake
pip install -e .
pip install robosuite@https://github.com/cheng-chi/robosuite/archive/277ab9588ad7a4f4b55cf75508b44aa67ec171f0.tar.gz
You should also download dataset that contains robomimic_image.zip
or robomimic_lowdim.zip
from the official link or HuggingFace. In this example, I use the tool of HF-Mirror. You can set the environment variable export HF_ENDPOINT=https://hf-mirror.com
to avoid the connection problem in some regions.
apt install git-lfs aria2
wget https://hf-mirror.com/hfd/hfd.sh
chmod a+x hfd.sh
./hfd.sh EDiRobotics/mimictest_data --dataset --tool aria2c -x 9
If you only want to download a subset of the data, e.g., the square task with image input:
./hfd.sh EDiRobotics/mimictest_data --dataset --tool aria2c -x 9 --include robomimic_image/square.zip
You should also do these to use florence-based models.
To use florence-based models, you should download one of it from HuggingFace, for example:
./hfd.sh microsoft/Florence-2-base --model --tool aria2c -x 9
And then set model_path
in the script, for example:
# in Script/FlorenceImage.py
model_path = "/path/to/downloaded/florence/folder"
You need to install flash-attention. We recommend to download a pre-build wheel from official release, instead of building a wheel by yourself. For example (you should choose a wheel depending on your system):
wget https://github.com/Dao-AILab/flash-attention/releases/download/v2.6.3/flash_attn-2.6.3+cu118torch2.4cxx11abiTRUE-cp39-cp39-linux_x86_64.whl
pip install flash_attn-2.6.3+cu118torch2.4cxx11abiTRUE-cp39-cp39-linux_x86_64.whl
- You shall first run
accelerate config
to set environment parameters (number of GPUs, precision, etc). We recommend to usebf16
. - Download and unzip the dataset mentioned above.
- Please check and modify the settings (e.g, train or eval, and the corresponding settings) in the scripts you want to run, under the
Script
directory. Each script represents a configuration of an algorithm. - Please then run
accelerate launch Script/<the script you choose>.py
ImportError: /opt/conda/envs/test/bin/../lib/libstdc++.so.6: version `GLIBCXX_3.4.30' not found (required by /lib/x86_64-linux-gnu/libLLVM-15.so.1)
Please check this link.