GithubHelp home page GithubHelp logo

rlhf-v / rlhf-v Goto Github PK

View Code? Open in Web Editor NEW
136.0 2.0 5.0 63.73 MB

[CVPR'24] RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback

Home Page: https://rlhf-v.github.io

Python 85.67% Shell 14.33%
chatbot gpt-4 llama multi-modality multimodal visual-language-learning rlhf-v

rlhf-v's Introduction

RLHF-V

Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback

License

Brief Introduction

This repository hosts the code, data, and model weight of RLHF-V, a novel framework that aligns Multimodal Large Language Models (MLLMs) behavior through fine-grained correctional human feedback.

We collect fine-grained correctional feedback data, which can better credit the desired behavior, by asking human annotators to correct the hallucinated segments in model responses. Benefiting from the high data efficiency, it takes only 1 hour on 8 A100 GPUs for us to reduce the hallucination rate of the base model by 34.8%. Specifically, we conduct experiments on Muffin, an MLLM that has a strong ability in image understanding and reasoning which is trained on UniMM-Chat.

Visit our 🏠 project page and 📃 paper to explore more! And don't miss to try our interactive 🔥 demo!

🎈News

  • [2024.03.10] 📃 Our RLHF-V is accepted by CVPR 2024!

  • [2024.02.04] 🔥 OmniLMM-12B which is built with RLHF-V achieves the #1 rank among open-source models on MMHal-Bench and even outperforms GPT-4V on Object HalBench! The demo is avaible at here!

  • [2024.01.06] 🔥 A larger, more diverse set of fine-grained human correction data is available at hugging face now! 🔥 The newly released data has about 5.7k of fine-grained human correction data that covers the output of more powerful models (Qwen-VL-Chat, InstructBLIP, etc.). We also expand the image types from everyday scenes to diverse styles and themes (WikiArt, landmarks, scene texts, etc.).

  • [2023.12.15] 🗂 We merge a new subset in our huggingface dataset! It contains an amount of 1,065 fine-grained human preference data annotated on the outputs of LLaVA-13B.

  • [2023.12.04] 📃 Our paper is accesible at arxiv now. We are still working hard to improve the data diversity and amount. More high-qulity data are just on the way!

Contents

Dataset

We present the RLHF-V-Dataset, which is a human preference dataset constructed by fine-grained segment-level human corrections. In practice, we obtain a total of 1.4k annotated data that includes a diverse set of detailed description instructions and question-answering instructions.

RLHF-V Weights

We release RLHF-V model weights on Hugging Face.

We also provide our SFT weights, which is the model checkpoint after finetuning Muffin on the VQAv2 dataset.

Install

  1. Install Muffin
cd RLHF-V
git clone https://github.com/thunlp/muffin

cd Muffin
# Creating conda environment
conda create -n muffin python=3.10
conda activate muffin

# Installing dependencies
pip install -e .

# Install specific version of transformers to make sure you can reproduce the experimental results in our papers
git clone --recursive [email protected]:huggingface/transformers.git
cd transformers
git checkout a92e0ad2e20ef4ce28410b5e05c5d63a5a304e65
pip install .
cd ..
  1. Prepare training environment

Install additional packages if you need to do training.

git clone --recursive https://github.com/Dao-AILab/flash-attention.git
cd flash-attention

# Note: Uncomment the following line if you have CUDA version <= 11.4
# git checkout ad11394

MAX_JOBS=8 python setup.py install
cd ..
  1. Prepare evaluation environment

To run Object HalBench evaluation, you also need the following packages:

jsonlines
nltk==3.8.1
spacy==3.7.0

# Download and install "en_core_web_trf" for spacy
# The wheel version we use can be downloaded from
# https://github.com/explosion/spacy-models/releases/tag/en_core_web_trf-3.7.2
# run pip install en_core_web_trf-3.7.2-py3-none-any.whl

Evaluation

LLaVA Bench

Run the following script to generate, evaluate, and summarize results for LLaVA Bench:

# cd RLHF-V

bash ./script/eval/eval_muffin_llavabench.sh ./RLHF-V_weight ./results/RLHF-V {YOUR_OPENAI_API_KEY}

Object HalBench

  1. Prepare COCO2014 annotations

The evaluation of Object HalBench relies on the caption and segmentation annotations from the COCO2014 dataset. Please first download the COCO2014 dataset from the COCO dataset's official website.

mkdir coco2014
cd coco2014

wget http://images.cocodataset.org/annotations/annotations_trainval2014.zip

unzip annotations_trainval2014.zip
  1. Inference, evaluation, and summarization

Please replace {YOUR_COCO2014_ANNOTATION_DIR} with the path for COCO2014 annotation directory(e.g. ./coco2014/annotations), and replace {YOUR_OPENAI_API_KEY} with a valid OpenAI api-key.

# cd RLHF-V

bash ./script/eval_muffin_objhal.sh ./RLHF-V_weight ./results/RLHF-V {YOUR_COCO2014_ANNOTATION_DIR} {YOUR_OPENAI_API_KEY}

MMHal Bench

  1. Prepare MMHal Data

Please download the MMHal evaluation data here, and save the file in eval/data.

  1. Run the following script to generate, evaluate, and summarize results for MMHal Bench:
# cd RLHF-V

bash ./script/eval_muffin_mmhal.sh ./RLHF-V_weight ./results/RLHF-V {YOUR_OPENAI_API_KEY}

RLHF-V Training

  1. Prepare environment

Please follow the instructions in the Install section to prepare the training environment. And make sure to upgrade to the latest code base of Muffin:

cd Muffin

git pull
pip install -e .
  1. Prepare model checkpoint

Please download our SFT model checkpoint and save it to Muffin/RLHF-V_SFT_weight.

  1. Training

Please make sure to upgrade to the latest code base of Muffin. After installing the environment of Muffin, you can train your model as follows. This script will automatically download our open-sourced training data from HuggingFace, generate logps by our SFT model, and do DDPO training:

cd Muffin

ref_model=./RLHF-V_SFT_weight

bash ./script/train/run_RLHFV.sh \
    ./RLHFV_checkpoints/dpo_exp \
    master \
    RLHFV \
    1.1 \
    $ref_model \
    ./RLHF-V-Dataset \
    RLHFV_SFT \
    2160 \
    360 \
    0.1 \
    False \
    True

Licenses

Code License Data License

Usage and License Notices: The data, code, and checkpoint are intended and licensed for research use only. They are also restricted to uses that follow the license agreement of LLaMA, Vicuna, and Chat GPT. The dataset is CC BY NC 4.0 (allowing only non-commercial use) and models trained using the dataset should not be used outside of research purposes.

Acknowledgement

  • Muffin: the codebase we built upon.
  • LLaVA-RLHF: we utilize the MMHal-Bench data and evaluation code constructed by them.
  • Object Hallucination: we refer to the CHAIR evaluation code included in the repository.

If you find RLHF-V useful for your research and applications, please cite using this BibTeX:

@article{yu2023rlhf,
  title={Rlhf-v: Towards trustworthy mllms via behavior alignment from fine-grained correctional human feedback},
  author={Yu, Tianyu and Yao, Yuan and Zhang, Haoye and He, Taiwen and Han, Yifeng and Cui, Ganqu and Hu, Jinyi and Liu, Zhiyuan and Zheng, Hai-Tao and Sun, Maosong and others},
  journal={arXiv preprint arXiv:2312.00849},
  year={2023}
}

rlhf-v's People

Contributors

haoye17 avatar yiranyyu avatar rlhf-v avatar

Stargazers

 avatar Po Tsui avatar few-shot avatar Hangzhou He avatar w avatar Ramsey avatar luckycallor avatar Ye Bai avatar TaylorAndy avatar  avatar  avatar Evan avatar Wenhao Chai avatar Timsty avatar Fu avatar  avatar  avatar sterzhang avatar Wei Guo avatar Fangkai Jiao avatar Sangmin Woo avatar Bean avatar wuyujack (Mingfu Liang) avatar Bowen Dong avatar HXH avatar Xiaolong avatar Jas avatar  avatar Kunlun Zhu avatar MLCV avatar  avatar B.Berkay Aydın avatar Bolun Cai avatar  avatar Tan Shaohui avatar  avatar Vishaal Udandarao avatar  avatar Luca Medeiros avatar wangxin999 avatar Wizyoung avatar Wei Liu avatar Masataka Ogawa avatar  avatar  avatar  avatar Weizhi Wang avatar Zheng Bowen avatar  avatar Yaya Shi avatar Jeff Carpenter avatar  avatar  avatar dongkai.liang avatar 나영욱 avatar elucida avatar Gary Gege avatar  avatar 姬忠鹏 avatar Li Yi avatar  avatar Schism avatar yashvardhan goyal avatar Ling-Hao CHEN avatar Hao Zhang avatar 苹果的味道 avatar Mustapha AJEGHRIR avatar Xin Li avatar Sunan He avatar Andrew Chan avatar Xubing Ye avatar tensorboy avatar Siyuan Yan avatar Doing avatar  avatar Guan Dai avatar  avatar Xiaojian Yuan avatar smellslikeml avatar Omar Sanseviero avatar QiulinW avatar Jack Li Shufan avatar daoyuan98 avatar  avatar  avatar  avatar 林豪佳 avatar Xiong Jun Wu(熊君武) avatar SundogsLiu avatar Tiancheng Zhao (Tony)  avatar Anirudh Rani (Ven) avatar Hao Lu avatar DokyoonYoon avatar Edson-Niu avatar slyviacassell avatar @choucaicai avatar Jialong Wu avatar Tony Davis avatar Shu avatar Mohammad Reza Taesiri avatar

Watchers

 avatar  avatar

rlhf-v's Issues

KeyError: 'RLHF-V-Dataset'

When I run run_RLHFV.sh,
'bash ./script/train/run_RLHFV.sh ./RLHFV_checkpoints/dpo_exp master RLHFV 5.0 ./RLHF-V_SFT_weight RLHF-V-Dataset 1 320 40 0.5 False True'
the error comes:''

File "/opt/tiger/xxx/Muffin/muffin/train/train_muffin.py", line 278, in make_dpo_data_module
train_dataset = DPODataset(tokenizer=tokenizer,
File "/opt/tiger/xxx/Muffin/muffin/train/train_muffin.py", line 178, in init
self.list_data_dict = create_multi_data_source_dataset(multimodal_cfg['data_source_names'], multimodal_cfg['data_source_weights'])
File "/opt/tiger/xxx/Muffin/muffin/train/train_muffin.py", line 144, in create_multi_data_source_dataset
ds = SingleDataSourceDataset(name, *register_data_pathname)
File "/opt/tiger/xxx/Muffin/muffin/data/data_processors.py", line 47, in getitem
return self._dict[key]
KeyError: 'RLHF-V-Dataset'

Why it comes?

Thank you very much

corrections & hallucinations

Great job, thank you for writing this paper; it was very intriguing and informative. I have a question that arose during my reading.
The paper said you collected the segment-level corrections & hallucinations. But I do not find this kind of information in the dataset.

About training code and some question about the DDPO

Thanks for your amazing work!

I have several questions:

  1. I do not find the training code and the shell, I wonder if they are not released.
  2. The DPO algorithm is to label the output of the pre-trained model with artificial prefer and disprefer. But RLHF-V is a manual correction of the results of the model output. Is that mean the output of the model is all disprefered, and the modified ones are all prefered?

TypeError: forward() got an unexpected keyword argument 'position_ids'

I was trying to run the dpo command using the provided command. I installed the muffin library as instructed, but was getting the following issue. Updating to latest transformer library did not help. I was wondering if you had run into any issue like this during your experiments. I am using 4 a100 with 80gb of memory

Traceback (most recent call last):
File "/home/ubuntu/muffin/./muffin/train/train_mem_muffin.py", line 13, in
train()
File "/home/ubuntu/muffin/muffin/train/train_muffin.py", line 473, in train
trainer.train()
File "/home/ubuntu/mambaforge-pypy3/envs/muffin/lib/python3.10/site-packages/transformers/trainer.py", line 1555, in train
return inner_training_loop(
File "/home/ubuntu/mambaforge-pypy3/envs/muffin/lib/python3.10/site-packages/transformers/trainer.py", line 1860, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/home/ubuntu/mambaforge-pypy3/envs/muffin/lib/python3.10/site-packages/transformers/trainer.py", line 2725, in training_step
loss = self.compute_loss(model, inputs)
File "/home/ubuntu/muffin/muffin/train/trainers.py", line 189, in compute_loss
concatenated_logp = forward_DPO(model,
File "/home/ubuntu/muffin/muffin/train/trainers.py", line 123, in forward_DPO
output = model(
File "/home/ubuntu/mambaforge-pypy3/envs/muffin/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/ubuntu/mambaforge-pypy3/envs/muffin/lib/python3.10/site-packages/accelerate/utils/operations.py", line 581, in forward
return model_forward(*args, **kwargs)
File "/home/ubuntu/mambaforge-pypy3/envs/muffin/lib/python3.10/site-packages/accelerate/utils/operations.py", line 569, in call
return convert_to_fp32(self.model_forward(*args, **kwargs))
File "/home/ubuntu/mambaforge-pypy3/envs/muffin/lib/python3.10/site-packages/torch/amp/autocast_mode.py", line 14, in decorate_autocast
return func(*args, **kwargs)
File "/home/ubuntu/mambaforge-pypy3/envs/muffin/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 748, in forward
output = self._fsdp_wrapped_module(*args, **kwargs)
File "/home/ubuntu/mambaforge-pypy3/envs/muffin/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
Traceback (most recent call last):
File "/home/ubuntu/muffin/./muffin/train/train_mem_muffin.py", line 13, in
return forward_call(*args, **kwargs)
File "/home/ubuntu/mambaforge-pypy3/envs/muffin/lib/python3.10/site-packages/accelerate/utils/operations.py", line 581, in forward
train()return model_forward(*args, **kwargs)

File "/home/ubuntu/muffin/muffin/train/train_muffin.py", line 473, in train
File "/home/ubuntu/mambaforge-pypy3/envs/muffin/lib/python3.10/site-packages/accelerate/utils/operations.py", line 569, in call
return convert_to_fp32(self.model_forward(*args, **kwargs))
File "/home/ubuntu/mambaforge-pypy3/envs/muffin/lib/python3.10/site-packages/torch/amp/autocast_mode.py", line 14, in decorate_autocast
trainer.train()return func(*args, **kwargs)

File "/home/ubuntu/mambaforge-pypy3/envs/muffin/lib/python3.10/site-packages/transformers/trainer.py", line 1555, in train
File "/home/ubuntu/muffin/muffin/model/muffin.py", line 338, in forward
outputs = self.model(
File "/home/ubuntu/mambaforge-pypy3/envs/muffin/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/ubuntu/muffin/muffin/model/muffin.py", line 298, in forward
return inner_training_loop(
File "/home/ubuntu/mambaforge-pypy3/envs/muffin/lib/python3.10/site-packages/transformers/trainer.py", line 1860, in _inner_training_loop
return super(Beit3LlavaLlamaModel, self).forward(
File "/home/ubuntu/mambaforge-pypy3/envs/muffin/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 912, in forward
layer_outputs = self._gradient_checkpointing_func(
File "/home/ubuntu/mambaforge-pypy3/envs/muffin/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 249, in checkpoint
return CheckpointFunction.apply(function, preserve, *args)
File "/home/ubuntu/mambaforge-pypy3/envs/muffin/lib/python3.10/site-packages/torch/autograd/function.py", line 506, in apply
tr_loss_step = self.training_step(model, inputs)
File "/home/ubuntu/mambaforge-pypy3/envs/muffin/lib/python3.10/site-packages/transformers/trainer.py", line 2725, in training_step
return super().apply(*args, **kwargs) # type: ignore[misc]
File "/home/ubuntu/mambaforge-pypy3/envs/muffin/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 107, in forward
outputs = run_function(*args)
File "/home/ubuntu/mambaforge-pypy3/envs/muffin/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/ubuntu/mambaforge-pypy3/envs/muffin/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 748, in forward
output = self._fsdp_wrapped_module(*args, **kwargs)
File "/home/ubuntu/mambaforge-pypy3/envs/muffin/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
loss = self.compute_loss(model, inputs)
File "/home/ubuntu/muffin/muffin/train/trainers.py", line 189, in compute_loss
concatenated_logp = forward_DPO(model,
File "/home/ubuntu/muffin/muffin/train/trainers.py", line 123, in forward_DPO
output = model(
File "/home/ubuntu/mambaforge-pypy3/envs/muffin/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/ubuntu/mambaforge-pypy3/envs/muffin/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 672, in forward
hidden_states, self_attn_weights, present_key_value = self.self_attn(
File "/home/ubuntu/mambaforge-pypy3/envs/muffin/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/ubuntu/mambaforge-pypy3/envs/muffin/lib/python3.10/site-packages/accelerate/utils/operations.py", line 581, in forward
return model_forward(*args, **kwargs)return forward_call(*args, **kwargs)

File "/home/ubuntu/mambaforge-pypy3/envs/muffin/lib/python3.10/site-packages/accelerate/utils/operations.py", line 569, in call
TypeError: forward() got an unexpected keyword argument 'position_ids'
return convert_to_fp32(self.model_forward(*args, **kwargs))
File "/home/ubuntu/mambaforge-pypy3/envs/muffin/lib/python3.10/site-packages/torch/amp/autocast_mode.py", line 14, in decorate_autocast
return func(*args, **kwargs)
File "/home/ubuntu/mambaforge-pypy3/envs/muffin/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 748, in forward
output = self._fsdp_wrapped_module(*args, **kwargs)
File "/home/ubuntu/mambaforge-pypy3/envs/muffin/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/ubuntu/mambaforge-pypy3/envs/muffin/lib/python3.10/site-packages/accelerate/utils/operations.py", line 581, in forward
return model_forward(*args, **kwargs)
File "/home/ubuntu/mambaforge-pypy3/envs/muffin/lib/python3.10/site-packages/accelerate/utils/operations.py", line 569, in call
return convert_to_fp32(self.model_forward(*args, **kwargs))
File "/home/ubuntu/mambaforge-pypy3/envs/muffin/lib/python3.10/site-packages/torch/amp/autocast_mode.py", line 14, in decorate_autocast
return func(*args, **kwargs)
File "/home/ubuntu/muffin/muffin/model/muffin.py", line 338, in forward
outputs = self.model(
File "/home/ubuntu/mambaforge-pypy3/envs/muffin/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/ubuntu/muffin/muffin/model/muffin.py", line 298, in forward
return super(Beit3LlavaLlamaModel, self).forward(
File "/home/ubuntu/mambaforge-pypy3/envs/muffin/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 912, in forward
layer_outputs = self._gradient_checkpointing_func(
File "/home/ubuntu/mambaforge-pypy3/envs/muffin/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 249, in checkpoint
return CheckpointFunction.apply(function, preserve, *args)
File "/home/ubuntu/mambaforge-pypy3/envs/muffin/lib/python3.10/site-packages/torch/autograd/function.py", line 506, in apply
return super().apply(*args, **kwargs) # type: ignore[misc]
File "/home/ubuntu/mambaforge-pypy3/envs/muffin/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 107, in forward
outputs = run_function(*args)
File "/home/ubuntu/mambaforge-pypy3/envs/muffin/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/ubuntu/mambaforge-pypy3/envs/muffin/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 748, in forward
output = self._fsdp_wrapped_module(*args, **kwargs)
File "/home/ubuntu/mambaforge-pypy3/envs/muffin/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/ubuntu/mambaforge-pypy3/envs/muffin/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 672, in forward
hidden_states, self_attn_weights, present_key_value = self.self_attn(
File "/home/ubuntu/mambaforge-pypy3/envs/muffin/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
TypeError: forward() got an unexpected keyword argument 'position_ids'
Traceback (most recent call last):
File "/home/ubuntu/muffin/./muffin/train/train_mem_muffin.py", line 13, in
train()
File "/home/ubuntu/muffin/muffin/train/train_muffin.py", line 473, in train
trainer.train()
File "/home/ubuntu/mambaforge-pypy3/envs/muffin/lib/python3.10/site-packages/transformers/trainer.py", line 1555, in train
return inner_training_loop(
File "/home/ubuntu/mambaforge-pypy3/envs/muffin/lib/python3.10/site-packages/transformers/trainer.py", line 1860, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/home/ubuntu/mambaforge-pypy3/envs/muffin/lib/python3.10/site-packages/transformers/trainer.py", line 2725, in training_step
loss = self.compute_loss(model, inputs)
File "/home/ubuntu/muffin/muffin/train/trainers.py", line 189, in compute_loss
concatenated_logp = forward_DPO(model,
File "/home/ubuntu/muffin/muffin/train/trainers.py", line 123, in forward_DPO
output = model(
File "/home/ubuntu/mambaforge-pypy3/envs/muffin/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/ubuntu/mambaforge-pypy3/envs/muffin/lib/python3.10/site-packages/accelerate/utils/operations.py", line 581, in forward
return model_forward(*args, **kwargs)
File "/home/ubuntu/mambaforge-pypy3/envs/muffin/lib/python3.10/site-packages/accelerate/utils/operations.py", line 569, in call
return convert_to_fp32(self.model_forward(*args, **kwargs))
File "/home/ubuntu/mambaforge-pypy3/envs/muffin/lib/python3.10/site-packages/torch/amp/autocast_mode.py", line 14, in decorate_autocast
return func(*args, **kwargs)
File "/home/ubuntu/mambaforge-pypy3/envs/muffin/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 748, in forward
output = self._fsdp_wrapped_module(*args, **kwargs)
File "/home/ubuntu/mambaforge-pypy3/envs/muffin/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/ubuntu/mambaforge-pypy3/envs/muffin/lib/python3.10/site-packages/accelerate/utils/operations.py", line 581, in forward
return model_forward(*args, **kwargs)
File "/home/ubuntu/mambaforge-pypy3/envs/muffin/lib/python3.10/site-packages/accelerate/utils/operations.py", line 569, in call
return convert_to_fp32(self.model_forward(*args, **kwargs))
File "/home/ubuntu/mambaforge-pypy3/envs/muffin/lib/python3.10/site-packages/torch/amp/autocast_mode.py", line 14, in decorate_autocast
return func(*args, **kwargs)
File "/home/ubuntu/muffin/muffin/model/muffin.py", line 338, in forward
outputs = self.model(
File "/home/ubuntu/mambaforge-pypy3/envs/muffin/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/ubuntu/muffin/muffin/model/muffin.py", line 298, in forward
return super(Beit3LlavaLlamaModel, self).forward(
File "/home/ubuntu/mambaforge-pypy3/envs/muffin/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 912, in forward
layer_outputs = self._gradient_checkpointing_func(
File "/home/ubuntu/mambaforge-pypy3/envs/muffin/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 249, in checkpoint
return CheckpointFunction.apply(function, preserve, *args)
File "/home/ubuntu/mambaforge-pypy3/envs/muffin/lib/python3.10/site-packages/torch/autograd/function.py", line 506, in apply
return super().apply(*args, **kwargs) # type: ignore[misc]
File "/home/ubuntu/mambaforge-pypy3/envs/muffin/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 107, in forward
outputs = run_function(*args)
File "/home/ubuntu/mambaforge-pypy3/envs/muffin/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/ubuntu/mambaforge-pypy3/envs/muffin/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 748, in forward
output = self._fsdp_wrapped_module(*args, **kwargs)
File "/home/ubuntu/mambaforge-pypy3/envs/muffin/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/ubuntu/mambaforge-pypy3/envs/muffin/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 672, in forward
hidden_states, self_attn_weights, present_key_value = self.self_attn(
File "/home/ubuntu/mambaforge-pypy3/envs/muffin/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
TypeError: forward() got an unexpected keyword argument 'position_ids'
Traceback (most recent call last):
File "/home/ubuntu/muffin/./muffin/train/train_mem_muffin.py", line 13, in
train()
File "/home/ubuntu/muffin/muffin/train/train_muffin.py", line 473, in train
trainer.train()
File "/home/ubuntu/mambaforge-pypy3/envs/muffin/lib/python3.10/site-packages/transformers/trainer.py", line 1555, in train
return inner_training_loop(
File "/home/ubuntu/mambaforge-pypy3/envs/muffin/lib/python3.10/site-packages/transformers/trainer.py", line 1860, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/home/ubuntu/mambaforge-pypy3/envs/muffin/lib/python3.10/site-packages/transformers/trainer.py", line 2725, in training_step
loss = self.compute_loss(model, inputs)
File "/home/ubuntu/muffin/muffin/train/trainers.py", line 189, in compute_loss
concatenated_logp = forward_DPO(model,
File "/home/ubuntu/muffin/muffin/train/trainers.py", line 123, in forward_DPO
output = model(
File "/home/ubuntu/mambaforge-pypy3/envs/muffin/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/ubuntu/mambaforge-pypy3/envs/muffin/lib/python3.10/site-packages/accelerate/utils/operations.py", line 581, in forward
return model_forward(*args, **kwargs)
File "/home/ubuntu/mambaforge-pypy3/envs/muffin/lib/python3.10/site-packages/accelerate/utils/operations.py", line 569, in call
return convert_to_fp32(self.model_forward(*args, **kwargs))
File "/home/ubuntu/mambaforge-pypy3/envs/muffin/lib/python3.10/site-packages/torch/amp/autocast_mode.py", line 14, in decorate_autocast
return func(*args, **kwargs)
File "/home/ubuntu/mambaforge-pypy3/envs/muffin/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 748, in forward
output = self._fsdp_wrapped_module(*args, **kwargs)
File "/home/ubuntu/mambaforge-pypy3/envs/muffin/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/ubuntu/mambaforge-pypy3/envs/muffin/lib/python3.10/site-packages/accelerate/utils/operations.py", line 581, in forward
return model_forward(*args, **kwargs)
File "/home/ubuntu/mambaforge-pypy3/envs/muffin/lib/python3.10/site-packages/accelerate/utils/operations.py", line 569, in call
return convert_to_fp32(self.model_forward(*args, **kwargs))
File "/home/ubuntu/mambaforge-pypy3/envs/muffin/lib/python3.10/site-packages/torch/amp/autocast_mode.py", line 14, in decorate_autocast
return func(*args, **kwargs)
File "/home/ubuntu/muffin/muffin/model/muffin.py", line 338, in forward
outputs = self.model(
File "/home/ubuntu/mambaforge-pypy3/envs/muffin/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/ubuntu/muffin/muffin/model/muffin.py", line 298, in forward
return super(Beit3LlavaLlamaModel, self).forward(
File "/home/ubuntu/mambaforge-pypy3/envs/muffin/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 912, in forward
layer_outputs = self._gradient_checkpointing_func(
File "/home/ubuntu/mambaforge-pypy3/envs/muffin/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 249, in checkpoint
return CheckpointFunction.apply(function, preserve, *args)
File "/home/ubuntu/mambaforge-pypy3/envs/muffin/lib/python3.10/site-packages/torch/autograd/function.py", line 506, in apply
return super().apply(*args, **kwargs) # type: ignore[misc]
File "/home/ubuntu/mambaforge-pypy3/envs/muffin/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 107, in forward
outputs = run_function(*args)
File "/home/ubuntu/mambaforge-pypy3/envs/muffin/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/ubuntu/mambaforge-pypy3/envs/muffin/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 748, in forward
output = self._fsdp_wrapped_module(*args, **kwargs)
File "/home/ubuntu/mambaforge-pypy3/envs/muffin/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/ubuntu/mambaforge-pypy3/envs/muffin/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 672, in forward
hidden_states, self_attn_weights, present_key_value = self.self_attn(
File "/home/ubuntu/mambaforge-pypy3/envs/muffin/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
TypeError: forward() got an unexpected keyword argument 'position_ids'

Request inference script for logps data

Hi there,
Thanks for your awesome work on RLHF_V, I have noticed that you have provided weights for a pretrained model to generate logps, may I have the code for implementing this model to produce data?

Cheers!

About model size

I hope this message finds you well. I am writing to you regarding your GitHub repository and, in particular, the models used in your recent paper. I have noticed that the models mentioned, such as the 13B Vicuna v1.0 and BEit3, differ in architecture from LLaVA-1.5.

In the interest of a fair comparison, I am interested in training a 7B model to align with LLaVA-1.5. However, I couldn't find explicit instructions or commands in your repository on how to train a 7B model. I understand that model size can significantly impact results, and I want to ensure a fair evaluation.

Could you kindly provide guidance or the necessary commands to train a 7B model using your repository? I appreciate your expertise and assistance in making the model comparison as unbiased as possible.

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 0 (pid: 484375) of binary

Hello~ I have used 8 A100 and almost didn't change anything, but encountered this problem. Could you help to have a look? I have set GPUS_PER_NODE=8, and command line is srun --partition=llm3 --gres=gpu:8 --nodes=1 --cpus-per-task=16 --job-name=RLHF-V --ntasks-per-node=8 bash run_RLHFV.sh
image

detailed information:


> WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 484376 closing signal SIGTERM
> WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 484378 closing signal SIGTERM
> WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 484379 closing signal SIGTERM
> WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 484380 closing signal SIGTERM
> WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 484381 closing signal SIGTERM
> WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 484383 closing signal SIGTERM
> WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 484384 closing signal SIGTERM
> ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 0 (pid: 484375) of binary: /mnt/petrelfs/xinglong/anaconda3/envs/muffin/bin/python
> Traceback (most recent call last):
>   File "/mnt/petrelfs/xinglong/anaconda3/envs/muffin/bin/torchrun", line 33, in <module>
>     sys.exit(load_entry_point('torch==2.0.1', 'console_scripts', 'torchrun')())
>   File "/mnt/petrelfs/xinglong/anaconda3/envs/muffin/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
>     return f(*args, **kwargs)
>   File "/mnt/petrelfs/xinglong/anaconda3/envs/muffin/lib/python3.10/site-packages/torch/distributed/run.py", line 794, in main
>     run(args)
>   File "/mnt/petrelfs/xinglong/anaconda3/envs/muffin/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
>     elastic_launch(
>   File "/mnt/petrelfs/xinglong/anaconda3/envs/muffin/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
>     return launch_agent(self._config, self._entrypoint, list(args))
>   File "/mnt/petrelfs/xinglong/anaconda3/envs/muffin/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
>     raise ChildFailedError(
> torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
> =======================================================
> ./muffin/train/train_mem_muffin.py FAILED
> -------------------------------------------------------
> Failures:
>   <NO_OTHER_FAILURES>
> -------------------------------------------------------
> Root Cause (first observed failure):
> [0]:
>   time      : 2024-01-30_11:52:52
>   host      : SH-IDC1-10-140-1-167
>   rank      : 0 (local_rank: 0)
>   exitcode  : -9 (pid: 484375)
>   error_file: <N/A>
>   traceback : Signal 9 (SIGKILL) received by PID 484375
> =======================================================
> WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 484385 closing signal SIGTERM
> WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 484389 closing signal SIGTERM
> WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 484391 closing signal SIGTERM
> WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 484395 closing signal SIGTERM
> WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 484397 closing signal SIGTERM
> WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 484398 closing signal SIGTERM
> WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 484399 closing signal SIGTERM
> ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 1 (pid: 484387) of binary: /mnt/petrelfs/xinglong/anaconda3/envs/muffin/bin/python
> Traceback (most recent call last):
>   File "/mnt/petrelfs/xinglong/anaconda3/envs/muffin/bin/torchrun", line 33, in <module>
>     sys.exit(load_entry_point('torch==2.0.1', 'console_scripts', 'torchrun')())
>   File "/mnt/petrelfs/xinglong/anaconda3/envs/muffin/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
>     return f(*args, **kwargs)
>   File "/mnt/petrelfs/xinglong/anaconda3/envs/muffin/lib/python3.10/site-packages/torch/distributed/run.py", line 794, in main
>     run(args)
>   File "/mnt/petrelfs/xinglong/anaconda3/envs/muffin/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
>     elastic_launch(
>   File "/mnt/petrelfs/xinglong/anaconda3/envs/muffin/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
>     return launch_agent(self._config, self._entrypoint, list(args))
>   File "/mnt/petrelfs/xinglong/anaconda3/envs/muffin/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
>     raise ChildFailedError(
> torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
> =======================================================
> ./muffin/train/train_mem_muffin.py FAILED
> -------------------------------------------------------
> Failures:
>   <NO_OTHER_FAILURES>
> -------------------------------------------------------
> Root Cause (first observed failure):
> [0]:
>   time      : 2024-01-30_11:53:13
>   host      : SH-IDC1-10-140-1-167
>   rank      : 1 (local_rank: 1)
>   exitcode  : -9 (pid: 484387)
>   error_file: <N/A>
>   traceback : Signal 9 (SIGKILL) received by PID 484387
> =======================================================

Missing Data (tsv files)

"For training simplicity, we generate the logp values based on RLHF-V_SFT-13B model and provide it in our dataset in advance."

However I did not find data in the dataset link. There is only one jsonl file.
How can I convert it to tsv? Because when I tried to generate logp myself, I encountered
No such file or directory: './data/RLHF-V-Dataset/RLHF-V-Dataset-1401.tsv'

ImportError: cannot import name 'flash_attn_unpadded_qkvpacked_func' from 'flash_attn.flash_attn_interface'

I follow the ReadMe to install the environment an then run muffin/script/train/run_RLHFV.sh but it returns the ImportError:

Traceback (most recent call last):
  File "~/codes_MLLM/RLHF-V/muffin/./muffin/train/train_mem_muffin.py", line 6, in <module>
    from muffin.train.llama_flash_attn_monkey_patch import replace_llama_attn_with_flash_attn
  File "~/codes_MLLM/RLHF-V/muffin/muffin/train/llama_flash_attn_monkey_patch.py", line 12, in <module>
    from flash_attn.flash_attn_interface import flash_attn_unpadded_qkvpacked_func
ImportError: cannot import name 'flash_attn_unpadded_qkvpacked_func' from 'flash_attn.flash_attn_interface' (~/miniconda3/envs/muffin/lib/python3.10/site-packages/flash_attn-2.4.2-py3.10-linux-x86_64.egg/flash_attn/flash_attn_interface.py)

Question about MHumanEval

Dear authors, thank you for sharing the code of this impressive work! I am just wondering if you have instructions of how to use MHumanEval benchmark as a evaluator? I saw the github presents multiple benchmarks but no MHumanEval. Or if you had already shown this in some external links, could you direct me to the link? Thank you in advance!

CUDA ordinal error within run_RLHFV.sh

Any idea what is wrong?
using Cuda 12.2 and torch 2.3.0 and transformers 4.28.0

`
local:/mnt/task_runtime/RLHF-V/muffin# bash ./script/train/run_RLHFV.sh ./RLHFV_checkpoints/dpo_exp master RLHFV 1.1 $ref_model ./RLHF-V-Dataset RLHFV_SFT 2160 360 0.1 False True
Working Directory at /mnt/task_runtime/RLHF-V/muffin
Bash at /usr/bin/bash
Python at /miniconda/envs/muffin/bin/python
Tue Apr 30 23:50:16 2024
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.125.06 Driver Version: 525.125.06 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A100-SXM... On | 00000000:00:08.0 Off | 0 |
| N/A 37C P0 67W / 400W | 0MiB / 81920MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
master local_addr=local
ddpo weight is 1.1 beta is 0.1
pythonpath=:/mnt/task_runtime/RLHF-V/muffin
RUNNER=torchrun --nnodes=1 --nproc_per_node=8 --rdzv_id=1 --rdzv_backend=c10d --rdzv_endpoint=local
Data config: ./RLHF-V-Dataset RLHFV_SFT
sft_output_dir=./RLHFV_checkpoints/dpo_exp/muffin_13b_DPO-RLHFV---beit3_large_patch16_448/checkpionts sft_logging_dir=./RLHFV_checkpoints/dpo_exp/muffin_13b_DPO-RLHFV---beit3_large_patch16_448/log
Load from ./RLHF-V_SFT_weight
W0430 23:50:23.074000 134467772659520 torch/distributed/run.py:757]
W0430 23:50:23.074000 134467772659520 torch/distributed/run.py:757] *****************************************
W0430 23:50:23.074000 134467772659520 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W0430 23:50:23.074000 134467772659520 torch/distributed/run.py:757] *****************************************
/miniconda/envs/muffin/lib/python3.10/site-packages/transformers/training_args.py:1356: FutureWarning: using --fsdp_transformer_layer_cls_to_wrap is deprecated. Use fsdp_config instead
warnings.warn(
DPO data: ./RLHF-V-Dataset RLHFV_SFT
[rank6]: Traceback (most recent call last):
[rank6]: File "/mnt/task_runtime/RLHF-V/muffin/./muffin/train/train_mem_muffin.py", line 13, in
[rank6]: train()
[rank6]: File "/mnt/task_runtime/RLHF-V/muffin/muffin/train/train_muffin.py", line 452, in train
[rank6]: model_args, data_args, training_args = parser.parse_args_into_dataclasses()
[rank6]: File "/miniconda/envs/muffin/lib/python3.10/site-packages/transformers/hf_argparser.py", line 332, in parse_args_into_dataclasses
[rank6]: obj = dtype(**inputs)
[rank6]: File "", line 118, in init
[rank6]: File "/miniconda/envs/muffin/lib/python3.10/site-packages/transformers/training_args.py", line 1227, in post_init
[rank6]: and (self.device.type != "cuda")
[rank6]: File "/miniconda/envs/muffin/lib/python3.10/site-packages/transformers/training_args.py", line 1662, in device
[rank6]: return self._setup_devices
[rank6]: File "/miniconda/envs/muffin/lib/python3.10/site-packages/transformers/utils/generic.py", line 54, in get
[rank6]: cached = self.fget(obj)
[rank6]: File "/miniconda/envs/muffin/lib/python3.10/site-packages/transformers/training_args.py", line 1652, in _setup_devices
[rank6]: torch.cuda.set_device(device)
[rank6]: File "/miniconda/envs/muffin/lib/python3.10/site-packages/torch/cuda/init.py", line 399, in set_device
[rank6]: torch._C._cuda_setDevice(device)
[rank6]: RuntimeError: CUDA error: invalid device ordinal
[rank6]: CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
[rank6]: For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
[rank6]: Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

[rank4]: Traceback (most recent call last):
[rank4]: File "/mnt/task_runtime/RLHF-V/muffin/./muffin/train/train_mem_muffin.py", line 13, in
[rank4]: train()
[rank4]: File "/mnt/task_runtime/RLHF-V/muffin/muffin/train/train_muffin.py", line 452, in train
[rank4]: model_args, data_args, training_args = parser.parse_args_into_dataclasses()
[rank4]: File "/miniconda/envs/muffin/lib/python3.10/site-packages/transformers/hf_argparser.py", line 332, in parse_args_into_dataclasses
[rank4]: obj = dtype(**inputs)
[rank4]: File "", line 118, in init
[rank4]: File "/miniconda/envs/muffin/lib/python3.10/site-packages/transformers/training_args.py", line 1227, in post_init
[rank4]: and (self.device.type != "cuda")
[rank4]: File "/miniconda/envs/muffin/lib/python3.10/site-packages/transformers/training_args.py", line 1662, in device
[rank4]: return self._setup_devices
[rank4]: File "/miniconda/envs/muffin/lib/python3.10/site-packages/transformers/utils/generic.py", line 54, in get
[rank4]: cached = self.fget(obj)
[rank4]: File "/miniconda/envs/muffin/lib/python3.10/site-packages/transformers/training_args.py", line 1652, in _setup_devices
[rank4]: torch.cuda.set_device(device)
[rank4]: File "/miniconda/envs/muffin/lib/python3.10/site-packages/torch/cuda/init.py", line 399, in set_device
[rank4]: torch._C._cuda_setDevice(device)
[rank4]: RuntimeError: CUDA error: invalid device ordinal
[rank4]: CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
[rank4]: For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
[rank4]: Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

[rank1]: Traceback (most recent call last):
[rank1]: File "/mnt/task_runtime/RLHF-V/muffin/./muffin/train/train_mem_muffin.py", line 13, in
[rank1]: train()
[rank1]: File "/mnt/task_runtime/RLHF-V/muffin/muffin/train/train_muffin.py", line 452, in train
[rank1]: model_args, data_args, training_args = parser.parse_args_into_dataclasses()
[rank1]: File "/miniconda/envs/muffin/lib/python3.10/site-packages/transformers/hf_argparser.py", line 332, in parse_args_into_dataclasses
[rank1]: obj = dtype(**inputs)
[rank1]: File "", line 118, in init
[rank1]: File "/miniconda/envs/muffin/lib/python3.10/site-packages/transformers/training_args.py", line 1227, in post_init
[rank1]: and (self.device.type != "cuda")
[rank1]: File "/miniconda/envs/muffin/lib/python3.10/site-packages/transformers/training_args.py", line 1662, in device
[rank1]: return self._setup_devices
[rank1]: File "/miniconda/envs/muffin/lib/python3.10/site-packages/transformers/utils/generic.py", line 54, in get
[rank1]: cached = self.fget(obj)
[rank1]: File "/miniconda/envs/muffin/lib/python3.10/site-packages/transformers/training_args.py", line 1652, in _setup_devices
[rank1]: torch.cuda.set_device(device)
[rank1]: File "/miniconda/envs/muffin/lib/python3.10/site-packages/torch/cuda/init.py", line 399, in set_device
[rank1]: torch._C._cuda_setDevice(device)
[rank1]: RuntimeError: CUDA error: invalid device ordinal
[rank1]: CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
[rank1]: For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
[rank1]: Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

[rank7]: Traceback (most recent call last):
[rank7]: File "/mnt/task_runtime/RLHF-V/muffin/./muffin/train/train_mem_muffin.py", line 13, in
[rank7]: train()
[rank7]: File "/mnt/task_runtime/RLHF-V/muffin/muffin/train/train_muffin.py", line 452, in train
[rank7]: model_args, data_args, training_args = parser.parse_args_into_dataclasses()
[rank7]: File "/miniconda/envs/muffin/lib/python3.10/site-packages/transformers/hf_argparser.py", line 332, in parse_args_into_dataclasses
[rank7]: obj = dtype(**inputs)
[rank7]: File "", line 118, in init
[rank7]: File "/miniconda/envs/muffin/lib/python3.10/site-packages/transformers/training_args.py", line 1227, in post_init
[rank7]: and (self.device.type != "cuda")
[rank7]: File "/miniconda/envs/muffin/lib/python3.10/site-packages/transformers/training_args.py", line 1662, in device
[rank7]: return self._setup_devices
[rank7]: File "/miniconda/envs/muffin/lib/python3.10/site-packages/transformers/utils/generic.py", line 54, in get
[rank7]: cached = self.fget(obj)
[rank7]: File "/miniconda/envs/muffin/lib/python3.10/site-packages/transformers/training_args.py", line 1652, in _setup_devices
[rank7]: torch.cuda.set_device(device)
[rank7]: File "/miniconda/envs/muffin/lib/python3.10/site-packages/torch/cuda/init.py", line 399, in set_device
[rank7]: torch._C._cuda_setDevice(device)
[rank7]: RuntimeError: CUDA error: invalid device ordinal
[rank7]: CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
[rank7]: For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
[rank7]: Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

[rank3]: Traceback (most recent call last):
[rank3]: File "/mnt/task_runtime/RLHF-V/muffin/./muffin/train/train_mem_muffin.py", line 13, in
[rank3]: train()
[rank3]: File "/mnt/task_runtime/RLHF-V/muffin/muffin/train/train_muffin.py", line 452, in train
[rank3]: model_args, data_args, training_args = parser.parse_args_into_dataclasses()
[rank3]: File "/miniconda/envs/muffin/lib/python3.10/site-packages/transformers/hf_argparser.py", line 332, in parse_args_into_dataclasses
[rank3]: obj = dtype(**inputs)
[rank3]: File "", line 118, in init
[rank3]: File "/miniconda/envs/muffin/lib/python3.10/site-packages/transformers/training_args.py", line 1227, in post_init
[rank3]: and (self.device.type != "cuda")
[rank3]: File "/miniconda/envs/muffin/lib/python3.10/site-packages/transformers/training_args.py", line 1662, in device
[rank3]: return self._setup_devices
[rank3]: File "/miniconda/envs/muffin/lib/python3.10/site-packages/transformers/utils/generic.py", line 54, in get
[rank3]: cached = self.fget(obj)
[rank3]: File "/miniconda/envs/muffin/lib/python3.10/site-packages/transformers/training_args.py", line 1652, in _setup_devices
[rank3]: torch.cuda.set_device(device)
[rank3]: File "/miniconda/envs/muffin/lib/python3.10/site-packages/torch/cuda/init.py", line 399, in set_device
[rank3]: torch._C._cuda_setDevice(device)
[rank3]: RuntimeError: CUDA error: invalid device ordinal
[rank3]: CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
[rank3]: For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
[rank3]: Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

[rank5]: Traceback (most recent call last):
[rank5]: File "/mnt/task_runtime/RLHF-V/muffin/./muffin/train/train_mem_muffin.py", line 13, in
[rank5]: train()
[rank5]: File "/mnt/task_runtime/RLHF-V/muffin/muffin/train/train_muffin.py", line 452, in train
[rank5]: model_args, data_args, training_args = parser.parse_args_into_dataclasses()
[rank5]: File "/miniconda/envs/muffin/lib/python3.10/site-packages/transformers/hf_argparser.py", line 332, in parse_args_into_dataclasses
[rank5]: obj = dtype(**inputs)
[rank5]: File "", line 118, in init
[rank5]: File "/miniconda/envs/muffin/lib/python3.10/site-packages/transformers/training_args.py", line 1227, in post_init
[rank5]: and (self.device.type != "cuda")
[rank5]: File "/miniconda/envs/muffin/lib/python3.10/site-packages/transformers/training_args.py", line 1662, in device
[rank5]: return self._setup_devices
[rank5]: File "/miniconda/envs/muffin/lib/python3.10/site-packages/transformers/utils/generic.py", line 54, in get
[rank5]: cached = self.fget(obj)
[rank5]: File "/miniconda/envs/muffin/lib/python3.10/site-packages/transformers/training_args.py", line 1652, in _setup_devices
[rank5]: torch.cuda.set_device(device)
[rank5]: File "/miniconda/envs/muffin/lib/python3.10/site-packages/torch/cuda/init.py", line 399, in set_device
[rank5]: torch._C._cuda_setDevice(device)
[rank5]: RuntimeError: CUDA error: invalid device ordinal
[rank5]: CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
[rank5]: For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
[rank5]: Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

[rank2]: Traceback (most recent call last):
[rank2]: File "/mnt/task_runtime/RLHF-V/muffin/./muffin/train/train_mem_muffin.py", line 13, in
[rank2]: train()
[rank2]: File "/mnt/task_runtime/RLHF-V/muffin/muffin/train/train_muffin.py", line 452, in train
[rank2]: model_args, data_args, training_args = parser.parse_args_into_dataclasses()
[rank2]: File "/miniconda/envs/muffin/lib/python3.10/site-packages/transformers/hf_argparser.py", line 332, in parse_args_into_dataclasses
[rank2]: obj = dtype(**inputs)
[rank2]: File "", line 118, in init
[rank2]: File "/miniconda/envs/muffin/lib/python3.10/site-packages/transformers/training_args.py", line 1227, in post_init
[rank2]: and (self.device.type != "cuda")
[rank2]: File "/miniconda/envs/muffin/lib/python3.10/site-packages/transformers/training_args.py", line 1662, in device
[rank2]: return self._setup_devices
[rank2]: File "/miniconda/envs/muffin/lib/python3.10/site-packages/transformers/utils/generic.py", line 54, in get
[rank2]: cached = self.fget(obj)
[rank2]: File "/miniconda/envs/muffin/lib/python3.10/site-packages/transformers/training_args.py", line 1652, in _setup_devices
[rank2]: torch.cuda.set_device(device)
[rank2]: File "/miniconda/envs/muffin/lib/python3.10/site-packages/torch/cuda/init.py", line 399, in set_device
[rank2]: torch._C._cuda_setDevice(device)
[rank2]: RuntimeError: CUDA error: invalid device ordinal
[rank2]: CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
[rank2]: For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
[rank2]: Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

W0430 23:50:38.337000 134467772659520 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 347345 closing signal SIGTERM
E0430 23:50:38.551000 134467772659520 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 1 (pid: 347346) of binary: /miniconda/envs/muffin/bin/python
Traceback (most recent call last):
File "/miniconda/envs/muffin/bin/torchrun", line 33, in
sys.exit(load_entry_point('torch==2.3.0', 'console_scripts', 'torchrun')())
File "/miniconda/envs/muffin/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 347, in wrapper
return f(*args, **kwargs)
File "/miniconda/envs/muffin/lib/python3.10/site-packages/torch/distributed/run.py", line 879, in main
run(args)
File "/miniconda/envs/muffin/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run
elastic_launch(
File "/miniconda/envs/muffin/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/miniconda/envs/muffin/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

./muffin/train/train_mem_muffin.py FAILED

Failures:
[1]:
time : 2024-04-30_23:50:38
host : local
rank : 2 (local_rank: 2)
exitcode : 1 (pid: 347347)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
time : 2024-04-30_23:50:38
host : local
rank : 3 (local_rank: 3)
exitcode : 1 (pid: 347348)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
time : 2024-04-30_23:50:38
host : local
rank : 4 (local_rank: 4)
exitcode : 1 (pid: 347349)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[4]:
time : 2024-04-30_23:50:38
host : local
rank : 5 (local_rank: 5)
exitcode : 1 (pid: 347350)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[5]:
time : 2024-04-30_23:50:38
host : local
rank : 6 (local_rank: 6)
exitcode : 1 (pid: 347351)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[6]:
time : 2024-04-30_23:50:38
host : local
rank : 7 (local_rank: 7)
exitcode : 1 (pid: 347352)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure):
[0]:
time : 2024-04-30_23:50:38
host : local
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 347346)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html`

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.