openbmb / ultrafeedback Goto Github PK

View Code? Open in Web Editor NEW

267.0 267.0 16.0 2.18 MB

A large-scale, fine-grained, diverse preference dataset (and models).

License: MIT License

Python 99.36% Shell 0.64%

ultrafeedback's People

Contributors

Stargazers

Watchers

Forkers

buptygz nicola-zhang chen278947895 sankeerthrao eltociear sanyaade-projects silasdao 5l1v3r1 jangocheng evelynmitchell danshi777 hiteshis lihuibng zhmmx weixiongust

ultrafeedback's Issues

如何获取comparison data？

您好，很感谢作者团队公布了UltraFeedback数据集，我目前在尝试使用这个数据集去训练Reward model，但遇到了一个问题。

数据集共包含64K的指令，256K的response，依照论文的设定，从这些数据集能生成340K的comparisons，请问这个是怎么生成的？我没有在项目代码中找到这一功能。如果项目代码里有的话，是在下面的路经中吗？

https://github.com/OpenBMB/UltraFeedback/tree/main/src/comparison_data_generation

evol_instruct issues: prompts with missing data

ds = datasets.load_dataset('openbmb/ultrafeedback')
print(ds['train'][490]['instruction'])

Gives

Add a requirement for the given prompt that the hashtag must also include the top 3 countries with the highest sustainable energy consumption in 2020, based on their percentage of total energy consumption.

But there is no "given prompt". This seems to be an issue with several of the evol_instruct prompts.
Also note that the completions for such samples include wild hallucinations, and ratings evaluating them as free of hallucinations.

In addition, even evol_instruct prompts that do include the prompt to be modified are often full of issues, with either the model or the evaluator interpreting it as a request to answer the original prompt.

Code usage description

Hello

Thank you for your work! I'd like to build on your scripts to do something similar for a different language. Could you please describe the workflow that one needs to follow to reproduce your work? I.e., which scripts to work in which order, where to store downloaded flan, trufulqa, ... datasets, etc. It is unclear to me now how to continue.

Thanks!

Bram

Missing harmlessness template or wrong reference

Description

Hi to whoever is reading this 🤗 I saw that within the code snippets you're referencing a harmlessness template, while there's none on the code (see import below)

UltraFeedback/src/data_annotation/annotate_preference.py

Line 75 in bf80fd4

 from preference_templates import system_prompt, instruction_following_template, truthfulness_template, honesty_template, harmlessness_template, helpfulness_template 

Is that intended? Was there a harmlessness category that was dropped during the process? Is that something to be integrated within the next revision of the paper?

Thanks in advance!

About code and prompts

Thank you so much for your dedication; this is truly remarkable work. The lack of high-quality open-source preference data has indeed been a challenge. Do you have plans to also open-source the corresponding code and prompts, so we can utilize this workflow to generate our own datasets? Many thanks!

Issue with `overall_score` computation

Hi!

Congrats on this amazing project.

We've been exploring the data and identified an issue with very high overall_score responses. The issue seems to be related with this line. This causes responses with a critique rating of 1 to become a 10. We noticed this by looking at the critique rational which was highly negative for many (~2K) examples with an overall_score of 10.

Questions about training code for UltraRM/UltraCM

Great Work! And thanks for the contribution. May I ask you if you have plans to release the training code for UltraRM/UltraCM?

Reproducing data generation

Thanks for providing the code to generate UltraFeedback data. I tried running the files in comparison_data_generation folder. First of all, the bash script run_vllm.sh points to python script named main_vllm_batch.py which is not there. Secondly, I tried to run the main_vllm.py, it seems like the function instruction_completion, which actually completes the responses from the model is not used anywhere. Could you please provide a description of how to run these scripts: generating the data from models and then annotating?
Thanks!

Could you please release the training code of the preference model

Training details for reproducing UltraCM

Thank you so much for sharing the data. It's very helpful for the RLHF community!

I found some hyper-parameters for training UltraCM in your paper, but I am also confused by the following questions:

How do you prepare the training examples? It seems that the instruction, completion, the feedback, and the overall score are filled into the ultracm_instruction_template as defined in your demo page. But I'm not sure...
How is the loss calculated? Did you apply masking to the input content, including the instruction and completion?
Did you compare tuning a critique model from an SFT model versus a pretrained checkpoint?

Thanks again for your efforts!

models	helpfulness	honesty	instruction following	truthfulness	overall score
gpt-3.5-turbo	4	5	4	5	7
llama-2-70b-chat	4	4	5	5	7.5
mpt-30b-chat	3	4	3	5	6.5
vicuna-33b	5	4	4	5	6.5

The answer of vicuna-33b has the highest helpfulness but lowest overall score.

My question is should I pickup the answer with the highest overall score or the highest helpfulness score as a preference anwer, or should I use the mean of the four principles.

Any suggestions will be appriciated, thx.

openbmb / ultrafeedback Goto Github PK

ultrafeedback's People

Contributors

Stargazers

Watchers

Forkers

ultrafeedback's Issues

Description

Recommend Projects

Recommend Topics

Recommend Org

Jobs