openbmb / ultrafeedback Goto Github PK
View Code? Open in Web Editor NEWA large-scale, fine-grained, diverse preference dataset (and models).
License: MIT License
A large-scale, fine-grained, diverse preference dataset (and models).
License: MIT License
您好,很感谢作者团队公布了UltraFeedback数据集,我目前在尝试使用这个数据集去训练Reward model,但遇到了一个问题。
数据集共包含64K的指令,256K的response,依照论文的设定,从这些数据集能生成340K的comparisons,请问这个是怎么生成的?我没有在项目代码中找到这一功能。如果项目代码里有的话,是在下面的路经中吗?
https://github.com/OpenBMB/UltraFeedback/tree/main/src/comparison_data_generation
ds = datasets.load_dataset('openbmb/ultrafeedback')
print(ds['train'][490]['instruction'])
Gives
Add a requirement for the given prompt that the hashtag must also include the top 3 countries with the highest sustainable energy consumption in 2020, based on their percentage of total energy consumption.
But there is no "given prompt". This seems to be an issue with several of the evol_instruct prompts.
Also note that the completions for such samples include wild hallucinations, and ratings evaluating them as free of hallucinations.
In addition, even evol_instruct prompts that do include the prompt to be modified are often full of issues, with either the model or the evaluator interpreting it as a request to answer the original prompt.
Hello
Thank you for your work! I'd like to build on your scripts to do something similar for a different language. Could you please describe the workflow that one needs to follow to reproduce your work? I.e., which scripts to work in which order, where to store downloaded flan, trufulqa, ... datasets, etc. It is unclear to me now how to continue.
Thanks!
Bram
Hi to whoever is reading this 🤗 I saw that within the code snippets you're referencing a harmlessness
template, while there's none on the code (see import below)
Is that intended? Was there a harmlessness category that was dropped during the process? Is that something to be integrated within the next revision of the paper?
Thanks in advance!
Thank you so much for your dedication; this is truly remarkable work. The lack of high-quality open-source preference data has indeed been a challenge. Do you have plans to also open-source the corresponding code and prompts, so we can utilize this workflow to generate our own datasets? Many thanks!
Hi!
Congrats on this amazing project.
We've been exploring the data and identified an issue with very high overall_score
responses. The issue seems to be related with this line. This causes responses with a critique rating of 1
to become a 10
. We noticed this by looking at the critique rational which was highly negative for many (~2K) examples with an overall_score
of 10
.
Great Work! And thanks for the contribution. May I ask you if you have plans to release the training code for UltraRM/UltraCM?
Thanks for providing the code to generate UltraFeedback data. I tried running the files in comparison_data_generation folder. First of all, the bash script run_vllm.sh points to python script named main_vllm_batch.py which is not there. Secondly, I tried to run the main_vllm.py, it seems like the function instruction_completion, which actually completes the responses from the model is not used anywhere. Could you please provide a description of how to run these scripts: generating the data from models and then annotating?
Thanks!
Thank you so much for sharing the data. It's very helpful for the RLHF community!
I found some hyper-parameters for training UltraCM in your paper, but I am also confused by the following questions:
ultracm_instruction_template
as defined in your demo page. But I'm not sure...Thanks again for your efforts!
你好,请问helpfulness和truthfulness中的type字段是什么意思?1,2分别代表什么?
你好,看了数据集都是英文的,请问用英文训练的奖励模型是批评模型是否能用于中文呢?后续是否会开源中文的RLHF数据集?
Hi,
I found that some answer with higher overall_socre possessing a lower helpfulness_score in evol_instruct.jsonl
dataset which the principle is 100% helpfulness.
for example, the scores of 9th sample in evol_instruct.jsonl
dataset is as following:
models | helpfulness | honesty | instruction following | truthfulness | overall score |
---|---|---|---|---|---|
gpt-3.5-turbo | 4 | 5 | 4 | 5 | 7 |
llama-2-70b-chat | 4 | 4 | 5 | 5 | 7.5 |
mpt-30b-chat | 3 | 4 | 3 | 5 | 6.5 |
vicuna-33b | 5 | 4 | 4 | 5 | 6.5 |
The answer of vicuna-33b has the highest helpfulness but lowest overall score.
My question is should I pickup the answer with the highest overall score or the highest helpfulness score as a preference anwer, or should I use the mean of the four principles.
Any suggestions will be appriciated, thx.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.