GithubHelp home page GithubHelp logo

ultrafeedback's People

Contributors

cgq15 avatar lifan-yuan avatar ningding97 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ultrafeedback's Issues

如何获取comparison data?

您好,很感谢作者团队公布了UltraFeedback数据集,我目前在尝试使用这个数据集去训练Reward model,但遇到了一个问题。

数据集共包含64K的指令,256K的response,依照论文的设定,从这些数据集能生成340K的comparisons,请问这个是怎么生成的?我没有在项目代码中找到这一功能。如果项目代码里有的话,是在下面的路经中吗?

https://github.com/OpenBMB/UltraFeedback/tree/main/src/comparison_data_generation

evol_instruct issues: prompts with missing data

ds = datasets.load_dataset('openbmb/ultrafeedback')
print(ds['train'][490]['instruction'])

Gives

Add a requirement for the given prompt that the hashtag must also include the top 3 countries with the highest sustainable energy consumption in 2020, based on their percentage of total energy consumption.

But there is no "given prompt". This seems to be an issue with several of the evol_instruct prompts.
Also note that the completions for such samples include wild hallucinations, and ratings evaluating them as free of hallucinations.

In addition, even evol_instruct prompts that do include the prompt to be modified are often full of issues, with either the model or the evaluator interpreting it as a request to answer the original prompt.

Code usage description

Hello

Thank you for your work! I'd like to build on your scripts to do something similar for a different language. Could you please describe the workflow that one needs to follow to reproduce your work? I.e., which scripts to work in which order, where to store downloaded flan, trufulqa, ... datasets, etc. It is unclear to me now how to continue.

Thanks!

Bram

Missing harmlessness template or wrong reference

Description

Hi to whoever is reading this 🤗 I saw that within the code snippets you're referencing a harmlessness template, while there's none on the code (see import below)

from preference_templates import system_prompt, instruction_following_template, truthfulness_template, honesty_template, harmlessness_template, helpfulness_template

Is that intended? Was there a harmlessness category that was dropped during the process? Is that something to be integrated within the next revision of the paper?

Thanks in advance!

About code and prompts

Thank you so much for your dedication; this is truly remarkable work. The lack of high-quality open-source preference data has indeed been a challenge. Do you have plans to also open-source the corresponding code and prompts, so we can utilize this workflow to generate our own datasets? Many thanks!

Issue with `overall_score` computation

Hi!

Congrats on this amazing project.

We've been exploring the data and identified an issue with very high overall_score responses. The issue seems to be related with this line. This causes responses with a critique rating of 1 to become a 10. We noticed this by looking at the critique rational which was highly negative for many (~2K) examples with an overall_score of 10.

Reproducing data generation

Thanks for providing the code to generate UltraFeedback data. I tried running the files in comparison_data_generation folder. First of all, the bash script run_vllm.sh points to python script named main_vllm_batch.py which is not there. Secondly, I tried to run the main_vllm.py, it seems like the function instruction_completion, which actually completes the responses from the model is not used anywhere. Could you please provide a description of how to run these scripts: generating the data from models and then annotating?
Thanks!

Training details for reproducing UltraCM

Thank you so much for sharing the data. It's very helpful for the RLHF community!

I found some hyper-parameters for training UltraCM in your paper, but I am also confused by the following questions:

  1. How do you prepare the training examples? It seems that the instruction, completion, the feedback, and the overall score are filled into the ultracm_instruction_template as defined in your demo page. But I'm not sure...
  2. How is the loss calculated? Did you apply masking to the input content, including the instruction and completion?
  3. Did you compare tuning a critique model from an SFT model versus a pretrained checkpoint?

Thanks again for your efforts!

The overall score is not matching with the principles

Hi,
I found that some answer with higher overall_socre possessing a lower helpfulness_score in evol_instruct.jsonl dataset which the principle is 100% helpfulness.

for example, the scores of 9th sample in evol_instruct.jsonl dataset is as following:

models helpfulness honesty instruction following truthfulness overall score
gpt-3.5-turbo 4 5 4 5 7
llama-2-70b-chat 4 4 5 5 7.5
mpt-30b-chat 3 4 3 5 6.5
vicuna-33b 5 4 4 5 6.5

The answer of vicuna-33b has the highest helpfulness but lowest overall score.

My question is should I pickup the answer with the highest overall score or the highest helpfulness score as a preference anwer, or should I use the mean of the four principles.

Any suggestions will be appriciated, thx.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.