Comments (9)
I'd be happy to take a look at this.
If I understand correctly, you'd like me to
- Start with rankgen pretrained model
- Use the dot product method between posts explained in rankgen paper to score the summaries from the SFF dataset
- Treat the dot product score as rewards for the RM loss from SFF
A few further questions:
- Do we want to then implement the policy optimization step as well?
- What is the baseline generative model we are using, e.g. to do comparisons with #77
- What kind of compute resources do you expect this to require?
from open-assistant.
2. Use the dot product method between posts explained in rankgen paper to score the summaries from the SFF dataset
I am not sure if the method described in the rankgen paper can be directly applied here (I only skimmed the paper)? If you know more, let me know... My first thoughts were to take the rankgen embedding and project it to a scalar reward and then finetune with cross-entropy loss (e.g. loss = -torch.mean(torch.log(torch.sigmoid(pos - neg)))
).
A few further questions:
1. Do we want to then implement the policy optimization step as well?
That will be the topic of a separate issue.
2. What is the baseline generative model we are using, e.g. to do comparisons with
The idea is to use an equal train/val split of the OpenAI summarization dataset for both reward models ... the instructor based #77 one and the other one based on rankgen. The simplest metric that we could compare is accuracy (how often the models match the preference ranking provided by humans, e.g. rm(human_preferred_example) > rm(other_element_of_examples_pair)
).
3. What kind of compute resources do you expect this to require?
That would be a good first thing you could find out and report back. :-)
from open-assistant.
As another reference could also take a look at the RM code by @copycat https://github.com/theblackcat102/copycat/blob/master/train_critic.py ...
from open-assistant.
I am not sure if the method described in the rankgen paper can be directly applied here (I only skimmed the paper)? If you know more, let me know... My first thoughts were to take the rankgen embedding and project it to a scalar reward and then finetune with cross-entropy loss (e.g.
loss = -torch.mean(torch.log(torch.sigmoid(pos - neg)))
).
Okay, yeah I also just skimmed, but it seemed like rankgen operates by embedding the prefix and suffix into vectors of the same size and then finds their dot product, which is, of course, already a scalar and (in theory) should be alright to use as the reward.
My intuition is that just doing a linear probe from the prefix or combining the prefix and suffix into a single string and projecting down from that embedding wouldn't take advantage of the pretrained weights as much, but maybe I can experiment with both approaches.
That will be the topic of a separate issue.
👍
The idea is to use an equal train/val split of the OpenAI summarization dataset for both reward models ... the instructor based #77 one and the other one based on rankgen. The simplest metric that we could compare is accuracy (how often the models match the preference ranking provided by humans, e.g.
rm(human_preferred_example) > rm(other_element_of_examples_pair)
).
Sounds good, I was unsure whether we were going to do the full RLHF process as well, so was wondering which LM we were supposed to optimize. It makes sense to compare the reward models independently first though.
That would be a good first thing you could find out and report back. :-)
Will do!
from open-assistant.
My intuition is that just doing a linear probe from the prefix or combining the prefix and suffix into a single string and projecting down from that embedding wouldn't take advantage of the pretrained weights as much, but maybe I can experiment with both approaches.
ok, I see .. in our case we indeed have a well defined prefix (e.g. the user's instruction / conversation so far). For other cases it is porbably possible to split the text somewhere to compute the 'coherence' of those two segments (as they do in the paper for beam-search). The interesting part for us would then be the training of the model from the ranking data that we generate. We will be able to get multiple results for a given prefix from our db .. with combined ranking scores which allows to generate.
from open-assistant.
You can also take a look at trlx library for reward function implementation.
from open-assistant.
Hey few Qs,
-
any suggestions for how to compare with the other tasks? I'm thinking we might want to make comparing our pretrained models it's own issue since all these rankers are being built in parallel by different people... Otherwise, perhaps we should just use the loss or some other figures of merit, e.g. consider this a binary classification and use your standard accuracy, f1 score, ROC, etc.., but I think we should at least have a fixed list of the summarize-from-feedback examples, right?
-
I heard there was a discord, can I get an invite?
-
I'll make a PR tonight with a sketch of what I'm doing. So far have just been working with the webgpt dataset but think I have basic training infrastructure setup for the model. It seems like my PC can't handle the xl t5 model, but I can def train a t5-base model and maybe t5-large. Will we have training infra somewhere for this project?
You can also take a look at trlx library for reward function implementation.
Thanks I'll take a look :)
from open-assistant.
Added draft PR, mainly working out of this notebook for now:
from open-assistant.
As per discussion with @theblackcat102 I decided to build this on top of their trainer. The code for this is in the new PR. I am also training the model on W&B here
from open-assistant.
Related Issues (20)
- when i click on start new message it doesn'[t click HOT 1
- Open AI doesn't work for me HOT 1
- Create a New Chat HOT 4
- Questions about what's going on with Open-Assistant? please watch. HOT 2
- Chat doesn't open HOT 1
- text to speech HOT 1
- Unable to create new chat HOT 6
- unable to create new chat HOT 2
- Dear ladies and gentlemen, I click on the button "create a chat" but it doesn't work at all. Could you please solve this problem and help me ? Best regards Ehsan Pazooki HOT 1
- Not able to access the chat dashboard HOT 1
- Open assistant registration error
- Dashboard not working in the official website. HOT 2
- /dashboard exit HOT 1
- chat frontend no longer active, fix readme HOT 2
- Can't open dashboard HOT 1
- Sign on malfunction HOT 1
- Not able to get to the dashboard HOT 1
- no puedo acceder a la bandeja de entrenamiento supervisado
- Potential Information Leakage
- https://open-assistant.io/ SSL cert is outdated by 28 days
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from open-assistant.