from the README from /s . <div class="highli

Thank you <a class="user-mention notranslate" data-hovercard-type="user" data-hovercar

Clarification on dataset mixer about alignment-handbook HOT 2 OPEN

deep-diver commented on May 28, 2024 1

Clarification on dataset mixer

from alignment-handbook.

Comments (2)

shabie commented on May 28, 2024

From the comments, it looks like ONLY training samples from dataset_1, dataset_2, and dataset_3 are considered. There isn't explanation how each dataset contributes to the test_xxx split.

Each dataset should have a separate train and test splits. This is made clear in the docstring where the expecatation is that they start with train_ and test_ respectively. Now the percentages sample the fraction of all datapoints from the train split. The corresponding test dataset is taken in full since subsampling for validation seems pointless (unless validation is super expensive then yeah maybe).

If the confusion was that the datamixer automatically uses the "unused" part of the train split as a test dataset (like how sklearn allows us to do that) then no that doesn't happen here. I like it cuz it always keeps the test set away from being mistakenly used as training by just changing the percentages of the mix.

Anyhow, all this is based on my understanding of the code. Hope it helps or if I am wrong, please correct me :)

from alignment-handbook.

deep-diver commented on May 28, 2024

Thank you @shabie

I think it could be common to have a test dataset in a single repo while we could have training dataset from multiple sources.

At least this is my use-case.
To do this, I ended up merging multiple datasets into a single one by myself. Just hoping it could be done in alignment handbook too.

from alignment-handbook.

Recommend Projects

Clarification on dataset mixer about alignment-handbook HOT 2 OPEN

Comments (2)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs