Hi! Thanks again for the awesome repo. I have a small question regar

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

<a class="user-mention notranslate" data-hovercard-type="user" data-hover

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Global batch size question about alignment-handbook HOT 7 OPEN

huggingface commented on May 14, 2024

Global batch size question

from alignment-handbook.

Comments (7)

timothylimyl commented on May 14, 2024 1

Yeah, I was thinking the same, should be per_device_train_batch_size: 4 instead of 8 since the assumption is 8 GPUs here.

However, I think the mistake kind of propagated in their own replication: https://huggingface.co/alignment-handbook/zephyr-7b-dpo-full

For the official model released, as you mentioned , the DPO batch params seem to contradict their replication based on this repo.

Also the official model link does not have any details on the SFT part, so I have no idea yet on whether is it LoRA or Full finetuning that the HF team decided to release as the official model.

from alignment-handbook.

liutianlin0121 commented on May 14, 2024 1

Hi!

I re-ran MT-bench to compare the two public DPO-trained zephyr-7b checkpoints:

The MT-bench score of HuggingFaceH4/zephyr-7b-beta (blue curves above) closely reproduces the number reported in the paper. The number is 7.34 in the paper (Table 1), and the score from my re-run is 7.37.

But the MT-bench score of alignment-handbook/zephyr-7b-dpo-full (yellow curves above) was worse overall. The score is 7.09.

There could be multiple reasons, such as:

the randomness of GPT4 evaluation used in MT-bench (if anyone has the resources to rerun MT-bench multiple times, that'll be great)
the difference in the SFT step (the two models used different SFT checkpoints)
the difference in the DPO step (e.g., the global batch size difference that I mentioned; I am not sure if this is the only difference).

I am wondering if you have any insights @lewtun. It would be great if we can use the recipe to re-train the stronger HuggingFaceH4/zephyr-7b-beta with a MT-bench score of 7.37. 🙏

from alignment-handbook.

timothylimyl commented on May 14, 2024

anyways, I am running an experiment for the DPO based on 4 GPUs, leaving the batch to be 8. If the loss is the same then I can confirm it official release is using the ....-sft-full model and batch size is correct.

from alignment-handbook.

timothylimyl commented on May 14, 2024

@liutianlin0121 I do not think it's the issue with the replication of the model (as we go about re-training along the recipes provided). It seems that even the officially released hugging face model score has degraded.

from alignment-handbook.

liutianlin0121 commented on May 14, 2024

@liutianlin0121 I do not think it's the issue with the replication of the model (as we go about re-training along the recipes provided). It seems that even the officially released hugging face model score has degraded.

Yeah my objective is to reproduce the original model HuggingFaceH4/zephyr-7b-beta. Using the existing code base, I suppose I can reproduce the handbook model alignment-handbook/zephyr-7b-dpo-full, but the latter is somehow weaker in MT-bench compared to the former.

from alignment-handbook.

timothylimyl commented on May 14, 2024

nt-handbook/zephyr-7b-dpo-full (yellow curves above) was worse overall. The score is 7.09.

@liutianlin0121,

It seems that I misunderstand your post.

Just to confirm, you are able to replicate (close enough) the MT-Bench score for the official HF model of 7.37?

from alignment-handbook.

liutianlin0121 commented on May 14, 2024

@timothylimyl Yes. I was able to reproduce the MT-Bench score for the official model. But I ran the MT-bench evaluation a few weeks ago. To debug, perhaps it would be useful to take a look at the GPT4-generated judgement at data/mt_bench/model_judgment/gpt-4_single.jsonl. Do they appear reasonable?

In one of my early MT-Bench runs, I used too many concurrent-api-call with
python gen_judgment.py --model-list [LIST-OF-MODEL-ID] --parallel A_LARGE_NUMBER_LIKE_8_or_16
This caused some errors in the GPT4 model judgements at data/mt_bench/model_judgment/gpt-4_single.jsonl. Specifically, some score fields were populated with $error, and these $error were automatically omitted when computing the mean scores. After that, I only use a single concurrent api call, and the evaluation speed is not much slower. Not sure if this is the case for your evaluation, but perhaps it would be helpful to manually look at several model judgement.

from alignment-handbook.

Global batch size question about alignment-handbook HOT 7 OPEN

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs