GithubHelp home page GithubHelp logo

Comments (7)

timothylimyl avatar timothylimyl commented on May 14, 2024 1

Yeah, I was thinking the same, should be per_device_train_batch_size: 4 instead of 8 since the assumption is 8 GPUs here.

However, I think the mistake kind of propagated in their own replication: https://huggingface.co/alignment-handbook/zephyr-7b-dpo-full

For the official model released, as you mentioned , the DPO batch params seem to contradict their replication based on this repo.

Also the official model link does not have any details on the SFT part, so I have no idea yet on whether is it LoRA or Full finetuning that the HF team decided to release as the official model.

from alignment-handbook.

liutianlin0121 avatar liutianlin0121 commented on May 14, 2024 1

Hi!

I re-ran MT-bench to compare the two public DPO-trained zephyr-7b checkpoints:

  1. HuggingFaceH4/zephyr-7b-beta, and
  2. alignment-handbook/zephyr-7b-dpo-full

mt-bench-result-radar (1)

The MT-bench score of HuggingFaceH4/zephyr-7b-beta (blue curves above) closely reproduces the number reported in the paper. The number is 7.34 in the paper (Table 1), and the score from my re-run is 7.37.

But the MT-bench score of alignment-handbook/zephyr-7b-dpo-full (yellow curves above) was worse overall. The score is 7.09.

There could be multiple reasons, such as:

  • the randomness of GPT4 evaluation used in MT-bench (if anyone has the resources to rerun MT-bench multiple times, that'll be great)
  • the difference in the SFT step (the two models used different SFT checkpoints)
  • the difference in the DPO step (e.g., the global batch size difference that I mentioned; I am not sure if this is the only difference).

I am wondering if you have any insights @lewtun. It would be great if we can use the recipe to re-train the stronger HuggingFaceH4/zephyr-7b-beta with a MT-bench score of 7.37. 🙏

from alignment-handbook.

timothylimyl avatar timothylimyl commented on May 14, 2024

anyways, I am running an experiment for the DPO based on 4 GPUs, leaving the batch to be 8. If the loss is the same then I can confirm it official release is using the ....-sft-full model and batch size is correct.

from alignment-handbook.

timothylimyl avatar timothylimyl commented on May 14, 2024

@liutianlin0121 I do not think it's the issue with the replication of the model (as we go about re-training along the recipes provided). It seems that even the officially released hugging face model score has degraded.

from alignment-handbook.

liutianlin0121 avatar liutianlin0121 commented on May 14, 2024

@liutianlin0121 I do not think it's the issue with the replication of the model (as we go about re-training along the recipes provided). It seems that even the officially released hugging face model score has degraded.

Yeah my objective is to reproduce the original model HuggingFaceH4/zephyr-7b-beta. Using the existing code base, I suppose I can reproduce the handbook model alignment-handbook/zephyr-7b-dpo-full, but the latter is somehow weaker in MT-bench compared to the former.

from alignment-handbook.

timothylimyl avatar timothylimyl commented on May 14, 2024

nt-handbook/zephyr-7b-dpo-full (yellow curves above) was worse overall. The score is 7.09.

@liutianlin0121,

It seems that I misunderstand your post.

Just to confirm, you are able to replicate (close enough) the MT-Bench score for the official HF model of 7.37?

from alignment-handbook.

liutianlin0121 avatar liutianlin0121 commented on May 14, 2024

@timothylimyl Yes. I was able to reproduce the MT-Bench score for the official model. But I ran the MT-bench evaluation a few weeks ago. To debug, perhaps it would be useful to take a look at the GPT4-generated judgement at data/mt_bench/model_judgment/gpt-4_single.jsonl. Do they appear reasonable?

In one of my early MT-Bench runs, I used too many concurrent-api-call with
python gen_judgment.py --model-list [LIST-OF-MODEL-ID] --parallel A_LARGE_NUMBER_LIKE_8_or_16
This caused some errors in the GPT4 model judgements at data/mt_bench/model_judgment/gpt-4_single.jsonl. Specifically, some score fields were populated with $error, and these $error were automatically omitted when computing the mean scores. After that, I only use a single concurrent api call, and the evaluation speed is not much slower. Not sure if this is the case for your evaluation, but perhaps it would be helpful to manually look at several model judgement.

from alignment-handbook.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.