GithubHelp home page GithubHelp logo

l3030 / gpt2-lora-practice Goto Github PK

View Code? Open in Web Editor NEW
1.0 1.0 0.0 22.33 MB

This is a practice project for lora gpt2, with performance evaluation with different optimizers and schedulars.

Jupyter Notebook 100.00%

gpt2-lora-practice's Introduction

gpt2-lora-practice

This is a practice project for lora gpt2, with performance evaluation with different optimizers and schedulars.

Details

We lora the last two layers of gpt2, including the attn layer as well as the fnn layer, on dataset used in official LoRA example. However, it meets some trouble when fine-tune LN layer.

We use AdamW, CAME, adafactor as optimizers; LambdaLR, ExponentialLR, CosineAnnealingLR and AdafactorSchedule as lr scheduler. Specifically, for AdamW and CAME, we found out that LambdaLR and CosineAnnealingLR both performs good, the latter is slightly better. For adafactor, it only works when using AdafactorSchedule but the convergence performance is still poor, and all these optimizers fails by using ExponentiallR. To understand why, we further try different initialization schemes, by replacing kaiming_uniform_ by kaiming_normal_, but the situation did not improve.

It seems the existing lora method can not applied into LN layer, may consider using tuning method in "Tuning LayerNorm in Attention: Towards Efficient Multi-Modal LLM Finetuning".

Observations

The illustration of different methods can be found in "./save/attention/attention" and "./save/fnn/fnn".

The lora performance by using AdamW and CAME is stable and good. adafactor may needs more tuned for working normally, since the mean of lora matrix shows a monotonically changing trend.

For AdamW, the last layer's mean decrease fast at first, similarly for last second layer, implying a larger initial learning rate with dacay for lora.

For CAME, the last layer's mean changes dramatically, may sugggesting a smaller learning rate; for the last second layer's mean of lora matrix, the value of lora matrix continues to be larger than zero, may implying an initialization scheme with larger value.

For adafactor, we try different lora rank and different initialization schemes, but it always fails to converge in this setting.

Update 2024.4.16 Upload the weights_distribution rar file.

We provide the histograms of parameter distributions for Lora A and Lora B in the second-to-last layer and last layer, for both the attention (attn) and feedforward neural network (fnn) layers. The distribution shifts are compared for AdamW, CAME, and Adafactor.

With one batch*gradient_accumulation_steps, for every gradient_accumulation_step, we report the histograms of parameter distributions because we want to observe the changes in distributions as they converge rapidly. After that, we report the distribution every num_update_steps, since the distribution changes become small as the model tends to converge.

AdamW

For example, in AdamW, Lora A of last layer starts with uniform distribution: Lora A lastlayer - Epoch 15_within a batch, and after one entire num_update_steps, the Lora A parameter varys toward a Gaussian distribution with 0 as the mean: Lora A lastlayer - Epoch 1. As the number of training iterations increases, the model parameters eventually become distributed in a narrower Gaussian distribution: Lora A lastlayer - Epoch 42.

Similarly, in Adamw, for Lora B of the last layer, it starts with a zero distribution, and after one round's update, the parameter becomes concave distribution:Lora B lastlayer - Epoch 47_within a batch, while finally converges to a normal distribution: Lora B lastlayer - Epoch 42.

From this observation, it implies us that zero initialization is not a good initialization strategy for Lora B in AdamW, instead a normal initialization seems a good strategy since Lora B finally converges to normal distribution. However, as we know, Lora to ensure the constancy of the initialization, one of A,B to choose the all-zero initialization. In our experiment, our results prove that A or B is best with a normal distribution under the AdamW optimizer for both. A remaining question is what kind of normal distribution initialization is appropriate?

CAME

In CAME, both Lora A and Lora B becomes obvious normal distribution after one training update, and converges to the narrow normal distribution, while B has a smaller standard deviation. Lora A lastlayer - Epoch 42 Lora B lastlayer - Epoch 42. This may imply that we should use normal initialization to replace zero initialization for Lora B.

adafactor

As for adafactor, Lora A changes unobviously to a normal distribution, while Lora B changes slowly to a concave distribution. In addition, the second last layer converges better than the last layer in the adafactor, and the last layer is nearly converging. Thus, we think more training rounds, and assign Lora A with normal initialization while Lora B with uniform initialization may lead to a better performance. As for learning rate, we alreadly apply the specific AdafactorSchedule. Lora A secondlayer - Epoch 42 Lora B secondlayer - Epoch 42


Similar observation can be found in fnn layer. It seems that fnn converges slower than attn layer. For example, the Lora B of adafactor operator in last layer, attn alreadly changes from normal to concave distribution. Lora B lastlayer - Epoch 42 Lora B lastlayer - Epoch 42

gpt2-lora-practice's People

Contributors

l3030 avatar

Stargazers

 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.