The implementation of layerwise learning rate decay about electra HOT 2 CLOSED

google-research commented on July 2, 2024

The implementation of layerwise learning rate decay

from electra.

Comments (2)

clarkkev commented on July 2, 2024 1

For the layerwise learning rate decay we count task-specific layer added on top of the pre-trained transformer as additional layer of the model, so the learning rate for the last layer of ELECTRA should be learning_rate * 0.8. But you've still found a bug, where instead it is learning_rate * 0.8^2.

The bug happened because there used to be a pooler layer in ELECTRA before we removed the next-sentence-prediction task. In that case the learning rates per layer were

task-specific softmax: learning_rate
pooler: learning_rate * 0.8
transformer layer 24: learning_rate * 0.8^2
transformer layer 23: learning_rate * 0.8^3
...
However, when we removed the pooling layer, we didn't fix the learning rates correspondingly. I guess in practice this didn't hurt performance much, so I'm leaving it as-is to keep result reproducible for now.

from electra.

importpandas commented on July 2, 2024

I got it, thanks for your explanation.

from electra.

The implementation of layerwise learning rate decay about electra HOT 2 CLOSED

Comments (2)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs