Hi there. I’ve run the training code in this repository for 25k out

Training Results and Scaling about megabyte-pytorch HOT 1 OPEN

MiscellaneousStuff commented on August 10, 2024

Training Results and Scaling

from megabyte-pytorch.

Comments (1)

ChukwumaChukwuma commented on August 10, 2024 2

Hi there,

Thanks for your question. It sounds like you are experiencing a common problem with language model training, which is the early plateau problem. This is where the validation loss stops improving after a certain number of epochs, even though the training loss continues to decrease.

There are a few possible reasons for this problem. One possibility is that the model is overfitting to the training data. This can happen if the model is too complex or if the training data is not diverse enough. Another possibility is that the learning rate is too high. This can cause the model to jump around the loss landscape, making it difficult to converge.

In your case, it is possible that the model is overfitting to the training data. This is because you are using a relatively small batch size (4) on a large dataset. This means that the model is seeing the same examples over and over again, which can make it more likely to overfit.

You can try to address the early plateau problem by doing the following:

Increase the batch size. This will help to reduce overfitting by exposing the model to more data.
Use a different optimizer. Some optimizers, such as AdamW, are better at preventing overfitting than others.
Reduce the learning rate. This will help the model to converge more slowly and avoid jumping around the loss landscape.
Add regularization. Regularization techniques, such as dropout and L2 regularization, can help to prevent overfitting.

If you are still experiencing the early plateau problem after trying these suggestions, then you may need to increase the size of your dataset. This will give the model more data to learn from and help it to generalize better to new data.

As for your question about scaling training on larger devices, the answer is yes, other hyperparameters may need to be adjusted. For example, you may need to increase the batch size and learning rate. You may also need to use a different optimizer, such as AdamW.

I hope this helps!

from megabyte-pytorch.

Recommend Projects

Training Results and Scaling about megabyte-pytorch HOT 1 OPEN

Comments (1)

Related Issues (13)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs