Comments (1)
Hi there,
Thanks for your question. It sounds like you are experiencing a common problem with language model training, which is the early plateau problem. This is where the validation loss stops improving after a certain number of epochs, even though the training loss continues to decrease.
There are a few possible reasons for this problem. One possibility is that the model is overfitting to the training data. This can happen if the model is too complex or if the training data is not diverse enough. Another possibility is that the learning rate is too high. This can cause the model to jump around the loss landscape, making it difficult to converge.
In your case, it is possible that the model is overfitting to the training data. This is because you are using a relatively small batch size (4) on a large dataset. This means that the model is seeing the same examples over and over again, which can make it more likely to overfit.
You can try to address the early plateau problem by doing the following:
- Increase the batch size. This will help to reduce overfitting by exposing the model to more data.
- Use a different optimizer. Some optimizers, such as AdamW, are better at preventing overfitting than others.
- Reduce the learning rate. This will help the model to converge more slowly and avoid jumping around the loss landscape.
- Add regularization. Regularization techniques, such as dropout and L2 regularization, can help to prevent overfitting.
If you are still experiencing the early plateau problem after trying these suggestions, then you may need to increase the size of your dataset. This will give the model more data to learn from and help it to generalize better to new data.
As for your question about scaling training on larger devices, the answer is yes, other hyperparameters may need to be adjusted. For example, you may need to increase the batch size and learning rate. You may also need to use a different optimizer, such as AdamW.
I hope this helps!
from megabyte-pytorch.
Related Issues (13)
- Minor shape error HOT 1
- the patch embbeder implementations are different from the original paper HOT 4
- Why your Attention impl use kv dimention dim_head instead of inner_dim? HOT 1
- Evaluation metric bits-per-byte
- GPU used for original paper experiments HOT 1
- Why does it expect tokens? HOT 1
- What are the implications of this model? HOT 4
- the string is still divided into pieces HOT 1
- Some question about the MEGABYTE HOT 4
- translation of model sizes from paper to model definition
- No available kernel error HOT 1
- some implementations are different from the original paper HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from megabyte-pytorch.