Comments (5)
Multi node training is now supported from #208
@LifeIsStrange you can also open issues on Megatron-LM now, thanks for contributing!
from deeplearningexamples.
I saw a bert is trained with 64 GPUs in 3.3 days in this article
https://medium.com/future-vision/bert-meets-gpus-403d3fbed848?fbclid=IwAR0bFskUVVKDRyYF-9cQGgRXeq7dTvteGHi10HaTG5zI7_eE8oW-BfrxYQw
I want to know is this repo or the pytorch version in https://github.com/NVIDIA/Megatron-LM ?
Can you help me to train a bert model with distributed training, thanks.
from deeplearningexamples.
We currently only published scripts for single node training. Stay tuned for distributed multi node training scripts, we will publish them soon.
from deeplearningexamples.
@swethmandava
Why doesn't Megatron allow us to open issues?
For example it would be nice if it supported
https://github.com/zihangdai/xlnet
Which is the new state of the art (consistently beat BERT) as you can see on paperswithcode.com
And it does not yet support multi gpu zihangdai/xlnet#218
(it would be nice to support ERNIE 2.0 too but less of a priority)
from deeplearningexamples.
BTW nvidia is already contributing to xlnet e.g this nvidia employee:
zihangdai/xlnet#200
So let's be consistent
from deeplearningexamples.
Related Issues (20)
- Seeking Help with Tacotron 2 Training for Telugu Language
- [Model/Framework or something else] Feature requested
- [ResNet-50/pytorch] FP32 and AMP Mode taking same time to complete 90 Epochs HOT 2
- [Model/Framework] in the model_zoo.py the torch.hub api use wrong
- Inconsistent librosa versions PyTorch/SpeechSynthesis/All and CUDA-Optimized/FastSpeech
- Support for Ada Lovelace Architecture
- [nnUNet] pytorch_lightning.utilities.exceptions.MisconfigurationException when training
- [nnUNET/PyTorch] Training step running into "RuntimeError: Critical error in pipeline: Error when executing CPU operator readers__Numpy, instance name: "ReaderX", encountered: CUDA allocation failed Current pipeline object is no longer valid."
- How to train ResNet50 for ImageNet1k HOT 1
- [BERT/TF2] Global batch size not matching with the description
- [DLRM/PyTorch] repository name (library/image-machine-DGX-A100) must be lowercase
- [Model/Framework or something else] Feature requested
- ```suggestion
- wrong place
- [Model/Framework] What is the problem?
- Н
- О
- [DeepSpeedExamples/training/HelloDeepSpeed] Fail to run if onnxruntime-training is installed
- [TFT/PyTorch] Is it possible to run the TFT implementation on Kubeflow?
- [Model/Framework] error in run the script run_inference_on_triton.py
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from deeplearningexamples.