Comments (1)
Update azure doc with the following issues:
(regarding cifar accuracy statement) Can we rephrase this comment? It requires a subtle understanding of data parallelism and I'm worried it will be misinterpreted as a critique of DeepSpeed.
(before azure cli details) Maybe a first bullet of making an Azure account if you are new?
I think it would be good to add a sentence or two telling users about the hardware that our sample configuration uses. We should also link to the SKU info page so pricing is clear before they follow the tutorial.
from deepspeed.
Related Issues (20)
- [BUG] sequence parallel alltoall with batch size > 1 HOT 6
- when I finetune the model use deepspeed on 2 4*A800s,log only contain worker1 HOT 2
- [BUG]deepspeed can't build cpu_adam.so under llama factory framework to run training HOT 1
- [BUG] Training with RoPE is broken: Can't stop FP32 layers from being cast to FP16/BF16 during training
- [BUG] build cpu_adam.so with wrong build.ninja file, missing the necessary ldflags HOT 9
- Multi node multi GPU sharding for inference / training Llama 405B HOT 2
- [BUG] Launcher does not honor CUDA_VISIBLE_DEVICES
- [BUG] Deepspeed ZeRO3 not partitioning model parameters HOT 7
- [BUG] Universal checkpoint conversion failed HOT 5
- [BUG] In deepspeed Zero3, RuntimeError: still have inflight params HOT 3
- [BUG] Training time regression with ZeRO-3 after upgrade to torch 2.3.1 and CUDA 12.1 HOT 2
- [BUG] 'trust_remote_code' needs to be set to True or atleast a method to pass this information as required.
- [BUG] deepspeed.utils.safe_get_full_grad get all nan value
- [REQUEST] can we load a deepspeed ckpt without deepspeed? HOT 3
- [BUG] Gradient accumulation causing training loss differences in Deepspeed vs FSDP
- Unable to install DeepSpeed using "pip install deepspeed" command in Windows 11 HOT 1
- nv-nightly CI test failure
- [BUG] Circular import error with PyTorch nightly
- [BUG] Bad compatibility check by testing the existence of a CHANGELOG.md file which is not always available depending on the way of CUTLASS library installation
- [BUG] Trainer saves global_steps300 in LoRA training with deepspeed
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from deepspeed.