Comments (2)
I kept train_batch_size the same in different trainings, and found that increasing num_gpus would cause the loss to increase. I don't know why.
from deepspeed.
@mhy9989
Oh I forgot about this, now that I think about it it was a stupid question lol
if you take something like loss.mean(), you can see that it's a different operation when you derive the backdrop.
in addition to numerical errors, it will never be the same.
of course, I think there should not be any performance degradation or divergence because of grad accum
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
import copy
# Set random seed for reproducibility
import random
import numpy as np
def set_seed(seed_val: int = 42):
random.seed(seed_val)
np.random.seed(seed_val)
torch.manual_seed(seed_val)
torch.cuda.manual_seed_all(seed_val)
# set
seed = 42
vocab_size = 32768
d_embd = 1024
bsz = 32
seq_len = 512
n = d_embd
dtype = torch.bfloat16
# create input and target
set_seed(seed)
x = torch.randn((bsz, seq_len, d_embd)).cuda().to(dtype=dtype)
y = torch.randint(0, vocab_size, (bsz, seq_len)).cuda()
y[-1, -20:] = -100
class Model(nn.Module):
def __init__(self, vocab_size, d_embd):
super(Model, self).__init__()
self.vocab_size = vocab_size
self.ffn = nn.Linear(d_embd, d_embd, bias=False)
self.unemb = nn.Linear(d_embd, vocab_size, bias=False)
def forward(self, x, y, reduction):
x = self.unemb(F.relu(self.ffn(x))).float()
x = x.contiguous().view(-1, self.vocab_size)
y = y.contiguous().view(-1).to(x.device)
assert x.size(0) == y.size(0), f"x.size()({x.size()}) != y.size(){y.size()}"
loss = nn.CrossEntropyLoss(reduction=reduction)(x, y)
num_valid_tokens = (y != -100).sum()
if reduction == 'sum':
loss = loss / num_valid_tokens
print(f'x.size(): {x.size()}, num_valid_tokens: {num_valid_tokens}')
return loss
set_seed(seed)
model = Model(vocab_size, d_embd).cuda().to(dtype=dtype)
optimizer = optim.Adam(model.parameters(), lr=0.001)
set_seed(seed)
model_ = Model(vocab_size, d_embd).cuda().to(dtype=dtype)
optimizer_ = optim.Adam(model_.parameters(), lr=0.001)
reduction='sum'
# reduction='mean'
num_accum = 2
for epoch in range(5):
loss = model(x, y, reduction)
loss.backward()
ffn_grad_cache = copy.deepcopy(model.ffn.weight.grad)
unemb_grad_cache = copy.deepcopy(model.unemb.weight.grad)
optimizer.step()
optimizer.zero_grad()
avg_loss = 0.0
for accum in range(num_accum):
x_ = x[accum * (bsz // num_accum):(accum + 1) * (bsz // num_accum), :, :]
y_ = y[accum * (bsz // num_accum):(accum + 1) * (bsz // num_accum), :]
loss_ = model_(x_, y_, reduction)
avg_loss += loss_
loss_.backward()
avg_loss /= num_accum
ffn_grad_cache_ = copy.deepcopy(model_.ffn.weight.grad)
unemb_grad_cache_ = copy.deepcopy(model_.unemb.weight.grad)
optimizer_.step()
optimizer_.zero_grad()
print(f'''
reduction: {reduction}
num_accum: {num_accum}
loss (not accum): {loss}
loss (accum): {avg_loss}
loss diff? : {loss-avg_loss}
ffn_grad allclose?: {torch.allclose(ffn_grad_cache, ffn_grad_cache_)}, abs diff max: {(ffn_grad_cache.abs()-ffn_grad_cache_.abs()).max()}
ffn_grad allclose?: {torch.allclose(unemb_grad_cache, unemb_grad_cache_)}, abs diff max: {(unemb_grad_cache.abs()-unemb_grad_cache_.abs()).max()}
''')
from deepspeed.
Related Issues (20)
- [REQUEST] Does Universal Checkpoint supports for MoE Checkpoint? HOT 3
- Different seeds are giving the exact same loss on Zero 1,2 and 3 during multi gpu training [BUG]
- [BUG] fp16 not supported for CPU? HOT 1
- Issue with LoRA Tuning on llama3-70b using PEFT and TRL's SFTTrainer
- [REQUEST] Asynchronous Checkpointing HOT 1
- [BUG] ImportError: /home/nlp/.cache/torch_extensions/py310_cu121/cpu_adam/cpu_adam.so: cannot open shared object file: No such file or directory HOT 1
- CUDA error: no kernel image is available for execution on the device [BUG]
- lr scheduler defined in config cannot be overwritten by lr scheduler defined in code and pass to `deepspeed.initialize` [BUG]
- [BUG] PipelineEngine calculates loss with outputs and labels from different batches. HOT 1
- [BUG] Learning rate scheduler and optimizer logical issue
- In distributed training, in order to continue training, an error occurred when loading model checkpoints after saving them.
- DS communication issue when using NCCL backend: All_reduce instead of reduce_scatter (or several reduce ops) HOT 5
- [BUG] I can't run fp8 with pipeline parallel HOT 2
- [BUG] Multi-gpu stuck when the computation graph is not complete for wach process.
- [BUG] Multi-node fine-tuning with thunderbolt HOT 1
- Multi-node multi-GPUs training is slower than single-node multi-GPUs training[BUG] HOT 2
- Default libcurand path fails HOT 1
- [BUG] Universal checkpoint conversion - "Cannot find layer_01* files in there"
- test
- how to set "training_step" during training?
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from deepspeed.