GithubHelp home page GithubHelp logo

Comments (2)

mhy9989 avatar mhy9989 commented on July 22, 2024 1

I kept train_batch_size the same in different trainings, and found that increasing num_gpus would cause the loss to increase. I don't know why.

from deepspeed.

SeunghyunSEO avatar SeunghyunSEO commented on July 22, 2024

@mhy9989
Oh I forgot about this, now that I think about it it was a stupid question lol
if you take something like loss.mean(), you can see that it's a different operation when you derive the backdrop.
in addition to numerical errors, it will never be the same.
of course, I think there should not be any performance degradation or divergence because of grad accum

import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
import copy

# Set random seed for reproducibility
import random
import numpy as np
def set_seed(seed_val: int = 42):
    random.seed(seed_val)
    np.random.seed(seed_val)
    torch.manual_seed(seed_val)
    torch.cuda.manual_seed_all(seed_val)
    
# set
seed = 42
vocab_size = 32768
d_embd = 1024
bsz = 32
seq_len = 512
n = d_embd
dtype = torch.bfloat16

# create input and target
set_seed(seed)
x = torch.randn((bsz, seq_len, d_embd)).cuda().to(dtype=dtype)
y = torch.randint(0, vocab_size, (bsz, seq_len)).cuda()
y[-1, -20:] = -100

class Model(nn.Module):
    def __init__(self, vocab_size, d_embd):
        super(Model, self).__init__()
        self.vocab_size = vocab_size
        self.ffn = nn.Linear(d_embd, d_embd, bias=False)
        self.unemb = nn.Linear(d_embd, vocab_size, bias=False)

    def forward(self, x, y, reduction):
        x = self.unemb(F.relu(self.ffn(x))).float()
        x = x.contiguous().view(-1, self.vocab_size)
        y = y.contiguous().view(-1).to(x.device)
        assert x.size(0) == y.size(0), f"x.size()({x.size()}) != y.size(){y.size()}"
        loss = nn.CrossEntropyLoss(reduction=reduction)(x, y)
        num_valid_tokens = (y != -100).sum()
        if reduction == 'sum':
            loss = loss / num_valid_tokens
        print(f'x.size(): {x.size()}, num_valid_tokens: {num_valid_tokens}')
        return loss

set_seed(seed)
model = Model(vocab_size, d_embd).cuda().to(dtype=dtype)
optimizer = optim.Adam(model.parameters(), lr=0.001)
set_seed(seed)
model_ = Model(vocab_size, d_embd).cuda().to(dtype=dtype)
optimizer_ = optim.Adam(model_.parameters(), lr=0.001)

reduction='sum'
# reduction='mean'

num_accum = 2

for epoch in range(5):
    loss = model(x, y, reduction)
    loss.backward()
    ffn_grad_cache = copy.deepcopy(model.ffn.weight.grad)
    unemb_grad_cache = copy.deepcopy(model.unemb.weight.grad)
    optimizer.step()
    optimizer.zero_grad()

    avg_loss = 0.0
    for accum in range(num_accum):
        x_ = x[accum * (bsz // num_accum):(accum + 1) * (bsz // num_accum), :, :]
        y_ = y[accum * (bsz // num_accum):(accum + 1) * (bsz // num_accum), :]
        loss_ = model_(x_, y_, reduction)
        avg_loss += loss_
        loss_.backward()

    avg_loss /= num_accum
    ffn_grad_cache_ = copy.deepcopy(model_.ffn.weight.grad)
    unemb_grad_cache_ = copy.deepcopy(model_.unemb.weight.grad)
    optimizer_.step()
    optimizer_.zero_grad()

    print(f'''
    reduction: {reduction}
    num_accum: {num_accum}
    loss (not accum): {loss}
    loss (accum): {avg_loss}
    loss diff? : {loss-avg_loss}
    ffn_grad allclose?: {torch.allclose(ffn_grad_cache, ffn_grad_cache_)}, abs diff max: {(ffn_grad_cache.abs()-ffn_grad_cache_.abs()).max()}
    ffn_grad allclose?: {torch.allclose(unemb_grad_cache, unemb_grad_cache_)}, abs diff max: {(unemb_grad_cache.abs()-unemb_grad_cache_.abs()).max()}
    ''')

from deepspeed.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.