请提出你的问题 Please ask your question 我正在尝试在Transformer Engine中增加<code

可以先不要amp，用float32试一下吗？我这边暂时没有环境，明天再帮你看一下哈 <a target="_blank" rel="noopener norefer

可以先不要amp，用float32试一下吗？我这边暂时没有环境，明天再帮你看一下哈 <a target="_blank" rel="noopene

Paddle框架中目前存在两种支持main_grad的方式：使用 mix

本地使用下面的脚本运行，可以跑通呀，咱们是不是环境没对齐？（这个脚本直接把 nlp 的代码复制过来了，单文件就可以跑） <div class="highl

开启main_grad的时候，怎么融合gradient accumulation？ about paddle HOT 7 CLOSED

Wong4j commented on September 21, 2024

开启main_grad的时候，怎么融合gradient accumulation？

from paddle.

Comments (7)

feifei-111 commented on September 21, 2024

可以先不要amp，用float32试一下吗？我这边暂时没有环境，明天再帮你看一下哈

from paddle.

Wong4j commented on September 21, 2024

可以先不要amp，用float32试一下吗？我这边暂时没有环境，明天再帮你看一下哈

main_grad不是只用于AMP吗？

from paddle.

feifei-111 commented on September 21, 2024

inp = paddle.normal(mean=0, std=0.01, shape=[1, 32, 32]).astype('float32')
输入用 fp32 就可以了，amp会把它变成16，改成fp32可以跑通

from paddle.

Xreki commented on September 21, 2024

Paddle框架中目前存在两种支持main_grad的方式：

使用mix_precision_utils.MixPrecisionOptimizer封装optimizer
使用paddle.amp.decorate并设置master_grad=True

两种方式不可同时启用，当前分布式环境下建议使用第一种。后续框架将会统一两种用法。

from paddle.

Wong4j commented on September 21, 2024

我按照意见，使用FP32输入，并且只使用mix_precision_utils.MixPrecisionOptimizer封装optimizer，测试仍然会遇到一样的错误。

from paddle.

feifei-111 commented on September 21, 2024

本地使用下面的脚本运行，可以跑通呀，咱们是不是环境没对齐？
（这个脚本直接把 nlp 的代码复制过来了，单文件就可以跑）

# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import paddle
from paddle import _C_ops
from paddle.framework import core
import numpy as np
from paddle.distributed.fleet.utils import mix_precision_utils

def is_fused_matmul_bias_supported():
    if paddle.is_compiled_with_cuda() and not paddle.is_compiled_with_rocm() or paddle.is_compiled_with_xpu():
        return hasattr(core.eager.ops.legacy, "fused_gemm_epilogue")
    else:
        return False


if is_fused_matmul_bias_supported():
    origin_linear = paddle.incubate.nn.functional.fused_linear
else:
    origin_linear = paddle.nn.functional.linear


class FusedLinearWithGradAdd(paddle.autograd.PyLayer):
    @staticmethod
    def forward(ctx, x, weight, bias=None, name=None):
        y = origin_linear(x, weight, bias)
        ctx.save_for_backward(x, weight, bias)
        return y

    @staticmethod
    def backward(ctx, y_grad):
        x, weight, bias = ctx.saved_tensor()
        x_grad = paddle.matmul(y_grad, weight, transpose_y=True)

        # _C_ops.fused_linear_param_grad_add(x, y_grad, dw, db, multi precision, has bias)
        if bias is None:
            if hasattr(weight, "main_grad"):
                weight.main_grad, _ = _C_ops.fused_linear_param_grad_add(
                    x, y_grad, weight.main_grad, None, True, False
                )
                return x_grad, None
            else:
                if weight.grad is not None:
                    weight.grad, _ = _C_ops.fused_linear_param_grad_add(x, y_grad, weight.grad, None, False, False)
                    return x_grad, None
                else:
                    weight_grad, _ = _C_ops.fused_linear_param_grad_add(x, y_grad, None, None, False, False)
                    return x_grad, weight_grad

        if hasattr(weight, "main_grad") and hasattr(bias, "main_grad"):
            weight.main_grad, bias.main_grad = _C_ops.fused_linear_param_grad_add(
                x, y_grad, weight.main_grad, bias.main_grad, True, True
            )
            return x_grad, None, None
        else:
            if weight.grad is not None:
                assert bias.grad is not None
                weight.grad, bias.grad = _C_ops.fused_linear_param_grad_add(
                    x, y_grad, weight.grad, bias.grad, False, True
                )
                return x_grad, None, None
            else:
                weight_grad, bias_grad = _C_ops.fused_linear_param_grad_add(x, y_grad, None, None, False, True)
                return x_grad, weight_grad, bias_grad


def mock_layers():
    paddle.nn.functional.linear = FusedLinearWithGradAdd.apply
    if is_fused_matmul_bias_supported():
        paddle.incubate.nn.functional.fused_linear = FusedLinearWithGradAdd.apply


mock_layers()

def create_optimizer(model, use_pure_bf16, use_main_grad):
    if use_main_grad:
        assert use_pure_bf16
        model = mix_precision_utils.MixPrecisionLayer(model, dtype="bfloat16")
    optimizer = paddle.optimizer.AdamW(
        parameters=model.parameters(),
        learning_rate=0.0001,
        multi_precision=use_pure_bf16,
    )
    if use_main_grad:
        optimizer = mix_precision_utils.MixPrecisionOptimizer(optimizer)

    return optimizer

class Net(paddle.nn.Layer):
    """Network use for recompute testing"""

    def __init__(self):
        super().__init__()
        self.layer = paddle.nn.Linear(32, 32)

    def forward(self, inp):
        out = self.layer(inp)
        return out

def main():
    paddle.seed(10)

    model = Net()
    optimizer = create_optimizer(model, use_pure_bf16=True, use_main_grad=True)
    model = paddle.amp.decorate(models=model, dtype="bfloat16", level='O2', master_grad=True)

    model.train()
    for _ in range(10):
        inp = paddle.normal(mean=0, std=0.01, shape=[1, 32, 32]).astype('float32')
        inp.stop_gradient = False
        with paddle.amp.auto_cast(True, level="O2", dtype="bfloat16"):
            out = model(inp)
        loss = out.mean()
        loss.backward()
        optimizer.step()
        optimizer.clear_grad()
        print(loss)

if __name__ == "__main__":
    main()

from paddle.

Wong4j commented on September 21, 2024

已解决，关掉这个issue

from paddle.

开启main_grad的时候，怎么融合gradient accumulation？ about paddle HOT 7 CLOSED

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs