The diopi from deeplink-org

[New op] sup SyncBatchnorm in mlu

适配算子 SyncBatchnorm，在370M8完成功能验证 & 入库
SyncBatchnorm 包含以下cnnl接口：

9.296. SyncBatchnormBackwardReduce
9.297. SyncBatchNormElemt
9.298. SyncBatchNormBackwardElemt
9.299. SyncBatchNormBackwardElemtV2
9.300. SyncBatchNormGatherStatsWithCounts
9.301. SyncBatchNormStats

参考文档见

ctc_loss适配

pytorch逻辑和cnnlCTCLoss逻辑差异，导致在不修改pytroch的前提下，反向计算的适配出现问题。
log_alpha不影响前向计算的过程及结果，但是会影响反向计算的逻辑。cnnlCTCLoss在计算反向梯度时所需的reduction参数不能从对接的pytroch接口得到。CTCLoss问题总结

使用sh scripts/build_impl.sh cuda编译cuda版本报错

在用使用sh scripts/build_impl.sh cuda的时候编译报错：
报错显示：
“impl/../diopi_test/codegen/gen.py" : [Errno 2] No such file or directory
我搜索代码工程中的gen.py文件，发现有两处：adaptor/codegen/gen.py 和diopi_test/diopi_stub/codegen/gen.py
我替换CMakelists中的文件路径，发现使用上述两个文件编译均不成功。

我的环境是一个python的虚拟环境，python3.8.5 ， nvcc的版本11.8，GPU为A100

无法跑通quick start例子

Describe the bug

cd impl && sh scripts/build_impl.sh torch

即编译torch的算子库，编译完成后，生成了以下动态库
/home/xiongchao/work/DIOPI/impl/lib
├── export_functions.cpython-39-x86_64-linux-gnu.so
├── export_runtime.cpython-39-x86_64-linux-gnu.so
├── libdiopi_impl.so
└── libdiopirt.so
测试python脚本里import的为export_functions.cpython-39-x86_64-linux-gnu.so，
通过以下make过程的log中可知，算子实现放在libdiopi_impl.so中的

可我通过readelf -d 其他三个so没有看到对libdiopi_impl的依赖

Environment information
cuda 11.8
torch2.0.0 (conda安装)
gcc9.4
DIOPI commit id 6f6da63
Screenshots
测试时，就一直报not implenmented.

python main.py --mode run_test --fname relu
2023-10-17 16:16:25,460 - ConformanceTest - ERROR - NotImplemented: relu not implement
2023-10-17 16:16:25,462 - ConformanceTest - ERROR - NotImplemented: relu not implement
2023-10-17 16:16:25,465 - ConformanceTest - ERROR - NotImplemented: relu not implement
2023-10-17 16:16:25,465 - ConformanceTest - ERROR - NotImplemented: relu not implement
2023-10-17 16:16:25,918 - ConformanceTest - ERROR - NotImplemented: relu not implement
2023-10-17 16:16:26,234 - ConformanceTest - ERROR - NotImplemented: relu not implement
2023-10-17 16:16:26,266 - ConformanceTest - ERROR - NotImplemented: relu not implement
2023-10-17 16:16:26,272 - ConformanceTest - ERROR - NotImplemented: relu not implement

在python侧运行其他api，也不正常。

我理解理论上应该得有对libdiopi_impl.so依赖？谢谢解答

【DeepLink v0.26】DIOPI 双月开发计划

【计划版本】

v0.26

【预计周期】

7月-8月

【开发内容】

支持跑通LLaMa大模型训练；
定义6个新模型和其他具潜力模型所需算子接口，并在cuda上验证实现；
在camb上实现定义16个模型所需算子接口，并验证实现；
在其他两款硬件上实现定义共计7个模型所需算子接口，并验证实现；
完善DIOPI一致性测试，增加测试不连续输入等功能。

interLm cambricon 算子适配记录

文档更新和编译说明补充

【文档更新不彻底】github文档很多地方待更新，存在一些错误，例如不存在的头文件 diopi_register.h
https://github.com/DeepLink-org/DIOPI/blob/main/README.md
https://github.com/DeepLink-org/DIOPI/blob/main/impl/README.md
【补充编译说明】提供常用的编译命令和编译选项说明。当前在编译时默认DTEST=OFF，如果要测试op/model，需要设置DTEST=ON,DIOPI文档中没有提到这点，在DIPU文档中才有提及：https://github.com/DeepLink-org/dipu/blob/main/QuickStart.md

昇腾环境上在部署Llama2模型环境时遇到flash_attn无法安装

用户在昇腾环境上验证DeepLink能力时遇到问题，以下为原话。
---------------------邮件原文---------------------

我看DeepLink 官方社区是接入昇腾芯片的，并且是适配过llama 模型的。在昇腾环境上验证DeepLink能力时，在部署Llama2模型环境时遇到flash_attn无法安装，有如下报错：

发现flash_attn与cuda强相关，阻塞了验证流程，需要做类似迁移的操作。请问你们在验证时有遇到类似大模型用到与芯片架构强相关的库的问题吗？如有可以提供下解决方案吗

《人工智能算子接口》算子开发清单

《人工智能算子接口第1部分：基础数学类》- 最小集（79个）需新增或修改算子（28个）：

《人工智能算子接口第2部分：神经网络类》-最小集（22个）需新增或修改算子（20个）：

环境准备步骤复杂、耗时长，能否提供docker镜像？

文档中环境准备内容不清晰，
（1）gcc/cmake/python/pytorch 版本不够明确；
（2）期望提供docker image。

Copy kernel性能较慢

Copy kernel性能较慢
同contiguous输入比transpose慢一百倍

cnnl sub存在bug，预计1.24版本修复

Describe the bug
sub在单维度超过234的计算时会报错，与camb确认过是cnnl的问题，下一个版本会修复

cast nan from float32 to float16 will get 0

Describe the bug
when cast nan from float32 to float16, the result is not nan but got 0.

Environment information
version number of the system、 software、python packages and other environment variables that may be related.

To Reproduce
Steps to reproduce the behavior:

Prepare the environment (including data, scripts) ...
Run script / Execute command ...
Check result or message, noticing that ...
See error

Expected behavior
A clear and concise description of what you expected to happen.

Screenshots
If applicable, add screenshots to help explain your problem.

Additional context
Add any other context about the problem here.

NonZero Problem

Describe the bug
When testing FCOS on MLU, there are bugs related to Nonzero and index, only for bool dtype. Now we fixed it via bool2int, however, cnnl should support bool as same as int. #545

lmdeploy适配的算子在华为实现的需求

1.diopi接口，但是还没有华为下的实现
diopiexpand
diopiSplitWithSizes
diopiLinspace
2.diopi扩展接口，还没有华为下的实现。共用lightlm的实现。
diopiRMSNorm

mmcv中的modulated deform conv算子适配

Describe the bug
modulated deform conv算子在模型实际运行时会遇到output tensor的shape与offset tensor、mask tensor的shape不一致情况，无法满足cnnl kernel要求。

暂时无法通过变换，将这三个tensor的shape对齐以满足kernel要求【因为这些NCHW格式的tensor中每一个维度已经有了明确的实际意义】

MMCV里nms.cpp，line 45的"PrivateUse1"应该修改为XPU

下方图片中的变量需要进行修改：

n diopi_functions.py softmax_backward return maybe add .numpy()

refer DeepLink-org/DIOPI-TEST/issues/91

diopiError_t set(T& t, cnnlTensorLayout_t layout)

refer DeepLink/DIOPI-TEST(Archive)#69

[CNNL] [Error]:[cnnlReduce] indices shouldn't be null when indices_size_inbytes > 0.

diopi（camb）实现+dipu疑似有个问题：

import torch
import torch_dipu
device = torch.device("dipu")
x = torch.randn(2, 2).to(device)
x.sum()

报错信息：


>>> x.sum()
[2023-5-22 17:48:58] [CNNL] [Error]:[cnnlReduce] indices shouldn't be null when indices_size_inbytes > 0.
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
RuntimeError: /mnt/share/share_data/dongkaixing/mmcv_dipu/dipu_poc_new/torch_dipu/csrc_dipu/aten/ops/AutoGenedKernels.cpp:1168'diopiadaptor::diopiSum(ctx, outDiopiTensorHandle, selfDiopiTensorHandle, diopi_size);' error, error code is 1error message is diopiErrorOccurred: diopiErrorOccurred: cnnl error 3 : CNNL_STATUS_BAD_PARAM at /mnt/share/share_data/dongkaixing/mmcv_dipu/dipu_poc_new/third_party/DIOPI/DIOPI-IMPL/camb/functions/reduce.cpp:121 called by `reduce_dim_impl` at /mnt/share/share_data/dongkaixing/mmcv_dipu/dipu_poc_new/third_party/DIOPI/DIOPI-IMPL/camb/functions/reduce.cpp:157
 called by `diopiSum` at /mnt/share/share_data/dongkaixing/mmcv_dipu/dipu_poc_new/third_party/DIOPI/DIOPI-IMPL/camb/functions/reduce.cpp:171

疑似diopiSum出错，from torch_dipu/csrc_dipu/aten/ops/AutoGenedKernels.cpp:1168

diopi commit: 4b1481d
dipu commit: b1b2d7bcad8c65bc9b5e650f45eecbc9ddfe2032

ascend测试时遇到Aborted

Describe the bug

没有其他报错信息，gdb拉出来栈好像是在加载runtime和function时报的错

Environment information
version number of the system、 software、python packages and other environment variables that may be related.

To Reproduce
Steps to reproduce the behavior:

Prepare the environment (including data, scripts) ...
Run script / Execute command ...
Check result or message, noticing that ...
See error

Expected behavior
A clear and concise description of what you expected to happen.

Screenshots
If applicable, add screenshots to help explain your problem.

Additional context
Add any other context about the problem here.

为了解决fcos算子问题而更改为MMCV mlu 算子适配方式是否合适？

mmcv的fcos算子无问题，diopi有问题
短期不建议更改为MMCV mlu 算子适配方式
短期建议寒武纪支援解决此算子问题
长期MMCV mlu 算子适配方式建议加入排期

编译不加-DDEBUG=ON，conv2d backward结果为全0 tensor

refer DeepLink/DIOPI-TEST(Archive)#191

调用diopiTensorCopyFromBuffer完成h2d操作

现状
目前diopi_test 中D2H的过程，是通过调用diopiTensorCopyToBuffer()完成d2h的传递。这个函数里面提供了tensor的指针，同时它是一个弱符号。所以厂商可以重写该函数，并根据tensor信息和硬件特性，在数据复制过程中，进行高效的数据类型和内存布局的转换。

而H2D的过程则不同，是在Tensor的构造函数里，直接调用了device_memcpy_h2d_async，以线性内存的方式复制到device里。
期望
希望H2D能和D2H过程类似，调用diopiTensorCopyFromBuffer这个函数。厂商可以重写该函数，来解决H2D过程中数据类型和内存布局转换的问题。

算子`diopiSlice`缺少华为diopi实现

华为拓展算子nms实现存在对diopiSlice基础算子的依赖。

单测相关的文档不够详细

单测相关的文档不够详细，比如只使用某种数据类型，只测试前反向，只比较某个输出等等选项的支持，和对这些选项的使用说明。

glob_vars may cover Scalar in diopi_runtime

refer DeepLink-org/DIOPI-TEST#88.
reported by @zhang261007

Dangerous warnings in C++ codes

Describe the bug
Too many dangerous warnings like no return statement / may be used uninitialized / control sequence reaches the end...

To Reproduce

$ cd impl; bash scripts/build_impl.sh torch

Expected behavior
No warnings during compilation.

Better add -Werror to compile options.

``convert_input_tensors`` did not check ``filter_dtype_str_list`` when the key of function_paras["kwargs"] is ``tensor``

refer DeepLink-org/DIOPI-TEST#92.
reported by @TuringKi.

ERROR - numpy boolean subtract, the ``-`` operator, is not supported, use the bitwise_xor, the ``^`` operator, or the logical_xor function instead.

refer DeepLink-org/DIOPI-TEST#89.
reported by @zhang261007

printDevData 设计的有些复杂

template <typename T>
void print_tensor_impl(diopiContextHandle_t ctx, DiopiTensor tensor, int max_num = -1, std::string label = "") {
    printf("[%s, %s, %p] \ndata: ", label .c_str(), DiopiDataType::dataTypeStr(tensor.dtype()).c_str(), tensor.data());
    int N = max_num;
    if (max_num == -1 || N > tensor.numel()) {
        N = tensor.numel();
    }
    void* cpu_ptr = malloc(N * sizeof(T));
    CNRT_CHECK(cnrtMemcpyAsync((void*)cpu_ptr, tensor.data(), N * sizeof(T), getStream(ctx), CNRT_MEM_TRANS_DIR_DEV2HOST));
    syncStreamInCtx(ctx);
    for (int i = 0; i < N; ++i) {
        std::cout << *((T*)cpu_ptr + i) << ", ";
    }
    std::cout << "\n";
    free(cpu_ptr);
}

void print_tensor(diopiContextHandle_t ctx, DiopiTensor tensor, int max_num = -1, std::string label = "") {
    if (tensor.dtype() == diopi_dtype_int64) {
        print_tensor_impl<int64_t>(ctx, tensor, max_num, label );
    } else if (tensor.dtype() == diopi_dtype_float64) {
        print_tensor_impl<double>(ctx, tensor, max_num, label );
    } else if (tensor.dtype() == diopi_dtype_int32) {
        print_tensor_impl<int32_t>(ctx, tensor, max_num, label );
    } else if (tensor.dtype() == diopi_dtype_float32) {
        print_tensor_impl<float>(ctx, tensor, max_num, label );
    } else {
        std::cout << "@@@ unsupport datatype " << DiopiDataType::dataTypeStr(tensor.dtype()) << "\n";
    }
}

这样写的话，打印tensor会简单一些

diopiContextHandle_t ctx;
DiopiTensor a, b, c;

// 直接打印
print_tensor(ctx, a);

// 打印十个数
print_tensor(ctx, b, 10);

// 打印tensor中的所有数，并增加标签，方便查找
print_tensor(ctx, c, -1, "tensor c");

Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

Describe the solution you'd like
A clear and concise description of what you want to happen.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

Index put kernel bug 和 overflow问题

Index put kernel bug（accumulate为true且indices为bool），预期什么时候修好呢。
还有一个是overflow问题：在370上是一个随机数

When accumulate = true and the data type of input is int32, int16, int8 or uint8, the accumation result supports overflow on MLU500 series(excluding tp_520) and higher platforms; on the platforms earlier than MLU500 series, the overflow is not supported and undefined behaviour may occur if the overflow arises.

【用户反馈】通过Llama2-7b模型验证DeeLink问题咨询

用户在通过Llama2-7b模型验证DeeLink能力时遇到问题，以下为原话。
---------------------邮件原文---------------------
目前正在通过Llama2-7b模型验证DeeLink能力，其中遇到两个问题需要求助下：

在昇腾上通过DeepLink训练Llama2-7b，验证无代码改动可在英伟达与昇腾无障碍训练。我使用的llama2-chinese脚本，但在部署Llama2模型环境时遇到flash_attn与cuda强相关，无法在昇腾环境安装，猜测验证DeepLink能力需要特定的脚本，我在你们github上的测评模型中未找到相关脚本，请问你们方便提供下不，万分感谢！
在英伟达通过DeepLink训练Llama2-7b时遇到IndexError: map::at报错，前期定位是device='cuda:7'中:7不存在问题，但今天看到这个问题已解决，更新deeplink后测试发现以下报错仍然存在，请问你们有遇到这种问题吗？有啥临时解决方案吗？

deeplink-org / diopi Goto Github PK

diopi's People

Contributors

Stargazers

Watchers

Forkers

diopi's Issues

【计划版本】

【预计周期】

【开发内容】

Recommend Projects

Recommend Topics

Recommend Org

Jobs