GithubHelp home page GithubHelp logo

deeplink-org / diopi Goto Github PK

View Code? Open in Web Editor NEW
67.0 67.0 31.0 6.16 MB

License: BSD 3-Clause "New" or "Revised" License

CMake 2.17% C++ 42.37% Cuda 4.16% Shell 0.20% Python 45.90% C 5.17% Makefile 0.01% Batchfile 0.01%

diopi's People

Contributors

bonbon-tang avatar caikun-pjlab avatar ccjincong avatar chrysantd avatar cokedong avatar dx111 avatar fengsibo avatar gong-air avatar hellozmz avatar hu-qingqing avatar jfxu-st avatar jingguo-st avatar leungchinan avatar lljbash avatar lwj-st avatar miaoyyu avatar neoszhang avatar poi-wx avatar shshenhao avatar windstamp avatar wugeshui avatar xiaobotj avatar xintian-514 avatar yangbofun avatar yewentao256 avatar yeyeye333 avatar z379035389 avatar zhangzefeng92 avatar zhaoguochun1995 avatar zsksmhq avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

diopi's Issues

[New op] sup SyncBatchnorm in mlu

适配算子 SyncBatchnorm,在370M8完成功能验证 & 入库
SyncBatchnorm 包含以下cnnl接口:

  • 9.296. SyncBatchnormBackwardReduce
  • 9.297. SyncBatchNormElemt
  • 9.298. SyncBatchNormBackwardElemt
  • 9.299. SyncBatchNormBackwardElemtV2
  • 9.300. SyncBatchNormGatherStatsWithCounts
  • 9.301. SyncBatchNormStats

参考文档见

ctc_loss适配

pytorch逻辑和cnnlCTCLoss逻辑差异,导致在不修改pytroch的前提下,反向计算的适配出现问题。
log_alpha不影响前向计算的过程及结果,但是会影响反向计算的逻辑。cnnlCTCLoss在计算反向梯度时所需的reduction参数不能从对接的pytroch接口得到。CTCLoss问题总结

使用sh scripts/build_impl.sh cuda编译cuda版本报错

在用使用sh scripts/build_impl.sh cuda的时候编译报错:
报错显示:
“impl/../diopi_test/codegen/gen.py" : [Errno 2] No such file or directory
我搜索代码工程中的gen.py文件,发现有两处:adaptor/codegen/gen.py 和diopi_test/diopi_stub/codegen/gen.py
我替换CMakelists中的文件路径,发现使用上述两个文件编译均不成功。

我的环境是一个python的虚拟环境,python3.8.5 , nvcc的版本11.8,GPU为A100

无法跑通quick start例子

Describe the bug

cd impl && sh scripts/build_impl.sh torch

即编译torch的算子库,编译完成后,生成了以下动态库
/home/xiongchao/work/DIOPI/impl/lib
├── export_functions.cpython-39-x86_64-linux-gnu.so
├── export_runtime.cpython-39-x86_64-linux-gnu.so
├── libdiopi_impl.so
└── libdiopirt.so
测试python脚本里import的为export_functions.cpython-39-x86_64-linux-gnu.so,
通过以下make过程的log中可知,算子实现放在libdiopi_impl.so中的
image
可我通过readelf -d 其他三个so没有看到对libdiopi_impl的依赖

Environment information
cuda 11.8
torch2.0.0 (conda安装)
gcc9.4
DIOPI commit id 6f6da63
Screenshots
测试时,就一直报not implenmented.

python main.py --mode run_test --fname relu
2023-10-17 16:16:25,460 - ConformanceTest - ERROR - NotImplemented: relu not implement
2023-10-17 16:16:25,462 - ConformanceTest - ERROR - NotImplemented: relu not implement
2023-10-17 16:16:25,465 - ConformanceTest - ERROR - NotImplemented: relu not implement
2023-10-17 16:16:25,465 - ConformanceTest - ERROR - NotImplemented: relu not implement
2023-10-17 16:16:25,918 - ConformanceTest - ERROR - NotImplemented: relu not implement
2023-10-17 16:16:26,234 - ConformanceTest - ERROR - NotImplemented: relu not implement
2023-10-17 16:16:26,266 - ConformanceTest - ERROR - NotImplemented: relu not implement
2023-10-17 16:16:26,272 - ConformanceTest - ERROR - NotImplemented: relu not implement

在python侧运行其他api,也不正常。
image

我理解理论上应该得有对libdiopi_impl.so依赖?谢谢解答

【DeepLink v0.26】DIOPI 双月开发计划

【计划版本】

v0.26

【预计周期】

7月-8月

【开发内容】

  1. 支持跑通LLaMa大模型训练
  2. 定义6个新模型和其他具潜力模型所需算子接口,并在cuda上验证实现;
  3. 在camb上实现定义16个模型所需算子接口,并验证实现;
  4. 在其他两款硬件上实现定义共计7个模型所需算子接口,并验证实现;
  5. 完善DIOPI一致性测试,增加测试不连续输入等功能。

文档更新和编译说明补充

  1. 【文档更新不彻底】github文档很多地方待更新,存在一些错误,例如不存在的头文件 diopi_register.h
    https://github.com/DeepLink-org/DIOPI/blob/main/README.md
    https://github.com/DeepLink-org/DIOPI/blob/main/impl/README.md
  2. 【补充编译说明】提供常用的编译命令和编译选项说明。当前在编译时默认DTEST=OFF,如果要测试op/model,需要设置DTEST=ON,DIOPI文档中没有提到这点,在DIPU文档中才有提及:https://github.com/DeepLink-org/dipu/blob/main/QuickStart.md

昇腾环境上在部署Llama2模型环境时遇到flash_attn无法安装

用户在昇腾环境上验证DeepLink能力时遇到问题,以下为原话。
---------------------邮件原文---------------------

我看DeepLink 官方社区是接入昇腾芯片的,并且是适配过llama 模型的。在昇腾环境上验证DeepLink能力时,在部署Llama2模型环境时遇到flash_attn无法安装,有如下报错:

发现flash_attn与cuda强相关,阻塞了验证流程,需要做类似迁移的操作。请问你们在验证时有遇到类似大模型用到与芯片架构强相关的库的问题吗?如有可以提供下解决方案吗

image

《人工智能 算子接口》算子开发清单

《人工智能 算子接口 第1部分:基础数学类》- 最小集(79个)需新增或修改算子(28个):

  • 7.2.1.4 按指定值创建稠密张量
  • 7.2.1.5 创建未初始化张量
  • 7.2.1.7 以均匀分布随机数创建稠密张量
  • 7.2.1.8 以正态分布随机数创建稠密张量
  • 7.2.1.13 创建线性空间均匀分布稠密张量
  • 7.2.2.1 形状查询
  • 7.2.2.3 无穷检查
  • 7.2.3.2 改变张量形状
  • 7.2.6.3 张量的“逻辑非”操作
  • 7.2.7.3 逐位“异或”操作
  • 7.2.9.3 截断取整
  • 7.2.9.4 就近取整
  • 7.2.10.3 正切函数
  • 7.2.10.5 反余弦函数
  • 7.2.11.1 双曲正弦函数
  • 7.2.11.2 双曲余弦函数
  • 7.2.11.4 反双曲正弦函数
  • 7.2.11.5 反双曲余弦函数
  • 7.2.11.6 反双曲正切函数
  • 7.2.12.2 指数函数扩展
  • 7.2.12.4 以e为底的对数函数扩展
  • 7.2.13.2 前缀和
  • 7.2.14.2 最小索引
  • 7.2.14.3 排序索引
  • 7.2.15.1 复数构建
  • 7.2.15.2 复数共轭
  • 7.2.15.3 获取虚部
  • 7.2.15.4 获取实部

《人工智能 算子接口 第2部分:神经网络类》-最小集(22个)需新增或修改算子(20个):

  • 7.2.1.3 分段线性近似 S 型函数
  • 7.2.1.6 线性整流单元
  • 7.2.1.7 带阈值的线性整流单元
  • 7.2.1.8 指数线性单元
  • 7.2.1.10 带参数线性整流单元
  • 7.2.1.11 扩展指数线性单元
  • 7.2.1.14 Softplus 函数
  • 7.2.1.15 Softsign 函数
  • 7.2.1.18 误差函数
  • 7.2.3.1 随机失活函数
  • 7.2.4.1 批量归一化
  • 7.2.4.2 分组归一化
  • 7.2.4.3 层归一化
  • 7.2.4.4 实例归一化
  • 7.2.4.7 Lp 范数归一化
  • 7.2.5.1 一维池化
  • 7.2.5.2 二维池化
  • 7.2.5.3 三维池化
  • 7.2.6.6 三维转置卷积
  • 7.2.11.1 网格插值采样

cast nan from float32 to float16 will get 0

Describe the bug
when cast nan from float32 to float16, the result is not nan but got 0.

Environment information
version number of the system、 software、python packages and other environment variables that may be related.

To Reproduce
Steps to reproduce the behavior:

  1. Prepare the environment (including data, scripts) ...
  2. Run script / Execute command ...
  3. Check result or message, noticing that ...
  4. See error

Expected behavior
A clear and concise description of what you expected to happen.

Screenshots
If applicable, add screenshots to help explain your problem.

Additional context
Add any other context about the problem here.

NonZero Problem

Describe the bug
When testing FCOS on MLU, there are bugs related to Nonzero and index, only for bool dtype. Now we fixed it via bool2int, however, cnnl should support bool as same as int. #545

lmdeploy适配的算子在华为实现的需求

1.diopi接口,但是还没有华为下的实现
diopiexpand
diopiSplitWithSizes
diopiLinspace
2.diopi扩展接口,还没有华为下的实现。共用lightlm的实现。
diopiRMSNorm

mmcv中的modulated deform conv算子适配

Describe the bug
modulated deform conv算子在模型实际运行时会遇到output tensor的shape与offset tensor、mask tensor的shape不一致情况,无法满足cnnl kernel要求。
image

image

暂时无法通过变换,将这三个tensor的shape对齐以满足kernel要求【因为这些NCHW格式的tensor中每一个维度已经有了明确的实际意义】

[CNNL] [Error]:[cnnlReduce] indices shouldn't be null when indices_size_inbytes > 0.

diopi(camb)实现+dipu疑似有个问题:

import torch
import torch_dipu
device = torch.device("dipu")
x = torch.randn(2, 2).to(device)
x.sum()

报错信息:


>>> x.sum()
[2023-5-22 17:48:58] [CNNL] [Error]:[cnnlReduce] indices shouldn't be null when indices_size_inbytes > 0.
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
RuntimeError: /mnt/share/share_data/dongkaixing/mmcv_dipu/dipu_poc_new/torch_dipu/csrc_dipu/aten/ops/AutoGenedKernels.cpp:1168'diopiadaptor::diopiSum(ctx, outDiopiTensorHandle, selfDiopiTensorHandle, diopi_size);' error, error code is 1error message is diopiErrorOccurred: diopiErrorOccurred: cnnl error 3 : CNNL_STATUS_BAD_PARAM at /mnt/share/share_data/dongkaixing/mmcv_dipu/dipu_poc_new/third_party/DIOPI/DIOPI-IMPL/camb/functions/reduce.cpp:121 called by `reduce_dim_impl` at /mnt/share/share_data/dongkaixing/mmcv_dipu/dipu_poc_new/third_party/DIOPI/DIOPI-IMPL/camb/functions/reduce.cpp:157
 called by `diopiSum` at /mnt/share/share_data/dongkaixing/mmcv_dipu/dipu_poc_new/third_party/DIOPI/DIOPI-IMPL/camb/functions/reduce.cpp:171


疑似diopiSum出错,from torch_dipu/csrc_dipu/aten/ops/AutoGenedKernels.cpp:1168

diopi commit: 4b1481d
dipu commit: b1b2d7bcad8c65bc9b5e650f45eecbc9ddfe2032

ascend测试时遇到Aborted

Describe the bug
image
没有其他报错信息,gdb拉出来栈好像是在加载runtime和function时报的错
image

Environment information
version number of the system、 software、python packages and other environment variables that may be related.

To Reproduce
Steps to reproduce the behavior:

  1. Prepare the environment (including data, scripts) ...
  2. Run script / Execute command ...
  3. Check result or message, noticing that ...
  4. See error

Expected behavior
A clear and concise description of what you expected to happen.

Screenshots
If applicable, add screenshots to help explain your problem.

Additional context
Add any other context about the problem here.

调用diopiTensorCopyFromBuffer完成h2d操作

现状
目前diopi_test 中D2H的过程,是通过调用diopiTensorCopyToBuffer()完成d2h的传递。这个函数里面提供了tensor的指针,同时它是一个弱符号。所以厂商可以重写该函数,并根据tensor信息和硬件特性,在数据复制过程中,进行高效的数据类型和内存布局的转换。

而H2D的过程则不同,是在Tensor的构造函数里,直接调用了device_memcpy_h2d_async,以线性内存的方式复制到device里。
期望
希望H2D能和D2H过程类似,调用diopiTensorCopyFromBuffer这个函数。厂商可以重写该函数,来解决H2D过程中数据类型和内存布局转换的问题。

单测相关的文档不够详细

单测相关的文档不够详细,比如只使用某种数据类型,只测试前反向,只比较某个输出等等选项的支持,和对这些选项的使用说明。

Dangerous warnings in C++ codes

Describe the bug
Too many dangerous warnings like no return statement / may be used uninitialized / control sequence reaches the end...

To Reproduce

$ cd impl; bash scripts/build_impl.sh torch

Expected behavior
No warnings during compilation.

Better add -Werror to compile options.

printDevData 设计的有些复杂

template <typename T>
void print_tensor_impl(diopiContextHandle_t ctx, DiopiTensor tensor, int max_num = -1, std::string label = "") {
    printf("[%s, %s, %p] \ndata: ", label .c_str(), DiopiDataType::dataTypeStr(tensor.dtype()).c_str(), tensor.data());
    int N = max_num;
    if (max_num == -1 || N > tensor.numel()) {
        N = tensor.numel();
    }
    void* cpu_ptr = malloc(N * sizeof(T));
    CNRT_CHECK(cnrtMemcpyAsync((void*)cpu_ptr, tensor.data(), N * sizeof(T), getStream(ctx), CNRT_MEM_TRANS_DIR_DEV2HOST));
    syncStreamInCtx(ctx);
    for (int i = 0; i < N; ++i) {
        std::cout << *((T*)cpu_ptr + i) << ", ";
    }
    std::cout << "\n";
    free(cpu_ptr);
}

void print_tensor(diopiContextHandle_t ctx, DiopiTensor tensor, int max_num = -1, std::string label = "") {
    if (tensor.dtype() == diopi_dtype_int64) {
        print_tensor_impl<int64_t>(ctx, tensor, max_num, label );
    } else if (tensor.dtype() == diopi_dtype_float64) {
        print_tensor_impl<double>(ctx, tensor, max_num, label );
    } else if (tensor.dtype() == diopi_dtype_int32) {
        print_tensor_impl<int32_t>(ctx, tensor, max_num, label );
    } else if (tensor.dtype() == diopi_dtype_float32) {
        print_tensor_impl<float>(ctx, tensor, max_num, label );
    } else {
        std::cout << "@@@ unsupport datatype " << DiopiDataType::dataTypeStr(tensor.dtype()) << "\n";
    }
}

这样写的话,打印tensor会简单一些

diopiContextHandle_t ctx;
DiopiTensor a, b, c;

// 直接打印
print_tensor(ctx, a);

// 打印十个数
print_tensor(ctx, b, 10);

// 打印tensor中的所有数,并增加标签,方便查找
print_tensor(ctx, c, -1, "tensor c");

Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

Describe the solution you'd like
A clear and concise description of what you want to happen.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

Index put kernel bug 和 overflow问题

  1. Index put kernel bug(accumulate为true且indices为bool),预期什么时候修好呢。
  2. 还有一个是overflow问题:在370上是一个随机数
  • When accumulate = true and the data type of input is int32, int16, int8 or uint8, the accumation result supports overflow on MLU500 series(excluding tp_520) and higher platforms; on the platforms earlier than MLU500 series, the overflow is not supported and undefined behaviour may occur if the overflow arises.

【用户反馈】通过Llama2-7b模型验证DeeLink问题咨询

用户在通过Llama2-7b模型验证DeeLink能力时遇到问题,以下为原话。
---------------------邮件原文---------------------
目前正在通过Llama2-7b模型验证DeeLink能力,其中遇到两个问题需要求助下:

  1. 在昇腾上通过DeepLink训练Llama2-7b,验证无代码改动可在英伟达与昇腾无障碍训练。我使用的llama2-chinese脚本,但在部署Llama2模型环境时遇到flash_attn与cuda强相关,无法在昇腾环境安装,猜测验证DeepLink能力需要特定的脚本,我在你们github上的测评模型中未找到相关脚本,请问你们方便提供下不,万分感谢!

  2. 在英伟达通过DeepLink训练Llama2-7b时遇到IndexError: map::at报错,前期定位是device='cuda:7'中:7不存在问题,但今天看到这个问题已解决,更新deeplink后测试发现以下报错仍然存在,请问你们有遇到这种问题吗?有啥临时解决方案吗?

image

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.