blinkdl / rwkv-cuda Goto Github PK

View Code? Open in Web Editor NEW

191.0 191.0 32.0 127 KB

The CUDA version of the RWKV language model ( https://github.com/BlinkDL/RWKV-LM )

Cuda 54.20% C++ 4.96% Python 40.83%

rwkv-cuda's Introduction

BlinkDL

A minimalist deep learning library in Javascript using WebGL + asm.js. Runs in your browser.

Currently it is a proof-of-concept (inference only). Note: Convolution is buggy when memories overlap.

The WebGL backend is powered by weblas: https://github.com/waylonflinn/weblas.

Example

https://withablink.coding.me/goPolicyNet/ : a weiqi (baduk, go) policy network in AlphaGo style:

const N = 19;
const NN = N * N;
const nFeaturePlane = 8;
const nFilter = 128;

const x = new BlinkArray();
x.Init('weblas');
x.nChannel = nFeaturePlane;
x.data = new Float32Array(nFeaturePlane * NN);
for (var i = 0; i < NN; i++)
    x.data[5 * NN + i] = 1; // set feature plane for empty board

// pre-act residual network with 6 residual blocks
const bak = new Float32Array(nFilter * NN);
x.Convolution(nFilter, 3);
x.CopyTo(bak);
x.BatchNorm().ReLU().Convolution(nFilter, 3);
x.BatchNorm().ReLU().Convolution(nFilter, 3);
x.Add(bak).CopyTo(bak);
x.BatchNorm().ReLU().Convolution(nFilter, 3);
x.BatchNorm().ReLU().Convolution(nFilter, 3);
x.Add(bak).CopyTo(bak);
x.BatchNorm().ReLU().Convolution(nFilter, 3);
x.BatchNorm().ReLU().Convolution(nFilter, 3);
x.Add(bak).CopyTo(bak);
x.BatchNorm().ReLU().Convolution(nFilter, 3);
x.BatchNorm().ReLU().Convolution(nFilter, 3);
x.Add(bak).CopyTo(bak);
x.BatchNorm().ReLU().Convolution(nFilter, 3);
x.BatchNorm().ReLU().Convolution(nFilter, 3);
x.Add(bak).CopyTo(bak);
x.BatchNorm().ReLU().Convolution(nFilter, 3);
x.BatchNorm().ReLU().Convolution(nFilter, 3);
x.Add(bak);
x.BatchNorm().ReLU().Convolution(1, 1).Softmax();

Usage

<script src='weblas.js' type='text/javascript'></script>
<script src='BlinkDL.js' type='text/javascript'></script>

Todo

rwkv-cuda's People

Contributors

Stargazers

Watchers

rwkv-cuda's Issues

似乎多显卡支持有点问题

在用RWKV_Role_Playing项目
当使用--strategy='cuda:1 fp32 *26 -> cuda:0 fp32' --jit_on=1 --cuda_on=1 时
会在生成时崩溃报错

Traceback (most recent call last):
  File "/srv/RWKV/venv/lib/python3.10/site-packages/gradio/routes.py", line 422, in run_predict
    output = await app.get_blocks().process_api(
  File "/srv/RWKV/venv/lib/python3.10/site-packages/gradio/blocks.py", line 1323, in process_api
    result = await self.call_function(
  File "/srv/RWKV/venv/lib/python3.10/site-packages/gradio/blocks.py", line 1051, in call_function
    prediction = await anyio.to_thread.run_sync(
  File "/srv/RWKV/venv/lib/python3.10/site-packages/anyio/to_thread.py", line 31, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
  File "/srv/RWKV/venv/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 937, in run_sync_in_worker_thread
    return await future
  File "/srv/RWKV/venv/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 867, in run
    result = context.run(func, *args)
  File "/srv/RWKV/RWKV_Role_Playing/modules/ui.py", line 75, in __load_char
    chatbot = self.chat_model.load_init_prompt(char['user'], char['bot'], char['action_start'],
  File "/srv/RWKV/RWKV_Role_Playing/modules/chat.py", line 48, in load_init_prompt
    out, model_tokens, model_state = self.model_utils.run_rnn(model_tokens, model_state, self.model_utils.fix_tokens(self.model_utils.pipeline.encode(init_prompt)))
  File "/srv/RWKV/RWKV_Role_Playing/modules/model_utils.py", line 32, in run_rnn
    out, model_state = self.model.forward(tokens[:self.CHUNK_LEN], model_state)
  File "/srv/RWKV/RWKV_Role_Playing/rwkv/model.py", line 607, in forward
    x, state[i*5+0], state[i*5+1], state[i*5+2], state[i*5+3] = ATT(
torch.jit.Error: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
  File "/srv/RWKV/RWKV_Role_Playing/rwkv/model.py", line 39, in cuda_att_seq
    def cuda_wkv(T: int, C: int, w, u, k, v, aa, bb, pp):
        assert 1 * C % min(C, 32) == 0
        assert k.dtype == torch.float16
        ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
        w = w.contiguous()
        u = u.contiguous()
RuntimeError: AssertionError:

^CKeyboard interruption in main thread... closing server.

去掉--cuda_on=1关闭RWKV-CUDA后正常
不去掉--cuda_on=1，但是--strategy='cuda:1 fp16也正常
当使用--strategy='cuda:1 fp32 *26 -> cpu fp32'时再次故障

目前看来排除法复现分析应该是
RWKV-CUDA启用时不能使用任何多硬件联合计算功能

a quertion about timex:

ImportError: DLL load failed while importing timex: 找不到指定的模块。

(gh_baize-chatbot) ub2004@ub2004-B85M-A0:~/llm_dev/RWKV-CUDA/wkv$ python3 run.py
Using /home/ub2004/.cache/torch_extensions/py38_cu117 as PyTorch extensions root...
Creating extension directory /home/ub2004/.cache/torch_extensions/py38_cu117/wkv...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/ub2004/.cache/torch_extensions/py38_cu117/wkv/build.ninja...
Building extension module wkv...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/3] c++ -MMD -MF wkv_op.o.d -DTORCH_EXTENSION_NAME=wkv -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="_gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1011" -isystem /home/ub2004/.local/lib/python3.8/site-packages/torch/include -isystem /home/ub2004/.local/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -isystem /home/ub2004/.local/lib/python3.8/site-packages/torch/include/TH -isystem /home/ub2004/.local/lib/python3.8/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /usr/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 /wd4624 -c /home/ub2004/llm_dev/RWKV-CUDA/wkv/cuda/wkv_op.cpp -o wkv_op.o
FAILED: wkv_op.o
c++ -MMD -MF wkv_op.o.d -DTORCH_EXTENSION_NAME=wkv -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1011" -isystem /home/ub2004/.local/lib/python3.8/site-packages/torch/include -isystem /home/ub2004/.local/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -isystem /home/ub2004/.local/lib/python3.8/site-packages/torch/include/TH -isystem /home/ub2004/.local/lib/python3.8/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /usr/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 /wd4624 -c /home/ub2004/llm_dev/RWKV-CUDA/wkv/cuda/wkv_op.cpp -o wkv_op.o
c++: error: /wd4624: No such file or directory
[2/3] /usr/local/cuda/bin/nvcc -DTORCH_EXTENSION_NAME=wkv -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1011" -isystem /home/ub2004/.local/lib/python3.8/site-packages/torch/include -isystem /home/ub2004/.local/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -isystem /home/ub2004/.local/lib/python3.8/site-packages/torch/include/TH -isystem /home/ub2004/.local/lib/python3.8/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /usr/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS -D__CUDA_NO_HALF_CONVERSIONS -D__CUDA_NO_BFLOAT16_CONVERSIONS -D__CUDA_NO_HALF2_OPERATORS --expt-relaxed-constexpr -gencode=arch=compute_61,code=compute_61 -gencode=arch=compute_61,code=sm_61 --compiler-options '-fPIC' --use_fast_math --extra-device-vectorization -std=c++17 -c /home/ub2004/llm_dev/RWKV-CUDA/wkv/cuda/wkv_cuda_v2.cu -o wkv_cuda_v2.cuda.o
ninja: build stopped: subcommand failed.
Traceback (most recent call last):
File "/home/ub2004/.local/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1893, in _run_ninja_build
subprocess.run(
File "/usr/lib/python3.8/subprocess.py", line 516, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "run.py", line 86, in
wkv_cuda = load(name="wkv", sources=["cuda/wkv_op.cpp", f"cuda/wkv_cuda_v{CUDA_KERNEL_VERSION}.cu"],
File "/home/ub2004/.local/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1284, in load
return _jit_compile(
File "/home/ub2004/.local/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1509, in _jit_compile
_write_ninja_file_and_build_library(
File "/home/ub2004/.local/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1624, in _write_ninja_file_and_build_library
_run_ninja_build(
File "/home/ub2004/.local/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1909, in _run_ninja_build
raise RuntimeError(message) from e
RuntimeError: Error building extension 'wkv'
(gh_baize-chatbot) ub2004@ub2004-B85M-A0:~/llm_dev/RWKV-CUDA/wkv$

作者你好，想请教一下

作者你好，我们复现了您的代码，发现使用cuda确实可以实现很大程度的加速。我们目前在研究如何使用cuda加速Unet网络从而解决一些问题，想请教您一些有关cuda的问题。并且如果您了解之后，对我们感兴趣的话，我们想与您合作做一些研究。如果您觉得可以的话，请回复我。 @BlinkDL

在colab中运行报错

执行python run.py时有如下报错：

Using /root/.cache/torch_extensions/py37_cu113 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /root/.cache/torch_extensions/py37_cu113/timex/build.ninja...
Building extension module timex...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/3] c++ -MMD -MF timex_op.o.d -DTORCH_EXTENSION_NAME=timex -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /usr/local/lib/python3.7/dist-packages/torch/include -isystem /usr/local/lib/python3.7/dist-packages/torch/include/torch/csrc/api/include -isystem /usr/local/lib/python3.7/dist-packages/torch/include/TH -isystem /usr/local/lib/python3.7/dist-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /usr/include/python3.7m -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 /wd4624 -c /content/RWKV-CUDA/depthwise_conv1d/cuda/timex_op.cpp -o timex_op.o 
FAILED: timex_op.o 
c++ -MMD -MF timex_op.o.d -DTORCH_EXTENSION_NAME=timex -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /usr/local/lib/python3.7/dist-packages/torch/include -isystem /usr/local/lib/python3.7/dist-packages/torch/include/torch/csrc/api/include -isystem /usr/local/lib/python3.7/dist-packages/torch/include/TH -isystem /usr/local/lib/python3.7/dist-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /usr/include/python3.7m -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 /wd4624 -c /content/RWKV-CUDA/depthwise_conv1d/cuda/timex_op.cpp -o timex_op.o 
c++: error: /wd4624: No such file or directory
[2/3] /usr/local/cuda/bin/nvcc  -DTORCH_EXTENSION_NAME=timex -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /usr/local/lib/python3.7/dist-packages/torch/include -isystem /usr/local/lib/python3.7/dist-packages/torch/include/torch/csrc/api/include -isystem /usr/local/lib/python3.7/dist-packages/torch/include/TH -isystem /usr/local/lib/python3.7/dist-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /usr/include/python3.7m -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_75,code=compute_75 -gencode=arch=compute_75,code=sm_75 --compiler-options '-fPIC' --use_fast_math --extra-device-vectorization -DTmax=768 -DBF=8 -DBB=2 -std=c++14 -c /content/RWKV-CUDA/depthwise_conv1d/cuda/timex_cuda_v3.cu -o timex_cuda_v3.cuda.o 
ninja: build stopped: subcommand failed.
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/torch/utils/cpp_extension.py", line 1746, in _run_ninja_build
    env=env)
  File "/usr/lib/python3.7/subprocess.py", line 512, in run
    output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "run.py", line 72, in <module>
    verbose=True, extra_cuda_cflags=['--use_fast_math', '--extra-device-vectorization', f'-DTmax={T_MAX}', f'-DBF={B_GROUP_FORWARD}', f'-DBB={B_GROUP_BACKWARD}'], extra_cflags=['/wd4624'])
  File "/usr/local/lib/python3.7/dist-packages/torch/utils/cpp_extension.py", line 1156, in load
    keep_intermediates=keep_intermediates)
  File "/usr/local/lib/python3.7/dist-packages/torch/utils/cpp_extension.py", line 1367, in _jit_compile
    is_standalone=is_standalone)
  File "/usr/local/lib/python3.7/dist-packages/torch/utils/cpp_extension.py", line 1472, in _write_ninja_file_and_build_library
    error_prefix=f"Error building extension '{name}'")
  File "/usr/local/lib/python3.7/dist-packages/torch/utils/cpp_extension.py", line 1756, in _run_ninja_build
    raise RuntimeError(message) from e
RuntimeError: Error building extension 'timex'

Is the 1-D depthwise conv still critical for RWKV?

Seems the RWKV-v4 didn't use the 1-d depthwise kernel you developed before. Is the 1-D depthwise CUDA kernel in this repo still a critical operator for RWKV?

I just want to check if I intend to contribute to this project, which CUDA kernel should I work on? Should I work on the codes in the 1-D depthwise folder or the codes in the WKV folders of this repo?

Considering to contribute it to PyTorch?

Hey @BlinkDL, I already posted under your issue in the PyTorch main repo. Would you consider contributing the 1d conv code to PyTorch? This is very relevant to the project I am currently working on and I would also be down to helping you with it.

Best,
Julien

blinkdl / rwkv-cuda Goto Github PK

rwkv-cuda's Introduction

BlinkDL

Example

Usage

Todo

rwkv-cuda's People

Contributors

Stargazers

Watchers

Forkers

rwkv-cuda's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs