During my work with Candle I have noticed that matmul

I'm not sure it would, we use cublas for the matmul (actually <a href="https://docs.nv

Random thoughts on performance: Be sure to use the latest gith

Is it a normal cpu flamegraph? Yes <p

Avoiding `.contiguous` call before `matmul` about candle HOT 12 CLOSED

EricLBuehler commented on August 22, 2024

Avoiding `.contiguous` call before `matmul`

from candle.

Comments (12)

LaurentMazare commented on August 22, 2024

Matmuls are actually also supported for transposed layouts on top of contiguous ones, so we enforce contiguous or transposed contiguous as these are supported on all devices + mkl & accelerate and likely to be the fastest. Supporting arbitrary strides is dodgy, e.g. on mkl hence the limitation at the moment.

from candle.

EricLBuehler commented on August 22, 2024

I plan on writing this for CUDA, would that work? Do you know of any good resources or places where I can find a sample implementation?

from candle.

LaurentMazare commented on August 22, 2024

I'm not sure it would, we use cublas for the matmul (actually this function) and it doesn't let you work with arbitrary strides except for the batch dimensions, you can just specify transpose and not transpose which is what we're doing.
The recent lazy reshape work should make the situation a bit better though and avoid some unecessary copies (and there is one more change to come on this front that will add some fast copy mode in some specific case).

from candle.

EricLBuehler commented on August 22, 2024

That sounds exciting! I am trying to create a fast LLM engine, targeting being faster than llama.cpp with Candle, so any optimization is beneficial. I have been using my fork as a playground to test out new ideas, and have implemented some fused kernels. Are there any directions which I could look into for optimizing performance, specifically on CUDA?

from candle.

LaurentMazare commented on August 22, 2024

Random thoughts on performance:

Be sure to use the latest github version as performance on cuda for transformers architectures has improved a lot over the last two weeks (as well as metal).
You should also enable flash-attn if possible.
Fused kernels are certainly good.
The recently introduced inplace-ops also make it possible to avoid allocation and reduce the memory bandwidth, though they are a bit tricky to use manually at the moment.

from candle.

EricLBuehler commented on August 22, 2024

I have been following the recent changes and look forward to trying them out! I am running what is essentially the quantized_llama code as my test case, so I don't think flash attention is an option for that, unfortunately (unless we implemented a f32 version).

Regarding inplace-ops, could that be implemented for CUDA by passing the src slice = dst slice (depending on kernel impl, of course)?

from candle.

LaurentMazare commented on August 22, 2024

Regarding inplace-ops, could that be implemented for CUDA by passing the src slice = dst slice (depending on kernel impl, of course)?

Yes, we don't have a cuda test to show this but here is the cpu one, custom_op_test.

Also besides this I've spent quite some time using nsys profile and nsys-ui lately and it's really great to see where the time is actually spent on the cuda side. For the in-house transformer model I've been looking at, we're at ~85% of the time spent in the matmul kernels so not much to be gained by optimizing the other kernels.

from candle.

EricLBuehler commented on August 22, 2024

I ran a flamegraph with mistral.rs, and unfortunately it shows that much of the time is spent in the call to synchronize the GPU (almost 2x more than the model shows). It seems strange that llama.cpp can be so much faster, but looking at their code I think they have their own matmul kernel which allows non-contiguous tensors: https://github.com/ggerganov/llama.cpp/blob/c342d070c64a1ffe35d22c1b16b672e684a30297/ggml-cuda.cu#L1111. Do you think that avoiding calling .contiguous and using a kernel like this could improve performance? I wonder what measures can currently be taken to optimize candle.

from candle.

LaurentMazare commented on August 22, 2024

Is it a normal cpu flamegraph? If so it's not representative of where the time is actually spent if you use cuda. I've optimized a bit the synchronization points with the gpu - removing most of them in #1886 - and it actually doesn't make much of a difference except for very small transformers (which we do use internally so that was actually quite a win there), for a 7b it's negligilble at least in my setup. You certainly see them disappearing in the nsys-ui but that doesn't change much as the time is spent in the matmul kernels.

Here is the kernel occupation on my RTX 2080 for the quantized example using a q4 7b model and you can see that we spent mostly all the time in the dequantize_mul_mat_vec_q4_0... kernels that do the matmul, the copy kernels are almost invisible. On a full exec I get 95% in the matmul, 1% in the rms-norm, 0.9% in the kv-concatenation optimized kernel and the actual contiguous are reported as 0.0% in nsys-ui.

from candle.

EricLBuehler commented on August 22, 2024

Is it a normal cpu flamegraph?

Yes

I wonder what llama.cpp is doing that makes it so much faster. Clearly, candle is spending much of its time in matmul, so there is probably not much more trim. I noticed that they use some sort of graph for model execution, do you know of any other techniques they employ?

from candle.

LaurentMazare commented on August 22, 2024

I'm not sure what llama.cpp is doing differently, certainly interested if you manage to dig some of this out.
Also just to mention I double checked and with the latest backend tweaks all the contiguous ops in the inference loop for the quantized example are actually no-ops (they don't trigger any cuda kernel or anything). I'm looking at removing them as there are some spurious checks but pretty sure that there isn't much to be gained on that front.

from candle.

EricLBuehler commented on August 22, 2024

Thanks for the help!

from candle.

Avoiding `.contiguous` call before `matmul` about candle HOT 12 CLOSED

Comments (12)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs