What is the expected behavior Would expect the performance of GEMM

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

hipblasSgemm extremely slow when compared to cublasSgemm about hipblas HOT 4 CLOSED

rocm commented on July 20, 2024

hipblasSgemm extremely slow when compared to cublasSgemm

from hipblas.

Comments (4)

daineAMD commented on July 20, 2024

Hi @jordan44665, thanks for opening this issue.

I see that you're using hipMallocManaged() to allocate your device memory. The HIP managed memory documentation recommends calling hipDeviceGetAttribute() to query hipDeviceAttributeManagedMemory before allocating memory using hipMallocManaged(). Can you please try this to see if it is supported on your device? I think this is the root of your performance woes.

If it is not supported, can you try changing hipMallocManaged to hipMalloc in your code? I hope this will make the performance more in line with what you're expecting.

Also note that there is a one-time startup cost to load gemm kernels for the first call to hipblasSgemm, as seen with your 512 case being slower than your 1024 case. When measuring performance you can try adding "cold call(s)" outside of the timing framework to avoid measuring this one-time cost.

Thanks,
Daine

from hipblas.

jordan44665 commented on July 20, 2024

Hi @daineAMD,

I changed the hipMallocManaged() to hipMalloc (see attached source). The numbers for the W6800 are mostly better than before but still significantly higher than expected. And here are the results:

---------------- W6800 -----------------
size 256 average 0.256122 s
size 512 average 0.000284425 s
size 1024 average 0.000909363 s
size 2048 average 0.00233341 s
size 4096 average 0.0078005 s
size 8192 average 0.071839 s

---------- RTX 2060 ---------
size 256 average 5.9744e-05 s
size 512 average 6.02688e-05 s
size 1024 average 0.000408064 s
size 2048 average 0.00289365 s
size 4096 average 0.0236227 s
size 8192 average 0.156054 s

I just changed the following functions:

gpuMallocManaged() to gpuMalloc() in main() -- three instances

cudaMallocManaged() to cudaMalloc()
hipMallocManaged() to hipMalloc()

If you like, I can attach the full source with the modifications.

from hipblas.

daineAMD commented on July 20, 2024

Hi again,

These numbers look fairly reasonable to me, at least for the large sizes. Those times correspond to ~17.6 Tflops for size = 4096 and ~15.3 Tflops for size = 8192 which seems fairly close to peak performance on this hardware.

The smaller sizes likely haven't gotten as much focus for tuning on navi21. With AMD backend, we use Tensile for gemm kernels; you can try opening an issue at the Tensile repo and specify specific sizes you're interested in for tuning and see if they can help. hipBLAS is just a wrapper to rocBLAS/cuBLAS backends, so it doesn't really have any major impact on performance.

As for the size=256 performance, it's still getting the startup initialization cost, it would be more accurate to do a "warm up" call to hipblasSgemm() before benchmarking the call.

Thanks again,
Daine

from hipblas.

daineAMD commented on July 20, 2024

@jordan44665 I'll close this issue as I haven't heard back in a while and I think the crux of the issue is resolved. Hopefully Tensile folks can help you out if you have tuning requests for any specific sizes. Feel free to re-open if I haven't addressed any of your concerns, thanks!

from hipblas.

hipblasSgemm extremely slow when compared to cublasSgemm about hipblas HOT 4 CLOSED

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs