Comments (4)
Hi @jordan44665, thanks for opening this issue.
I see that you're using hipMallocManaged()
to allocate your device memory. The HIP managed memory documentation recommends calling hipDeviceGetAttribute()
to query hipDeviceAttributeManagedMemory
before allocating memory using hipMallocManaged()
. Can you please try this to see if it is supported on your device? I think this is the root of your performance woes.
If it is not supported, can you try changing hipMallocManaged
to hipMalloc
in your code? I hope this will make the performance more in line with what you're expecting.
Also note that there is a one-time startup cost to load gemm kernels for the first call to hipblasSgemm, as seen with your 512 case being slower than your 1024 case. When measuring performance you can try adding "cold call(s)" outside of the timing framework to avoid measuring this one-time cost.
Thanks,
Daine
from hipblas.
Hi @daineAMD,
I changed the hipMallocManaged() to hipMalloc (see attached source). The numbers for the W6800 are mostly better than before but still significantly higher than expected. And here are the results:
---------------- W6800 -----------------
size 256 average 0.256122 s
size 512 average 0.000284425 s
size 1024 average 0.000909363 s
size 2048 average 0.00233341 s
size 4096 average 0.0078005 s
size 8192 average 0.071839 s
---------- RTX 2060 ---------
size 256 average 5.9744e-05 s
size 512 average 6.02688e-05 s
size 1024 average 0.000408064 s
size 2048 average 0.00289365 s
size 4096 average 0.0236227 s
size 8192 average 0.156054 s
I just changed the following functions:
gpuMallocManaged() to gpuMalloc() in main() -- three instances
cudaMallocManaged() to cudaMalloc()
hipMallocManaged() to hipMalloc()
If you like, I can attach the full source with the modifications.
from hipblas.
Hi again,
These numbers look fairly reasonable to me, at least for the large sizes. Those times correspond to ~17.6 Tflops for size = 4096 and ~15.3 Tflops for size = 8192 which seems fairly close to peak performance on this hardware.
The smaller sizes likely haven't gotten as much focus for tuning on navi21. With AMD backend, we use Tensile for gemm kernels; you can try opening an issue at the Tensile repo and specify specific sizes you're interested in for tuning and see if they can help. hipBLAS is just a wrapper to rocBLAS/cuBLAS backends, so it doesn't really have any major impact on performance.
As for the size=256 performance, it's still getting the startup initialization cost, it would be more accurate to do a "warm up" call to hipblasSgemm() before benchmarking the call.
Thanks again,
Daine
from hipblas.
@jordan44665 I'll close this issue as I haven't heard back in a while and I think the crux of the issue is resolved. Hopefully Tensile folks can help you out if you have tuning requests for any specific sizes. Feel free to re-open if I haven't addressed any of your concerns, thanks!
from hipblas.
Related Issues (20)
- Error with `DgetriBatched` function on AMD devices HOT 5
- GemmEx of float16 runs so fast but with wrong result HOT 12
- hipBLAS Ubuntu packaging is broken HOT 2
- hipBLAS compiled for CUDA cannot be found in CMake HOT 4
- Link broken in the README for `hipBLAS functions` HOT 1
- hipblasGemmEx does not match the CPU or ROCBlas results for int8 x int8 to int32 matrix multiplication HOT 3
- [5.3.X] TRMM functions do not have correct correspondence in hipBLAS HOT 4
- [5.3.X][CUDA >= 11.0] `hipblasGemmEx` doesn't fully match `cublasGemmEx` HOT 2
- performance of hipblasHgemm HOT 9
- Type of AP in hipblasStrsm functions HOT 2
- Building rocBLAS for CUDA backend fails HOT 2
- 64-bit interface HOT 3
- hipEventCreate HOT 1
- question about axpby HOT 2
- question about mixed precision dot HOT 3
- compute type of hipblasGemmEx HOT 2
- add topic tags HOT 3
- hipblasXgelsBatched() failed with error HOT 1
- Results of hipblasHgemm seems incorrect HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from hipblas.