GithubHelp home page GithubHelp logo

Comments (4)

daineAMD avatar daineAMD commented on July 20, 2024

Hi @jordan44665, thanks for opening this issue.

I see that you're using hipMallocManaged() to allocate your device memory. The HIP managed memory documentation recommends calling hipDeviceGetAttribute() to query hipDeviceAttributeManagedMemory before allocating memory using hipMallocManaged(). Can you please try this to see if it is supported on your device? I think this is the root of your performance woes.

If it is not supported, can you try changing hipMallocManaged to hipMalloc in your code? I hope this will make the performance more in line with what you're expecting.

Also note that there is a one-time startup cost to load gemm kernels for the first call to hipblasSgemm, as seen with your 512 case being slower than your 1024 case. When measuring performance you can try adding "cold call(s)" outside of the timing framework to avoid measuring this one-time cost.

Thanks,
Daine

from hipblas.

jordan44665 avatar jordan44665 commented on July 20, 2024

Hi @daineAMD,

I changed the hipMallocManaged() to hipMalloc (see attached source). The numbers for the W6800 are mostly better than before but still significantly higher than expected. And here are the results:

---------------- W6800 -----------------
size 256 average 0.256122 s
size 512 average 0.000284425 s
size 1024 average 0.000909363 s
size 2048 average 0.00233341 s
size 4096 average 0.0078005 s
size 8192 average 0.071839 s

---------- RTX 2060 ---------
size 256 average 5.9744e-05 s
size 512 average 6.02688e-05 s
size 1024 average 0.000408064 s
size 2048 average 0.00289365 s
size 4096 average 0.0236227 s
size 8192 average 0.156054 s


I just changed the following functions:

gpuMallocManaged() to gpuMalloc() in main() -- three instances

cudaMallocManaged() to cudaMalloc()
hipMallocManaged() to hipMalloc()

If you like, I can attach the full source with the modifications.

from hipblas.

daineAMD avatar daineAMD commented on July 20, 2024

Hi again,

These numbers look fairly reasonable to me, at least for the large sizes. Those times correspond to ~17.6 Tflops for size = 4096 and ~15.3 Tflops for size = 8192 which seems fairly close to peak performance on this hardware.

The smaller sizes likely haven't gotten as much focus for tuning on navi21. With AMD backend, we use Tensile for gemm kernels; you can try opening an issue at the Tensile repo and specify specific sizes you're interested in for tuning and see if they can help. hipBLAS is just a wrapper to rocBLAS/cuBLAS backends, so it doesn't really have any major impact on performance.

As for the size=256 performance, it's still getting the startup initialization cost, it would be more accurate to do a "warm up" call to hipblasSgemm() before benchmarking the call.

Thanks again,
Daine

from hipblas.

daineAMD avatar daineAMD commented on July 20, 2024

@jordan44665 I'll close this issue as I haven't heard back in a while and I think the crux of the issue is resolved. Hopefully Tensile folks can help you out if you have tuning requests for any specific sizes. Feel free to re-open if I haven't addressed any of your concerns, thanks!

from hipblas.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.