Hi, First off I want to say that you have made some great software,

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Amount of synchronization seemingly a bit hurtful for AMD GPUs about fluidx3d HOT 4 CLOSED

projectphysx commented on August 26, 2024 1

Amount of synchronization seemingly a bit hurtful for AMD GPUs

from fluidx3d.

Comments (4)

ProjectPhysX commented on August 26, 2024

Hi @Epliz,

thank you!

HIP seems to only support 7 (!) super expensive AMD GPUs, so I'll stick with OpenCL. Maybe your findings give AMD an incentive to optimize their OpenCL runtime :)

At least one synchronization is required per time step. Otherwiese, with an unlimited number of time steps, the OpenCL queue gets new entries every couple milliseconds, but kernels can't complete fast enough, so the queue and system memory fill up within seconds, causing a crash.
For multi-GPU I have already minimized the number of synchronization points to what's absolutely necessary, and disabled the additional synchronization for single-GPU if there is more than one domain.

Can you send me the standard benchmark results for MI50/MI100 please? From my experience, AMD GPUs are quite sensitive to box size, so maybe go with 464³ resolution instead of the standard 256³, that runs fastest on the Radeon VII. Would be good additions to the table!

Regards,
Moritz

from fluidx3d.

Epliz commented on August 26, 2024

Hey,

Actually I was wrong, when the kernel code is the same, there is not really any difference between HIP and OpenCL beyond the small synchronization difference.
To illustrate what I proposed, I put my changes in a fork at https://github.com/Epliz/FluidX3D/commits/master .
I added some other small optimizations so that in total there are the following optimizations:

less sync
Specialize the stream_collide kernel to remove the t parameter
use "near loads" and "near stores" whenever possible which leverage some variant of the load/store instructions that can avoid some 64 bit operations

I think the changes are correct, but to be honest I am not sure how to test them properly.
With those changes, on a 256^3 grid, on my MI100 the perf goes from 5220 peak MLUPS (baseline using your repo) to 5572 peak MLUPS. The gains come predominantly from the sync optimization, to the point that you don't really need to take the other commits.
On V100 there is no perf change at all.

I hope my changes are valid, and that they can be of help.

Best,
Epliz

from fluidx3d.

ProjectPhysX commented on August 26, 2024

Hi @Epliz,

during the last week I have experimented some more with reducing synchronization barriers in every time step. In headless mode, the performance difference on Radeon VII is measurable but insignificant, and in interactive graphics mode, it makes the graphics freeze repeatedly as graphics kernels then don't get placed in the queue often enough. I also tried event-based synchronization in multi-GPU but that isn't faster either. Removing synchronization from every time step creates more trouble than it has benefits.

Specializing stream_collide for even and odd time steps defeats the purpose of modular code. Passing the t parameter directly has no disadvantage. I couldn't measure a significant performance difference between 32-bit and 64-bit time parameter. I was considering passing time only as even/odd (0/1) or as 32-bit integer, but left it as 64-bit integer as the time step could be used as a seed for random number generation in the future.

In my testing, the few 64-bit integer operations for array index calculation also don't significantly impact performance. When using 64-bit integer everywhere (for the global ID and computing all the neighbor grid indices), there is a significant difference, so I stick to 32-bit for the max grid size and only extend to 64-bit for array accesses in higher dimensions, like in the fi array. This makes sure that up to 2³² grid points work without integer overflow. The application is so much bandwidth-bound that even with the few 64-bit integer operations at 1:64 INT64:FP32 ratio, it remains well in the bandwidth limit.

Regards,
Moritz

from fluidx3d.

Epliz commented on August 26, 2024

Hi Moritz,

Thanks for trying, and I think your results make sense.
For the effect of reducing synchronization, based on Amdahl's law, I would expect bigger improvement on the faster GPUs like MI100 or MI250X while the improvement should be smaller on Radeon vii.
Have you had the chance to try on MI250x?

In any case, thanks for trying.
Best,
Epliz

from fluidx3d.

Amount of synchronization seemingly a bit hurtful for AMD GPUs about fluidx3d HOT 4 CLOSED

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs