GithubHelp home page GithubHelp logo

Comments (4)

ProjectPhysX avatar ProjectPhysX commented on May 13, 2024

Hi @Epliz,

thank you!

HIP seems to only support 7 (!) super expensive AMD GPUs, so I'll stick with OpenCL. Maybe your findings give AMD an incentive to optimize their OpenCL runtime :)

At least one synchronization is required per time step. Otherwiese, with an unlimited number of time steps, the OpenCL queue gets new entries every couple milliseconds, but kernels can't complete fast enough, so the queue and system memory fill up within seconds, causing a crash.
For multi-GPU I have already minimized the number of synchronization points to what's absolutely necessary, and disabled the additional synchronization for single-GPU if there is more than one domain.

Can you send me the standard benchmark results for MI50/MI100 please? From my experience, AMD GPUs are quite sensitive to box size, so maybe go with 464³ resolution instead of the standard 256³, that runs fastest on the Radeon VII. Would be good additions to the table!

Regards,
Moritz

from fluidx3d.

Epliz avatar Epliz commented on May 13, 2024

Hey,

Actually I was wrong, when the kernel code is the same, there is not really any difference between HIP and OpenCL beyond the small synchronization difference.
To illustrate what I proposed, I put my changes in a fork at https://github.com/Epliz/FluidX3D/commits/master .
I added some other small optimizations so that in total there are the following optimizations:

  1. less sync
  2. Specialize the stream_collide kernel to remove the t parameter
  3. use "near loads" and "near stores" whenever possible which leverage some variant of the load/store instructions that can avoid some 64 bit operations

I think the changes are correct, but to be honest I am not sure how to test them properly.
With those changes, on a 256^3 grid, on my MI100 the perf goes from 5220 peak MLUPS (baseline using your repo) to 5572 peak MLUPS. The gains come predominantly from the sync optimization, to the point that you don't really need to take the other commits.
On V100 there is no perf change at all.

I hope my changes are valid, and that they can be of help.

Best,
Epliz

from fluidx3d.

ProjectPhysX avatar ProjectPhysX commented on May 13, 2024

Hi @Epliz,

during the last week I have experimented some more with reducing synchronization barriers in every time step. In headless mode, the performance difference on Radeon VII is measurable but insignificant, and in interactive graphics mode, it makes the graphics freeze repeatedly as graphics kernels then don't get placed in the queue often enough. I also tried event-based synchronization in multi-GPU but that isn't faster either. Removing synchronization from every time step creates more trouble than it has benefits.

Specializing stream_collide for even and odd time steps defeats the purpose of modular code. Passing the t parameter directly has no disadvantage. I couldn't measure a significant performance difference between 32-bit and 64-bit time parameter. I was considering passing time only as even/odd (0/1) or as 32-bit integer, but left it as 64-bit integer as the time step could be used as a seed for random number generation in the future.

In my testing, the few 64-bit integer operations for array index calculation also don't significantly impact performance. When using 64-bit integer everywhere (for the global ID and computing all the neighbor grid indices), there is a significant difference, so I stick to 32-bit for the max grid size and only extend to 64-bit for array accesses in higher dimensions, like in the fi array. This makes sure that up to 2³² grid points work without integer overflow. The application is so much bandwidth-bound that even with the few 64-bit integer operations at 1:64 INT64:FP32 ratio, it remains well in the bandwidth limit.

Regards,
Moritz

from fluidx3d.

Epliz avatar Epliz commented on May 13, 2024

Hi Moritz,

Thanks for trying, and I think your results make sense.
For the effect of reducing synchronization, based on Amdahl's law, I would expect bigger improvement on the faster GPUs like MI100 or MI250X while the improvement should be smaller on Radeon vii.
Have you had the chance to try on MI250x?

In any case, thanks for trying.
Best,
Epliz

from fluidx3d.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.