GithubHelp home page GithubHelp logo

Comments (3)

iassiour avatar iassiour commented on August 11, 2024 2

Hi @msimberg

Do you think there's a chance this could be fixed to be a constant time operation in future releases (and if so, should I open a separate issue for that somewhere)?

I have raised an internal ticket to track the issue. We will work towards optimising this use case for future release.

Any ideas about what's going on with hipDeviceSynchronize and just launching a kernel?

Launching the kernel has the side-effect to create the nullstream at that point.
If the nullstream exists the search during hipEventRecord stops earlier otherwise it continues to the end of the streamSet.

Calling hipDeviceSynchronize only (without the kernel) does not seem to have an effect in my tests. I do see that the last 3 iterations are marginally faster maybe because of cache effects but running the test with more iterations it seems that changing the synchronize_index does not make any difference.

from hip.

iassiour avatar iassiour commented on August 11, 2024

I think that the dependence on the number of created streams (even if you do not use any of these) can be explained by the fact that during hipEventRecord(events[i] , stream) the runtime needs to synchronize with the null stream. The way this is currently implemented it iterates through the whole set of created streams in order to get the right stream to synchronise with (the nullstream in this case).

To confirm that this is the case and as a possible workaround, can you try to use a non NonBlocking stream in hipEventRecord
i.e create the stream with:
check_error(hipStreamCreateWithFlags(&stream, hipStreamNonBlocking));

This indicates that the stream does not need any synchronisation with the nullstream (there is no work submitted to the nullstream here anyway) and the runtime will skip the iteration through the large stream set. This can be faster and the timings should be constant with increasing number of streams.

from hip.

msimberg avatar msimberg commented on August 11, 2024

Thanks @iassiour for the quick response.

The way this is currently implemented it iterates through the whole set of created streams in order to get the right stream to synchronise with (the nullstream in this case).

That's a bit of an unfortunate but not much that can be done about it for released versions. Do you think there's a chance this could be fixed to be a constant time operation in future releases (and if so, should I open a separate issue for that somewhere)? I understand this may have been sufficient for most use cases, but given a system with 8 GPUs and say 16 streams for each GPU you're already up to 128 streams. We're lucky in this case to use only one MPI process per GPU, so we can limit the number of streams, but it would be interesting to explore making use of all GPUs on a node from a single process in the future and then I see this becoming a bottleneck again.

To confirm that this is the case and as a possible workaround, can you try to use a non NonBlocking stream in hipEventRecord
i.e create the stream with:
check_error(hipStreamCreateWithFlags(&stream, hipStreamNonBlocking));

I'll give this a try, thanks for the hint! We've actually had that enabled before, but had to disable it due to issues with rocBLAS. I'm hoping we do not see those issues anymore.

Any ideas about what's going on with hipDeviceSynchronize and just launching a kernel?

from hip.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.