Comments (3)
Hi @msimberg
Do you think there's a chance this could be fixed to be a constant time operation in future releases (and if so, should I open a separate issue for that somewhere)?
I have raised an internal ticket to track the issue. We will work towards optimising this use case for future release.
Any ideas about what's going on with hipDeviceSynchronize and just launching a kernel?
Launching the kernel has the side-effect to create the nullstream at that point.
If the nullstream exists the search during hipEventRecord stops earlier otherwise it continues to the end of the streamSet.
Calling hipDeviceSynchronize only (without the kernel) does not seem to have an effect in my tests. I do see that the last 3 iterations are marginally faster maybe because of cache effects but running the test with more iterations it seems that changing the synchronize_index does not make any difference.
from hip.
I think that the dependence on the number of created streams (even if you do not use any of these) can be explained by the fact that during hipEventRecord(events[i] , stream) the runtime needs to synchronize with the null stream. The way this is currently implemented it iterates through the whole set of created streams in order to get the right stream to synchronise with (the nullstream in this case).
To confirm that this is the case and as a possible workaround, can you try to use a non NonBlocking stream in hipEventRecord
i.e create the stream with:
check_error(hipStreamCreateWithFlags(&stream, hipStreamNonBlocking));
This indicates that the stream does not need any synchronisation with the nullstream (there is no work submitted to the nullstream here anyway) and the runtime will skip the iteration through the large stream set. This can be faster and the timings should be constant with increasing number of streams.
from hip.
Thanks @iassiour for the quick response.
The way this is currently implemented it iterates through the whole set of created streams in order to get the right stream to synchronise with (the nullstream in this case).
That's a bit of an unfortunate but not much that can be done about it for released versions. Do you think there's a chance this could be fixed to be a constant time operation in future releases (and if so, should I open a separate issue for that somewhere)? I understand this may have been sufficient for most use cases, but given a system with 8 GPUs and say 16 streams for each GPU you're already up to 128 streams. We're lucky in this case to use only one MPI process per GPU, so we can limit the number of streams, but it would be interesting to explore making use of all GPUs on a node from a single process in the future and then I see this becoming a bottleneck again.
To confirm that this is the case and as a possible workaround, can you try to use a non NonBlocking stream in hipEventRecord
i.e create the stream with:
check_error(hipStreamCreateWithFlags(&stream, hipStreamNonBlocking));
I'll give this a try, thanks for the hint! We've actually had that enabled before, but had to disable it due to issues with rocBLAS. I'm hoping we do not see those issues anymore.
Any ideas about what's going on with hipDeviceSynchronize
and just launching a kernel?
from hip.
Related Issues (20)
- What are the supported APIs in HIP Graph Management? HOT 4
- [Issue]: Multiple definition error caused by using <hip/hip_bf16.h> HOT 1
- [Issue]: Unable to find header file cmath.h HOT 2
- [Issue]: lld: error: undefined hidden symbol: __ockl_get_group_id, __ockl_get_local_size, __ockl_get_local_id HOT 5
- [Issue]: hipModuleLoad returns error HOT 3
- [Issue]: `__syncthreads` not syncing global memory as per its definition. HOT 7
- [Documentation]: Add information that building on Windows is not supported HOT 1
- [Issue]: Stable diffusion, Pytorch conv2d breaks in rocm 6.0 HOT 4
- Is there any lossless compression available on AMD GPUs in ROCm or HIP? HOT 1
- [Issue]: Unable to install HIP extension for Visual Studio 2022 HOT 8
- [Feature]: Allow specifying maximum size for memory pool HOT 1
- [Issue]: templated __constant__ variables crashes in hipMemcpyToSymbol on ROCm 5.7.1
- support of __float2bfloat162_rn HOT 3
- [Feature]: Make the `options` parameter to `hiprtcCompileProgram` const char* const *` HOT 13
- [Documentation]: examples in "Reference" contain multiple syntax errors HOT 1
- [Issue]: `hipDeviceSetLimit(hipLimitMallocHeapSize, -)` does not behave as specified HOT 2
- [Documentation]: build guide does not handle dependency CppHeaderParser HOT 1
- [Issue]: Windows HIP SDK CMake HIP language support not working HOT 1
- [Issue]: COMGR failed to get code object ISA name HOT 6
- [Issue]: Cannot port with https://rocmdocs.amd.com/projects/HIP/en/latest/user_guide/hip_porting_guide.html
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from hip.