GithubHelp home page GithubHelp logo

clr's Introduction

AMD ROCm Software

ROCm is an open-source stack, composed primarily of open-source software, designed for graphics processing unit (GPU) computation. ROCm consists of a collection of drivers, development tools, and APIs that enable GPU programming from low-level kernel to end-user applications.

With ROCm, you can customize your GPU software to meet your specific needs. You can develop, collaborate, test, and deploy your applications in a free, open source, integrated, and secure software ecosystem. ROCm is particularly well-suited to GPU-accelerated high-performance computing (HPC), artificial intelligence (AI), scientific computing, and computer aided design (CAD).

ROCm is powered by AMD’s Heterogeneous-computing Interface for Portability (HIP), an open-source software C++ GPU programming environment and its corresponding runtime. HIP allows ROCm developers to create portable applications on different platforms by deploying code on a range of platforms, from dedicated gaming GPUs to exascale HPC clusters.

ROCm supports programming models, such as OpenMP and OpenCL, and includes all necessary open source software compilers, debuggers, and libraries. ROCm is fully integrated into machine learning (ML) frameworks, such as PyTorch and TensorFlow.

Getting the ROCm Source Code

AMD ROCm is built from open source software. It is, therefore, possible to modify the various components of ROCm by downloading the source code and rebuilding the components. The source code for ROCm components can be cloned from each of the GitHub repositories using git. For easy access to download the correct versions of each of these tools, the ROCm repository contains a repo manifest file called default.xml. You can use this manifest file to download the source code for ROCm software.

Installing the repo tool

The repo tool from Google allows you to manage multiple git repositories simultaneously. Run the following commands to install the repo tool:

mkdir -p ~/bin/
curl https://storage.googleapis.com/git-repo-downloads/repo > ~/bin/repo
chmod a+x ~/bin/repo

Note: The ~/bin/ folder is used as an example. You can specify a different folder to install the repo tool into if you desire.

Installing git-lfs

Some ROCm projects use the Git Large File Storage (LFS) format that may require you to install git-lfs. Refer to Git Large File Storage for more information. For example, to install git-lfs for Ubuntu, use the following command:

sudo apt-get install git-lfs

Downloading the ROCm source code

The following example shows how to use the repo tool to download the ROCm source code. If you choose a directory other than ~/bin/ to install the repo tool, you must use that chosen directory in the code as shown below:

mkdir -p ~/ROCm/
cd ~/ROCm/
~/bin/repo init -u http://github.com/ROCm/ROCm.git -b roc-6.0.x
~/bin/repo sync

Note: Using this sample code will cause the repo tool to download the open source code associated with the specified ROCm release. Ensure that you have ssh-keys configured on your machine for your GitHub ID prior to the download as explained at Connecting to GitHub with SSH.

Building the ROCm source code

Each ROCm component repository contains directions for building that component, such as the rocSPARSE documentation Installation and Building for Linux. Refer to the specific component documentation for instructions on building the repository.

Each release of the ROCm software supports specific hardware and software configurations. Refer to System requirements (Linux) for the current supported hardware and OS.

ROCm documentation

This repository contains the manifest file for ROCm releases, changelogs, and release information.

The default.xml file contains information for all repositories and the associated commit used to build the current ROCm release; default.xml uses the Manifest Format repository.

Source code for our documentation is located in the /docs folder of most ROCm repositories. The develop branch of our repositories contains content for the next ROCm release.

The ROCm documentation homepage is rocm.docs.amd.com.

Building the documentation

For a quick-start build, use the following code. For more options and detail, refer to Building documentation.

cd docs
pip3 install -r sphinx/requirements.txt
python3 -m sphinx -T -E -b html -d _build/doctrees -D language=en . _build/html

Alternatively, CMake build is supported.

cmake -B build
cmake --build build --target=doc

Older ROCm releases

For release information for older ROCm releases, refer to the CHANGELOG.

clr's People

Contributors

aaronenyeshi avatar aditya4d1 avatar agunashe avatar alexvlx avatar alexxamd avatar arsenm avatar aryansalmanpour avatar bensander avatar chaunceyhui avatar chriskitching avatar chrispaquot avatar emankov avatar gandryey avatar gargrahul avatar iassiour avatar jasonttang avatar jaydeeppatel1111 avatar jujiang-del avatar kjayapra-amd avatar mangupta avatar mhbliao avatar saleelk avatar sarbojitamd avatar satyanveshd avatar scchan avatar shadidashmiz avatar sunway513 avatar tomsang avatar vsytch avatar yxsamliu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

clr's Issues

Stress_printf_ComplexKernelMultStream failed on Radeon VII

Running hip-5.7.1 stress tests on Rdeon VII results in one failure:

Filters: Stress_printf_ComplexKernelMultStream
Test - Stress_printf_ComplexKernelMultStream start
estimatedPrintSize = 141484838915489, actualFileSize = 4322107392
estimatedLinesPrinted = 45243684, actualLinesPrinted = 44421378
   
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
printf_stress is a Catch v2.13.4 host application.
Run with -? for options
   
-------------------------------------------------------------------------------
Stress_printf_ComplexKernelMultStream
-------------------------------------------------------------------------------
/fast/portage/dev-util/hip-5.7.1-r4/work/hip-tests-rocm-5.7.0/catch/stress/printf/Stress_printf_ComplexKernels.cc:465
...............................................................................
   
/fast/portage/dev-util/hip-5.7.1-r4/work/hip-tests-rocm-5.7.0/catch/stress/printf/Stress_printf_ComplexKernels.cc:481: FAILED:
  REQUIRE( TestPassed )
with expansion:
  false

===============================================================================
test cases: 1 | 1 failed
assertions: 1 | 1 failed

Driver: Linux kernel 6.5.10
Userspace Environment: Gentoo

Build fails with strict-aliasing violations

I tried to compile with LTO: -flto=4 -Werror=odr -Werror=lto-type-mismatch -Werror=strict-aliasing

The -Werror=* flags are important to detect cases where the compiler can try to optimize based on assuming UB cannot happen, and miscompile code that has UB in it. strict-aliasing issues are always bad but LTO can make them even worse.

I got this error:

FAILED: rocclr/CMakeFiles/rocclr.dir/platform/memory.cpp.o 
/usr/bin/x86_64-pc-linux-gnu-g++ -DATI_OS_LINUX -DCL_TARGET_OPENCL_VERSION=220 -DCL_USE_DEPRECATED_OPENCL_1_0_APIS -DCL_USE_DEPRECATED_OPENCL_1_1_APIS -DCL_USE_DEPRECATED_OPENCL_1_2_APIS -DCL_USE_DEPRECATED_OPENCL_2_0_APIS -DCOMGR_DYN_DLL -DHAVE_CL2_HPP -DHIP_MAJOR_VERSION=5 -DHIP_MINOR_VERSION=7 -DLITTLEENDIAN_CPU -DOPENCL_C_MAJOR=2 -DOPENCL_C_MINOR=0 -DOPENCL_MAJOR=2 -DOPENCL_MINOR=1 -DROCCLR_SUPPORT_NUMA_POLICY -DUSE_COMGR_LIBRARY -DWITH_HSA_DEVICE -DWITH_LIGHTNING_COMPILER -I/var/tmp/portage/dev-util/hip-5.7.1-r2/work/clr-rocm-5.7.1/rocclr -I/var/tmp/portage/dev-util/hip-5.7.1-r2/work/clr-rocm-5.7.1/rocclr/compiler/lib -I/var/tmp/portage/dev-util/hip-5.7.1-r2/work/clr-rocm-5.7.1/rocclr/compiler/lib/include -I/var/tmp/portage/dev-util/hip-5.7.1-r2/work/clr-rocm-5.7.1/rocclr/compiler/lib/backends/common -I/var/tmp/portage/dev-util/hip-5.7.1-r2/work/clr-rocm-5.7.1/rocclr/device -I/var/tmp/portage/dev-util/hip-5.7.1-r2/work/clr-rocm-5.7.1/rocclr/elf -I/var/tmp/portage/dev-util/hip-5.7.1-r2/work/clr-rocm-5.7.1/rocclr/include -I/var/tmp/portage/dev-util/hip-5.7.1-r2/work/clr-rocm-5.7.1/opencl/khronos/headers/opencl2.2/CL -I/var/tmp/portage/dev-util/hip-5.7.1-r2/work/clr-rocm-5.7.1/opencl/khronos/headers/opencl2.2/CL/.. -I/var/tmp/portage/dev-util/hip-5.7.1-r2/work/clr-rocm-5.7.1/opencl/khronos/headers/opencl2.2/CL/../.. -I/var/tmp/portage/dev-util/hip-5.7.1-r2/work/clr-rocm-5.7.1/opencl/khronos/headers/opencl2.2/CL/../../.. -I/var/tmp/portage/dev-util/hip-5.7.1-r2/work/clr-rocm-5.7.1/opencl/khronos/headers/opencl2.2/CL/../../../.. -I/var/tmp/portage/dev-util/hip-5.7.1-r2/work/clr-rocm-5.7.1/opencl/khronos/headers/opencl2.2/CL/../../../../amdocl  -march=native -fstack-protector-all -O2 -pipe -fdiagnostics-color=always -frecord-gcc-switches -U_FORTIFY_SOURCE -D_FORTIFY_SOURCE=3 -fstack-clash-protection -flto=4 -Werror=odr -Werror=lto-type-mismatch -Werror=strict-aliasing  -Wformat -Werror=format-security -std=c++17 -fPIC -MD -MT rocclr/CMakeFiles/rocclr.dir/platform/memory.cpp.o -MF rocclr/CMakeFiles/rocclr.dir/platform/memory.cpp.o.d -o rocclr/CMakeFiles/rocclr.dir/platform/memory.cpp.o -c /var/tmp/portage/dev-util/hip-5.7.1-r2/work/clr-rocm-5.7.1/rocclr/platform/memory.cpp
/var/tmp/portage/dev-util/hip-5.7.1-r2/work/clr-rocm-5.7.1/rocclr/platform/memory.cpp: In function ‘int amd::round_to_even(float)’:
/var/tmp/portage/dev-util/hip-5.7.1-r2/work/clr-rocm-5.7.1/rocclr/platform/memory.cpp:1285:19: error: dereferencing type-punned pointer will break strict-aliasing rules [-Werror=strict-aliasing]
 1285 |   if (fabsf(v) < *reinterpret_cast<const float*>(&magic[0])) {
      |                   ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/var/tmp/portage/dev-util/hip-5.7.1-r2/work/clr-rocm-5.7.1/rocclr/platform/memory.cpp:1286:23: error: dereferencing type-punned pointer will break strict-aliasing rules [-Werror=strict-aliasing]
 1286 |     float magicVal = *reinterpret_cast<const float*>(&magic[v < 0.0f]);
      |                       ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/var/tmp/portage/dev-util/hip-5.7.1-r2/work/clr-rocm-5.7.1/rocclr/platform/memory.cpp: In function ‘uint16_t amd::float2half_rtz(float)’:
/var/tmp/portage/dev-util/hip-5.7.1-r2/work/clr-rocm-5.7.1/rocclr/platform/memory.cpp:1311:13: error: dereferencing type-punned pointer will break strict-aliasing rules [-Werror=strict-aliasing]
 1311 |   if (x >= *reinterpret_cast<float*>(&values[0])) {
      |             ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/var/tmp/portage/dev-util/hip-5.7.1-r2/work/clr-rocm-5.7.1/rocclr/platform/memory.cpp:1312:15: error: dereferencing type-punned pointer will break strict-aliasing rules [-Werror=strict-aliasing]
 1312 |     if (x == *reinterpret_cast<float*>(&values[4])) {
      |               ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/var/tmp/portage/dev-util/hip-5.7.1-r2/work/clr-rocm-5.7.1/rocclr/platform/memory.cpp:1319:12: error: dereferencing type-punned pointer will break strict-aliasing rules [-Werror=strict-aliasing]
 1319 |   if (x < *reinterpret_cast<float*>(&values[1])) {
      |            ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/var/tmp/portage/dev-util/hip-5.7.1-r2/work/clr-rocm-5.7.1/rocclr/platform/memory.cpp:1324:12: error: dereferencing type-punned pointer will break strict-aliasing rules [-Werror=strict-aliasing]
 1324 |   if (x < *reinterpret_cast<float*>(&values[2])) {
      |            ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/var/tmp/portage/dev-util/hip-5.7.1-r2/work/clr-rocm-5.7.1/rocclr/platform/memory.cpp:1325:11: error: dereferencing type-punned pointer will break strict-aliasing rules [-Werror=strict-aliasing]
 1325 |     x *= *reinterpret_cast<float*>(&values[3]);
      |           ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
cc1plus: some warnings being treated as errors
ninja: build stopped: cannot make progress due to previous errors.

Downstream report: https://bugs.gentoo.org/858383
Full build log: build.log

Consider marking the stack non-executable in assembly files

Hi,

with GNU Binutils >=2.39 new warning were added into ld, indicating that hardware stack protection is disabled in some object files of hipamd.

Example warning is:

/usr/libexec/gcc/x86_64-pc-linux-gnu/ld: warning: hipamd/src/hiprtc/hip_rtc_gen/hipRTC_header.o: missing .note.GNU-stack section implies executable stack
/usr/libexec/gcc/x86_64-pc-linux-gnu/ld: NOTE: This behaviour is deprecated and will be removed in a future version of the linker

Following https://wiki.gentoo.org/wiki/Hardened/GNU_stack_quickstart, please, consider adding .section .note.GNU-stack,"",%progbits to fix affected files:

!WX --- ---  ./clr-rocm-5.7.1_build/hipamd/src/hiprtc/hip_rtc_gen/hipRTC_header.o
RWX --- ---  ./clr-rocm-5.7.1_build/hipamd/lib/libhiprtc-builtins.so.5.7.31921
RWX --- ---  ./clr-rocm-5.7.1_build/hipamd/lib/libamdhip64.so.5.7.31921
RWX --- ---  ./clr-rocm-5.7.1_build/hipamd/lib/libhiprtc.so.5.7.31921
!WX --- ---  ./clr-rocm-5.7.1_build/hip_pch.o

Patch for Gentoo: https://github.com/gentoo/gentoo/blob/d3ef88d985b6656ddf42c3b202ada11a16e91e6c/dev-util/hip/files/hip-5.7.1-exec-stack.patch

hipMemcpy2D fails with invalid arg if `hipMemcpyDefault` used

This simple repro fails with ROCm 5.7

#include <hip/hip_runtime.h>

constexpr int width = 64;
constexpr int height = 64;
constexpr int src_sz = width * height;
constexpr int dst_offset = src_sz; // This larger offset fails
// constexpr int dst_offset = 1; // This small offset passes
constexpr int dst_sz = src_sz + dst_offset;

int main() {
  float *src, *dst;

  // Allocate memory on the device
  hipMalloc(&src, src_sz * sizeof(float));
  hipMallocManaged(&dst, dst_sz * sizeof(float));

  size_t pitch = width * sizeof(float); // no padding

  auto err = hipMemcpy2D(dst + dst_offset, pitch, src, pitch,
                         width * sizeof(float), height, hipMemcpyDefault);
  if (err != hipSuccess)
    printf("hipMemcpy2DAsync failed: %s\n", hipGetErrorString(err));

  hipFree(src);
  hipFree(dst);
}

Things that fix the invalid argument return code:

  1. Making the offset smaller
  2. Changing policy from hipMemcpyDefault to hipMemcpyDeviceToHost
  3. Changing hipMallocManaged to hipMalloc
$ hipcc --version
HIP version: 5.7.31921-d1770ee1b
AMD clang version 17.0.0 (https://github.com/RadeonOpenCompute/llvm-project roc-5.7.0 23352 d1e13c532a947d0cbfc94759c00dcf152294aa13)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /opt/rocm-5.7.0/llvm/bin

HIPCC_VERBOSE breaks cmake variables

hip-config-amd.cmake uses HIPCC to fill some variable. If HIPCC_VERBOSE is enabled, this variable includes debug strings, which later on breaks the build.

It was not super obvious for me to find this when my build failed, so I just want to avoid others from shooting them in the foot .. I suggest to maybe call HIPCC explicitly with HIPCC_VERBOSE=0?

HIP build failing with hiprtc related error. #23 (CONT-D)

THis is continued from following which closed without confirmation: HIP build failing with hiprtc related error. #23
First I want to ask why are you closing without even confirming with me that build works? Build is still failing, please do not close because I cant re-open!!!

Your last post here absolutely makes no sense:
the idea here is, there are two repo you need to build HIP on AMD platforms (there are several but for our purpose we can ignore them).
As I said clr was not building due to hiprtc related error. Are you saying checkout hip and clr, which one to build first?
IF i build clr, it fails with same reason.

It starts with which version of ROCm you have installed.
you can figure it out by doing a cat /opt/rocm/.info/version-dev

You will get a number 5.6 or 5.7

After that you need to checkout the repo
https://github.com/ROCm-Developer-Tools/clr
https://github.com/ROCm-Developer-Tools/hip
to rocm-5.n whatever your ROCm version was installed and then try to build it.

Here clr and hip version needs to match




hipUserObjectRetain return hipErrorInvalidValue?

Should it be returning unsuccessful?

hipError_t hipUserObjectRetain(hipUserObject_t object, unsigned int count) {
  HIP_INIT_API(hipUserObjectRetain, object, count);
  if (object == nullptr || count == 0 || count > INT_MAX) {
    HIP_RETURN(hipErrorInvalidValue);
  }
  if (!hipUserObject::isUserObjvalid(object)) {
    HIP_RETURN(hipSuccess);  <<<<<<<<<<<<<
  }
  object->increaseRefCount(count);
  HIP_RETURN(hipSuccess);
}

https://github.com/ROCm-Developer-Tools/clr/blob/9fdee05aeea7db0e20bc65770556aefd1d2c78f0/hipamd/src/hip_graph.cpp#L2460C20-L2460C20

hipamd: SIGSEGV when code for particular device architecture is absent

ROCm 5.6.0

This bug has 2 parts.

PlatformState::init returns immediately if digestFatBinary fails, leaving not only the failed binary uninitialized, but also all binaries that happen to be further in the list. There is no indication of this condition to the application, and by default, no diagnostic message.

hip::Function::getStatFunc and other functions use null pointer from modules_, and the program crashes.

Freezes / Low perfomance when GPU is idling

1 - When I'm rendering video in Davinci Resolve and I start some application which loads GPU (firefox with youtube / mpv / even vkMark) it actually increases rendering speed like +30-50%
2 - Geekbench takes immense amount of time to complete opencl tests if GPU is idle.
But if I start video in background, again, time reduces dramatically (there's no big difference in total score tho).
3 - I usually get 1-sec playback delay in Davinci, but quite often i got 5 seconds freezes.
When i do anything connected with GPU load (open latte dock / resize window / start Discord) it magically un-freezes.
Playing video in background eliminates all freezes.

  • I haven't experienced same on Windows tho.
  • Capping gpu clock at max doesn't make difference

Specs:

  • CPU: Ryzen 4300U
  • GPU: integrated
  • OS: Arch Linux
  • Kernel: linux / linux zen
  • RAM: 40Gb
  • Drivers: mesa + rocm-opencl-runtime / opencl-amd 5.6.1 / 5.7.1
Number of platforms                               1
  Platform Name                                   AMD Accelerated Parallel Processing
  Platform Vendor                                 Advanced Micro Devices, Inc.
  Platform Version                                OpenCL 2.1 AMD-APP.dbg (3570.0)
  Platform Profile                                FULL_PROFILE
  Platform Extensions                             cl_khr_icd cl_amd_event_callback 
  Platform Extensions function suffix             AMD
  Platform Host timer resolution                  1ns

  Platform Name                                   AMD Accelerated Parallel Processing
Number of devices                                 1
  Device Name                                     gfx90c:xnack-
  Device Vendor                                   Advanced Micro Devices, Inc.
  Device Vendor ID                                0x1002
  Device Version                                  OpenCL 2.0 
  Driver Version                                  3570.0 (HSA1.1,LC)
  Device OpenCL C Version                         OpenCL C 2.0 
  Device Type                                     GPU
  Device Board Name (AMD)                         AMD Radeon Graphics
  Device PCI-e ID (AMD)                           0x1636
  Device Topology (AMD)                           PCI-E, 0000:05:00.0
  Device Profile                                  FULL_PROFILE
  Device Available                                Yes
  Compiler Available                              Yes
  Linker Available                                Yes
  Max compute units                               5
  SIMD per compute unit (AMD)                     4
  SIMD width (AMD)                                16
  SIMD instruction width (AMD)                    1
  Max clock frequency                             1400MHz
  Graphics IP (AMD)                               9.0
  Device Partition                                (core)
    Max number of sub-devices                     5
    Supported partition types                     None
    Supported affinity domains                    (n/a)
  Max work item dimensions                        3
  Max work item sizes                             1024x1024x1024
  Max work group size                             256
  Preferred work group size (AMD)                 256
  Max work group size (AMD)                       1024
  Preferred work group size multiple (kernel)     64
  Wavefront width (AMD)                           64
  Preferred / native vector sizes                 
    char                                                 4 / 4       
    short                                                2 / 2       
    int                                                  1 / 1       
    long                                                 1 / 1       
    half                                                 1 / 1        (cl_khr_fp16)
    float                                                1 / 1       
    double                                               1 / 1        (cl_khr_fp64)
  Half-precision Floating-point support           (cl_khr_fp16)
    Denormals                                     No
    Infinity and NANs                             No
    Round to nearest                              No
    Round to zero                                 No
    Round to infinity                             No
    IEEE754-2008 fused multiply-add               No
    Support is emulated in software               No
  Single-precision Floating-point support         (core)
    Denormals                                     Yes
    Infinity and NANs                             Yes
    Round to nearest                              Yes
    Round to zero                                 Yes
    Round to infinity                             Yes
    IEEE754-2008 fused multiply-add               Yes
    Support is emulated in software               No
    Correctly-rounded divide and sqrt operations  Yes
  Double-precision Floating-point support         (cl_khr_fp64)
    Denormals                                     Yes
    Infinity and NANs                             Yes
    Round to nearest                              Yes
    Round to zero                                 Yes
    Round to infinity                             Yes
    IEEE754-2008 fused multiply-add               Yes
    Support is emulated in software               No
  Address bits                                    64, Little-Endian
  Global memory size                              4294967296 (4GiB)
  Global free memory (AMD)                        4014080 (3.828GiB) 4014080 (3.828GiB)
  Global memory channels (AMD)                    4
  Global memory banks per channel (AMD)           4
  Global memory bank width (AMD)                  256 bytes
  Error Correction support                        No
  Max memory allocation                           3650722200 (3.4GiB)
  Unified memory for Host and Device              No
  Shared Virtual Memory (SVM) capabilities        (core)
    Coarse-grained buffer sharing                 Yes
    Fine-grained buffer sharing                   Yes
    Fine-grained system sharing                   No
    Atomics                                       No
  Minimum alignment for any data type             128 bytes
  Alignment of base address                       1024 bits (128 bytes)
  Preferred alignment for atomics                 
    SVM                                           0 bytes
    Global                                        0 bytes
    Local                                         0 bytes
  Max size for global variable                    3650722200 (3.4GiB)
  Preferred total size of global vars             4294967296 (4GiB)
  Global Memory cache type                        Read/Write
  Global Memory cache size                        16384 (16KiB)
  Global Memory cache line size                   64 bytes
  Image support                                   Yes
    Max number of samplers per kernel             5686
    Max size for 1D images from buffer            134217728 pixels
    Max 1D or 2D image array size                 8192 images
    Base address alignment for 2D image buffers   256 bytes
    Pitch alignment for 2D image buffers          256 pixels
    Max 2D image size                             16384x16384 pixels
    Max 3D image size                             16384x16384x8192 pixels
    Max number of read image args                 128
    Max number of write image args                8
    Max number of read/write image args           64
  Max number of pipe args                         16
  Max active pipe reservations                    16
  Max pipe packet size                            3650722200 (3.4GiB)
  Local memory type                               Local
  Local memory size                               65536 (64KiB)
  Local memory size per CU (AMD)                  65536 (64KiB)
  Local memory banks (AMD)                        32
  Max number of constant args                     8
  Max constant buffer size                        3650722200 (3.4GiB)
  Preferred constant buffer size (AMD)            16384 (16KiB)
  Max size of kernel argument                     1024
  Queue properties (on host)                      
    Out-of-order execution                        No
    Profiling                                     Yes
  Queue properties (on device)                    
    Out-of-order execution                        Yes
    Profiling                                     Yes
    Preferred size                                262144 (256KiB)
    Max size                                      8388608 (8MiB)
  Max queues on device                            1
  Max events on device                            1024
  Prefer user sync for interop                    Yes
  Number of P2P devices (AMD)                     0
  Profiling timer resolution                      1ns
  Profiling timer offset since Epoch (AMD)        0ns (Thu Jan  1 03:00:00 1970)
  Execution capabilities                          
    Run OpenCL kernels                            Yes
    Run native kernels                            No
    Thread trace supported (AMD)                  No
    Number of async queues (AMD)                  8
    Max real-time compute queues (AMD)            8
    Max real-time compute units (AMD)             5
  printf() buffer size                            4194304 (4MiB)
  Built-in kernels                                (n/a)
  Device Extensions                               cl_khr_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_fp16 cl_khr_gl_sharing cl_amd_device_attribute_query cl_amd_media_ops cl_amd_media_ops2 cl_khr_image2d_from_buffer cl_khr_subgroups cl_khr_depth_images cl_amd_copy_buffer_p2p cl_amd_assembly_program 

NULL platform behavior
  clGetPlatformInfo(NULL, CL_PLATFORM_NAME, ...)  AMD Accelerated Parallel Processing
  clGetDeviceIDs(NULL, CL_DEVICE_TYPE_ALL, ...)   Success [AMD]
  clCreateContext(NULL, ...) [default]            Success [AMD]
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_DEFAULT)  Success (1)
    Platform Name                                 AMD Accelerated Parallel Processing
    Device Name                                   gfx90c:xnack-
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_CPU)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_GPU)  Success (1)
    Platform Name                                 AMD Accelerated Parallel Processing
    Device Name                                   gfx90c:xnack-
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_ACCELERATOR)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_CUSTOM)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_ALL)  Success (1)
    Platform Name                                 AMD Accelerated Parallel Processing
    Device Name                                   gfx90c:xnack-

ICD loader properties
  ICD loader Name                                 OpenCL ICD Loader
  ICD loader Vendor                               OCL Icd free software
  ICD loader Version                              2.3.2
  ICD loader Profile                              OpenCL 3.0

It looks like GPU entering some kind of power-save mode and to exit this mode it needs to draw some graphics :)
Thank you for reading

[Issue]: CO lookup in fatbin should only fail when none of the GPUs have matching CO

Problem Description

I noticed that in a multi-GPU system (in my case, a gfx90c iGPU and a gfx1032 dGPU), a fat binary must have code objects for all architectures in order to run and produce correct outputs, yet most of the time I only want to run on one architecture. This also poses issues for users that have an integrated GPU because obviously no one would compile against an iGPU, meaning that I have to use HIP_VISIBLE_DEVICES to limit access to only the dGPU every time I run a ROCm binary or libraries like PyTorch.

I believe the problem is with this line, where we set hip_status to hipErrorNoBinaryForGpu even if only one device has unmatched CO. We should only set hip_status to hipErrorNoBinaryForGpu if none of the devices have matching CO.

Operating System

Solus 4.5 Resilience

CPU

AMD Ryzen 7 5800H with Radeon Graphics

GPU

AMD Radeon RX6600M

ROCm Version

ROCm 6.0.0

ROCm Component

clr

Steps to Reproduce

The following assumes one has two GPUs with incompatible architectures. In my case, I have a gfx1032 (device index 0) and gfx90c (device index 1). Please adjust the arch names accordingly.

  1. Use the official vectorAdd example. Compile against only the architecture with device index 0: hipcc --offload-arch=gfx1032 -o vectoradd_hip vectoradd_hip.cpp
  2. Run AMD_LOG_LEVEL=1 ./vectoradd_hip
  3. Now I get the following error:
:1:hip_fatbin.cpp           :256 : 1271514880 us: [pid:7468  tid:0x7f6318e9ca80] Cannot find CO in the bundle for ISA: amdgcn-amd-amdhsa--gfx90c:xnack- 

:1:hip_fatbin.cpp           :109 : 1271514917 us: [pid:7468  tid:0x7f6318e9ca80] Missing CO for these ISAs - 
:1:hip_fatbin.cpp           :112 : 1271514929 us: [pid:7468  tid:0x7f6318e9ca80]      amdgcn-amd-amdhsa--gfx90c:xnack-
:1:hip_fatbin.cpp           :302 : 1271514949 us: [pid:7468  tid:0x7f6318e9ca80] Releasing COMGR data failed with status 2 
 System minor 3
 System major 10
 agent prop name AMD Radeon RX 6600M
hip Device prop succeeded 
FAILED: 1048576 errors
:1:hip_fatbin.cpp           :83  : 1271770378 us: [pid:7468  tid:0x7f6318e9ca80] All Unique FDs are closed
  1. However, if I hide the GPU with device index 1 by running HIP_VISIBLE_DEVICES=0 AMD_LOG_LEVEL=1 ./vectoradd_hip, I get:
 System minor 3
 System major 10
 agent prop name AMD Radeon RX 6600M
hip Device prop succeeded 
PASSED!
:1:hip_fatbin.cpp           :83  : 1584583122 us: [pid:7749  tid:0x7fe017517a80] All Unique FDs are closed

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

rocminfo output
ROCk module is loaded
=====================    
HSA System Attributes    
=====================    
Runtime Version:         1.1
System Timestamp Freq.:  1000.000000MHz
Sig. Max Wait Duration:  18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model:           LARGE                              
System Endianness:       LITTLE                             
Mwaitx:                  DISABLED
DMAbuf Support:          YES

==========               
HSA Agents               
==========               
*******                  
Agent 1                  
*******                  
  Name:                    AMD Ryzen 7 5800H with Radeon Graphics
  Uuid:                    CPU-XX                             
  Marketing Name:          AMD Ryzen 7 5800H with Radeon Graphics
  Vendor Name:             CPU                                
  Feature:                 None specified                     
  Profile:                 FULL_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        0(0x0)                             
  Queue Min Size:          0(0x0)                             
  Queue Max Size:          0(0x0)                             
  Queue Type:              MULTI                              
  Node:                    0                                  
  Device Type:             CPU                                
  Cache Info:              
    L1:                      32768(0x8000) KB                   
  Chip ID:                 0(0x0)                             
  ASIC Revision:           0(0x0)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   3201                               
  BDFID:                   0                                  
  Internal Node ID:        0                                  
  Compute Unit:            16                                 
  SIMDs per CU:            0                                  
  Shader Engines:          0                                  
  Shader Arrs. per Eng.:   0                                  
  WatchPts on Addr. Ranges:1                                  
  Features:                None
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: FINE GRAINED        
      Size:                    61576860(0x3ab969c) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 2                   
      Segment:                 GLOBAL; FLAGS: KERNARG, FINE GRAINED
      Size:                    61576860(0x3ab969c) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 3                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    61576860(0x3ab969c) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
  ISA Info:                
*******                  
Agent 2                  
*******                  
  Name:                    gfx1032                            
  Uuid:                    GPU-XX                             
  Marketing Name:          AMD Radeon RX 6600M                
  Vendor Name:             AMD                                
  Feature:                 KERNEL_DISPATCH                    
  Profile:                 BASE_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        128(0x80)                          
  Queue Min Size:          64(0x40)                           
  Queue Max Size:          131072(0x20000)                    
  Queue Type:              MULTI                              
  Node:                    1                                  
  Device Type:             GPU                                
  Cache Info:              
    L1:                      16(0x10) KB                        
    L2:                      2048(0x800) KB                     
    L3:                      32768(0x8000) KB                   
  Chip ID:                 29695(0x73ff)                      
  ASIC Revision:           0(0x0)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   2720                               
  BDFID:                   768                                
  Internal Node ID:        1                                  
  Compute Unit:            28                                 
  SIMDs per CU:            2                                  
  Shader Engines:          2                                  
  Shader Arrs. per Eng.:   2                                  
  WatchPts on Addr. Ranges:4                                  
  Coherent Host Access:    FALSE                              
  Features:                KERNEL_DISPATCH 
  Fast F16 Operation:      TRUE                               
  Wavefront Size:          32(0x20)                           
  Workgroup Max Size:      1024(0x400)                        
  Workgroup Max Size per Dimension:
    x                        1024(0x400)                        
    y                        1024(0x400)                        
    z                        1024(0x400)                        
  Max Waves Per CU:        32(0x20)                           
  Max Work-item Per CU:    1024(0x400)                        
  Grid Max Size:           4294967295(0xffffffff)             
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)             
    y                        4294967295(0xffffffff)             
    z                        4294967295(0xffffffff)             
  Max fbarriers/Workgrp:   32                                 
  Packet Processor uCode:: 115                                
  SDMA engine uCode::      76                                 
  IOMMU Support::          None                               
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    8372224(0x7fc000) KB               
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 2                   
      Segment:                 GLOBAL; FLAGS: EXTENDED FINE GRAINED
      Size:                    8372224(0x7fc000) KB               
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 3                   
      Segment:                 GROUP                              
      Size:                    64(0x40) KB                        
      Allocatable:             FALSE                              
      Alloc Granule:           0KB                                
      Alloc Alignment:         0KB                                
      Accessible by all:       FALSE                              
  ISA Info:                
    ISA 1                    
      Name:                    amdgcn-amd-amdhsa--gfx1032         
      Machine Models:          HSA_MACHINE_MODEL_LARGE            
      Profiles:                HSA_PROFILE_BASE                   
      Default Rounding Mode:   NEAR                               
      Default Rounding Mode:   NEAR                               
      Fast f16:                TRUE                               
      Workgroup Max Size:      1024(0x400)                        
      Workgroup Max Size per Dimension:
        x                        1024(0x400)                        
        y                        1024(0x400)                        
        z                        1024(0x400)                        
      Grid Max Size:           4294967295(0xffffffff)             
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)             
        y                        4294967295(0xffffffff)             
        z                        4294967295(0xffffffff)             
      FBarrier Max Size:       32                                 
*******                  
Agent 3                  
*******                  
  Name:                    gfx90c                             
  Uuid:                    GPU-XX                             
  Marketing Name:          AMD Radeon Graphics                
  Vendor Name:             AMD                                
  Feature:                 KERNEL_DISPATCH                    
  Profile:                 BASE_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        128(0x80)                          
  Queue Min Size:          64(0x40)                           
  Queue Max Size:          131072(0x20000)                    
  Queue Type:              MULTI                              
  Node:                    2                                  
  Device Type:             GPU                                
  Cache Info:              
    L1:                      16(0x10) KB                        
    L2:                      1024(0x400) KB                     
  Chip ID:                 5688(0x1638)                       
  ASIC Revision:           0(0x0)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   2000                               
  BDFID:                   2048                               
  Internal Node ID:        2                                  
  Compute Unit:            8                                  
  SIMDs per CU:            4                                  
  Shader Engines:          1                                  
  Shader Arrs. per Eng.:   1                                  
  WatchPts on Addr. Ranges:4                                  
  Coherent Host Access:    FALSE                              
  Features:                KERNEL_DISPATCH 
  Fast F16 Operation:      TRUE                               
  Wavefront Size:          64(0x40)                           
  Workgroup Max Size:      1024(0x400)                        
  Workgroup Max Size per Dimension:
    x                        1024(0x400)                        
    y                        1024(0x400)                        
    z                        1024(0x400)                        
  Max Waves Per CU:        40(0x28)                           
  Max Work-item Per CU:    2560(0xa00)                        
  Grid Max Size:           4294967295(0xffffffff)             
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)             
    y                        4294967295(0xffffffff)             
    z                        4294967295(0xffffffff)             
  Max fbarriers/Workgrp:   32                                 
  Packet Processor uCode:: 469                                
  SDMA engine uCode::      40                                 
  IOMMU Support::          None                               
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    4194304(0x400000) KB               
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 2                   
      Segment:                 GLOBAL; FLAGS: EXTENDED FINE GRAINED
      Size:                    4194304(0x400000) KB               
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 3                   
      Segment:                 GROUP                              
      Size:                    64(0x40) KB                        
      Allocatable:             FALSE                              
      Alloc Granule:           0KB                                
      Alloc Alignment:         0KB                                
      Accessible by all:       FALSE                              
  ISA Info:                
    ISA 1                    
      Name:                    amdgcn-amd-amdhsa--gfx90c:xnack-   
      Machine Models:          HSA_MACHINE_MODEL_LARGE            
      Profiles:                HSA_PROFILE_BASE                   
      Default Rounding Mode:   NEAR                               
      Default Rounding Mode:   NEAR                               
      Fast f16:                TRUE                               
      Workgroup Max Size:      1024(0x400)                        
      Workgroup Max Size per Dimension:
        x                        1024(0x400)                        
        y                        1024(0x400)                        
        z                        1024(0x400)                        
      Grid Max Size:           4294967295(0xffffffff)             
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)             
        y                        4294967295(0xffffffff)             
        z                        4294967295(0xffffffff)             
      FBarrier Max Size:       32                                 
*** Done ***   

Additional Information

No response

hipamd: SIGSEGV when compiled with -march=znver4

Due to unaligned allocations, library crashes in nontemporalMemcpy in _mm512_stream_si512 (which requires 64-aligned allocations, but used to copy default-aligned objects) in https://github.com/ROCm-Developer-Tools/clr/blob/5914ac3c6e9b3848023a7fa25e19e560b1c38541/rocclr/device/rocm/rocvirtual.cpp#L2793

Originally reported to https://bugs.gentoo.org/915969 as a part of rocBLAS and miopen update (failure in hipamd module loader causes crash in dependent libraries).

Incompatibilities between bfloat16 types

There are (for some reason) two bfloat types in hip; __hip_bfloat16 and hip_bfloat16. The former is a C type, whereas the latter is a C++ type.

Judging from hipify, hip_bfloat16 is the preferred version here:
https://github.com/ROCm-Developer-Tools/HIPIFY/blob/0e353a6af8b4d4b1d63aa7706b297fb3a33a7ef0/src/CUDA2HIP_Device_types.cpp#L33

While I can understand that its now hard to change the confusing headers (cuda uses cuda_bf16.h while HIP uses hip_bfloat16.h for the preferred type, hip_bf16.h is already taken by the 'bad' bfloat16), my main problem is that there are missing overloads for hip_bfloat16. Specifically, hip_bfloat16 does not overload any of the built-ins that operate on bfloat16 in cuda, for example, the functions defined here:
https://docs.nvidia.com/cuda/cuda-math-api/group__CUDA__MATH____BFLOAT16__COMPARISON.html
and here:
https://docs.nvidia.com/cuda/cuda-math-api/group__CUDA__MATH____BFLOAT16__MISC.html

This makes it quite annoying to write code that must be ported between CUDA and HIP.

In CUDA, this is implemented by having a cuda_bf16.hpp and cuda_bf16.h which operate on the same type rather than 2 different, incompatible ones.

[Issue/clr]: clEnqueueReleaseD3D11ObjectsKHR returned CL_INVALID_GL_OBJECT

Problem Description

Hello, our users encountered a DX11->OpenCL texture sharing issue after updating to Adrenalin 24.1.1 driver. After rolling the driver back to 23.12.1 everything went fine.

static void opencl_unmap_from_d3d11(AVHWFramesContext *dst_fc,
                                    HWMapDescriptor *hwmap)
{
    AVOpenCLFrameDescriptor    *desc = hwmap->priv;
    OpenCLDeviceContext *device_priv = dst_fc->device_ctx->internal->priv;
    OpenCLFramesContext *frames_priv = dst_fc->internal->priv;
    cl_event event;
    cl_int cle;

    cle = device_priv->clEnqueueReleaseD3D11ObjectsKHR(
        frames_priv->command_queue, desc->nb_planes, desc->planes,
        0, NULL, &event);
    if (cle != CL_SUCCESS) {
        av_log(dst_fc, AV_LOG_ERROR, "Failed to release texture "
              "handle: %d.\n", cle);
    }

    opencl_wait_events(dst_fc, &event, 1);
}

[AVHWFramesContext @ 00000153372f0200] Failed to release texture handle: -60.

The log shows that the clEnqueueReleaseD3D11ObjectsKHR() function returned an irrelevant return value: CL_INVALID_GL_OBJECT (-60).

It returns -60 which is ridiculous. Because (CL_INVALID_GL_OBJECT) is used exclusively in OpenGL/CL sharing, not DX11/CL sharing. According to the OpenCL documentation, this value is also not within the return value range of this function.

After digging deeper into AMD's OpenCL runtime (clr), I found that the return code -60 only used by OpenGL/CL interop does appear on the return path of this DX11/CL sharing function. And you guys refactored this part of the code not long ago.

https://github.com/ROCm/clr/blame/8ff39a54fc790454b95b325eb2d9cdfa06ba7968/opencl/amdocl/cl_gl.cpp#L1597
https://github.com/ROCm/clr/blame/8ff39a54fc790454b95b325eb2d9cdfa06ba7968/opencl/amdocl/cl_gl.cpp#L1583
https://github.com/ROCm/clr/blame/8ff39a54fc790454b95b325eb2d9cdfa06ba7968/opencl/amdocl/cl_gl.cpp#L1708
https://github.com/ROCm/clr/blame/8ff39a54fc790454b95b325eb2d9cdfa06ba7968/opencl/amdocl/cl_gl.cpp#L1693

RUNTIME_ENTRY(cl_int, clEnqueueReleaseD3D11ObjectsKHR,
(cl_command_queue command_queue, cl_uint num_objects, const cl_mem* mem_objects,
cl_uint num_events_in_wait_list, const cl_event* event_wait_list, cl_event* event)) {
return amd::clEnqueueReleaseExtObjectsAMD(command_queue, num_objects, mem_objects,
num_events_in_wait_list, event_wait_list, event,
CL_COMMAND_RELEASE_D3D11_OBJECTS_KHR);
}
RUNTIME_EXIT

Operating System

10.0.19045 (Windows 10 22H2)

CPU

AMD Ryzen 9 5950X 16-Core Processor

GPU

AMD Radeon RX 7900 XTX

ROCm Version

ROCm 6.0.0

ROCm Component

clr

Steps to Reproduce

  1. Prepare a 1080p or 4k video. It can be any common video format such as H.264, HEVC or AV1.

  2. Download and unzip the jellyfin-ffmpeg6 6.0.1-1, which is the video transcoder of Jellyfin Media Server.

  3. Run the following command in CMD or PowerShell, this FFmpeg command uses DX11/CL sharing to interact directly with the D3D11VA decoder, OpenCL filter and AMF encoder to avoid extra copies.

// Input file path is `C:\ANY_H264_HEVC_AV1_VIDEO.mp4`, you can change it
// Output file path is `C:\output.mp4`, you can change it

ffmpeg.exe  -init_hw_device d3d11va=dx11:,vendor=0x1002 -init_hw_device opencl=ocl@dx11 \
-filter_hw_device ocl -hwaccel d3d11va -hwaccel_output_format d3d11 -autorotate 0 -i C:\ANY_H264_HEVC_AV1_VIDEO.mp4 \
-autoscale 0 -an -sn -c:v h264_amf -quality speed -b:v 20M -maxrate 20M \
-vf "hwmap=derive_device=opencl,scale_opencl=w=1920:h=1080:format=nv12,hwmap=derive_device=d3d11va:reverse=1,format=d3d11" \
-vframes 5000 -y C:\output.mp4
  1. It should fail immediately with error code -60 (CL_INVALID_GL_OBJECT).
Stream mapping:
  Stream #0:0 -> #0:0 (hevc (native) -> h264 (h264_amf))
  Stream #0:1 -> #0:1 (copy)
Press [q] to stop, [?] for help
[AVHWFramesContext @ 000001f01432de40] Failed to release texture handle: -60.
  1. Downgrade the driver to the old version Adrenalin 23.12.1, re-do the above procedures, you can run the ffmpeg command without any issue.

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

Issue threads from users:

[Issue]: libamdocl64.so causes segfault

Problem Description

model name : 11th Gen Intel(R) Core(TM) i7-11700KF @ 3.60GHz
OS:
NAME="Arch Linux"
CPU:
model name : 11th Gen Intel(R) Core(TM) i7-11700KF @ 3.60GHz
GPU:
Name: 11th Gen Intel(R) Core(TM) i7-11700KF @ 3.60GHz
Marketing Name: 11th Gen Intel(R) Core(TM) i7-11700KF @ 3.60GHz
Name: gfx1100
Marketing Name: AMD Radeon RX 7900 XTX
Name: amdgcn-amd-amdhsa--gfx1100

Operating System

Arch

CPU

11th Gen Intel i7-11700KF (16) @ 4.900GHz

GPU

AMD Radeon RX 7900 XTX

ROCm Version

ROCm 6.0.0, ROCm 5.7.1

ROCm Component

ROCm

Steps to Reproduce

  1. Install rocm-opencl-runtime
  2. Run clinfo

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

ROCk module is loaded

HSA System Attributes

Runtime Version: 1.1
System Timestamp Freq.: 1000.000000MHz
Sig. Max Wait Duration: 18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model: LARGE
System Endianness: LITTLE
Mwaitx: DISABLED
DMAbuf Support: YES

==========
HSA Agents


Agent 1


Name: 11th Gen Intel(R) Core(TM) i7-11700KF @ 3.60GHz
Uuid: CPU-XX
Marketing Name: 11th Gen Intel(R) Core(TM) i7-11700KF @ 3.60GHz
Vendor Name: CPU
Feature: None specified
Profile: FULL_PROFILE
Float Round Mode: NEAR
Max Queue Number: 0(0x0)
Queue Min Size: 0(0x0)
Queue Max Size: 0(0x0)
Queue Type: MULTI
Node: 0
Device Type: CPU
Cache Info:
L1: 49152(0xc000) KB
Chip ID: 0(0x0)
ASIC Revision: 0(0x0)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 4900
BDFID: 0
Internal Node ID: 0
Compute Unit: 16
SIMDs per CU: 0
Shader Engines: 0
Shader Arrs. per Eng.: 0
WatchPts on Addr. Ranges:1
Features: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: FINE GRAINED
Size: 65683344(0x3ea3f90) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 2
Segment: GLOBAL; FLAGS: KERNARG, FINE GRAINED
Size: 65683344(0x3ea3f90) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 3
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 65683344(0x3ea3f90) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
ISA Info:


Agent 2


Name: gfx1100
Uuid: GPU-c0e9f44d3ad3931a
Marketing Name: AMD Radeon RX 7900 XTX
Vendor Name: AMD
Feature: KERNEL_DISPATCH
Profile: BASE_PROFILE
Float Round Mode: NEAR
Max Queue Number: 128(0x80)
Queue Min Size: 64(0x40)
Queue Max Size: 131072(0x20000)
Queue Type: MULTI
Node: 1
Device Type: GPU
Cache Info:
L1: 32(0x20) KB
L2: 6144(0x1800) KB
L3: 98304(0x18000) KB
Chip ID: 29772(0x744c)
ASIC Revision: 0(0x0)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 2371
BDFID: 768
Internal Node ID: 1
Compute Unit: 96
SIMDs per CU: 2
Shader Engines: 6
Shader Arrs. per Eng.: 2
WatchPts on Addr. Ranges:4
Features: KERNEL_DISPATCH
Fast F16 Operation: TRUE
Wavefront Size: 32(0x20)
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Max Waves Per CU: 32(0x20)
Max Work-item Per CU: 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
Max fbarriers/Workgrp: 32
Packet Processor uCode:: 528
SDMA engine uCode:: 19
IOMMU Support:: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 25149440(0x17fc000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 2
Segment: GLOBAL; FLAGS:
Size: 25149440(0x17fc000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 3
Segment: GROUP
Size: 64(0x40) KB
Allocatable: FALSE
Alloc Granule: 0KB
Alloc Alignment: 0KB
Accessible by all: FALSE
ISA Info:
ISA 1
Name: amdgcn-amd-amdhsa--gfx1100
Machine Models: HSA_MACHINE_MODEL_LARGE
Profiles: HSA_PROFILE_BASE
Default Rounding Mode: NEAR
Default Rounding Mode: NEAR
Fast f16: TRUE
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
FBarrier Max Size: 32
*** Done ***

Additional Information

I have noticed that as soon as libamdocl64.so is added to icd clinfo crashes with segfault.

[Issue]: gfx900 *ERROR* ring page0 timeout

Problem Description

I realise gfx900 is no longer a supported GPU, however is clr known to work with recent ROCm releases? 5.4.x worked fine. I'm using mainline llvm-17, tested recent stable kernels, currently v6.7.2.

I've tried ROCm 5.7.1-6.0.2

I'm building ROCm for gfx900 only, when running OpenCL kernels the GPU locks up and gets reset.

[52736.307610] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring page0 timeout, signaled seq=6324, emitted seq=6327
[52736.307978] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process  pid 0 thread  pid 0
[52736.308308] amdgpu 0000:03:00.0: amdgpu: GPU reset begin!
[52736.320463] amdgpu: Failed to suspend process 0x8011
[52736.442595] [drm] psp gfx command UNLOAD_TA(0x2) failed and response status is (0x117)
[52736.471016] amdgpu 0000:03:00.0: amdgpu: BACO reset
[52737.034968] amdgpu 0000:03:00.0: amdgpu: GPU reset succeeded, trying to resume

clinfo works fine:

Number of platforms                               1
  Platform Name                                   AMD Accelerated Parallel Processing
  Platform Vendor                                 Advanced Micro Devices, Inc.
  Platform Version                                OpenCL 2.1 AMD-APP.dbg (3602.0)
  Platform Profile                                FULL_PROFILE
  Platform Extensions                             cl_khr_icd cl_amd_event_callback
  Platform Extensions function suffix             AMD
  Platform Host timer resolution                  1ns

  Platform Name                                   AMD Accelerated Parallel Processing
Number of devices                                 1
  Device Name                                     gfx900:xnack-
  Device Vendor                                   Advanced Micro Devices, Inc.
  Device Vendor ID                                0x1002
  Device Version                                  OpenCL 2.0
  Driver Version                                  3602.0 (HSA1.1,LC)
  Device OpenCL C Version                         OpenCL C 2.0
  Device Type                                     GPU
  Device Board Name (AMD)                         AMD Radeon RX Vega
  Device PCI-e ID (AMD)                           0x687f
  Device Topology (AMD)                           PCI-E, 0000:03:00.0
  Device Profile                                  FULL_PROFILE
  Device Available                                Yes
  Compiler Available                              Yes
  Linker Available                                Yes
  Max compute units                               64
  SIMD per compute unit (AMD)                     4
  SIMD width (AMD)                                16
  SIMD instruction width (AMD)                    1
  Max clock frequency                             1630MHz
  Graphics IP (AMD)                               9.0
  Device Partition                                (core)
    Max number of sub-devices                     64
    Supported partition types                     None
    Supported affinity domains                    (n/a)
  Max work item dimensions                        3
  Max work item sizes                             1024x1024x1024
  Max work group size                             256
  Preferred work group size (AMD)                 256
  Max work group size (AMD)                       1024
  Preferred work group size multiple (kernel)     64
  Wavefront width (AMD)                           64
  Preferred / native vector sizes
    char                                                 4 / 4
    short                                                2 / 2
    int                                                  1 / 1
    long                                                 1 / 1
    half                                                 1 / 1        (cl_khr_fp16)
    float                                                1 / 1
    double                                               1 / 1        (cl_khr_fp64)
  Half-precision Floating-point support           (cl_khr_fp16)
    Denormals                                     Yes
    Infinity and NANs                             Yes
    Round to nearest                              Yes
    Round to zero                                 Yes
    Round to infinity                             Yes
    IEEE754-2008 fused multiply-add               Yes
    Support is emulated in software               No
  Single-precision Floating-point support         (core)
    Denormals                                     Yes
    Infinity and NANs                             Yes
    Round to nearest                              Yes
    Round to zero                                 Yes
    Round to infinity                             Yes
    IEEE754-2008 fused multiply-add               Yes
    Support is emulated in software               No
    Correctly-rounded divide and sqrt operations  Yes
  Double-precision Floating-point support         (cl_khr_fp64)
    Denormals                                     Yes
    Infinity and NANs                             Yes
    Round to nearest                              Yes
    Round to zero                                 Yes
    Round to infinity                             Yes
    IEEE754-2008 fused multiply-add               Yes
    Support is emulated in software               No
  Address bits                                    64, Little-Endian
  Global memory size                              8573157376 (7.984GiB)
  Global free memory (AMD)                        8177664 (7.799GiB) 8177664 (7.799GiB)
  Global memory channels (AMD)                    64
  Global memory banks per channel (AMD)           4
  Global memory bank width (AMD)                  256 bytes
  Error Correction support                        No
  Max memory allocation                           7287183768 (6.787GiB)
  Unified memory for Host and Device              No
  Shared Virtual Memory (SVM) capabilities        (core)
    Coarse-grained buffer sharing                 Yes
    Fine-grained buffer sharing                   Yes
    Fine-grained system sharing                   No
    Atomics                                       No
  Minimum alignment for any data type             128 bytes
  Alignment of base address                       1024 bits (128 bytes)
  Preferred alignment for atomics
    SVM                                           0 bytes
    Global                                        0 bytes
    Local                                         0 bytes
  Max size for global variable                    7287183768 (6.787GiB)
  Preferred total size of global vars             8573157376 (7.984GiB)
  Global Memory cache type                        Read/Write
  Global Memory cache size                        16384 (16KiB)
  Global Memory cache line size                   64 bytes
  Image support                                   Yes
    Max number of samplers per kernel             16
    Max size for 1D images from buffer            134217728 pixels
    Max 1D or 2D image array size                 8192 images
    Base address alignment for 2D image buffers   256 bytes
    Pitch alignment for 2D image buffers          256 pixels
    Max 2D image size                             16384x16384 pixels
    Max 3D image size                             16384x16384x8192 pixels
    Max number of read image args                 128
    Max number of write image args                8
    Max number of read/write image args           64
  Max number of pipe args                         16
  Max active pipe reservations                    16
  Max pipe packet size                            2992216472 (2.787GiB)
  Local memory type                               Local
  Local memory size                               65536 (64KiB)
  Local memory size per CU (AMD)                  65536 (64KiB)
  Local memory banks (AMD)                        32
  Max number of constant args                     8
  Max constant buffer size                        7287183768 (6.787GiB)
  Preferred constant buffer size (AMD)            16384 (16KiB)
  Max size of kernel argument                     1024
  Queue properties (on host)
    Out-of-order execution                        No
    Profiling                                     Yes
  Queue properties (on device)
    Out-of-order execution                        Yes
    Profiling                                     Yes
    Preferred size                                262144 (256KiB)
    Max size                                      8388608 (8MiB)
  Max queues on device                            1
  Max events on device                            1024
  Prefer user sync for interop                    Yes
  Number of P2P devices (AMD)                     0
  Profiling timer resolution                      1ns
  Profiling timer offset since Epoch (AMD)        0ns (Thu Jan  1 01:00:00 1970)
  Execution capabilities
    Run OpenCL kernels                            Yes
    Run native kernels                            No
    Thread trace supported (AMD)                  No
    Number of async queues (AMD)                  8
    Max real-time compute queues (AMD)            8
    Max real-time compute units (AMD)             64
  printf() buffer size                            4194304 (4MiB)
  Built-in kernels                                (n/a)
  Device Extensions                               cl_khr_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_fp16 cl_khr_gl_sharing cl_amd_device_attribute_query cl_amd_media_ops cl_amd_media_ops2 cl_khr_image2d_from_buffer cl_khr_subgroups cl_khr_depth_images cl_amd_copy_buffer_p2p cl_amd_assembly_program

NULL platform behavior
  clGetPlatformInfo(NULL, CL_PLATFORM_NAME, ...)  No platform
  clGetDeviceIDs(NULL, CL_DEVICE_TYPE_ALL, ...)   No platform
  clCreateContext(NULL, ...) [default]            No platform
  clCreateContext(NULL, ...) [other]              Success [AMD]
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_DEFAULT)  Success (1)
    Platform Name                                 AMD Accelerated Parallel Processing
    Device Name                                   gfx900:xnack-
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_CPU)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_GPU)  Success (1)
    Platform Name                                 AMD Accelerated Parallel Processing
    Device Name                                   gfx900:xnack-
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_ACCELERATOR)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_CUSTOM)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_ALL)  Success (1)
    Platform Name                                 AMD Accelerated Parallel Processing
    Device Name                                   gfx900:xnack-

ICD loader properties
  ICD loader Name                                 Khronos OpenCL ICD Loader
  ICD loader Vendor                               Khronos Group
  ICD loader Version                              3.0.5
  ICD loader Profile                              OpenCL 3.0

Operating System

Gentoo

CPU

AMD FX8370E

GPU

AMD Instinct MI100

ROCm Version

ROCm 6.0.0, ROCm 5.7.1

ROCm Component

clr

Steps to Reproduce

No response

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

rocminfo --support
ROCk module is loaded

HSA System Attributes

Runtime Version: 1.1
System Timestamp Freq.: 1000.000000MHz
Sig. Max Wait Duration: 18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model: LARGE
System Endianness: LITTLE
Mwaitx: DISABLED
DMAbuf Support: YES

==========
HSA Agents


Agent 1


Name: AMD FX-8370E Eight-Core Processor
Uuid: CPU-XX
Marketing Name: AMD FX-8370E Eight-Core Processor
Vendor Name: CPU
Feature: None specified
Profile: FULL_PROFILE
Float Round Mode: NEAR
Max Queue Number: 0(0x0)
Queue Min Size: 0(0x0)
Queue Max Size: 0(0x0)
Queue Type: MULTI
Node: 0
Device Type: CPU
Cache Info:
L1: 16384(0x4000) KB
Chip ID: 0(0x0)
ASIC Revision: 0(0x0)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 4700
BDFID: 0
Internal Node ID: 0
Compute Unit: 8
SIMDs per CU: 0
Shader Engines: 0
Shader Arrs. per Eng.: 0
WatchPts on Addr. Ranges:1
Features: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: FINE GRAINED
Size: 16283176(0xf87628) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 2
Segment: GLOBAL; FLAGS: KERNARG, FINE GRAINED
Size: 16283176(0xf87628) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 3
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 16283176(0xf87628) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
ISA Info:


Agent 2


Name: gfx900
Uuid: GPU-0215054809580924
Marketing Name: AMD Radeon RX Vega
Vendor Name: AMD
Feature: KERNEL_DISPATCH
Profile: BASE_PROFILE
Float Round Mode: NEAR
Max Queue Number: 128(0x80)
Queue Min Size: 64(0x40)
Queue Max Size: 131072(0x20000)
Queue Type: MULTI
Node: 1
Device Type: GPU
Cache Info:
L1: 16(0x10) KB
L2: 4096(0x1000) KB
Chip ID: 26751(0x687f)
ASIC Revision: 1(0x1)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 1630
BDFID: 768
Internal Node ID: 1
Compute Unit: 64
SIMDs per CU: 4
Shader Engines: 4
Shader Arrs. per Eng.: 1
WatchPts on Addr. Ranges:4
Coherent Host Access: FALSE
Features: KERNEL_DISPATCH
Fast F16 Operation: TRUE
Wavefront Size: 64(0x40)
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Max Waves Per CU: 40(0x28)
Max Work-item Per CU: 2560(0xa00)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
Max fbarriers/Workgrp: 32
Packet Processor uCode:: 468
SDMA engine uCode:: 434
IOMMU Support:: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 8372224(0x7fc000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 2
Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED
Size: 8372224(0x7fc000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 3
Segment: GROUP
Size: 64(0x40) KB
Allocatable: FALSE
Alloc Granule: 0KB
Alloc Alignment: 0KB
Accessible by all: FALSE
ISA Info:
ISA 1
Name: amdgcn-amd-amdhsa--gfx900:xnack-
Machine Models: HSA_MACHINE_MODEL_LARGE
Profiles: HSA_PROFILE_BASE
Default Rounding Mode: NEAR
Default Rounding Mode: NEAR
Fast f16: TRUE
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
FBarrier Max Size: 32
*** Done ***

[Issue]: HIP causes blender to freeze

Problem Description

i tried with the latest available on my system (5.7.1), and it causes blender to freeze when opening preferences or loading a .blend file. The process becomes unkillable, requiring a hard reset.
I then downloaded the latest packages I could get on Arch Archive (6.0.0), and it causes the same issue when opening preferences or selecting "GPU Compute" on Render. And I also got a system freeze after adding the opencl runtime (6.0.0 too). The system freezes a few seconds after opening preferences, but the cursor still works?

Operating System

Manjaro Linux

CPU

Intel(R) Core(TM) i5-6500 CPU @ 3.20GHz

GPU

AMD Radeon RX 5500 XT

ROCm Version

ROCm 5.7.1 and 6.0.0

ROCm Component

HIP

Steps to Reproduce

pacman -S rocm-hip-runtime
open blender
open edit > preferences

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

ROCk module is loaded
=====================    
HSA System Attributes    
=====================    
Runtime Version:         1.1
System Timestamp Freq.:  1000.000000MHz
Sig. Max Wait Duration:  18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model:           LARGE                              
System Endianness:       LITTLE                             
Mwaitx:                  DISABLED
DMAbuf Support:          YES

==========               
HSA Agents               
==========               
*******                  
Agent 1                  
*******                  
  Name:                    Intel(R) Core(TM) i5-6500 CPU @ 3.20GHz
  Uuid:                    CPU-XX                             
  Marketing Name:          Intel(R) Core(TM) i5-6500 CPU @ 3.20GHz
  Vendor Name:             CPU                                
  Feature:                 None specified                     
  Profile:                 FULL_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        0(0x0)                             
  Queue Min Size:          0(0x0)                             
  Queue Max Size:          0(0x0)                             
  Queue Type:              MULTI                              
  Node:                    0                                  
  Device Type:             CPU                                
  Cache Info:              
    L1:                      32768(0x8000) KB                   
  Chip ID:                 0(0x0)                             
  ASIC Revision:           0(0x0)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   3600                               
  BDFID:                   0                                  
  Internal Node ID:        0                                  
  Compute Unit:            4                                  
  SIMDs per CU:            0                                  
  Shader Engines:          0                                  
  Shader Arrs. per Eng.:   0                                  
  WatchPts on Addr. Ranges:1                                  
  Features:                None
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: FINE GRAINED        
      Size:                    16232460(0xf7b00c) KB              
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 2                   
      Segment:                 GLOBAL; FLAGS: KERNARG, FINE GRAINED
      Size:                    16232460(0xf7b00c) KB              
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 3                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    16232460(0xf7b00c) KB              
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
  ISA Info:                
*******                  
Agent 2                  
*******                  
  Name:                    gfx1012                            
  Uuid:                    GPU-XX                             
  Marketing Name:          AMD Radeon RX 5500 XT              
  Vendor Name:             AMD                                
  Feature:                 KERNEL_DISPATCH                    
  Profile:                 BASE_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        128(0x80)                          
  Queue Min Size:          64(0x40)                           
  Queue Max Size:          131072(0x20000)                    
  Queue Type:              MULTI                              
  Node:                    1                                  
  Device Type:             GPU                                
  Cache Info:              
    L1:                      16(0x10) KB                        
    L2:                      2048(0x800) KB                     
  Chip ID:                 29504(0x7340)                      
  ASIC Revision:           1(0x1)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   1900                               
  BDFID:                   768                                
  Internal Node ID:        1                                  
  Compute Unit:            22                                 
  SIMDs per CU:            2                                  
  Shader Engines:          1                                  
  Shader Arrs. per Eng.:   2                                  
  WatchPts on Addr. Ranges:4                                  
  Coherent Host Access:    FALSE                              
  Features:                KERNEL_DISPATCH 
  Fast F16 Operation:      TRUE                               
  Wavefront Size:          32(0x20)                           
  Workgroup Max Size:      1024(0x400)                        
  Workgroup Max Size per Dimension:
    x                        1024(0x400)                        
    y                        1024(0x400)                        
    z                        1024(0x400)                        
  Max Waves Per CU:        40(0x28)                           
  Max Work-item Per CU:    1280(0x500)                        
  Grid Max Size:           4294967295(0xffffffff)             
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)             
    y                        4294967295(0xffffffff)             
    z                        4294967295(0xffffffff)             
  Max fbarriers/Workgrp:   32                                 
  Packet Processor uCode:: 146                                
  SDMA engine uCode::      41                                 
  IOMMU Support::          None                               
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    8372224(0x7fc000) KB               
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 2                   
      Segment:                 GLOBAL; FLAGS: EXTENDED FINE GRAINED
      Size:                    8372224(0x7fc000) KB               
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 3                   
      Segment:                 GROUP                              
      Size:                    64(0x40) KB                        
      Allocatable:             FALSE                              
      Alloc Granule:           0KB                                
      Alloc Alignment:         0KB                                
      Accessible by all:       FALSE                              
  ISA Info:                
    ISA 1                    
      Name:                    amdgcn-amd-amdhsa--gfx1012:xnack-  
      Machine Models:          HSA_MACHINE_MODEL_LARGE            
      Profiles:                HSA_PROFILE_BASE                   
      Default Rounding Mode:   NEAR                               
      Default Rounding Mode:   NEAR                               
      Fast f16:                TRUE                               
      Workgroup Max Size:      1024(0x400)                        
      Workgroup Max Size per Dimension:
        x                        1024(0x400)                        
        y                        1024(0x400)                        
        z                        1024(0x400)                        
      Grid Max Size:           4294967295(0xffffffff)             
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)             
        y                        4294967295(0xffffffff)             
        z                        4294967295(0xffffffff)             
      FBarrier Max Size:       32                                 
*** Done ***

Additional Information

No response

Enable DEBUG_CLR_GRAPH_PACKET_CAPTURE, the packet capture function can't work with kernel node modified by hipGraphExecKernelNodeSetParams

precondition: enable DEBUG_CLR_GRAPH_PACKET_CAPTURE

steps:

  1. hipGraphCreate
  2. hipGraphAddKernelNode
  3. hipGraphInstantiate
  4. hipGraphExecKernelNodeSetParams
  5. hipGraphLaunch
  6. hipStreamSynchronize

After function hipGraphInstantiate, the AQL packet is generated by CaptureAQLPackets which will be used in hipGraphLaunch step.

But when with hipGraphExecKernelNodeSetParams, the AQL will be modified, and the modified AQL will not executed in hipGraphLaunch.

I think this is an issue.

[Issue]: ROCM5.7.3, RCCL2.19.4 GPU kernel can't printf。Hash value collision detected

Problem Description

Problem Description

In the rccl file prims_simple.h,I have added a section of printf in this kernel function, such as :

device forceinline void genericOp(
intptr_t srcIx, intptr_t dstIx, int nelem, bool postOp
) {
constexpr int DirectRecv = /1 &&/ Direct && DirectRecv1;
constexpr int DirectSend = /1 &&/ Direct && DirectSend1;
constexpr int Src = SrcBuf != -1;
constexpr int Dst = DstBuf != -1;
nelem = nelem < 0 ? 0 : nelem;
int sliceSize = stepSizeStepPerSlice;
sliceSize = max(divUp(nelem, 16
SlicePerChunk)*16, sliceSize/32);
int slice = 0;
int offset = 0;
if(tid == 0) {
printf("in genericOp \n");
}

when i run rccl test, Use this command ./build/sendrecv_perf -b 8 -e 128M -f 2 -t 1 -g 2,will report this error:

enquence.cc Current function: ncclLaunchKernel line 1090
:1:rocvirtual.cpp :2945: 74877529363 us: [pid:44406 tid:0x7f26f4922c00] Pcie atomics not enabled, hostcall not supported
:1:rocvirtual.cpp :3280: 74877529375 us: [pid:44406 tid:0x7f26f4922c00] AQL dispatch failed!
yz-adm3: Test NCCL failure /home/yang.yang/yy/work/test-rccl/build/src/hipify/common.cu.cpp:451 'unhandled cuda error (run with NCCL_DEBUG=INFO for details) / '

After seeing the explanation here https://rocm.docs.amd.com/en/latest/about/CHANGELOG.html#non-hostcall-hip-printf, I have added the following settings in the RCCL CMakelists.txt file :

target_compile_options(rccl PRIVATE -mprintf-kind=buffered)

makefiles/common.mk:
CXXFLAGS := -DCUDA_MAJOR=$(CUDA_MAJOR) -DCUDA_MINOR=$(CUDA_MINOR) -fPIC -fvisibility=hidden
-Wall -mprintf-kind=buffered -g -Wno-unused-function -Wno-sign-compare -std=c++11 -Wvla
-I $(CUDA_INC)
$(CXXFLAGS)

After compiling RCCL, reported this error :

enquence.cc Current function: ncclLaunchKernel line 1090
:1:devhcprintf.cpp :265 : 81559524344 us: [pid:65800 tid:0x7f0d2c53d440] Hash value collision detected, printf buffer ill formed
:1:rocvirtual.cpp :3188: 81559524353 us: [pid:65800 tid:0x7f0d2c53d440]
Could not print data from the printf buffer!
:1:rocvirtual.cpp :3280: 81559524355 us: [pid:65800 tid:0x7f0d2c53d440] AQL dispatch failed!
:1:devhcprintf.cpp :265 : 81559524402 us: [pid:65799 tid:0x7ff8fd860440] Hash value collision detected, printf buffer ill formed
:1:rocvirtual.cpp :3188: 81559524410 us: [pid:65799 tid:0x7ff8fd860440]
Could not print data from the printf buffer!
:1:rocvirtual.cpp :3280: 81559524416 us: [pid:65799 tid:0x7ff8fd860440] AQL dispatch failed!
[rank0]: RuntimeError: NCCL Error 1: unhandled cuda error (run with NCCL_DEBUG=INFO for details)
[rank1]: RuntimeError: NCCL Error 1: unhandled cuda error (run with NCCL_DEBUG=INFO for details)

I have set these environment variables
export HIP_KERNEL_PRINTF=1
export HIP_ENABLE_PRINTF=1
export HCC_ENABLE_PRINTF=1
export AMD_LOG_LEVEL=1

Using a Linux server with two GPU cards, Without printf, the program executes normally, How should I solve this problem?

Operating System

22.04.1 LTS (Jammy Jellyfish)

CPU

12th Gen Intel(R) Core(TM) i7-12700

GPU

AMD Radeon RX 7900 XTX

ROCm Version

ROCm 5.7.0

ROCm Component

HIP, HIPCC, rccl

Steps to Reproduce

No response

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

No response

[Issue]: missing hip/nvidia_detail/nvidia_hip_runtime.h

Problem Description

Hi, I am testing HIP on Intel CPU workstation with Nvidia GPU.
I could build hip/clr as shown in the AMD documentation (after running dos2unix for all sources) but testing square.cu yields an error message, saying that "hip/nvidia_detail/nvidia_hip_runtime.h" is missing.
Looks like there is only hip/amd_detail folder, and I am wondering where I may find nvidia_detail folder?

Operating System

RHEL8.8

CPU

intel xeon

GPU

AMD Instinct MI300X, AMD Radeon VII

ROCm Version

ROCm 6.0.0

ROCm Component

clr

Steps to Reproduce

No response

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

GPU selection above is fake - there is no option for nvidia gpu.

Odd behaviour with darktable watermarks on Framework 13 AMD 7840U

I have a Framework 13 AMD (Ryzen 7840U, 64GB RAM, 2TB nvme) with the rocm-opencl-runtime installed. darktable recognizes the GPU but when I export an image that has a simple-text-shadow watermark, the following happens:
dog1
If I switch to simple-text it works correctly:
dog2
I contacted the darktable folks and they threw you guys under the bus.

Update AMD_PLATFORM_BUILD_NUMBER

In rocclr/utils/versions.hpp there is:

#define AMD_PLATFORM_BUILD_NUMBER 3581

#define AMD_PLATFORM_BUILD_NUMBER 3581

Which was last updated in March 2023 (one year ago).

But the same value for ROCm 6.1RC is 3614 ATM.

Please keep this public repo in sync with the ROCm that is released and with the ROCm RC.

This version number is useful to track which changes are where. I don't understand how the version here in the public repo was not updated for a year? while there are ongoing commits to this public repo. Is this a problem with the version number being out-of-sync, or is the situation that changes to CLR since one year ago didn't make it here in the open yet?

Simple HIP driver code crashes when launched in parallel on multi gpu system

Running the simple HIP code:

#include <hip/hip_runtime.h>

#define CHECK(Res)                                                             \
  if (Res != hipSuccess) {                                                     \
    printf(#Res " Failed!\n");                                                 \
    return 1;                                                                  \
  }

int main() {
  hipDevice_t Dev;
  CHECK(hipDeviceGet(&Dev, 0));
  hipCtx_t Ctx;
  CHECK(hipDevicePrimaryCtxRetain(&Ctx, Dev));
  CHECK(hipCtxSetCurrent(Ctx));
  hipEvent_t Ev;
  CHECK(hipEventCreateWithFlags(&Ev, hipEventDefault));
  CHECK(hipEventRecord(Ev, 0));
  CHECK(hipEventDestroy(Ev));
  CHECK(hipDevicePrimaryCtxRelease(Dev));
}

Crashes when run in parallel:

$ cat run.sh 
export AMD_LOG_LEVEL=4

for i in {1..500}; do
   {
     output_file=$(mktemp)  # Create a temporary file for the output
     ./a.out &> $output_file
     if [[ $? -ne 0 ]]; then  # Check if the exit status is non-zero
         cat "$output_file" > error.log   # Save the output
     fi
     rm "$output_file"       # Remove the temporary file
   } &
done
wait

Here is the error.log:

$ cat error.log 
:3:rocdevice.cpp            :434 : 136942947233 us: 44761: [tid:0x7f8f2a317f00] Initializing HSA stack.
:3:comgrctx.cpp             :33  : 136957862005 us: 44761: [tid:0x7f8f2a317f00] Loading COMGR library.
:3:rocdevice.cpp            :202 : 136957862078 us: 44761: [tid:0x7f8f2a317f00] Numa selects cpu agent[3]=0x308ec0(fine=0x3090e0,coarse=0x304c40) for gpu agent=0x305930
:3:rocdevice.cpp            :1635: 136957862488 us: 44761: [tid:0x7f8f2a317f00] HMM support: 1, xnack: 0, direct host access: 0

:4:rocdevice.cpp            :2012: 136957864821 us: 44761: [tid:0x7f8f2a317f00] Allocate hsa host memory 0x7f8cf9200000, size 0x101000
:4:rocdevice.cpp            :2012: 136957865040 us: 44761: [tid:0x7f8f2a317f00] Allocate hsa host memory 0x7f8cf9000000, size 0x101000
:3:rocdevice.cpp            :202 : 136957872821 us: 44761: [tid:0x7f8f2a317f00] Numa selects cpu agent[3]=0x308ec0(fine=0x3090e0,coarse=0x304c40) for gpu agent=0x3279c0
:3:rocdevice.cpp            :1635: 136957873016 us: 44761: [tid:0x7f8f2a317f00] HMM support: 1, xnack: 0, direct host access: 0

:4:rocdevice.cpp            :2012: 136957873079 us: 44761: [tid:0x7f8f2a317f00] Allocate hsa host memory 0x7f8f2a324000, size 0x70
:4:rocdevice.cpp            :2012: 136957873385 us: 44761: [tid:0x7f8f2a317f00] Allocate hsa host memory 0x7f8cf8e00000, size 0x101000
:4:rocdevice.cpp            :2012: 136957873805 us: 44761: [tid:0x7f8f2a317f00] Allocate hsa host memory 0x7f8cf8c00000, size 0x101000
:4:runtime.cpp              :83  : 136957873877 us: 44761: [tid:0x7f8f2a317f00] init
:3:hip_context.cpp          :48  : 136957873881 us: 44761: [tid:0x7f8f2a317f00] Direct Dispatch: 1
:3:hip_device.cpp           :169 : 136957873903 us: 44761: [tid:0x7f8f2a317f00] hipDeviceGet: Returned hipSuccess : 
:3:hip_context.cpp          :383 : 136957873918 us: 44761: [tid:0x7f8f2a317f00]  hipDevicePrimaryCtxRetain ( 0x7ffe1b0eec00, 0 ) 
:3:hip_context.cpp          :394 : 136957873922 us: 44761: [tid:0x7f8f2a317f00] hipDevicePrimaryCtxRetain: Returned hipSuccess : 
:3:hip_context.cpp          :179 : 136957873930 us: 44761: [tid:0x7f8f2a317f00]  hipCtxSetCurrent ( context:0x38c670 ) 
:3:hip_context.cpp          :193 : 136957873934 us: 44761: [tid:0x7f8f2a317f00] hipCtxSetCurrent: Returned hipSuccess : 
:3:hip_event.cpp            :321 : 136957873942 us: 44761: [tid:0x7f8f2a317f00]  hipEventCreateWithFlags ( 0x7ffe1b0eebf8, 0 ) 
:3:hip_event.cpp            :327 : 136957873948 us: 44761: [tid:0x7f8f2a317f00] hipEventCreateWithFlags: Returned hipSuccess : event:0x38d980
:3:hip_event.cpp            :396 : 136957873955 us: 44761: [tid:0x7f8f2a317f00]  hipEventRecord ( event:0x38d980, stream:<null> ) 
:3:rocdevice.cpp            :2822: 136957873966 us: 44761: [tid:0x7f8f2a317f00] number of allocated hardware queues with low priority: 0, with normal priority: 0, with high priority: 0, maximum per priority is: 4
:4:command.cpp              :349 : 136959305223 us: 44761: [tid:0x7f8f2a317f00] Command (InternalMarker) enqueued: 0x38ea60
run.sh: line 4: 40305 Segmentation fault      ./a.out &> $output_file

Using rocm-5.6.0.

$ rocminfo
ROCk module is loaded
=====================    
HSA System Attributes    
=====================    
Runtime Version:         1.1
System Timestamp Freq.:  1000.000000MHz
Sig. Max Wait Duration:  18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model:           LARGE                              
System Endianness:       LITTLE                             

==========               
HSA Agents               
==========               
*******                  
Agent 1                  
*******                  
  Name:                    AMD EPYC 7A53 64-Core Processor    
  Uuid:                    CPU-XX                             
  Marketing Name:          AMD EPYC 7A53 64-Core Processor    
  Vendor Name:             CPU                                
  Feature:                 None specified                     
  Profile:                 FULL_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        0(0x0)                             
  Queue Min Size:          0(0x0)                             
  Queue Max Size:          0(0x0)                             
  Queue Type:              MULTI                              
  Node:                    0                                  
  Device Type:             CPU                                
  Cache Info:              
    L1:                      32768(0x8000) KB                   
  Chip ID:                 0(0x0)                             
  ASIC Revision:           0(0x0)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   2000                               
  BDFID:                   0                                  
  Internal Node ID:        0                                  
  Compute Unit:            32                                 
  SIMDs per CU:            0                                  
  Shader Engines:          0                                  
  Shader Arrs. per Eng.:   0                                  
  WatchPts on Addr. Ranges:1                                  
  Features:                None
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: FINE GRAINED        
      Size:                    130797524(0x7cbcfd4) KB            
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 2                   
      Segment:                 GLOBAL; FLAGS: KERNARG, FINE GRAINED
      Size:                    130797524(0x7cbcfd4) KB            
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 3                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    130797524(0x7cbcfd4) KB            
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
  ISA Info:                
*******                  
Agent 2                  
*******                  
  Name:                    AMD EPYC 7A53 64-Core Processor    
  Uuid:                    CPU-XX                             
  Marketing Name:          AMD EPYC 7A53 64-Core Processor    
  Vendor Name:             CPU                                
  Feature:                 None specified                     
  Profile:                 FULL_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        0(0x0)                             
  Queue Min Size:          0(0x0)                             
  Queue Max Size:          0(0x0)                             
  Queue Type:              MULTI                              
  Node:                    1                                  
  Device Type:             CPU                                
  Cache Info:              
    L1:                      32768(0x8000) KB                   
  Chip ID:                 0(0x0)                             
  ASIC Revision:           0(0x0)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   2000                               
  BDFID:                   0                                  
  Internal Node ID:        1                                  
  Compute Unit:            32                                 
  SIMDs per CU:            0                                  
  Shader Engines:          0                                  
  Shader Arrs. per Eng.:   0                                  
  WatchPts on Addr. Ranges:1                                  
  Features:                None
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: FINE GRAINED        
      Size:                    132112468(0x7dfe054) KB            
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 2                   
      Segment:                 GLOBAL; FLAGS: KERNARG, FINE GRAINED
      Size:                    132112468(0x7dfe054) KB            
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 3                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    132112468(0x7dfe054) KB            
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
  ISA Info:                
*******                  
Agent 3                  
*******                  
  Name:                    AMD EPYC 7A53 64-Core Processor    
  Uuid:                    CPU-XX                             
  Marketing Name:          AMD EPYC 7A53 64-Core Processor    
  Vendor Name:             CPU                                
  Feature:                 None specified                     
  Profile:                 FULL_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        0(0x0)                             
  Queue Min Size:          0(0x0)                             
  Queue Max Size:          0(0x0)                             
  Queue Type:              MULTI                              
  Node:                    2                                  
  Device Type:             CPU                                
  Cache Info:              
    L1:                      32768(0x8000) KB                   
  Chip ID:                 0(0x0)                             
  ASIC Revision:           0(0x0)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   2000                               
  BDFID:                   0                                  
  Internal Node ID:        2                                  
  Compute Unit:            32                                 
  SIMDs per CU:            0                                  
  Shader Engines:          0                                  
  Shader Arrs. per Eng.:   0                                  
  WatchPts on Addr. Ranges:1                                  
  Features:                None
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: FINE GRAINED        
      Size:                    132112468(0x7dfe054) KB            
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 2                   
      Segment:                 GLOBAL; FLAGS: KERNARG, FINE GRAINED
      Size:                    132112468(0x7dfe054) KB            
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 3                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    132112468(0x7dfe054) KB            
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
  ISA Info:                
*******                  
Agent 4                  
*******                  
  Name:                    AMD EPYC 7A53 64-Core Processor    
  Uuid:                    CPU-XX                             
  Marketing Name:          AMD EPYC 7A53 64-Core Processor    
  Vendor Name:             CPU                                
  Feature:                 None specified                     
  Profile:                 FULL_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        0(0x0)                             
  Queue Min Size:          0(0x0)                             
  Queue Max Size:          0(0x0)                             
  Queue Type:              MULTI                              
  Node:                    3                                  
  Device Type:             CPU                                
  Cache Info:              
    L1:                      32768(0x8000) KB                   
  Chip ID:                 0(0x0)                             
  ASIC Revision:           0(0x0)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   2000                               
  BDFID:                   0                                  
  Internal Node ID:        3                                  
  Compute Unit:            32                                 
  SIMDs per CU:            0                                  
  Shader Engines:          0                                  
  Shader Arrs. per Eng.:   0                                  
  WatchPts on Addr. Ranges:1                                  
  Features:                None
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: FINE GRAINED        
      Size:                    132090580(0x7df8ad4) KB            
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 2                   
      Segment:                 GLOBAL; FLAGS: KERNARG, FINE GRAINED
      Size:                    132090580(0x7df8ad4) KB            
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 3                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    132090580(0x7df8ad4) KB            
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
  ISA Info:                
*******                  
Agent 5                  
*******                  
  Name:                    gfx90a                             
  Uuid:                    GPU-a5c82df98194e170               
  Marketing Name:                                             
  Vendor Name:             AMD                                
  Feature:                 KERNEL_DISPATCH                    
  Profile:                 BASE_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        128(0x80)                          
  Queue Min Size:          64(0x40)                           
  Queue Max Size:          131072(0x20000)                    
  Queue Type:              MULTI                              
  Node:                    4                                  
  Device Type:             GPU                                
  Cache Info:              
    L1:                      16(0x10) KB                        
    L2:                      8192(0x2000) KB                    
  Chip ID:                 29704(0x7408)                      
  ASIC Revision:           1(0x1)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   1700                               
  BDFID:                   49408                              
  Internal Node ID:        4                                  
  Compute Unit:            110                                
  SIMDs per CU:            4                                  
  Shader Engines:          8                                  
  Shader Arrs. per Eng.:   1                                  
  WatchPts on Addr. Ranges:4                                  
  Features:                KERNEL_DISPATCH 
  Fast F16 Operation:      TRUE                               
  Wavefront Size:          64(0x40)                           
  Workgroup Max Size:      1024(0x400)                        
  Workgroup Max Size per Dimension:
    x                        1024(0x400)                        
    y                        1024(0x400)                        
    z                        1024(0x400)                        
  Max Waves Per CU:        32(0x20)                           
  Max Work-item Per CU:    2048(0x800)                        
  Grid Max Size:           4294967295(0xffffffff)             
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)             
    y                        4294967295(0xffffffff)             
    z                        4294967295(0xffffffff)             
  Max fbarriers/Workgrp:   32                                 
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    67092480(0x3ffc000) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 2                   
      Segment:                 GLOBAL; FLAGS: FINE GRAINED        
      Size:                    67092480(0x3ffc000) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 3                   
      Segment:                 GROUP                              
      Size:                    64(0x40) KB                        
      Allocatable:             FALSE                              
      Alloc Granule:           0KB                                
      Alloc Alignment:         0KB                                
      Accessible by all:       FALSE                              
  ISA Info:                
    ISA 1                    
      Name:                    amdgcn-amd-amdhsa--gfx90a:sramecc+:xnack-
      Machine Models:          HSA_MACHINE_MODEL_LARGE            
      Profiles:                HSA_PROFILE_BASE                   
      Default Rounding Mode:   NEAR                               
      Default Rounding Mode:   NEAR                               
      Fast f16:                TRUE                               
      Workgroup Max Size:      1024(0x400)                        
      Workgroup Max Size per Dimension:
        x                        1024(0x400)                        
        y                        1024(0x400)                        
        z                        1024(0x400)                        
      Grid Max Size:           4294967295(0xffffffff)             
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)             
        y                        4294967295(0xffffffff)             
        z                        4294967295(0xffffffff)             
      FBarrier Max Size:       32                                 
*******                  
Agent 6                  
*******                  
  Name:                    gfx90a                             
  Uuid:                    GPU-01c9def4489c62b5               
  Marketing Name:                                             
  Vendor Name:             AMD                                
  Feature:                 KERNEL_DISPATCH                    
  Profile:                 BASE_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        128(0x80)                          
  Queue Min Size:          64(0x40)                           
  Queue Max Size:          131072(0x20000)                    
  Queue Type:              MULTI                              
  Node:                    5                                  
  Device Type:             GPU                                
  Cache Info:              
    L1:                      16(0x10) KB                        
    L2:                      8192(0x2000) KB                    
  Chip ID:                 29704(0x7408)                      
  ASIC Revision:           1(0x1)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   1700                               
  BDFID:                   50688                              
  Internal Node ID:        5                                  
  Compute Unit:            110                                
  SIMDs per CU:            4                                  
  Shader Engines:          8                                  
  Shader Arrs. per Eng.:   1                                  
  WatchPts on Addr. Ranges:4                                  
  Features:                KERNEL_DISPATCH 
  Fast F16 Operation:      TRUE                               
  Wavefront Size:          64(0x40)                           
  Workgroup Max Size:      1024(0x400)                        
  Workgroup Max Size per Dimension:
    x                        1024(0x400)                        
    y                        1024(0x400)                        
    z                        1024(0x400)                        
  Max Waves Per CU:        32(0x20)                           
  Max Work-item Per CU:    2048(0x800)                        
  Grid Max Size:           4294967295(0xffffffff)             
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)             
    y                        4294967295(0xffffffff)             
    z                        4294967295(0xffffffff)             
  Max fbarriers/Workgrp:   32                                 
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    67092480(0x3ffc000) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 2                   
      Segment:                 GLOBAL; FLAGS: FINE GRAINED        
      Size:                    67092480(0x3ffc000) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 3                   
      Segment:                 GROUP                              
      Size:                    64(0x40) KB                        
      Allocatable:             FALSE                              
      Alloc Granule:           0KB                                
      Alloc Alignment:         0KB                                
      Accessible by all:       FALSE                              
  ISA Info:                
    ISA 1                    
      Name:                    amdgcn-amd-amdhsa--gfx90a:sramecc+:xnack-
      Machine Models:          HSA_MACHINE_MODEL_LARGE            
      Profiles:                HSA_PROFILE_BASE                   
      Default Rounding Mode:   NEAR                               
      Default Rounding Mode:   NEAR                               
      Fast f16:                TRUE                               
      Workgroup Max Size:      1024(0x400)                        
      Workgroup Max Size per Dimension:
        x                        1024(0x400)                        
        y                        1024(0x400)                        
        z                        1024(0x400)                        
      Grid Max Size:           4294967295(0xffffffff)             
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)             
        y                        4294967295(0xffffffff)             
        z                        4294967295(0xffffffff)             
      FBarrier Max Size:       32                                 
*** Done ***             

OS:

$ cat /etc/os-release 
NAME="SLES"
VERSION="15-SP4"
VERSION_ID="15.4"
PRETTY_NAME="SUSE Linux Enterprise Server 15 SP4"
ID="sles"
ID_LIKE="suse"
ANSI_COLOR="0;32"
CPE_NAME="cpe:/o:suse:sles:15:sp4"
DOCUMENTATION_URL="https://documentation.suse.com/"

Missing warp match functions in HIP

Hi,

As pointed out at ROCm/hipamd#65 , match_any/match_all are not available in HIP.
These are available in CUDA (cf. https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#warp-match-functions ), and can be implemented on AMD GPUs on Vega+ architectures (such intrinsic corresponds to "WaveMatch" in HLSL shader model 6.5 https://microsoft.github.io/DirectX-Specs/d3d/HLSL_ShaderModel6_5.html#wavematch-function which is supported by Vega+).

Therefore it seems like they can and should be added.

match_any can for example be implemented as seen at llvm/llvm-project#62477 :

static inline __device__ uint64_t  __match_any(int value) {
  bool active = true;
  uint64_t result = 0;

  while (active) {
    // determine what threads have the same value as the currently first active thread
    int first_active_value = __builtin_amdgcn_readfirstlane(value);
    int predicate = (value == first_active_value);
    uint64_t m = __ballot(predicate); // THIS LINE IS PROBLEMATIC

    // if the current thread has the same value, set its result mask to the current one
    if (predicate) {
      result |= m;
      active = false;
    }
  }

  return result;
}

There used to be compiler bugs making it hard to implement them as with the code above, but they have been fixed.
Feel free to use that code if you want to.

Best regards,
Epliz

[Issue]: ODR Violations Due to Missing inline Specifiers in bfloat16 Conversion Functions

Problem Description

Starting from ROCm v5.7, the introduction of certain bfloat16 conversion functions in the header files has led to One Definition Rule (ODR) violations when building projects. This is due to some host functions not being specified as inline or static, resulting in linkage errors across multiple translation units.

Temporary Workaround

A temporary workaround involves manually modifying the header file to add the inline keyword to the __HOST_DEVICE__ macro definition. Specifically, changing line 96 in /opt/rocm/include/hip/amd_detail/amd_hip_bf16.h to:

#define _HOST_DEVICE_ _host_ _device_ inline

This resolves the linkage issue but is not a sustainable solution.

Expected Behavior

The host functions should be defined with inline or static specifiers to prevent ODR violations and ensure that the header files can be safely included across multiple translation units without causing linkage errors.

Additional Context

It appears that there is an ongoing effort to fix this issue, as seen in the commit 86bd518981b364c138f9901b28a529899d8654f3. However, this fix does not seem to be included in any of the ROCm releases.
Users attempting to install vLLM on ROCm, specifically after vLLM-rocm is merged into the mainline vLLM, may encounter issues due to the aforementioned ODR violations. It would be beneficial for the community if such fixes were included in an official ROCm release to avoid the need for manual intervention and to ensure clean and maintainable codebases.

Operating System

Ubuntu 22.04.3

CPU

AMD EPYC 7763 64-Core Processor

GPU

AMD Instinct MI250, AMD Instinct MI210

ROCm Version

ROCm 6.0.0, ROCm 5.7.1

ROCm Component

No response

Steps to Reproduce

  1. Build a project on ROCm that includes the bfloat16 conversion functions from the header file /opt/rocm/include/hip/amd_detail/amd_hip_bf16.h.
  2. Observe the ODR violations in the build process when linking multiple translations units including such header.

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

No response

Blender crashes with HIP 5.6.0 on AMD Ryzen 5 5625U

Main Problem

Trying to render a project with HIP results in blender 3.6.2 crashing. This is on Arch Linux. I have attached a backtrace of the crash

Backtraces

#0  0x00007fff84d58db8 in hip_impl::ihipOccupancyMaxActiveBlocksPerMultiprocessor(int*, int*, int*, amd::Device const&, ihipModuleSymbol_t*, int, unsigned long, bool) [clone .constprop.0]
    (maxBlocksPerCU=maxBlocksPerCU@entry=0x7ffe609a7b68, numBlocksPerGrid=numBlocksPerGrid@entry=0x7ffe609a7b70, bestBlockSize=bestBlockSize@entry=0x7ffe609a7b5c, device=..., func=func@entry=0x7ffe69001780, inputBlockSize=1024, inputBlockSize@entry=0, dynamicSMemSize=0, bCalcPotentialBlkSz=true)
    at /usr/src/debug/hip-runtime-amd/clr-rocm-5.6.0/hipamd/src/hip_platform.cpp:344
#1  0x00007fff84c43c35 in hipModuleOccupancyMaxPotentialBlockSize(int*, int*, hipFunction_t, size_t, int)
    (gridSize=<optimized out>, blockSize=0x7ffe8ffb9558, f=0x7ffe69001780, dynSharedMemPerBlk=<optimized out>, blockSizeLimit=<optimized out>)
    at /usr/src/debug/hip-runtime-amd/clr-rocm-5.6.0/hipamd/src/hip_platform.cpp:426
#2  0x00005555580a265b in ccl::HIPDeviceKernels::load(ccl::HIPDevice*) ()
#3  0x00005555580a2577 in ccl::HIPDevice::load_kernels(unsigned int) ()
#4  0x0000555557ef37ad in ccl::Scene::load_kernels(ccl::Progress&) ()
#5  0x0000555558043836 in ccl::Session::run_update_for_next_iteration() ()
#6  0x0000555558044e2b in ccl::Session::run_main_render_loop() ()
#7  0x000055555804593c in ccl::Session::thread_render() ()
#8  0x0000555558045b03 in ccl::Session::thread_run() ()
#9  0x000055555841e4de in ccl::thread::run(void*) ()
#10 0x00007fffe56e1943 in std::execute_native_thread_routine(void*) (__p=0x7ffe8ff0a620) at /usr/src/debug/gcc/gcc/libstdc++-v3/src/c++11/thread.cc:104
#11 0x00007fffe528c9eb in start_thread (arg=<optimized out>) at pthread_create.c:444
#12 0x00007fffe5310dfc in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78
Thread 74 "blender" received signal SIGFPE, Arithmetic exception.
[Switching to Thread 0x7ffe64aad000 (LWP 50176)]
0x00007fff86358db8 in hip_impl::ihipOccupancyMaxActiveBlocksPerMultiprocessor(int*, int*, int*, amd::Device const&, ihipModuleSymbol_t*, int, unsigned long, bool) [clone .constprop.0] (maxBlocksPerCU=maxBlocksPerCU@entry=0x7ffe64aa8b48, numBlocksPerGrid=numBlocksPerGrid@entry=0x7ffe64aa8b50, 
    bestBlockSize=bestBlockSize@entry=0x7ffe64aa8b3c, device=..., func=func@entry=0x7ffe6ae01780, inputBlockSize=1024, inputBlockSize@entry=0, dynamicSMemSize=0, 
    bCalcPotentialBlkSz=true) at /usr/src/debug/hip-runtime-amd/clr-rocm-5.6.0/hipamd/src/hip_platform.cpp:344
344         VgprWaves = maxVGPRs / amd::alignUp(wrkGrpInfo->usedVGPRs_, VgprGranularity);                                                                              

System Information

  • Arch Linux
  • ROCm 5.6.0
  • Mesa 23.1.6
  • Linux 6.4.12

Fail to build shared libs and OpenCL not detected with static lib - libamdocl64.a: invalid ELF header

Hi there
I don't succeed to build rocm-clr with BUILD_SHARED_LIBS=ON

So, I used:

cmake \
    -Wno-dev \
    -S "." \
    -B build \
    -DCMAKE_C_COMPILER=%{install_prefix}/llvm/bin/clang \
    -DLLVM_DIR=%{install_prefix}/llvm/lib/cmake/llvm \
    -DCLR_BUILD_OCL=ON \
    -DROCM_PATH=%{_usr} \
    -DClang_DIR=%{install_prefix}/llvm/lib/cmake/clang \
    -DLLD_DIR=%{install_prefix}/llvm/lib/cmake/lld \
    -DBUILD_SHARED_LIBS=OFF

with install_prefix = /usr/lib64/rocm

The build runs fine and libamdocl64.a is created (no .so is created, obviously).

I created /etc/OpenCL/vendors/amdocl64.icd with /usr/lib64/rocm//libamdocl64.a into it.

rocminfo finds my GPU:

ROCk module is loaded
=====================    
HSA System Attributes    
=====================    
Runtime Version:         1.1
System Timestamp Freq.:  1000.000000MHz
Sig. Max Wait Duration:  18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model:           LARGE                              
System Endianness:       LITTLE                             
Mwaitx:                  DISABLED
DMAbuf Support:          YES

==========               
HSA Agents               
==========               
*******                  
Agent 1                  
*******                  
  Name:                    AMD Ryzen 9 5900X 12-Core Processor
  Uuid:                    CPU-XX                             
  Marketing Name:          AMD Ryzen 9 5900X 12-Core Processor
  Vendor Name:             CPU                                
  Feature:                 None specified                     
  Profile:                 FULL_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        0(0x0)                             
  Queue Min Size:          0(0x0)                             
  Queue Max Size:          0(0x0)                             
  Queue Type:              MULTI                              
  Node:                    0                                  
  Device Type:             CPU                                
  Cache Info:              
    L1:                      32768(0x8000) KB                   
  Chip ID:                 0(0x0)                             
  ASIC Revision:           0(0x0)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   3700                               
  BDFID:                   0                                  
  Internal Node ID:        0                                  
  Compute Unit:            24                                 
  SIMDs per CU:            0                                  
  Shader Engines:          0                                  
  Shader Arrs. per Eng.:   0                                  
  WatchPts on Addr. Ranges:1                                  
  Features:                None
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: FINE GRAINED        
      Size:                    32782820(0x1f439e4) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 2                   
      Segment:                 GLOBAL; FLAGS: KERNARG, FINE GRAINED
      Size:                    32782820(0x1f439e4) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 3                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    32782820(0x1f439e4) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
  ISA Info:                
*******                  
Agent 2                  
*******                  
  Name:                    gfx1032                            
  Uuid:                    GPU-XX                             
  Marketing Name:          AMD Radeon RX 6600                 
  Vendor Name:             AMD                                
  Feature:                 KERNEL_DISPATCH                    
  Profile:                 BASE_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        128(0x80)                          
  Queue Min Size:          64(0x40)                           
  Queue Max Size:          131072(0x20000)                    
  Queue Type:              MULTI                              
  Node:                    1                                  
  Device Type:             GPU                                
  Cache Info:              
    L1:                      16(0x10) KB                        
    L2:                      2048(0x800) KB                     
    L3:                      32768(0x8000) KB                   
  Chip ID:                 29695(0x73ff)                      
  ASIC Revision:           0(0x0)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   2750                               
  BDFID:                   2304                               
  Internal Node ID:        1                                  
  Compute Unit:            28                                 
  SIMDs per CU:            2                                  
  Shader Engines:          2                                  
  Shader Arrs. per Eng.:   2                                  
  WatchPts on Addr. Ranges:4                                  
  Features:                KERNEL_DISPATCH 
  Fast F16 Operation:      TRUE                               
  Wavefront Size:          32(0x20)                           
  Workgroup Max Size:      1024(0x400)                        
  Workgroup Max Size per Dimension:
    x                        1024(0x400)                        
    y                        1024(0x400)                        
    z                        1024(0x400)                        
  Max Waves Per CU:        32(0x20)                           
  Max Work-item Per CU:    1024(0x400)                        
  Grid Max Size:           4294967295(0xffffffff)             
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)             
    y                        4294967295(0xffffffff)             
    z                        4294967295(0xffffffff)             
  Max fbarriers/Workgrp:   32                                 
  Packet Processor uCode:: 109                                
  SDMA engine uCode::      76                                 
  IOMMU Support::          None                               
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    8372224(0x7fc000) KB               
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 2                   
      Segment:                 GLOBAL; FLAGS:                     
      Size:                    8372224(0x7fc000) KB               
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 3                   
      Segment:                 GROUP                              
      Size:                    64(0x40) KB                        
      Allocatable:             FALSE                              
      Alloc Granule:           0KB                                
      Alloc Alignment:         0KB                                
      Accessible by all:       FALSE                              
  ISA Info:                
    ISA 1                    
      Name:                    amdgcn-amd-amdhsa--gfx1032         
      Machine Models:          HSA_MACHINE_MODEL_LARGE            
      Profiles:                HSA_PROFILE_BASE                   
      Default Rounding Mode:   NEAR                               
      Default Rounding Mode:   NEAR                               
      Fast f16:                TRUE                               
      Workgroup Max Size:      1024(0x400)                        
      Workgroup Max Size per Dimension:
        x                        1024(0x400)                        
        y                        1024(0x400)                        
        z                        1024(0x400)                        
      Grid Max Size:           4294967295(0xffffffff)             
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)             
        y                        4294967295(0xffffffff)             
        z                        4294967295(0xffffffff)             
      FBarrier Max Size:       32                                 
*** Done ***             

but clinfo from the ROCm package fails with:

dlerror: /usr/lib64/rocm/libamdocl64.a: invalid ELF header
ERROR: clGetPlatformIDs(-1001)

I am afraid to be in a catch22 situation: libamdocl64.so seems to be required in the amdocl64.icd but I can only get a libamdocl64.a from the build.

What is the way to get out of that? I have not been able to find any hints in https://rocm.docs.amd.com/en/latest/

Thanks for your support!

HIP build failing with hiprtc related error.

Seeing following error during clr build (ROCm5.6.1):
Build command is typical with cmake + cmake params followed by make.

    HIP_CLANG_PATH=$CONFIG_INSTALL_PREFIX/llvm/bin CXX=$CONFIG_INSTALL_PREFIX/llvm/bin/clang++ cmake .. \
        -DCMAKE_CXX_COMPILER=$CONFIG_INSTALL_PREFIX/llvm/bin/clang++ \
        -DCMAKE_PREFIX_PATH=$CONFIG_INSTALL_PREFIX \
        -DClang_DIR=$CONFIG_INSTALL_PREFIX/llvm/lib/cmake/clang/ \
        -DCLR_BUILD_HIP=ON \
        -DHIP_COMMON_DIR=$ROCM_SRC_FOLDER/HIP \
        -DROCCLR_PATH=$ROCM_SRC_FOLDER/clr/rocclr \
        -DHIPCC_BIN_DIR=$ROCM_SRC_FOLDER/HIPCC/bin 
    make -j32

Partial logs:

[100%] Building CXX object hipamd/src/CMakeFiles/amdhip64.dir/hip_vm.cpp.o
[100%] Building CXX object hipamd/src/CMakeFiles/amdhip64.dir/hiprtc/hiprtcComgrHelper.cpp.o
In file included from /home/jd538/ROCm-5.6/clr/hipamd/src/hip_peer.cpp:23:
In file included from /home/jd538/ROCm-5.6/clr/hipamd/src/hip_internal.hpp:28:
/home/jd538/ROCm-5.6/clr/hipamd/src/hip_formatting.hpp:416:10: error: comparison of different enumerati>
    case HIPRTC_JIT_NUM_OPTIONS:
         ^~~~~~~~~~~~~~~~~~~~~~
/home/jd538/ROCm-5.6/clr/hipamd/src/hip_formatting.hpp:413:10: error: comparison of different enumerati>
    case HIPRTC_JIT_FMA:
         ^~~~~~~~~~~~~~
/home/jd538/ROCm-5.6/clr/hipamd/src/hip_formatting.hpp:410:10: error: comparison of different enumerati>
    case HIPRTC_JIT_PREC_SQRT:

[Issue]: clr-rocm-6.0.2/rocclr/os/os_posix.cpp:321: static void amd::Os::currentStackInfo(unsigned char**, size_t*): Assertion `Os::currentStackPtr() >= *base - *size && Os::currentStackPtr() < *base && "just checking"' failed.

Problem Description

I'm on Gentoo Linux ppc64le (4K page size) using linux-6.7.6.
GPU is AMD RX 570 (mesa 24.0.1).
LLVM is 17.0.6.
I managed to successfully build rocm-opencl-runtime-6.0.2, but I had to use the -DNO_WARN_X86_INTRINSICS compile flag otherwise it fails.
Full build log without -DNO_WARN_X86_INTRINSICS: rocm-opencl-runtime-6.0.2.build.log
I'm also carrying this patch since v5 which used to fix tests:

--- ./opencl/tests/ocltst/module/perf/OCLPerfKernelThroughput.h.orig    2024-02-26 09:53:53.925778934 +0100
+++ ./opencl/tests/ocltst/module/perf/OCLPerfKernelThroughput.h 2024-02-26 09:54:09.165774504 +0100
@@ -45,7 +45,7 @@
 #define UNSIGNED_LARGE_INT unsigned long long
 #define MAX_LOOP_ITER 10
 typedef cl_float4 float4;
-typedef void (*CPUKernel)(__m128 *, __m128 *, unsigned int);
+typedef void (*CPUKernel)(__ibm128 *, __ibm128 *, unsigned int);
 
 class OCLPerfKernelThroughput : public OCLTestImp {
  public:

Unfortunately both clinfo and rocminfo still fail at runtime like they used to fail with 5.4.3:

talos2 ~ # clinfo 
clinfo: /var/tmp/portage/dev-libs/rocm-opencl-runtime-6.0.2/work/clr-rocm-6.0.2/rocclr/os/os_posix.cpp:321: static void amd::Os::currentStackInfo(unsigned char**, size_t*): Assertion `Os::currentStackPtr() >= *base - *size && Os::currentStackPtr() < *base && "just checking"' failed.
Aborted (core dumped)

clinfo: /var/tmp/portage/dev-libs/rocm-opencl-runtime-6.0.2/work/clr-rocm-6.0.2/rocclr/os/os_posix.cpp:321: static void amd::Os::currentStackInfo(unsigned char**, size_t*): Assertion `Os::currentStackPtr() >= *base - *size && Os::currentStackPtr() < *base && "just checking"' failed.

Program received signal SIGABRT, Aborted.
0x00003ffff7ca819c in ?? () from /usr/lib64/libc.so.6
(gdb) backtrace
#0  0x00003ffff7ca819c in ?? () from /usr/lib64/libc.so.6
#1  0x00003ffff7c4525c in raise () from /usr/lib64/libc.so.6
#2  0x00003ffff7c2543c in abort () from /usr/lib64/libc.so.6
#3  0x00003ffff7c39398 in ?? () from /usr/lib64/libc.so.6
#4  0x00003ffff7c39444 in __assert_fail () from /usr/lib64/libc.so.6
#5  0x00003ffff78cd504 in amd::Os::currentStackInfo (base=base@entry=0x100073630, size=size@entry=0x100073638) at /var/tmp/portage/dev-libs/rocm-opencl-runtime-6.0.2/work/clr-rocm-6.0.2/rocclr/os/os_posix.cpp:321
#6  0x00003ffff78fbd98 in amd::HostThread::HostThread (this=0x1000735d0) at /var/tmp/portage/dev-libs/rocm-opencl-runtime-6.0.2/work/clr-rocm-6.0.2/rocclr/thread/thread.cpp:34
#7  0x00003ffff78fbe8c in amd::Thread::init () at /var/tmp/portage/dev-libs/rocm-opencl-runtime-6.0.2/work/clr-rocm-6.0.2/rocclr/thread/thread.cpp:170
#8  0x00003ffff78ccae8 in amd::Os::init () at /var/tmp/portage/dev-libs/rocm-opencl-runtime-6.0.2/work/clr-rocm-6.0.2/rocclr/os/os_posix.cpp:170
#9  amd::Os::init () at /var/tmp/portage/dev-libs/rocm-opencl-runtime-6.0.2/work/clr-rocm-6.0.2/rocclr/os/os_posix.cpp:155
#10 0x00003ffff783d0b8 in amd::init () at /var/tmp/portage/dev-libs/rocm-opencl-runtime-6.0.2/work/clr-rocm-6.0.2/rocclr/os/os_posix.cpp:136
#11 0x00003ffff7fa5dfc in ?? () from /lib64/ld64.so.2
#12 0x00003ffff7fb9f18 in ?? () from /lib64/ld64.so.2
#13 0x00003ffff7f9f420 in _dl_catch_exception () from /lib64/ld64.so.2
#14 0x00003ffff7fba0d8 in ?? () from /lib64/ld64.so.2
#15 0x00003ffff7f9f37c in _dl_catch_exception () from /lib64/ld64.so.2
#16 0x00003ffff7fbb97c in ?? () from /lib64/ld64.so.2
#17 0x00003ffff7c9ed24 in ?? () from /usr/lib64/libc.so.6
#18 0x00003ffff7f9f37c in _dl_catch_exception () from /lib64/ld64.so.2
#19 0x00003ffff7f9f4fc in ?? () from /lib64/ld64.so.2
#20 0x00003ffff7c9e5f8 in ?? () from /usr/lib64/libc.so.6
#21 0x00003ffff7c9ee34 in dlopen () from /usr/lib64/libc.so.6
#22 0x00003ffff7f408a0 in ?? () from /usr/lib64/libOpenCL.so.1
#23 0x00003ffff7f3419c in ?? () from /usr/lib64/libOpenCL.so.1
#24 0x00003ffff7f40228 in ?? () from /usr/lib64/libOpenCL.so.1
#25 0x00003ffff7f404e4 in ?? () from /usr/lib64/libOpenCL.so.1
#26 0x00003ffff7cacf40 in ?? () from /usr/lib64/libc.so.6
#27 0x00003ffff7f40858 in ?? () from /usr/lib64/libOpenCL.so.1
#28 0x00003ffff7f34118 in ?? () from /usr/lib64/libOpenCL.so.1
#29 0x00003ffff7f36498 in clGetPlatformIDs () from /usr/lib64/libOpenCL.so.1
#30 0x0000000100008b58 in ?? ()
#31 0x00003ffff7c25c2c in ?? () from /usr/lib64/libc.so.6
#32 0x00003ffff7c25e6c in __libc_start_main () from /usr/lib64/libc.so.6
#33 0x0000000000000000 in ?? ()
talos2 ~ # rocminfo 
ROCk module is loaded
Segmentation fault (core dumped)

ROCk module is loaded

Program received signal SIGSEGV, Segmentation fault.
0x00003ffff7e5840c in rocr::os::callback (info=0x3fffffffda60, size=<optimized out>, data=0x3fffffffdb40) at /var/tmp/portage/dev-libs/rocr-runtime-6.0.2/work/ROCR-Runtime-rocm-6.0.2/src/core/util/lnx/os_linux.cpp:314
warning: 314	/var/tmp/portage/dev-libs/rocr-runtime-6.0.2/work/ROCR-Runtime-rocm-6.0.2/src/core/util/lnx/os_linux.cpp: No such file or directory
(gdb) backtrace
#0  0x00003ffff7e5840c in rocr::os::callback (info=0x3fffffffda60, size=<optimized out>, data=0x3fffffffdb40) at /var/tmp/portage/dev-libs/rocr-runtime-6.0.2/work/ROCR-Runtime-rocm-6.0.2/src/core/util/lnx/os_linux.cpp:314
#1  0x00003ffff77be50c in dl_iterate_phdr () from /usr/lib64/libc.so.6
#2  0x00003ffff7e58780 in rocr::os::GetLoadedToolsLib () at /var/tmp/portage/dev-libs/rocr-runtime-6.0.2/work/ROCR-Runtime-rocm-6.0.2/src/core/util/lnx/os_linux.cpp:332
#3  0x00003ffff7ebc3a8 in rocr::core::Runtime::LoadTools (this=this@entry=0x10003f1b0) at /var/tmp/portage/dev-libs/rocr-runtime-6.0.2/work/ROCR-Runtime-rocm-6.0.2/src/core/runtime/runtime.cpp:1745
#4  0x00003ffff7ebd460 in rocr::core::Runtime::Load (this=0x10003f1b0) at /var/tmp/portage/dev-libs/rocr-runtime-6.0.2/work/ROCR-Runtime-rocm-6.0.2/src/core/runtime/runtime.cpp:1539
#5  0x00003ffff7ebd688 in rocr::core::Runtime::Acquire () at /var/tmp/portage/dev-libs/rocr-runtime-6.0.2/work/ROCR-Runtime-rocm-6.0.2/src/core/runtime/runtime.cpp:116
#6  0x00003ffff7e8e1e8 in rocr::HSA::hsa_init () at /var/tmp/portage/dev-libs/rocr-runtime-6.0.2/work/ROCR-Runtime-rocm-6.0.2/src/core/runtime/hsa.cpp:206
#7  0x00003ffff7ed42fc in hsa_init () at /var/tmp/portage/dev-libs/rocr-runtime-6.0.2/work/ROCR-Runtime-rocm-6.0.2/src/core/common/hsa_table_interface.cpp:68
#8  0x00000001000027cc in ?? ()
#9  0x00003ffff7625c2c in ?? () from /usr/lib64/libc.so.6
#10 0x00003ffff7625e6c in __libc_start_main () from /usr/lib64/libc.so.6
#11 0x0000000000000000 in ?? ()

Operating System

Gentoo Linux ppc64le (4K page size)

CPU

IBM Power 9

GPU

AMD RX 570

ROCm Version

ROCm 6.0.2

ROCm Component

No response

Steps to Reproduce

No response

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

No response

The calls to currentStackInfo cause issues if a library uses stackful coroutines

So the context here is JuliaGPU/AMDGPU.jl#549 where julia switches the thread stack in to have goroutine like tasks. But that causes some issues, specifically in the call to https://github.com/ROCm-Developer-Tools/clr/blob/1a0c3e4dc45237716c08a6b74be5ff6405422ead/rocclr/os/os_posix.cpp#L290-L312 from https://github.com/ROCm-Developer-Tools/clr/blob/d96481fb3609720058180ca5aa02b1da57df68a5/rocclr/thread/thread.cpp#L32-L36 .
Taking a quick look, it doesn't seem that the stackBase_ and stackSize_ fields are used anywhere in the library, so I was curious if they have any use, or if they could be removed in order to avoid issues like these

No OpenCL devices found after creating OpenGL context

Since updating to the latest AMD Radeon Software for Linux, our software, Zivid Studio, is unable to find any AMD OpenCL device.

In debugging this issue, I found that if clGetDeviceIDs is called after an OpenGL context is initialized it returns CL_DEVICE_NOT_FOUND. When this happens, the following is written to the kernel log:

[Mon Oct  9 14:28:21 2023] amdgpu: Failed to create process VM object
[Mon Oct  9 14:28:21 2023] amdgpu: Failed to create process VM object

Here is a minimal example that reproduces the issue on my system:

#include <GL/glut.h>
#include <CL/cl.h>
#include <stdio.h>
#include <string.h>

void listOpenCLDevices() {
    cl_uint num_platforms;
    clGetPlatformIDs(0, NULL, &num_platforms);
    cl_platform_id platforms[num_platforms];
    clGetPlatformIDs(num_platforms, platforms, NULL);

    printf("OpenCL Platforms:\n");
    for (int i = 0; i < num_platforms; i++) {
        char name[128];
        clGetPlatformInfo(platforms[i], CL_PLATFORM_NAME, 128, name, NULL);
        printf("  Platform %d: %s\n", i, name);

        cl_uint num_devices;
        clGetDeviceIDs(platforms[i], CL_DEVICE_TYPE_ALL, 0, NULL, &num_devices);
        cl_device_id devices[num_devices];
        clGetDeviceIDs(platforms[i], CL_DEVICE_TYPE_ALL, num_devices, devices, NULL);

        printf("  Devices:\n");
        for (int j = 0; j < num_devices; j++) {
            char device_name[128];
            clGetDeviceInfo(devices[j], CL_DEVICE_NAME, 128, device_name, NULL);
            printf("    Device %d: %s\n", j, device_name);
        }
    }
}

int main(int argc, char **argv) {
    if (argc == 2 && strcmp(argv[1], "nogl") == 0) {
        listOpenCLDevices();
        return 0;
    }
    
    // Initialize GLUT and create a window
    glutInit(&argc, argv);
    glutInitDisplayMode(GLUT_DOUBLE | GLUT_RGB);
    glutInitWindowSize(640, 480);
    glutCreateWindow("Test");

    // List OpenCL devices
    listOpenCLDevices();

    return 0;
}

And here is a Makefile to compile it:

CC = gcc
CFLAGS = -Wall -Wextra -O2
LIBS = -lGL -lGLU -lglut -lOpenCL

all: program

program: main.c
	$(CC) $(CFLAGS) -o program main.c $(LIBS)

clean:
	rm -f program

output:

❯ ./program     
dlerror: /usr/lib/x86_64-linux-gnu/intel-opencl/libigdrcl.so: cannot open shared object file: No such file or directory
OpenCL Platforms:
  Platform 0: AMD Accelerated Parallel Processing
  Devices:
  Platform 1: Intel(R) OpenCL
  Devices:
    Device 0: AMD Ryzen 9 7950X3D 16-Core Processor

When running without OpenGL:

❯ ./program nogl
dlerror: /usr/lib/x86_64-linux-gnu/intel-opencl/libigdrcl.so: cannot open shared object file: No such file or directory
OpenCL Platforms:
  Platform 0: AMD Accelerated Parallel Processing
  Devices:
    Device 0: gfx1034
    Device 1: gfx1036
  Platform 1: Intel(R) OpenCL
  Devices:
    Device 0: AMD Ryzen 9 7950X3D 16-Core Processor

System Information:
CPU: AMD Ryzen 9 7950X3D 16-Core Processor
GPU: AMD Radeon rx 6500 xt
OS: Ubuntu 22.04.3 LTS (Kernel 6.2)

Sorry if this is the wrong place to report this issue.

[QA] error: use of undeclared identifier '__asm__'

Problem Description

I use inline asm in opencl code, but I get a error error: use of undeclared identifier '__asm__' on RX588 with driver 20.4.2

it is ok on RX6900xt, but the error will appear on RX 588

my code is like:

 __asm__ volatile("v_mad_u64_u32 %[t], null, %[aj], %[bi], 0;"      \
                           : [t] "=v"(tl[j])                                 \
                           : [aj] "v"(a->data[j]), [bi] "v"(b->data[i]));    \

Operating System

window10

CPU

amd R9

GPU

AMD Radeon VII

ROCm Version

ROCm 5.5.0

ROCm Component

No response

Steps to Reproduce

No response

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

No response

Why don't we just set all the fences to dirty?

I noticed a lock before resetFenceDirty() in iHipWaitActiveStreams(3c6505c). The question I have is why don't we just set all the fences to dirty after dispatchBarrierPacket(

if (cache_state == amd::Device::kCacheStateSystem) {
).

Would adding another lock and loop for this hurt performance?

Why can't other CacheState just set dirty like kCacheStateSystem?

Thanks and hope you answer!

OpenCL Performance: way to extract parallel kernel execution with out-of-order command queue

I'm developing an OpenCL application PRPLL/GpuOwl https://github.com/preda/gpuowl/tree/prpll for a primes search project.

The app runs a long series of kernels serially, in a long loop, e.g. let's say this is the sequence of kernels submitted:
A, B, C, D, A, B, C, D, and so on. As these kernels are to be run serially, it's natural to use an in-order queue.

So initially we had a single process, with a single in-order queue.

An observation was made that when running two such processes in parallel (independent processes, running on the same GPU), the performance is a bit better than "half". I.e. the agregate throughput was improved by running two processes in parallel on one GPU vs. running a single process on the GPU.

Taking this observation into account, I wanted to reproduce the same behaviour (observed when running two processes) in a single process by running two "logical" streams of kernels in a single process. The logic being that while each stream is serial, there is parallelism between the two streams that can be exploited by the GPU. E.g. we want to run A1,B1,C1,D1 on stream1, and A2,B2,C2,D2 on stream2, then A1 can be executed on the GPU in parallel with any kernel from stream 2. (by "stream" I mean a logical sequence of kernels that must be executed serially/in-order).

My first approach was to use two in-order command-queues, allocating one queue to each logical "stream". But I hit this bug ROCm/ROCR-Runtime#186 which causes one hot thread (100%CPU) and perf degradation when using two queues.

As a consequence, I decided to use a single out-of-order command queue, and model the serial dependence inside the logical streams with OpenCL event wait-lists. Unfortunately after implementing this, I realized that there is no parallelism exploited between the two "streams". It appears that no kernels are executed in parallel at all, even though some could and should be executed in parallel.

Example: let's assume these are the kernels submitted to the out-of-order queue:
A1,C2,B1,D2,C1, with dependence modelled through events: A1<B1<C1 and C2<D2. Then A1 and C2 could be run in parallel on the GPU.

(Another scenario is: A1,B1,A2 with the dependence A1<B1; here A1 and A2 are elligible for parallel running though this fact is less obvious. I would hope this parallelism opportunity can be exploited as well).

But this is not what is observed: by timing the kernels, I obtain a profile that is consistent with all the kernels being run serially.

When the kernels are run by way of two processes, I see that the "running" time of the kernels grows (almost doubles) as a consequence of two processes using the GPU in parallel. The kernels from the two processes are effectivelly executed in parallel, and this is seen in the per-kernel running time, and in the overall improved throughput.

But when the kernels are run through an "interleaved out-of-order queue", the running time of each kernel does not increase. That means that each kernel is executed "standalone", and no parallelism is exploited. The agregate throughput is consistent with running serially (lower than when running through two processes).

Basically, I want to be able to obtain the same level of parallelism and performance by running a single process (either with multiple queues, or with a single out-of-order queue) as what is obtained by running two processes with a single in-order queue each.

The story can reproduced using this project (at the given commit, or generally the "prpll" branch):
https://github.com/preda/gpuowl/tree/7520fade45359f07f19151085d1dff5480ab29a9
compiling with make in the source folder,
executing echo PRP=118845473 > work-1.txt
and running with
./build-debug/prpll -d 0 -prp 118063003 -verbose

(basically the above runs two PRP tests for the two numbers mentioned, one in the work-1.txt file, and one on the command line).

Support non-x86_64 platform

Hi,
Some code uses <immintrin.h> which cannot be built on non_x86_64 platform. Could you add a fallback generic implementation?
Thanks!

Segfaults when compiled with `-march=native`

ROCm: 5.7.1
gcc (Gentoo 13.2.1_p20230826 p7) 13.2.1 20230826
Linux desktop 6.6.0-pf2 #1 SMP PREEMPT_DYNAMIC
AMD Ryzen 9 5900X + 7900XTX

When I compile with -march=native, I get a segfaults in OpenCL and HIP applications. For example, when I run ./ocltst -m liboclperf.so I get the following trace:

(gdb) bt
#0  _mm256_stream_si256 (__B=..., __A=<optimized out>) at /var/tmp/portage/dev-libs/rocm-opencl-runtime-5.7.1/work/clr-rocm-5.7.1/rocclr/device/rocm/rocvirtual.cpp:2807
#1  roc::nontemporalMemcpy (size=96, src=0x55c4f440bdb0, dst=0x7f0985600000) at /var/tmp/portage/dev-libs/rocm-opencl-runtime-5.7.1/work/clr-rocm-5.7.1/rocclr/device/rocm/rocvirtual.cpp:2807
#2  roc::VirtualGPU::submitKernelInternal (this=this@entry=0x7f0994000c70, sizes=..., kernel=..., parameters=<optimized out>, eventHandle=eventHandle@entry=0x55c4f4063ad0, sharedMemBytes=<optimized out>, vcmd=<optimized out>, aql_packet=<optimized out>)
    at /var/tmp/portage/dev-libs/rocm-opencl-runtime-5.7.1/work/clr-rocm-5.7.1/rocclr/device/rocm/rocvirtual.cpp:3099
#3  0x00007f0aa4d646bc in roc::VirtualGPU::submitKernel (this=0x7f0994000c70, vcmd=...) at /var/tmp/portage/dev-libs/rocm-opencl-runtime-5.7.1/work/clr-rocm-5.7.1/rocclr/device/rocm/rocvirtual.cpp:3278
#4  0x00007f0aa4d248dc in amd::HostQueue::loop (this=this@entry=0x55c4f3ed83e0, virtualDevice=0x7f0994000c70) at /var/tmp/portage/dev-libs/rocm-opencl-runtime-5.7.1/work/clr-rocm-5.7.1/rocclr/platform/commandqueue.cpp:217
#5  0x00007f0aa4d25f0b in amd::HostQueue::Thread::run (this=0x55c4f3ed8508, data=0x55c4f3ed83e0) at /var/tmp/portage/dev-libs/rocm-opencl-runtime-5.7.1/work/clr-rocm-5.7.1/rocclr/platform/commandqueue.hpp:172
#6  0x00007f0aa4cb493d in amd::Thread::main (this=this@entry=0x55c4f3ed8508) at /var/tmp/portage/dev-libs/rocm-opencl-runtime-5.7.1/work/clr-rocm-5.7.1/rocclr/thread/thread.cpp:93
#7  0x00007f0aa4d18422 in amd::Thread::entry (thread=0x55c4f3ed8508) at /var/tmp/portage/dev-libs/rocm-opencl-runtime-5.7.1/work/clr-rocm-5.7.1/rocclr/os/os_posix.cpp:340
#8  0x00007f0aa4ead2b9 in ?? () from /usr/lib64/libc.so.6
#9  0x00007f0aa4f3030c in ?? () from /usr/lib64/libc.so.6

If I remove -march=native, everything works fine.

A typical compilation line that produces the segfault looks like this

[82/107] /usr/bin/x86_64-pc-linux-gnu-g++ -DATI_OS_LINUX -DCL_TARGET_OPENCL_VERSION=220 -DCL_USE_DEPRECATED_OPENCL_1_0_APIS -DCL_USE_DEPRECATED_OPENCL_1_1_APIS -DCL_USE_DEPRECATED_OPENCL_1_2_APIS -DCL_USE_DEPRECATED_OPENCL_2_0_APIS -DCOMGR_DYN_DLL -DHAVE_CL2_HPP -DLITTLEENDIAN_CPU -DOPENCL_C_MAJOR=2 -DOPENCL_C_MINOR=0 -DOPENCL_MAJOR=2 -DOPENCL_MINOR=1 -DROCCLR_SUPPORT_NUMA_POLICY -DUSE_COMGR_LIBRARY -DWITH_HSA_DEVICE -DWITH_LIGHTNING_COMPILER -I/var/tmp/portage/dev-libs/rocm-opencl-runtime-5.7.1/work/clr-rocm-5.7.1/rocclr -I/var/tmp/portage/dev-libs/rocm-opencl-runtime-5.7.1/work/clr-rocm-5.7.1/rocclr/compiler/lib -I/var/tmp/portage/dev-libs/rocm-opencl-runtime-5.7.1/work/clr-rocm-5.7.1/rocclr/compiler/lib/include -I/var/tmp/portage/dev-libs/rocm-opencl-runtime-5.7.1/work/clr-rocm-5.7.1/rocclr/compiler/lib/backends/common -I/var/tmp/portage/dev-libs/rocm-opencl-runtime-5.7.1/work/clr-rocm-5.7.1/rocclr/device -I/var/tmp/portage/dev-libs/rocm-opencl-runtime-5.7.1/work/clr-rocm-5.7.1/rocclr/elf -I/var/tmp/portage/dev-libs/rocm-opencl-runtime-5.7.1/work/clr-rocm-5.7.1/rocclr/include -I/var/tmp/portage/dev-libs/rocm-opencl-runtime-5.7.1/work/clr-rocm-5.7.1/opencl/khronos/headers/opencl2.2/CL -I/var/tmp/portage/dev-libs/rocm-opencl-runtime-5.7.1/work/clr-rocm-5.7.1/opencl/khronos/headers/opencl2.2/CL/.. -I/var/tmp/portage/dev-libs/rocm-opencl-runtime-5.7.1/work/clr-rocm-5.7.1/opencl/khronos/headers/opencl2.2/CL/../.. -I/var/tmp/portage/dev-libs/rocm-opencl-runtime-5.7.1/work/clr-rocm-5.7.1/opencl/khronos/headers/opencl2.2/CL/../../.. -I/var/tmp/portage/dev-libs/rocm-opencl-runtime-5.7.1/work/clr-rocm-5.7.1/opencl/khronos/headers/opencl2.2/CL/../../../.. -I/var/tmp/portage/dev-libs/rocm-opencl-runtime-5.7.1/work/clr-rocm-5.7.1/opencl/khronos/headers/opencl2.2/CL/../../../../amdocl  -march=native -O2 -pipe -std=c++17 -fPIC -MD -MT rocclr/CMakeFiles/rocclr.dir/device/rocm/rocvirtual.cpp.o -MF rocclr/CMakeFiles/rocclr.dir/device/rocm/rocvirtual.cpp.o.d -o rocclr/CMakeFiles/rocclr.dir/device/rocm/rocvirtual.cpp.o -c /var/tmp/portage/dev-libs/rocm-opencl-runtime-5.7.1/work/clr-rocm-5.7.1/rocclr/device/rocm/rocvirtual.cpp

Flags selected by -march=native:

$ gcc -march=native -E -v - </dev/null 2>&1 | grep cc1
 /usr/libexec/gcc/x86_64-pc-linux-gnu/13/cc1 -E -quiet -v - -march=znver3 -mmmx -mpopcnt -msse -msse2 -msse3 -mssse3 -msse4.1 -msse4.2 -mavx -mavx2 -msse4a -mno-fma4 -mno-xop -mfma -mno-avx512f -mbmi -mbmi2 -maes -mpclmul -mno-avx512vl -mno-avx512bw -mno-avx512dq -mno-avx512cd -mno-avx512er -mno-avx512pf -mno-avx512vbmi -mno-avx512ifma -mno-avx5124vnniw -mno-avx5124fmaps -mno-avx512vpopcntdq -mno-avx512vbmi2 -mno-gfni -mvpclmulqdq -mno-avx512vnni -mno-avx512bitalg -mno-avx512bf16 -mno-avx512vp2intersect -mno-3dnow -madx -mabm -mno-cldemote -mclflushopt -mclwb -mclzero -mcx16 -mno-enqcmd -mf16c -mfsgsbase -mfxsr -mno-hle -msahf -mno-lwp -mlzcnt -mmovbe -mno-movdir64b -mno-movdiri -mmwaitx -mno-pconfig -mpku -mno-prefetchwt1 -mprfchw -mno-ptwrite -mrdpid -mrdrnd -mrdseed -mno-rtm -mno-serialize -mno-sgx -msha -mshstk -mno-tbm -mno-tsxldtrk -mvaes -mno-waitpkg -mwbnoinvd -mxsave -mxsavec -mxsaveopt -mxsaves -mno-amx-tile -mno-amx-int8 -mno-amx-bf16 -mno-uintr -mno-hreset -mno-kl -mno-widekl -mno-avxvnni -mno-avx512fp16 -mno-avxifma -mno-avxvnniint8 -mno-avxneconvert -mno-cmpccxadd -mno-amx-fp16 -mno-prefetchi -mno-raoint -mno-amx-complex --param l1-cache-size=32 --param l1-cache-line-size=64 --param l2-cache-size=512 -mtune=znver3 -dumpbase -

[Issue]: warning clangrt builtins lib not found when CXX doesn't exist

Problem Description

Got warning

-- The Fortran compiler identification is GNU 9.4.0
-- The C compiler identification is GNU 9.4.0
-- The HIP compiler identification is Clang 17.0.0
-- Detecting Fortran compiler ABI info
-- Detecting Fortran compiler ABI info - done
-- Check for working Fortran compiler: /usr/bin/gfortran - skipped
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/gcc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting HIP compiler ABI info
-- Detecting HIP compiler ABI info - done
-- Check for working HIP compiler: /opt/rocm-6.0.0/llvm/bin/clang++ - skipped
-- Detecting HIP compile features
-- Detecting HIP compile features - done
-- Looking for pthread.h
-- Looking for pthread.h - found
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Failed
-- Looking for pthread_create in pthreads
-- Looking for pthread_create in pthreads - not found
-- Looking for pthread_create in pthread
-- Looking for pthread_create in pthread - found
-- Found Threads: TRUE  
CMake Warning (dev) at /opt/rocm-6.0.0/lib/cmake/hip/hip-config-amd.cmake:156 (message):
  clangrt builtins lib not found: No such file or directory
Call Stack (most recent call first):
  /opt/rocm-6.0.0/lib/cmake/hip/hip-config.cmake:149 (include)
  CMakeLists.txt:3 (find_package)
This warning is for project developers.  Use -Wno-dev to suppress it.

-- Configuring done
-- Generating done
-- Build files have been written to: /home/yeluo/opt/cmake_gpu/test_rocm_nocxx

There is another reference to CXX generating warnings to project without using CXX.

set(HIP_CXX_COMPILER ${CMAKE_CXX_COMPILER})

It seems that

-- Check for working HIP compiler: /opt/rocm-6.0.0/llvm/bin/clang++ - skipped

The HIP compiler has been found above, I don't get why looking for HIP_CXX_COMPILER compiler via CMAKE_CXX_COMPILER. I would expect no warning.

Operating System

Any linux

CPU

Any CPU

GPU

AMD Instinct MI250

ROCm Version

ROCm 6.0.0

ROCm Component

clr

Steps to Reproduce

CMakeLists.txt

cmake_minimum_required(VERSION 3.21 FATAL_ERROR)
project(qe LANGUAGES Fortran C HIP)
find_package(hip CONFIG)

command:

cmake -DCMAKE_C_COMPILER=gcc -DCMAKE_Fortran_COMPILER=gfortran .

Need patch first if testing with rocm 6.0.0 release

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

No response

[Issue]: hip-5.7.1 failed tests: invalid device pointer Code 17 with hipIpcOpenMemHandle( , , hipIpcMemLazyEnablePeerAccess)

Problem Description

Running Gentoo hip tests on dual MI100 system result in two tests failure:

The following tests FAILED:                                                                                                                                                                                        
        1699 - Unit_hipIpcMemAccess_Semaphores (Timeout)                                                                                                                                                           
        1715 - Unit_hipIpcEventHandle_Functional (Timeout)
1699/1801 Testing: Unit_hipIpcMemAccess_Semaphores
1699/1801 Test: Unit_hipIpcMemAccess_Semaphores
Command: "/tmp/portage/dev-util/hip-5.7.1-r1/work/hip-tests-rocm-5.7.0/catch_build/catch_tests/multiproc/MultiProc" "Unit_hipIpcMemAccess_Semaphores"
Directory: /tmp/portage/dev-util/hip-5.7.1-r1/work/hip-tests-rocm-5.7.0/catch_build/catch_tests/multiproc
"Unit_hipIpcMemAccess_Semaphores" start time: Jan 18 01:15 CST
Output:
----------------------------------------------------------
Filters: Unit_hipIpcMemAccess_Semaphores

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
MultiProc is a Catch v2.13.4 host application.
Run with -? for options

-------------------------------------------------------------------------------
Unit_hipIpcMemAccess_Semaphores
-------------------------------------------------------------------------------
/tmp/portage/dev-util/hip-5.7.1-r1/work/hip-tests-rocm-5.7.0/catch/multiproc/hipIpcMemAccessTest.cc:80
...............................................................................

/tmp/portage/dev-util/hip-5.7.1-r1/work/hip-tests-rocm-5.7.0/catch/multiproc/hipIpcMemAccessTest.cc:152: FAILED:
  REQUIRE( false )
with message:
  Error: invalid device pointer
      Code: 17
      Str: hipIpcOpenMemHandle(reinterpret_cast<void **>(&B_d), shrd_mem->
  memHandle, hipIpcMemLazyEnablePeerAccess)
      In File: /tmp/portage/dev-util/hip-5.7.1-r1/work/hip-tests-rocm-5.7.0/
  catch/multiproc/hipIpcMemAccessTest.cc
      At line: 152

===============================================================================
test cases: 1 | 1 failed
assertions: 7 | 6 passed | 1 failed

<end of output>

Operating System

Gentoo Prefix on upstream Linux kernel 6.1.69-1.1

CPU

AMD EPYC 7702 64-Core Processor

GPU

AMD Instinct MI100

ROCm Version

ROCm 5.7.1

ROCm Component

HIP

Steps to Reproduce

In a fresh Gentoo Linux, with

/etc/portage/package.accept_keywords

*/* ~amd64

/etc/portage/env/test.conf

FEATURES="${FEATURES} test"

/etc/portage/package.env/0-test

dev-util/hip test.conf

And run emerge --verbose hip

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

=====================    
HSA System Attributes    
=====================    
Runtime Version:         1.1
System Timestamp Freq.:  1000.000000MHz
Sig. Max Wait Duration:  18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model:           LARGE                              
System Endianness:       LITTLE                             
Mwaitx:                  DISABLED
DMAbuf Support:          NO

==========               
HSA Agents               
==========               
*******                  
Agent 1                  
*******                  
  Name:                    AMD EPYC 7702 64-Core Processor    
  Uuid:                    CPU-XX                             
  Marketing Name:          AMD EPYC 7702 64-Core Processor    
  Vendor Name:             CPU                                
  Feature:                 None specified                     
  Profile:                 FULL_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        0(0x0)                             
  Queue Min Size:          0(0x0)                             
  Queue Max Size:          0(0x0)                             
  Queue Type:              MULTI                              
  Node:                    0                                  
  Device Type:             CPU                                
  Cache Info:              
    L1:                      32768(0x8000) KB                   
  Chip ID:                 0(0x0)                             
  ASIC Revision:           0(0x0)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   2000                               
  BDFID:                   0                                  
  Internal Node ID:        0                                  
  Compute Unit:            128                                
  SIMDs per CU:            0                                  
  Shader Engines:          0                                  
  Shader Arrs. per Eng.:   0                                  
  WatchPts on Addr. Ranges:1                                  
  Features:                None
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: FINE GRAINED        
      Size:                    528222572(0x1f7c096c) KB           
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 2                   
      Segment:                 GLOBAL; FLAGS: KERNARG, FINE GRAINED
      Size:                    528222572(0x1f7c096c) KB           
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 3                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    528222572(0x1f7c096c) KB           
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
  ISA Info:                
*******                  
Agent 2                  
*******                  
  Name:                    AMD EPYC 7702 64-Core Processor    
  Uuid:                    CPU-XX                             
  Marketing Name:          AMD EPYC 7702 64-Core Processor    
  Vendor Name:             CPU                                
  Feature:                 None specified                     
  Profile:                 FULL_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        0(0x0)                             
  Queue Min Size:          0(0x0)                             
  Queue Max Size:          0(0x0)                             
  Queue Type:              MULTI                              
  Node:                    1                                  
  Device Type:             CPU                                
  Cache Info:              
    L1:                      32768(0x8000) KB                   
  Chip ID:                 0(0x0)                             
  ASIC Revision:           0(0x0)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   2000                               
  BDFID:                   0                                  
  Internal Node ID:        1                                  
  Compute Unit:            128                                
  SIMDs per CU:            0                                  
  Shader Engines:          0                                  
  Shader Arrs. per Eng.:   0                                  
  WatchPts on Addr. Ranges:1                                  
  Features:                None
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: FINE GRAINED        
      Size:                    528395328(0x1f7eac40) KB           
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 2                   
      Segment:                 GLOBAL; FLAGS: KERNARG, FINE GRAINED
      Size:                    528395328(0x1f7eac40) KB           
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 3                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    528395328(0x1f7eac40) KB           
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
  ISA Info:                
*******                  
Agent 3                  
*******                  
  Name:                    gfx908                             
  Uuid:                    GPU-736edc556309c935               
  Marketing Name:          AMD Instinct MI100                 
  Vendor Name:             AMD                                
  Feature:                 KERNEL_DISPATCH                    
  Profile:                 BASE_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        128(0x80)                          
  Queue Min Size:          64(0x40)                           
  Queue Max Size:          131072(0x20000)                    
  Queue Type:              MULTI                              
  Node:                    2                                  
  Device Type:             GPU                                
  Cache Info:              
    L1:                      16(0x10) KB                        
    L2:                      8192(0x2000) KB                    
  Chip ID:                 29580(0x738c)                      
  ASIC Revision:           2(0x2)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   1502                               
  BDFID:                   768                                
  Internal Node ID:        2                                  
  Compute Unit:            120                                
  SIMDs per CU:            4                                  
  Shader Engines:          8                                  
  Shader Arrs. per Eng.:   1                                  
  WatchPts on Addr. Ranges:4                                  
  Features:                KERNEL_DISPATCH 
  Fast F16 Operation:      TRUE                               
  Wavefront Size:          64(0x40)                           
  Workgroup Max Size:      1024(0x400)                        
  Workgroup Max Size per Dimension:
    x                        1024(0x400)                        
    y                        1024(0x400)                        
    z                        1024(0x400)                        
  Max Waves Per CU:        40(0x28)                           
  Max Work-item Per CU:    2560(0xa00)                        
  Grid Max Size:           4294967295(0xffffffff)             
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)             
    y                        4294967295(0xffffffff)             
    z                        4294967295(0xffffffff)             
  Max fbarriers/Workgrp:   32                                 
  Packet Processor uCode:: 64                                 
  SDMA engine uCode::      18                                 
  IOMMU Support::          None                               
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    33538048(0x1ffc000) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 2                   
      Segment:                 GLOBAL; FLAGS:                     
      Size:                    33538048(0x1ffc000) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 3                   
      Segment:                 GROUP                              
      Size:                    64(0x40) KB                        
      Allocatable:             FALSE                              
      Alloc Granule:           0KB                                
      Alloc Alignment:         0KB                                
      Accessible by all:       FALSE                              
  ISA Info:                
    ISA 1                    
      Name:                    amdgcn-amd-amdhsa--gfx908:sramecc+:xnack-
      Machine Models:          HSA_MACHINE_MODEL_LARGE            
      Profiles:                HSA_PROFILE_BASE                   
      Default Rounding Mode:   NEAR                               
      Default Rounding Mode:   NEAR                               
      Fast f16:                TRUE                               
      Workgroup Max Size:      1024(0x400)                        
      Workgroup Max Size per Dimension:
        x                        1024(0x400)                        
        y                        1024(0x400)                        
        z                        1024(0x400)                        
      Grid Max Size:           4294967295(0xffffffff)             
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)             
        y                        4294967295(0xffffffff)             
        z                        4294967295(0xffffffff)             
      FBarrier Max Size:       32                                 
*******                  
Agent 4                  
*******                  
  Name:                    gfx908                             
  Uuid:                    GPU-4c8c97e3b14bcfdc               
  Marketing Name:          AMD Instinct MI100                 
  Vendor Name:             AMD                                
  Feature:                 KERNEL_DISPATCH                    
  Profile:                 BASE_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        128(0x80)                          
  Queue Min Size:          64(0x40)                           
  Queue Max Size:          131072(0x20000)                    
  Queue Type:              MULTI                              
  Node:                    3                                  
  Device Type:             GPU                                
  Cache Info:              
    L1:                      16(0x10) KB                        
    L2:                      8192(0x2000) KB                    
  Chip ID:                 29580(0x738c)                      
  ASIC Revision:           2(0x2)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   1502                               
  BDFID:                   8960                               
  Internal Node ID:        3                                  
  Compute Unit:            120                                
  SIMDs per CU:            4                                  
  Shader Engines:          8                                  
  Shader Arrs. per Eng.:   1                                  
  WatchPts on Addr. Ranges:4                                  
  Features:                KERNEL_DISPATCH 
  Fast F16 Operation:      TRUE                               
  Wavefront Size:          64(0x40)                           
  Workgroup Max Size:      1024(0x400)                        
  Workgroup Max Size per Dimension:
    x                        1024(0x400)                        
    y                        1024(0x400)                        
    z                        1024(0x400)                        
  Max Waves Per CU:        40(0x28)                           
  Max Work-item Per CU:    2560(0xa00)                        
  Grid Max Size:           4294967295(0xffffffff)             
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)             
    y                        4294967295(0xffffffff)             
    z                        4294967295(0xffffffff)             
  Max fbarriers/Workgrp:   32                                 
  Packet Processor uCode:: 64                                 
  SDMA engine uCode::      18                                 
  IOMMU Support::          None                               
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    33538048(0x1ffc000) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 2                   
      Segment:                 GLOBAL; FLAGS:                     
      Size:                    33538048(0x1ffc000) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 3                   
      Segment:                 GROUP                              
      Size:                    64(0x40) KB                        
      Allocatable:             FALSE                              
      Alloc Granule:           0KB                                
      Alloc Alignment:         0KB                                
      Accessible by all:       FALSE                              
  ISA Info:                
    ISA 1                    
      Name:                    amdgcn-amd-amdhsa--gfx908:sramecc+:xnack-
      Machine Models:          HSA_MACHINE_MODEL_LARGE            
      Profiles:                HSA_PROFILE_BASE                   
      Default Rounding Mode:   NEAR                               
      Default Rounding Mode:   NEAR                               
      Fast f16:                TRUE                               
      Workgroup Max Size:      1024(0x400)                        
      Workgroup Max Size per Dimension:
        x                        1024(0x400)                        
        y                        1024(0x400)                        
        z                        1024(0x400)                        
      Grid Max Size:           4294967295(0xffffffff)             
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)             
        y                        4294967295(0xffffffff)             
        z                        4294967295(0xffffffff)             
      FBarrier Max Size:       32                                 
*** Done ***             

Additional Information

The full build & test log

build.log.gz

Test details:

LastTest.log.gz

Out-of-date install instructions within hipamd

hipamd/INSTALL.md contains build instructions that are rendered out-of-date with ROCM 5.6 and the moving of hipamd and ROCclr into the current clr repo. Those instructions should be either updated, removed, or at least have a warning banner added.

[Issue]: hip-config-amd.cmake choked when there is no CXX compiler

Problem Description

My application only use C and Fortran compiler + HIP as a CMake language.

CMake Error at /opt/rocm-6.0.0/lib/cmake/hip/hip-config-amd.cmake:121 (if):
  if given arguments:

    "STREQUAL" "Clang"

  Unknown arguments specified
Call Stack (most recent call first):
  /opt/rocm-6.0.0/lib/cmake/hip/hip-config.cmake:149 (include)
  CMakeLists.txt:3 (find_package)

due to non existing CMAKE_CXX_COMPILER_ID
https://github.com/ROCm/clr/blob/74edd40d26b049d7e9fd39faade8a6a83915f6df/hipamd/hip-config-amd.cmake#L149C6-L149C27

Operating System

Any Linux

CPU

Any CPU

GPU

AMD Instinct MI250X

ROCm Version

ROCm 6.0.0

ROCm Component

clr

Steps to Reproduce

CMakeLists.txt

cmake_minimum_required(VERSION 3.20 FATAL_ERROR)
project(qe LANGUAGES Fortran C HIP)
find_package(hip CONFIG)
cmake -DCMAKE_C_COMPILER=gcc -DCMAKE_Fortran_COMPILER=gfortran .

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

No response

cmake error at embed PCH with tag rocm-5.7.0

I built the clr from source, using the tag rocm-5.7.0
cmake reports an error:

'sh' '-c' '/home/marco/workspace/clr/hipamd/src/hip_embed_pch.sh /home/marco/workspace/HIP/include /home/marco/workspace/clr/build/hipamd/include /home/marco/workspace/clr/hipamd/include /home/marco/rocm/lib/cmake/llvm/../../..'
+/home/marco/rocm/lib/cmake/llvm/../../../bin/clang -O3 --rocm-path=/home/marco/workspace/clr/build/hipamd/include/.. -std=c++17 -nogpulib -isystem /home/marco/workspace/clr/build/hipamd/include -isystem /home/marco/workspace/HIP/include -isystem /home/marco/workspace/clr/hipamd/include --cuda-device-only --cuda-gpu-arch=gfx1030 -x hip /tmp/hip_pch.22317/hip_pch.h -E
clang: error: cannot find HIP runtime; provide its path via '--rocm-path', or pass '-nogpuinc' to build without HIP runtime
CMake Error at hipamd/src/CMakeLists.txt:182 (message):
Failed to embed PCH

My cmake setting is
export HIP_DIR=/home/marco/workspace/HIP
export HIPCC_DIR=/home/marco/workspace/HIPCC
cmake -S . -B build -G Ninja -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=$ROCM_PATH -DCMAKE_PREFIX_PATH=$ROCM_PATH -DHIP_COMMON_DIR=$HIP_DIR -DHIP_PLATFORM=amd -DHIPCC_BIN_DIR=$HIPCC_DIR/build -DHIP_CATCH_TEST=0 -DCLR_BUILD_HIP=ON -DROCM_PATH=$ROCM_PATH

My llvm-project is also built from source code with tag rocm-5.7.0
I wonder how to fix it. If need more information, please let me know.

Darktable fails to run with ROCm OpenCL runtime (but works with amdgpu-pro runtime)

I've opened an issue in the darktable repo but I wanted to report this here since I'm not sure which side is responsible for fixing the issue (I'm not clear what the underlying issue is in the first place...).

This is on an up-to-date Arch Linux system.
The program works fine when I use the runtime provided by the opencl-legacy-amdgpu-pro AUR package.
When using the runtime provided by the rocm-opencl-runtime package, I get the error below:

$ darktable -d opencl
Gtk-Message: 18:23:42.555: Failed to load module "appmenu-gtk-module"
     0,1665 [dt_get_sysresource_level] switched to 2 as `large'
     0,1665   total mem:       31995MB
     0,1665   mipmap cache:    3999MB
     0,1665   available mem:   21871MB
     0,1665   singlebuff:      499MB
     0,1665   OpenCL tune mem: WANTED
     0,1665   OpenCL pinned:   WANTED
[opencl_init] opencl related configuration options:
[opencl_init] opencl: ON
[opencl_init] opencl_scheduling_profile: 'default'
[opencl_init] opencl_library: 'default path'
[opencl_init] opencl_device_priority: '*/!0,*/*/*'
[opencl_init] opencl_mandatory_timeout: 200
[opencl_init] opencl library 'libOpenCL' found on your system and loaded
[opencl_init] found 1 platform
[opencl_init] found 1 device

[dt_opencl_device_init]
   DEVICE:                   0: 'gfx1032'
   PLATFORM NAME & VENDOR:   AMD Accelerated Parallel Processing, Advanced Micro Devices, Inc.
   CANONICAL NAME:           amdacceleratedparallelprocessinggfx1032
   DRIVER VERSION:           3570.0 (HSA1.1,LC)
   DEVICE VERSION:           OpenCL 2.0 
   DEVICE_TYPE:              GPU
   GLOBAL MEM SIZE:          8176 MB
   MAX MEM ALLOC:            6950 MB
   MAX IMAGE SIZE:           16384 x 16384
   MAX WORK GROUP SIZE:      256
   MAX WORK ITEM DIMENSIONS: 3
   MAX WORK ITEM SIZES:      [ 1024 1024 1024 ]
   ASYNC PIXELPIPE:          NO
   PINNED MEMORY TRANSFER:   WANTED
   MEMORY TUNING:            WANTED
   FORCED HEADROOM:          400
   AVOID ATOMICS:            NO
   MICRO NAP:                250
   ROUNDUP WIDTH:            16
   ROUNDUP HEIGHT:           16
   CHECK EVENT HANDLES:      128
   PERFORMANCE:              6.715
   TILING ADVANTAGE:         0.000
   DEFAULT DEVICE:           NO
   KERNEL BUILD DIRECTORY:   /usr/share/darktable/kernels
   KERNEL DIRECTORY:         /home/kant/.cache/darktable/cached_v1_kernels_for_AMDAcceleratedParallelProcessinggfx1032_35700HSA11LC
   CL COMPILER OPTION:       -cl-fast-relaxed-math
PHI node has multiple entries for the same basic block with different incoming values!
  %967 = phi float [ %largephi.extractslice0, %sw.default ], [ %largephi.extractslice055, %sw.bb667 ], [ %largephi.extractslice059, %sw.bb663 ], [ %largephi.extractslice063, %sw.bb659 ], [ %largephi.extractslice067, %sw.bb655 ], [ %largephi.extractslice071, %sw.bb646 ], [ %largephi.extractslice075, %_Z4fmodff.exit16 ], [ %largephi.extractslice079, %_Z4fmodff.exit13 ], [ %largephi.extractslice083, %_Z4fmodff.exit ], [ %largephi.extractslice087, %sw.bb562 ], [ %largephi.extractslice091, %sw.bb555 ], [ %largephi.extractslice095, %sw.bb533 ], [ %largephi.extractslice099, %if.then502 ], [ %largephi.extractslice0103, %if.else517 ], [ %largephi.extractslice0107, %if.then456 ], [ %largephi.extractslice0111, %if.else471 ], [ %largephi.extractslice0115, %if.then393 ], [ %largephi.extractslice0119, %if.else408 ], [ %largephi.extractslice0123, %if.then338 ], [ %largephi.extractslice0127, %if.else353 ], [ %largephi.extractslice0131, %if.then283 ], [ %largephi.extractslice0135, %if.else298 ], [ %largephi.extractslice0139, %if.then224 ], [ %largephi.extractslice0143, %if.else241 ], [ %largephi.extractslice0147, %sw.bb193 ], [ %largephi.extractslice0151, %sw.bb180 ], [ %largephi.extractslice0155, %sw.bb168 ], [ %largephi.extractslice0159, %sw.bb158 ], [ %largephi.extractslice0163, %sw.bb147 ], [ %largephi.extractslice0167, %if.then116 ], [ %largephi.extractslice0171, %if.else131 ], [ %largephi.extractslice0175, %sw.bb71 ], [ %largephi.extractslice0179, %sw.bb ], [ %largephi.extractslice0183, %if.end ], [ %largephi.extractslice0187, %if.end ], [ %largephi.extractslice0191, %if.end ], [ %largephi.extractslice0195, %if.end ], [ %largephi.extractslice0199, %if.end ]
label %if.end
  %largephi.extractslice0183 = extractelement <4 x float> %div, i64 0
  %largephi.extractslice0195 = extractelement <4 x float> %div, i64 0
in function blendop_Lab
LLVM ERROR: Broken function found, compilation aborted!
Aborted (core dumped)

How to insert cxx flag '-fno-stack-protector' for clang when using rtc?

On Gentoo distribution where /etc/clang/gentoo-hardened.cfg put -fstack-protector-strong by default, but that cause errors in compiling AMDGPU kernels. We have manually add -fno-stack-protector to hipcc to avoid issues like https://bugs.gentoo.org/890377, but I have encountered similar one when running rocFFT-5.7.1 tests:

[ RUN      ] pow2_1D/accuracy_test.vs_fftw/real_forward_len_2_double_ip_batch_4_istride_1_R_ostride_1_HI_idist_4_odist_2_ioffset_0_0_ooffset_0_0
LLVM ERROR: Cannot select: 0x7f0a5820b760: i64 = FrameIndex<0>
In function: fft_rtc_fwd_len1_factors_1_wgs_64_tpt_1_dim1_dp_ip_CI_unitstride_sbrr_R2C_dirReg_CB
LLVM ERROR: Cannot select: 0x560eae9a0a00: i64 = FrameIndex<0>
In function: fft_rtc_fwd_len1_factors_1_wgs_64_tpt_1_dim1_dp_ip_CI_unitstride_sbrr_R2C_dirReg

If I comment out -fstack-protector-strong in /etc/clang/gentoo-hardened.cfg the test can execute normally.

it seems that it uses rtc, which reads /etc/clang/gentoo-hardened.cfg. So, as a workaround, how can I put -fno-stack-protector as a compilation arg to rtc?

h2rcp produces incorrect result on rocm5.6

Calling the h2rcp() in rocm5.6 looks like it's converting the underlying storage as a short into a float and doing the reciprocal on that. Instead of 1/4.0=0.25, it produces 0.000057.

I tested this with gfx1010 in the docker image rocm/dev-ubuntu-20.04:5.6-complete but targeting gfx1030 gives an identical kernel disassembly so the same error should happen.

https://github.com/ROCm-Developer-Tools/hipamd/blob/4209792929ddf54ba9530813b7879cfdee42df14/include/hip/amd_detail/amd_hip_fp16.h#L1677-L1680

#include <stdio.h>
#include <hip/hip_runtime.h>
#include <hip/hip_fp16.h>

__device__ __forceinline__ __half2 __alternate_h2rcp(__half2 x) {
    return _Float16_2{static_cast<_Float16>(__builtin_amdgcn_rcph(static_cast<__half2_raw>(x).data.x)),
        static_cast<_Float16>(__builtin_amdgcn_rcph(static_cast<__half2_raw>(x).data.y))};
}

__global__ void do_rcp(half2* result, const half2* source) 
{
    result[0] = h2rcp(source[0]);
    result[1] = __alternate_h2rcp(source[1]);
}

int main(int argc, char *argv[]) 
{
    half2 *src_d, *result_d;
    half2 *src_h, *result_h;
    size_t N = 2;
    size_t Nbytes = N * sizeof(half2);

    src_h = (half2*)malloc(Nbytes);
    result_h = (half2*)malloc(Nbytes);
    src_h[0] = __floats2half2_rn(4.0f, 9.0f);
    src_h[1] = __floats2half2_rn(4.0f, 9.0f);

    hipMalloc(&src_d, Nbytes);
    hipMalloc(&result_d, Nbytes);
    hipMemcpy(src_d, src_h, Nbytes, hipMemcpyHostToDevice);

    hipLaunchKernelGGL(do_rcp, dim3(1), dim3(1), 0, 0,
        result_d, src_d);
    
    hipMemcpy(result_h, result_d, Nbytes, hipMemcpyDeviceToHost);

    printf("rocm: 1/%f = %f\n", __low2float(src_h[0]), __low2float(result_h[0]));
    printf("rocm: 1/%f = %f\n", __high2float(src_h[0]), __high2float(result_h[0]));
    printf("alternate: 1/%f = %f\n", __low2float(src_h[1]), __low2float(result_h[1]));
    printf("alternate: 1/%f = %f\n", __high2float(src_h[1]), __high2float(result_h[1]));
}

[Issue]: `roc::NullDevice::importExtSemaphore` (`hipImportExternalSemaphore`) crash

Problem Description

When running the 20_hip_vulkan example the program crashes with the following error

hipVulkan: /usr/src/debug/hip-runtime-amd/clr-rocm-6.0.2/rocclr/device/rocm/rocdevice.hpp:241: virtual bool roc::NullDevice::importExtSemaphore(void**, const amd::Os::FileDesc&, amd::ExternalSemaphoreHandleType): Assertion `false && "ShouldNotReachHere()"' failed.

Operating System

Arch Linux

CPU

AMD Ryzen 9 3950X 16-Core Processor

GPU

AMD Radeon Pro W6800, AMD Radeon VII

ROCm Version

ROCm 6.0.0

ROCm Component

clr

Steps to Reproduce

  1. Compile https://github.com/ROCm/hip-tests/tree/rocm-6.0.x/samples/2_Cookbook/20_hip_vulkan
  2. Run it

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

ROCk module is loaded
=====================    
HSA System Attributes    
=====================    
Runtime Version:         1.1
System Timestamp Freq.:  1000.000000MHz
Sig. Max Wait Duration:  18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model:           LARGE                              
System Endianness:       LITTLE                             
Mwaitx:                  DISABLED
DMAbuf Support:          YES

==========               
HSA Agents               
==========               
*******                  
Agent 1                  
*******                  
  Name:                    AMD Ryzen 9 3950X 16-Core Processor
  Uuid:                    CPU-XX                             
  Marketing Name:          AMD Ryzen 9 3950X 16-Core Processor
  Vendor Name:             CPU                                
  Feature:                 None specified                     
  Profile:                 FULL_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        0(0x0)                             
  Queue Min Size:          0(0x0)                             
  Queue Max Size:          0(0x0)                             
  Queue Type:              MULTI                              
  Node:                    0                                  
  Device Type:             CPU                                
  Cache Info:              
    L1:                      32768(0x8000) KB                   
  Chip ID:                 0(0x0)                             
  ASIC Revision:           0(0x0)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   3500                               
  BDFID:                   0                                  
  Internal Node ID:        0                                  
  Compute Unit:            32                                 
  SIMDs per CU:            0                                  
  Shader Engines:          0                                  
  Shader Arrs. per Eng.:   0                                  
  WatchPts on Addr. Ranges:1                                  
  Features:                None
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: FINE GRAINED        
      Size:                    131805100(0x7db2fac) KB            
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 2                   
      Segment:                 GLOBAL; FLAGS: KERNARG, FINE GRAINED
      Size:                    131805100(0x7db2fac) KB            
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 3                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    131805100(0x7db2fac) KB            
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
  ISA Info:                
*******                  
Agent 2                  
*******                  
  Name:                    gfx1030                            
  Uuid:                    GPU-fa924265fc8053c4               
  Marketing Name:          AMD Radeon RX 6800 XT              
  Vendor Name:             AMD                                
  Feature:                 KERNEL_DISPATCH                    
  Profile:                 BASE_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        128(0x80)                          
  Queue Min Size:          64(0x40)                           
  Queue Max Size:          131072(0x20000)                    
  Queue Type:              MULTI                              
  Node:                    1                                  
  Device Type:             GPU                                
  Cache Info:              
    L1:                      16(0x10) KB                        
    L2:                      4096(0x1000) KB                    
    L3:                      131072(0x20000) KB                 
  Chip ID:                 29631(0x73bf)                      
  ASIC Revision:           1(0x1)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   2575                               
  BDFID:                   3072                               
  Internal Node ID:        1                                  
  Compute Unit:            72                                 
  SIMDs per CU:            2                                  
  Shader Engines:          4                                  
  Shader Arrs. per Eng.:   2                                  
  WatchPts on Addr. Ranges:4                                  
  Coherent Host Access:    FALSE                              
  Features:                KERNEL_DISPATCH 
  Fast F16 Operation:      TRUE                               
  Wavefront Size:          32(0x20)                           
  Workgroup Max Size:      1024(0x400)                        
  Workgroup Max Size per Dimension:
    x                        1024(0x400)                        
    y                        1024(0x400)                        
    z                        1024(0x400)                        
  Max Waves Per CU:        32(0x20)                           
  Max Work-item Per CU:    1024(0x400)                        
  Grid Max Size:           4294967295(0xffffffff)             
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)             
    y                        4294967295(0xffffffff)             
    z                        4294967295(0xffffffff)             
  Max fbarriers/Workgrp:   32                                 
  Packet Processor uCode:: 116                                
  SDMA engine uCode::      83                                 
  IOMMU Support::          None                               
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    16760832(0xffc000) KB              
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 2                   
      Segment:                 GLOBAL; FLAGS: EXTENDED FINE GRAINED
      Size:                    16760832(0xffc000) KB              
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 3                   
      Segment:                 GROUP                              
      Size:                    64(0x40) KB                        
      Allocatable:             FALSE                              
      Alloc Granule:           0KB                                
      Alloc Alignment:         0KB                                
      Accessible by all:       FALSE                              
  ISA Info:                
    ISA 1                    
      Name:                    amdgcn-amd-amdhsa--gfx1030         
      Machine Models:          HSA_MACHINE_MODEL_LARGE            
      Profiles:                HSA_PROFILE_BASE                   
      Default Rounding Mode:   NEAR                               
      Default Rounding Mode:   NEAR                               
      Fast f16:                TRUE                               
      Workgroup Max Size:      1024(0x400)                        
      Workgroup Max Size per Dimension:
        x                        1024(0x400)                        
        y                        1024(0x400)                        
        z                        1024(0x400)                        
      Grid Max Size:           4294967295(0xffffffff)             
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)             
        y                        4294967295(0xffffffff)             
        z                        4294967295(0xffffffff)             
      FBarrier Max Size:       32                                 
*** Done ***

Additional Information

The program crashes on hipImportExternalSemaphore which albeit being documented as implemented is actually stubbed in rocclr/device/rocm/rocdevice.hpp.

The HSA backend appears to implement the same function but I couldn't find a simple way to compile the clr with that backend enabled.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.