ENCCS/CSC HIP101 workshop - Porting CUDA codes to HIP

April 16 2021

Important links

This page: https://hackmd.io/@enccs/hip101-april2021 Workshop page: https://enccs.se/events/2021/04/hip101-workshop/ Github repository: https://github.com/csc-training/hip Using Puhti: https://github.com/csc-training/hip/blob/main/using-puhti.md SLURM reservation for this training: hip101 AMD porting guide: https://rocmdocs.amd.com/en/latest/Programming_Guides/HIP-porting-guide.html AMD ROCm documentation: https://rocmdocs.amd.com/en/latest/

Instructor

George Markomanolis

Helpers

Artem Zhmurov
Mark Abraham
Roberto Di Remigio
Xin Li
Qiang Li

Code of conduct

We strive to follow the Code of Conduct developed by The Carpentries organisation to foster a welcoming environment for everyone (see https://docs.carpentries.org/topic_folders/policies/code-of-conduct.html). In short:

Use welcoming and inclusive language
Be respectful of different viewpoints and experiences
Gracefully accept constructive criticism
Focus on what is best for the community
Show courtesy and respect towards other community members

:::danger At the bottom of this document you can ask questions about the workshop content! We use the Zoom chat only for reporting Zoom problems and such. :::

Schedule

Time	Topic
09:00–10:00	Introduction to AMD architecture, HIP, and Hipify tools
10:00–10:15	Break
10:15-10:45	Deep dive to Hipify tools and examples
10:45-11:30	Lunch
11:30-12:10	Hands-on sessions
12:10-12:20	Break
12:20-12:40	GROMACS and CMake
12:40-13:00	Hands-on sessions - test your own code if you want

Icebreaker question

If you tomorrow got access to 10x larger computational resources than you currently have, would you be able to take advantage of it? If so, how?

Structure of the repository

docs
porting
└── codes
    ├── saxpy/cuda 
    ├── saxpy/cublas     
    ├── Discrete_Hankel_Transform
    ├── Heat-Equation
    ├── 2D Wave Propagation
    ├── KMeans clustering
    └── Vector_addition

Puhti

Please use the pre-assigned account:

trainingXXX: NAME
training018 : Cristian-Vasile A.
training019 : Xavier A.
training020 : Farid A.
training021 : Arash Alizad B.
training022 : Abdulrahman Azab M.
training023 : Markus B.
training024 : Oscar Bulancea L.
training025 : Wei Der C.
training026 : Gibson C.
training027 : Robert C.
training028 : Marco Crialesi E.
training029 : Tewodros D.
training030 : Yaxing D.
training031 : Pascal H.
training032 : Johan H.
training033 : Patric H.
training034 : Chia-Jung H.
training035 : Zuhair I.
training036 : Niclas J.
training037 : Esko J.
training038 : Joe J.
training039 : Bijoy J.
training040 : Kimmo K.
training041 : Umair K.
training042 : Tuomas K.
training043 : Oskar L.
training044 : Ashenafi L.
training045 : Qiang L.
training046 : Floriano M.
training047 : Talgat M.
training048 : Anders M.
training049 : Meghdoot M.
training050 : Gnanavel M.
training051 : Pedro O.
training052 : Johannes P.
training053 : Janne P.
training054 : Chaitanya P.
training055 : Jarno R.
training056 : Zilvinas R.
training057 : Tuomas R.
training058 : Kari R.
training059 : Assa Aravindh S.
training060 : Nicolo S.
training061 : Suraj S.
training062 : Pedro Simoes C.
training063 : Samuele S.
training064 : Pascal S.
training065 : Ronith S.
training066 : Walter T.
training067 : Fedor U.
training068 : Olav V.
training069 : Masi V.
training070 : Henric Z.
training071 : Hongyang Z.

Connect to Puhti

ssh [email protected]

Give your password and you should be located in the directory:

/users/trainingXXX

Explore the environment

 module list

Currently Loaded Modules:
  1) intel/19.0.4   2) hpcx-mpi/2.4.0   3) intel-mkl/2019.0.4   4) StdEnv

SLURM

Submit script sub.sh

sbatch sub.sh

Check the status of a job

squeue -u $USER

Cancel a job

scancel JOBID

Sample batch job script

#!/bin/bash
#SBATCH --job-name=hip_test
#SBATCH --account=project_2000745
#SBATCH --partition=gpu
#SBATCH --time=00:05:00
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=8000
#SBATCH --gres=gpu:v100:1
#SBATCH --reservation=gpu_training

module load hip

srun my_hip_program

HIP

Load HIP module

module load hip/4.0.0c

module list

Currently Loaded Modules:
  1) StdEnv   2) gcc/9.1.0   3) cuda/11.1.0   4) hip/4.0.0c   5) intel-mkl/2019.0.4   6) hpcx-mpi/2.4.0

There is also a module hip/4.0.0 but we created also one hip/4.0.0c which is an installation from the source code. The name will comply with the version in the future.

hipconfig

hipconfig

HIP version  : 4.0.20496-4f163c6

== hipconfig
HIP_PATH     : /appl/opt/rocm/rocm-4.0.0c/hip
ROCM_PATH    : /appl/opt/rocm/rocm-4.0.0c/
HIP_COMPILER : clang
HIP_PLATFORM : nvcc
HIP_RUNTIME  : ROCclr
CPP_CONFIG   :  -D__HIP_PLATFORM_NVCC__=  -I/appl/opt/rocm/rocm-4.0.0c/hip/include -I/appl/spack/install-tree/gcc-9.1.0/cuda-11.1.0-vvfuk2//include

== nvcc
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Tue_Sep_15_19:10:02_PDT_2020
Cuda compilation tools, release 11.1, V11.1.74
Build cuda_11.1.TC455_06.29069683_0

=== Environment Variables
PATH=/appl/opt/rocm/rocm-4.0.0c/hip/bin:/appl/spack/install-tree/gcc-9.1.0/hwloc-2.0.2-wqrgpf/bin:/appl/opt/ucx/1.9.0-cuda/bin:/appl/spack/install-tree/gcc-9.1.0/openmpi-4.0.5-ym53tz/bin:/appl/spack/install-tree/gcc-9.1.0/hdf5-1.12.0-wtlera/bin:/appl/spack/install-tree/gcc-9.1.0/cuda-11.1.0-vvfuk2/bin:/appl/spack/install-tree/gcc-4.8.5/gcc-9.1.0-vpjht2/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/appl/bin:/users/markoman/.local/bin:/users/markoman/bin
CUDA_PATH=/appl/spack/install-tree/gcc-9.1.0/cuda-11.1.0-vvfuk2/
HIPFORT_ARCH=nvptx
HIP_PLATFORM=nvcc
LD_LIBRARY_PATH=/appl/opt/rocm/rocm-4.0.0c/hip/lib:/appl/spack/install-tree/gcc-9.1.0/hwloc-2.0.2-wqrgpf/lib:/appl/opt/ucx/1.9.0-cuda/lib:/appl/spack/install-tree/gcc-9.1.0/openmpi-4.0.5-ym53tz/lib:/appl/spack/install-tree/gcc-9.1.0/hdf5-1.12.0-wtlera/lib:/appl/opt/cluster_studio_xe2019/compilers_and_libraries_2019.4.243/linux/tbb/lib/intel64_lin/gcc4.7:/appl/opt/cluster_studio_xe2019/compilers_and_libraries_2019.4.243/linux/compiler/lib/intel64_lin:/appl/opt/cluster_studio_xe2019/compilers_and_libraries_2019.4.243/linux/mkl/lib/intel64_lin:/appl/spack/install-tree/gcc-9.1.0/cuda-11.1.0-vvfuk2/lib64:/appl/spack/install-tree/gcc-4.8.5/gcc-9.1.0-vpjht2/lib64:/appl/spack/install-tree/gcc-4.8.5/gcc-9.1.0-vpjht2/lib:/appl/opt/rocm/rocm-4.0.0/hiprand/lib:/appl/opt/rocm/rocm-4.0.0c/hipblas/hipblas/lib
HIP_RUNTIME=ROCclr
HIPFORT_GPU=sm_70
CUDA_INSTALL_ROOT=/appl/spack/install-tree/gcc-9.1.0/cuda-11.1.0-vvfuk2
HIPFORT_HOME=/appl/opt/rocm/rocm-4.0.0c//hipfort/
HIPFORT_ARCHGPU=nvptx-sm_70
HIPCC_OPTS=--x cu
HIP_COMPILER=clang
HIP_PATH=/appl/opt/rocm/rocm-4.0.0c/hip

== Linux Kernel
Hostname     : puhti-login1.bullx
Linux puhti-login1.bullx 3.10.0-1062.33.1.el7.x86_64 #1 SMP Thu Aug 13 10:55:03 EDT 2020 x86_64 x86_64 x86_64 GNU/Linux
LSB Version:	:core-4.1-amd64:core-4.1-noarch
Distributor ID:	RedHatEnterpriseServer
Description:	Red Hat Enterprise Linux Server release 7.7 (Maipo)
Release:	7.7
Codename:	Maipo

The wrapper to compile on NVIDIA system is called hipcc

 which hipcc
/appl/opt/rocm/rocm-4.0.0c/hip/bin/hipcc

You can read the file /appl/opt/rocm/rocm-4.0.0c/hip/bin/hipcc for more information

hipcc -h

Usage  : nvcc [options] <inputfile>

Options for specifying the compilation phase
============================================
More exactly, this option specifies up to which stage the input files must be compiled,
according to the following compilation trajectories for different input file types:
        .c/.cc/.cpp/.cxx : preprocess, compile, link
        .o               : link
        .i/.ii           : compile, link
        .cu              : preprocess, cuda frontend, PTX assemble,
                           merge with host C code, compile, link
        .gpu             : cicc compile into cubin
        .ptx             : PTX assemble into cubin.

Porting CUDA codes to HIP

General Guidelines

Start porting the CUDA codes on an NVIDIA system
When it is finished, compile the code with HIP on an AMD system (no access to AMD hardware yet)
HIP can be used on both AMD and NVIDIA GPUs
The script hipconvertinplace-perl.sh can hipify all the files in a directory
Some HIP libraries seem not to work on NVIDIA systems

VERBOSE Mode

If you want to see the command that is executed from hipcc, declare the following_

export HIPCC_VERBOSE=1

For example, on Puhti, the command:

hipcc "--gpu-architecture=sm_70" -g -O3 -I../common -c core_cuda.cu -o core_cuda.o

would also print the command that was actually executed:

hipcc-cmd: /appl/spack/install-tree/gcc-9.1.0/cuda-11.1.0-vvfuk2//bin/nvcc -D__HIP_ROCclr__ -Wno-deprecated-gpu-targets  -isystem /appl/spack/install-tree/gcc-9.1.0/cuda-11.1.0-vvfuk2//include -isystem /appl/opt/rocm/rocm-4.0.0c/hip/include  --gpu-architecture=sm_70 -g -O3 -I../common -c core_cuda.cu -o core_cuda.o

Debug hipcc

Add in your submission script before srun:

export AMD_LOG_LEVEL=4

For example, with no debug mode:

srun: error: r01g01: task 0: Segmentation fault
srun: Terminating job step 4339273.0

with debug mode:

:3:rocdevice.cpp            :458 : 2193024923864 us: Initializing HSA stack.
:1:rocdevice.cpp            :466 : 2193024923948 us: hsa_init failed.
:4:runtime.cpp              :82  : 2193024923950 us: init
srun: error: r01g01: task 0: Segmentation fault
srun: Terminating job step 4339273.0

The outcome is that the used library does require AMD hardware and it crashes immediately. In a real execution you will observe a lot of output data.

Exercises - Demonstration

In this point, we assume that you have cloned the github repository

Clone the Git repository of the training:

$ git clone https://github.com/csc-training/hip.git
$ cd hip
$ export rootdir=$PWD

Acknowledgment: Some exercises were provided by Cristian-Valise Achim, Jussi Enkovaara, AMD, and found online.

Exercise: SAXPY CUDA

Steps

SAXPY is used for Single-Precision A*X Plus Y. It combines a scalar multiplication and vector addition.

cd ${rootdir}/porting/codes/saxpy/cuda

Check the file saxpy.cu with an editor

#include <stdio.h>

__global__
void saxpy(int n, float a, float *x, float *y)
{
  int i = blockIdx.x*blockDim.x + threadIdx.x;
  if (i < n) y[i] = a*x[i] + y[i];
}

int main(void)
{
  int N = 1<<30;
  float *x, *y, *d_x, *d_y;
  x = (float*)malloc(N*sizeof(float));
  y = (float*)malloc(N*sizeof(float));

  cudaMalloc(&d_x, N*sizeof(float)); 
  cudaMalloc(&d_y, N*sizeof(float));

  for (int i = 0; i < N; i++) {
    x[i] = 1.0f;
    y[i] = 2.0f;
  }

  cudaMemcpy(d_x, x, N*sizeof(float), cudaMemcpyHostToDevice);
  cudaMemcpy(d_y, y, N*sizeof(float), cudaMemcpyHostToDevice);

  // Perform SAXPY on 1M elements
saxpy<<<(N+255)/256, 256>>>(N, 2.0f, d_x, d_y);

  cudaMemcpy(y, d_y, N*sizeof(float), cudaMemcpyDeviceToHost);

  float maxError = 0.0f;
  for (int i = 0; i < N; i++)
    maxError = max(maxError, abs(y[i]-4.0f));
  printf("Max error: %f\n", maxError);

  cudaFree(d_x);
  cudaFree(d_y);
  free(x);
  free(y);
}

Compile CUDA code

make clean
make

Submit

sbatch sub.sh

Check the files out_* and error_*

The error output includes the duration for the execution which is close to 7.1 seconds and the out_* file includes the max error which should be 0.

Hipify

cp Makefile saxpy.cu sub.sh ../hip/
cd ../hip

Examine the hipify procedure

module load hip/4.0.0c
hipexamine-perl.sh saxpy.cu

  info: converted 14 CUDA->HIP refs ( error:0 init:0 version:0 device:0 context:0 module:0 memory:7 virtual_memory:0 addressing:0 stream:0 event:0 external_resource_interop:0 stream_memory:0 execution:0 graph:0 occupancy:0 texture:0 surface:0 peer:0 graphics:0 profiler:0 openGL:0 D3D9:0 D3D10:0 D3D11:0 VDPAU:0 EGL:0 thread:0 complex:0 library:0 device_library:0 device_function:3 include:0 include_cuda_main_header:0 type:0 literal:0 numeric_literal:3 define:0 extern_shared:0 kernel_launch:1 )
  warn:0 LOC:42 in 'saxpy.cu'
  hipMemcpy 3
  hipFree 2
  hipMemcpyHostToDevice 2
  hipMalloc 2
  hipLaunchKernelGGL 1
  hipMemcpyDeviceToHost 1

There is no warning, thus all the code can be hipified.

hipify-perl --inplace saxpy.cu

The file saxpy.cu is hipified:

#include "hip/hip_runtime.h"
#include <stdio.h>

__global__
void saxpy(int n, float a, float *x, float *y)
{
  int i = blockIdx.x*blockDim.x + threadIdx.x;
  if (i < n) y[i] = a*x[i] + y[i];
}

int main(void)
{
  int N = 1<<30;
  float *x, *y, *d_x, *d_y;
  x = (float*)malloc(N*sizeof(float));
  y = (float*)malloc(N*sizeof(float));

  hipMalloc(&d_x, N*sizeof(float));
  hipMalloc(&d_y, N*sizeof(float));

  for (int i = 0; i < N; i++) {
    x[i] = 1.0f;
    y[i] = 2.0f;
  }

  hipMemcpy(d_x, x, N*sizeof(float), hipMemcpyHostToDevice);
  hipMemcpy(d_y, y, N*sizeof(float), hipMemcpyHostToDevice);

  // Perform SAXPY on 1M elements
hipLaunchKernelGGL(saxpy, dim3((N+255)/256), dim3(256), 0, 0, N, 2.0f, d_x, d_y);

  hipMemcpy(y, d_y, N*sizeof(float), hipMemcpyDeviceToHost);

  float maxError = 0.0f;
  for (int i = 0; i < N; i++)
    maxError = max(maxError, abs(y[i]-4.0f));
  printf("Max error: %f\n", maxError);

  hipFree(d_x);
  hipFree(d_y);
  free(x);
  free(y);
}

14 CUDA calls were converted with no errors.

Compile

Edit the Makefile and change the nvcc to hipcc

CC        = nvcc

Modify to

CC        = hipcc

Clean and compile

make clean
make

Submit

sbatch sub.sh

Check the files out_* and error_*

The error output includes the duration for the execution which is close to 7.32 seconds and the out_* file includes the max error which should be 0. The overhead seems to be close to 3%.

The solution is here: https://github.com/csc-training/hip/tree/main/porting/codes/saxpy/hip_solution

If you want to chance the .cu file to .cpp

mv saxpy.cu saxpy.cpp

Edit Makefile
- HIP Makefile with .cu

# Compiler can be set below, or via environment variable
CC        = hipcc
OPTIMIZE  = yes
#
#===============================================================================
# Program name & source code list
#===============================================================================
program = saxpy
source = saxpy.cu
obj = $(source:.cu=.o)
#===============================================================================
# Sets Flags
#===============================================================================
# Standard Flags
CFLAGS := -Xcompiler -Wall
# Linker Flags
LDFLAGS =
# Optimization Flags
ifeq ($(OPTIMIZE),yes)
  CFLAGS += -O3
endif

#===============================================================================
# Targets to Build
#===============================================================================
#
$(program): $(obj) Makefile
        $(CC) $(CFLAGS) $(obj) -o $@ $(LDFLAGS)

%.o: %.cu Makefile
        $(CC) $(CFLAGS) -c $< -o $@

* HIP Makefile with .cpp

# Compiler can be set below, or via environment variable
CC        = hipcc
HIP_CU    = hipcc --x cu
OPTIMIZE  = yes
#
#===============================================================================
# Program name & source code list
#===============================================================================
program = saxpy
source = saxpy.cpp
obj = $(source:.cpp=.o)
#===============================================================================
# Sets Flags
#===============================================================================
# Standard Flags
CFLAGS := -Xcompiler -Wall
# Linker Flags
LDFLAGS =
# Optimization Flags
ifeq ($(OPTIMIZE),yes)
  CFLAGS += -O3
endif

#===============================================================================
# Targets to Build
#===============================================================================
#
$(program): $(obj) Makefile
        $(CC) $(CFLAGS) $(obj) -o $@ $(LDFLAGS)

%.o: %.cpp Makefile
        $(HIP_CU) $(CFLAGS) -c $< -o $@

Lessons learned:

Replace nvcc with hipcc in the Makefile
Hipify in-place with hipify-perl --inplace filename
For NVIDIA system, if the HIP code is in a file with extension .cpp use hipcc --x cu instead of hipcc

Exercise: SAXPY CUBLAS

Steps

SAXPY but using cuBLAS

cd ${rootdir}/porting/codes/saxpy/cublas

Check the file with an editor

#include <iostream>
#include "cublas_v2.h"
using namespace std;

int N = 1 << 30;

int main(){
        float *a_h, *b_h;
        a_h = new float[N];
        b_h = new float[N];
        float *a_d, *b_d;
        for(int i = 0; i < N; i++){
                a_h[i] = 1.0f;
                b_h[i] = 2.0f ;
        }
        cublasHandle_t handle;
        cublasCreate(&handle);
        cudaMalloc((void**) &a_d, sizeof(float) * N);
        cudaMalloc((void**) &b_d, sizeof(float) * N);
        cublasSetVector( N, sizeof(float), a_h, 1, a_d, 1);
        cublasSetVector( N, sizeof(float), b_h, 1, b_d, 1);
        const float s = 2.0f;
        cublasSaxpy( handle, N, &s, a_d, 1, b_d, 1);
        cublasGetVector( N, sizeof(float), b_d, 1, b_h, 1);
        cudaFree(a_d);
        cudaFree(b_d);
        cublasDestroy(handle);
        float maxError = 0.0f;

        for(int i = 0; i < N; i++)
                maxError = max(maxError, abs(b_h[i]-4.0f));

        cout << "Max error: " << maxError << endl;


        delete[] a_h;
        delete[] b_h;
        return 0;
}

Compile CUDA code

make clean
make

Submit

sbatch sub.sh

Check the files out_* and error_*

The error output includes the duration for the execution which is close to 7.1 seconds and the out_* file includes the max error which should be 0.

Hipify

Examine the hipify procedure

make clean
cp * ../hipblas
cd ../hipblas
module load hip/4.0.0
hipexamine-perl.sh saxpy_cublas.cu 
  info: converted 12 CUDA->HIP refs ( error:0 init:0 version:0 device:0 context:0 module:0 memory:4 virtual_memory:0 addressing:0 stream:0 event:0 external_resource_interop:0 stream_memory:0 execution:0 graph:0 occupancy:0 texture:0 surface:0 peer:0 graphics:0 profiler:0 openGL:0 D3D9:0 D3D10:0 D3D11:0 VDPAU:0 EGL:0 thread:0 complex:0 library:6 device_library:0 device_function:0 include:0 include_cuda_main_header:1 type:1 literal:0 numeric_literal:0 define:0 extern_shared:0 kernel_launch:0 )
  warn:0 LOC:39 in 'saxpy_cublas.cu'
  hipFree 2
  hipMalloc 2

We observe the there are 6 library calls that will be converted. Hipify the code:

hipify-perl --inplace saxpy_cublas.cu

Now the code is:

#include <iostream>
#include "hipblas.h"
using namespace std;

const int N = 1 << 30;

int main(){
        float *a_h, *b_h;
        a_h = new float[N];
        b_h = new float[N];
        float *a_d, *b_d;
        for(int i = 0; i < N; i++){
                a_h[i] = 1.0f;
                b_h[i] = 2.0f ;
        }
        hipblasHandle_t handle;
        hipblasCreate(&handle);
        hipMalloc((void**) &a_d, sizeof(float) * N);
        hipMalloc((void**) &b_d, sizeof(float) * N);
        hipblasSetVector( N, sizeof(float), a_h, 1, a_d, 1);
        hipblasSetVector( N, sizeof(float), b_h, 1, b_d, 1);
        const float s = 2.0f;
        hipblasSaxpy( handle, N, &s, a_d, 1, b_d, 1);
        hipblasGetVector( N, sizeof(float), b_d, 1, b_h, 1);
        hipFree(a_d);
        hipFree(b_d);
        hipblasDestroy(handle);
        float maxError = 0.0f;

        for(int i = 0; i < N; i++)
                maxError = max(maxError, abs(b_h[i]-4.0f));

        cout << "Max error: " << maxError << endl;


        delete[] a_h;
        delete[] b_h;
        return 0;
}

Compile

Modify the Makefile to:

...
CC        = hipcc
...
CFLAGS := -Xcompiler -Wall -I/appl/opt/rocm/rocm-4.0.0c/hipblas/hipblas/include
...
LDFLAGS = -L/appl/opt/rocm/rocm-4.0.0c/hipblas/hipblas/lib/ -lhipblas
...

Define variables to find the hipBLAS header and library and compile

export LD_LIBRARY_PATH=/appl/opt/rocm/rocm-4.0.0c/hipblas/hipblas/lib/:$LD_LIBRARY_PATH

Load the custom installation of ROCm

module load hip/4.0.0c
make clean
make

Submit your job script

sbatch sub.sh

Check the out* and error* files.

The solution is here: https://github.com/csc-training/hip/tree/main/porting/codes/saxpy/hipblas_solution

Lessons learned:

Always link with the appropriate library when it is available
Do not forget to declare the LD_LIBRARY_PATH environment variable
Adjust the Makefile

Exercise: Discrete_Hankel_Transform

Description: https://github.com/csc-training/hip/tree/main/porting/codes/Discrete_Hankel_Transform

Steps

cd ${rootdir}/porting/codes/Discrete_Hankel_Transform/cuda

Compile and Execute

nvcc -arch=sm_70  -o code Code.cu
sbatch sub.sh

Hipify, Compile, and Execute

cp * ../hip/
cd ../hip/

$ hipexamine-perl.sh Code.cu 
  info: converted 46 CUDA->HIP refs ( error:0 init:0 version:0 device:0 context:0 module:0 memory:24 virtual_memory:0 addressing:0 stream:0 event:0 external_resource_interop:0 stream_memory:0 execution:0 graph:0 occupancy:0 texture:0 surface:0 peer:0 graphics:0 profiler:0 openGL:0 D3D9:0 D3D10:0 D3D11:0 VDPAU:0 EGL:0 thread:0 complex:0 library:0 device_library:0 device_function:11 include:0 include_cuda_main_header:0 type:0 literal:0 numeric_literal:9 define:0 extern_shared:0 kernel_launch:2 )
  warn:0 LOC:220 in 'Code.cu'
  hipMalloc 9
  hipMemcpy 9
  hipMemcpyHostToDevice 7
  hipFree 3
  hipLaunchKernelGGL 2
  hipMemGetInfo 2
  hipMemcpyDeviceToHost 2
  hipMemset 1

$ hipify-perl --inplace Code.cu 

$ ls
bessel_zeros.in  Code.cu  Code.cu.prehip  README.md  sub.sh

hipcc -arch=sm_70  -o code Code.cu
sbatch sub.sh

The solution is here: https://github.com/csc-training/hip/tree/main/porting/codes/Discrete_Hankel_Transform/hip_solution

Exercise: Heat Equation

Example implementations of two dimensional heat equation.

CUDA

cd ${rootdir}/porting/codes/heat-equation/cuda

Load CUDA aware MPI

module load openmpi

Compile and execute

make
sbatch sub.sh

Check the out* file, for example:

cat out_5003895
Average temperature at start: 59.763305
Iteration took 0.179 seconds.
Average temperature: 59.281239
Reference value with default arguments: 59.281239

Hipify

make clean
mkdir ../hip
cp *.cpp *.h *.cu ../hip/
cd ../hip

hipexamine-perl.sh 
  info: converted 13 CUDA->HIP refs ( error:0 init:0 version:0 device:1 context:0 module:0 memory:6 virtual_memory:0 addressing:0 stream:0 event:0 external_resource_interop:0 stream_memory:0 execution:0 graph:0 occupancy:0 texture:0 surface:0 peer:0 graphics:0 profiler:0 openGL:0 D3D9:0 D3D10:0 D3D11:0 VDPAU:0 EGL:0 thread:0 complex:0 library:0 device_library:0 device_function:0 include:1 include_cuda_main_header:0 type:0 literal:0 numeric_literal:4 define:0 extern_shared:0 kernel_launch:1 )
  warn:0 LOC:90 in './core_cuda.cu'
  info: converted 3 CUDA->HIP refs ( error:0 init:0 version:0 device:2 context:0 module:0 memory:0 virtual_memory:0 addressing:0 stream:0 event:0 external_resource_interop:0 stream_memory:0 execution:0 graph:0 occupancy:0 texture:0 surface:0 peer:0 graphics:0 profiler:0 openGL:0 D3D9:0 D3D10:0 D3D11:0 VDPAU:0 EGL:0 thread:0 complex:0 library:0 device_library:0 device_function:0 include:1 include_cuda_main_header:0 type:0 literal:0 numeric_literal:0 define:0 extern_shared:0 kernel_launch:0 )
  warn:0 LOC:200 in './setup.cpp'

  info: TOTAL-converted 16 CUDA->HIP refs ( error:0 init:0 version:0 device:3 context:0 module:0 memory:6 virtual_memory:0 addressing:0 stream:0 event:0 external_resource_interop:0 stream_memory:0 execution:0 graph:0 occupancy:0 texture:0 surface:0 peer:0 graphics:0 profiler:0 openGL:0 D3D9:0 D3D10:0 D3D11:0 VDPAU:0 EGL:0 thread:0 complex:0 library:0 device_library:0 device_function:0 include:2 include_cuda_main_header:0 type:0 literal:0 numeric_literal:4 define:0 extern_shared:0 kernel_launch:1 )
  warn:0 LOC:702
  kernels (1 total) :   evolve_kernel(1)

  hipMemcpy 4
  hipMemcpyHostToDevice 3
  hipMalloc 2
  hip_runtime_api 2
  hipGetDeviceCount 1
  hipDeviceSynchronize 1
  hipLaunchKernelGGL 1
  hipSetDevice 1
  hipMemcpyDeviceToHost 1

hipconvertinplace-perl.sh .
  info: converted 13 CUDA->HIP refs ( error:0 init:0 version:0 device:1 context:0 module:0 memory:6 virtual_memory:0 addressing:0 stream:0 event:0 external_resource_interop:0 stream_memory:0 execution:0 graph:0 occupancy:0 texture:0 surface:0 peer:0 graphics:0 profiler:0 openGL:0 D3D9:0 D3D10:0 D3D11:0 VDPAU:0 EGL:0 thread:0 complex:0 library:0 device_library:0 device_function:0 include:1 include_cuda_main_header:0 type:0 literal:0 numeric_literal:4 define:0 extern_shared:0 kernel_launch:1 )
  warn:0 LOC:90 in './core_cuda.cu'
  info: converted 3 CUDA->HIP refs ( error:0 init:0 version:0 device:2 context:0 module:0 memory:0 virtual_memory:0 addressing:0 stream:0 event:0 external_resource_interop:0 stream_memory:0 execution:0 graph:0 occupancy:0 texture:0 surface:0 peer:0 graphics:0 profiler:0 openGL:0 D3D9:0 D3D10:0 D3D11:0 VDPAU:0 EGL:0 thread:0 complex:0 library:0 device_library:0 device_function:0 include:1 include_cuda_main_header:0 type:0 literal:0 numeric_literal:0 define:0 extern_shared:0 kernel_launch:0 )
  warn:0 LOC:200 in './setup.cpp'

  info: TOTAL-converted 16 CUDA->HIP refs ( error:0 init:0 version:0 device:3 context:0 module:0 memory:6 virtual_memory:0 addressing:0 stream:0 event:0 external_resource_interop:0 stream_memory:0 execution:0 graph:0 occupancy:0 texture:0 surface:0 peer:0 graphics:0 profiler:0 openGL:0 D3D9:0 D3D10:0 D3D11:0 VDPAU:0 EGL:0 thread:0 complex:0 library:0 device_library:0 device_function:0 include:2 include_cuda_main_header:0 type:0 literal:0 numeric_literal:4 define:0 extern_shared:0 kernel_launch:1 )
  warn:0 LOC:702
  kernels (1 total) :   evolve_kernel(1)

  hipMemcpy 4
  hipMemcpyHostToDevice 3
  hipMalloc 2
  hip_runtime_api 2
  hipGetDeviceCount 1
  hipDeviceSynchronize 1
  hipLaunchKernelGGL 1
  hipSetDevice 1
  hipMemcpyDeviceToHost 1

ls
core.cpp	 core_cuda.cu	      heat.h	     io.cpp	    main.cpp	     Makefile	    setup.cpp	      sub2.sh  utilities.cpp
core.cpp.prehip  core_cuda.cu.prehip  heat.h.prehip  io.cpp.prehip  main.cpp.prehip  Makefile_orig  setup.cpp.prehip  sub.sh   utilities.cpp.prehip

Update the Makefile

Original Makefile

ifeq ($(COMP),)
COMP=gnu
endif
...

ifeq ($(COMP),gnu)
CXX=mpicxx
CC=gcc
NVCC=nvcc
NVCCFLAGS=-g -O3 -I$(COMMONDIR)
CCFLAGS=-g -O3 -Wall -I$(COMMONDIR)
LDFLAGS=
LIBS=-lpng -lcudart
endif

EXE=heat_cuda
OBJS=main.o core.o core_cuda.o setup.o utilities.o io.o
OBJS_PNG=$(COMMONDIR)/pngwriter.o

...
$(EXE): $(OBJS) $(OBJS_PNG)
        $(CXX) $(CCFLAGS) $(OBJS) $(OBJS_PNG) -o $@ $(LDFLAGS) $(LIBS)

%.o: %.cpp
        $(CXX) $(CCFLAGS) -c $< -o $@

%.o: %.c
        $(CC) $(CCFLAGS) -c $< -o $@

%.o: %.cu
        $(NVCC) $(NVCCFLAGS) -c $< -o $@

Tips
- Use hipcc to compile the code with HIP calls
- hipcc can add the necessary options to link with the default libraries that are required (not the MPI etc.)
New Makefile with regards that we use nvcc under the hipcc

ifeq ($(COMP),)
COMP=hipcc
endif
...
ifeq ($(COMP),hipcc)
CXX=hipcc
CC=gcc
NVCC=hipcc --x cu
NVCCFLAGS=-g -O3 -I$(COMMONDIR)
CXXFLAGS=-g -O3 -Xcompiler -Wall -I$(COMMONDIR)
CCFLAGS=-g -O3 -Wall -I$(COMMONDIR)
LDFLAGS=
LIBS=-lpng -lmpi
endif
...
$(EXE): $(OBJS) $(OBJS_PNG)
        $(CXX) $(CXXFLAGS) $(OBJS) $(OBJS_PNG) -o $@ $(LDFLAGS) $(LIBS)

%.o: %.cpp
        $(CXX) $(CXXFLAGS) -c $< -o $@

%.o: %.c
        $(CC) $(CCFLAGS) -c $< -o $@

%.o: %.cu
        $(NVCC) $(NVCCFLAGS) -c $< -o $@

Compile and Execute

make
sbatch sub.sh

Exercise: 2D WAVE Propagation

2D Wave Propagation

CUDA and CPU

The 2D Wave Propagation case was provided by Ludovic Rass

cd ${rootdir}/porting/codes/wave_2d/cuda_cpu
ls 
compile.sh  sub.sh  vizme2D.m  Wave_2D.c  Wave_2D.cu

The file Wave_2D.c is for CPU and the Wave_2D.cu is for GPU.

Compile and Submit

cat compile.sh
#!/bin/bash
g++ -O3 Wave_2D.c -o wcpu

nvcc -arch=sm_70 -O3 Wave_2D.cu -o wgpu

./compile.sh

sbatch sub.sh

Check the out* file

cat out_5015029 
Perf: 220 iterations took 7.392e-03 seconds @ 32.9915 GB/s.
Process uses GPU with id 0 .
Perf: 220 iterations took 3.312e-03 seconds @ 73.6352 GB/s.

The CUDA code has 2.28 times better bandwidth. Of course, it depends on the problem size which in this case seems small.

HIP

There is a script to compile the code in the hip folder already. Copy the CUDA file to the ../hip directory

cp *.cu sub.sh ../hip
cd ../hip

Hipify

hipify-perl --print-stats --inplace Wave_2D.cu 
  info: converted 28 CUDA->HIP refs ( error:2 init:0 version:0 device:9 context:0 module:0 memory:4 virtual_memory:0 addressing:0 stream:0 event:0 external_resource_interop:0 stream_memory:0 execution:0 graph:0 occupancy:0 texture:0 surface:0 peer:0 graphics:0 profiler:0 openGL:0 D3D9:0 D3D10:0 D3D11:0 VDPAU:0 EGL:0 thread:0 complex:0 library:0 device_library:0 device_function:4 include:0 include_cuda_main_header:1 type:1 literal:0 numeric_literal:4 define:0 extern_shared:0 kernel_launch:3 )
  warn:0 LOC:126 in 'Wave_2D.cu'
  hipDeviceReset 3
  hipDeviceSynchronize 3
  hipLaunchKernelGGL 3
  hipMemcpy 2
  hipGetDevice 1
  hipError_t 1
  hipFree 1
  hipDeviceSetCacheConfig 1
  hipMalloc 1
  hip_runtime 1
  hipSetDevice 1
  hipMemcpyDeviceToHost 1
  hipMemcpyHostToDevice 1
  hipSuccess 1
  hipGetErrorString 1
  hipGetLastError 1
  hipFuncCachePreferL1 1

Compile and submit

Before you proceed with the submission, edit the sub.sh and comment the srun command to execute the CPU executable

./compile.sh
sbatch sub.sh

From the output file

cat out_*
Process uses GPU with id 0 .
Perf: 220 iterations took 3.385e-03 seconds @ 72.0481 GB/s.

The HIP version provides similar results to the CUDA version with a small overhead.

Exercise: KMeans

Parallel k-means clustering code

CUDA

cd ${rootdir}porting/codes/kmeans/cuda
ls
cuda_io.cu  cuda_kmeans.cu  cuda_main.cu  cuda_wtime.cu  Image_data  kmeans.h  LICENSE  Makefile  README  sample.output  sub.sh

Compile and Execute

make cuda
sbatch sub.sh

We can check the out* and error* files

Writing coordinates of K=128 cluster centers to file "Image_data/color17695.bin.cluster_centres"
Writing membership of N=17695 data objects to file "Image_data/color17695.bin.membership"

Performing **** Regular Kmeans (CUDA version) ****
Input file:     Image_data/color17695.bin
numObjs       = 17695
numCoords     = 9
numClusters   = 128
threshold     = 0.0010
Loop iterations    = 131
I/O time           =     0.0529 sec
Computation timing =     0.2059 sec

HIP

Copy the data to the ../hip directory

cp -r *.cu *.h Image_data ../hip 
cd ../hip

Hipify

hipconvertinplace-perl.sh .
  info: converted 28 CUDA->HIP refs ( error:0 init:0 version:0 device:4 context:0 module:0 memory:13 virtual_memory:0 addressing:0 stream:0 event:0 external_resource_interop:0 stream_memory:0 execution:0 graph:0 occupancy:0 texture:0 surface:0 peer:0 graphics:0 profiler:0 openGL:0 D3D9:0 D3D10:0 D3D11:0 VDPAU:0 EGL:0 thread:0 complex:0 library:0 device_library:0 device_function:1 include:0 include_cuda_main_header:0 type:1 literal:0 numeric_literal:5 define:0 extern_shared:2 kernel_launch:2 )
  warn:0 LOC:372 in './cuda_kmeans.cu'
  info: converted 8 CUDA->HIP refs ( error:3 init:0 version:0 device:0 context:0 module:0 memory:0 virtual_memory:0 addressing:0 stream:0 event:0 external_resource_interop:0 stream_memory:0 execution:0 graph:0 occupancy:0 texture:0 surface:0 peer:0 graphics:0 profiler:0 openGL:0 D3D9:0 D3D10:0 D3D11:0 VDPAU:0 EGL:0 thread:0 complex:0 library:0 device_library:0 device_function:0 include:1 include_cuda_main_header:0 type:2 literal:0 numeric_literal:1 define:1 extern_shared:0 kernel_launch:0 )
  warn:0 LOC:79 in './kmeans.h'

  info: TOTAL-converted 36 CUDA->HIP refs ( error:3 init:0 version:0 device:4 context:0 module:0 memory:13 virtual_memory:0 addressing:0 stream:0 event:0 external_resource_interop:0 stream_memory:0 execution:0 graph:0 occupancy:0 texture:0 surface:0 peer:0 graphics:0 profiler:0 openGL:0 D3D9:0 D3D10:0 D3D11:0 VDPAU:0 EGL:0 thread:0 complex:0 library:0 device_library:0 device_function:1 include:1 include_cuda_main_header:0 type:3 literal:0 numeric_literal:6 define:1 extern_shared:2 kernel_launch:2 )
  warn:0 LOC:843
  kernels (1 total) :   compute_delta(1)

  hipMemcpy 5
  hipFree 4
  hipMalloc 4
  hipMemcpyHostToDevice 3
  hipError_t 2
  hipDeviceSynchronize 2
  hipLaunchKernelGGL 2
  hipGetErrorString 2
  HIP_DYNAMIC_SHARED 2
  hipMemcpyDeviceToHost 2
  hipGetDevice 1
  hipDeviceProp_t 1
  hipGetDeviceProperties 1
  hipSuccess 1
  hipGetLastError 1

Compile and execute

make -f Makefile.hip 
sbatch sub.sh

The output file

Performing **** Regular Kmeans (CUDA version) ****
Input file:     Image_data/color17695.bin
numObjs       = 17695
numCoords     = 9
numClusters   = 128
threshold     = 0.0010
Loop iterations    = 131
I/O time           =     0.0081 sec
Computation timing =     0.2000 sec

Exercise: Madgraph 4 GPU

This code developed in the context of porting the MadGraph5_aMC@NLO event generator software onto GPU hardware. MadGraph5_aMC@NLO is able to generate code for various physics processes in different programming languages (Fortran, C, C++).

cd ${rootdir}/porting/codes/
mkdir madgraph4gpu
cd madgraph4gpu
wget https://github.com/madgraph5/madgraph4gpu/archive/master.zip
unzip master.zip
cd madgraph4gpu-master/
ls
epoch0	epoch1	epoch2	README.md  test  tools
cd epoch1/
cp -r cuda hip

Hipify

hipconvertinplace-perl.sh hip/
  info: converted 7 CUDA->HIP refs ( error:0 init:0 version:0 device:0 context:0 module:0 memory:3 virtual_memory:0 addressing:0 stream:0 event:0 external_resource_interop:0 stream_memory:0 execution:0 graph:0 occupancy:0 texture:0 surface:0 peer:0 graphics:0 profiler:0 openGL:0 D3D9:0 D3D10:0 D3D11:0 VDPAU:0 EGL:0 thread:0 complex:0 library:0 device_library:0 device_function:4 include:0 include_cuda_main_header:0 type:0 literal:0 numeric_literal:0 define:0 extern_shared:0 kernel_launch:0 )
  warn:0 LOC:841 in 'hip/gg_tt/SubProcesses/P1_Sigma_sm_gg_ttx/CPPProcess.cu'
...
 info: TOTAL-converted 294 CUDA->HIP refs ( error:11 init:0 version:0 device:3 context:0 module:0 memory:59 virtual_memory:0 addressing:0 stream:0 event:0 external_resource_interop:0 stream_memory:0 execution:0 graph:0 occupancy:0 texture:0 surface:0 peer:0 graphics:0 profiler:0 openGL:0 D3D9:0 D3D10:0 D3D11:0 VDPAU:0 EGL:0 thread:0 complex:14 library:6 device_library:16 device_function:35 include:0 include_cuda_main_header:1 type:26 literal:0 numeric_literal:31 define:79 extern_shared:0 kernel_launch:13 )
  warn:2 LOC:15920
  warning: unconverted cudaTearDown : 2
  kernels (2 total) :   sigmaKin(3)  gProc::sigmaKin(2)

  hipMemcpy 23
  hipMemcpyToSymbol 19
  hipMemcpyDeviceToHost 18
  hipLaunchKernelGGL 18
  hipFree 14
  hipMalloc 11
  hipPeekAtLastError 9
  hipMemcpyHostToDevice 6
  hipDeviceReset 5
  hipSuccess 5
  HIP_SYMBOL 5
  hipError_t 4
  hipGetErrorString 4
  hipDoubleComplex 3
  hip_runtime 3
  hipHostFree 3
  hipHostMalloc 3
  hipMemcpy3D 3
  hipFloatComplex 3
  hipMemcpy2D 3
  HIPRAND_STATUS_SUCCESS 2
  HIPRAND_RNG_PSEUDO_PHILOX4_32_10 2
  HIPRAND_RNG_PSEUDO_MT19937 2
  HIPRAND_RNG_PSEUDO_MTGP32 2
  HIPRAND_RNG_PSEUDO_XORWOW 2
  HIPRAND_RNG_PSEUDO_MRG32K3A 2
  hipComplex 1
  hip_complex 1
  hipCsubf 1
  hipCmulf 1
  hipCaddf 1
  hipCdiv 1
  hipCrealf 1
  hipCsub 1
  hipCreal 1
  hipCimag 1
  hipCmul 1
  hipCadd 1
  hipCimagf 1
  hipCdivf 1

The warning is about a cuda variable deployed by the developers, so it is safe. The hiprand is not working on our environment this moment, thus the utilization of this code requires the actual AMD hardware, however, it is ported. It requires tuning and checking.

CUDA Fortran

As we discussed already, there is no straight forward approach with CUDA Fortran.
The HIP functions are callable from C and with extern C are callable from Fortran
Procedure:
- Port CUDA Fortran code to HIP kernels in C++. The hipfort helps to call some HIP calls from Fortran.
- Wrap the kernel launch in C function
- Call the C function from Fortran through Fortran 2003 C binding, using pointers etc.

Exercise: SAXPY CUDA Fortran

We have the following example. SAXPY code in CUDA Fortran. In this case, to hipify the code, we follow this procedure.

$ cd ${rootdir}/porting/codes/cuda_fortran_saxpy/cuda
$ ls
main.cuf
cat main.cuf

module mathOps
contains
  attributes(global) subroutine saxpy(x, y, a)
    implicit none
    real :: x(:), y(:)
    real, value :: a
    integer :: i, n
    n = size(x)
    i = blockDim%x * (blockIdx%x - 1) + threadIdx%x
    if (i <= n) y(i) = y(i) + a*x(i)
  end subroutine saxpy 
end module mathOps

program testSaxpy
  use mathOps
  use cudafor
  implicit none
  integer, parameter :: N = 40000
  real :: x(N), y(N), a
  real, device :: x_d(N), y_d(N)
  type(dim3) :: grid, tBlock

  tBlock = dim3(256,1,1)
  grid = dim3(ceiling(real(N)/tBlock%x),1,1)

  x = 1.0; y = 2.0; a = 2.0
  x_d = x
  y_d = y
  call saxpy<<<grid, tBlock>>>(x_d, y_d, a)
  y = y_d
  write(*,*) 'Max error: ', maxval(abs(y-4.0))
end program testSaxpy

Compile the code and submit

./compile.sh
sbatch sub.sh

Check the out* and error* files

 cat out_*
 Max error:     0.000000
 
 cat error*
 real	0m0.404s
 ...

Create the HIP kernel

Original kernel

    i = blockDim%x * (blockIdx%x - 1) + threadIdx%x
    if (i <= n) y(i) = y(i) + a*x(i)

HIP kernel

__global__ void saxpy(float *y, float *x, float a, int n)
{
    size_t i = blockDim.x * blockIdx.x  + threadIdx.x;
    if (i < n) y[i] = y[i] + a*x[i];
}

Observations

_global_ means that the function will be executed on the GPU and it will be called from the host
In Fortran the variables such as blockDim%x are used in C/C++ as blockDim.x. This means that you have to change all these variables but a find and replace through sed could be easy
Using arrays also is different for example y(i) becomes y[i] which again sed could help
Overall we need to be careful that we do not do any mistake, always check the results

Wrap the kernel launch in C function

extern "C"
{
  void launch(float **dout, float **da, float db, int N)
  {

     dim3 tBlock(256,1,1);
     dim3 grid(ceil((float)N/tBlock.x),1,1);
    
    hipLaunchKernelGGL((saxpy), grid, tBlock, 0, 0, *dout, *da, db, N);
  }
}

The new Fortran 2003 code

program testSaxpy
  use iso_c_binding
  use hipfort
  use hipfort_check

  implicit none
  interface
     subroutine launch(y,x,b,N) bind(c)
       use iso_c_binding
       implicit none
       type(c_ptr) :: y,x
       integer, value :: N
       real, value :: b
     end subroutine
  end interface

  type(c_ptr) :: dx = c_null_ptr
  type(c_ptr) :: dy = c_null_ptr
  integer, parameter :: N = 40000
  integer, parameter :: bytes_per_element = 4
  integer(c_size_t), parameter :: Nbytes = N*bytes_per_element
  real, allocatable,target,dimension(:) :: x, y


  real, parameter ::  a=2.0
  real :: x_d(N), y_d(N)

  call hipCheck(hipMalloc(dx,Nbytes))
  call hipCheck(hipMalloc(dy,Nbytes))

  allocate(x(N))
  allocate(y(N))

  x = 1.0;y = 2.0

  call hipCheck(hipMemcpy(dx, c_loc(x), Nbytes, hipMemcpyHostToDevice))
  call hipCheck(hipMemcpy(dy, c_loc(y), Nbytes, hipMemcpyHostToDevice))

  call launch(dy, dx, a, N)

  call hipCheck(hipDeviceSynchronize())

  call hipCheck(hipMemcpy(c_loc(y), dy, Nbytes, hipMemcpyDeviceToHost))

  write(*,*) 'Max error: ', maxval(abs(y-4.0))

  call hipCheck(hipFree(dx))
  call hipCheck(hipFree(dy))

  deallocate(x)
  deallocate(y)

end program testSaxpy

Makefile

hipfort provides a Makefile called Makefile.hipfort

export HIPFORT_HOME=${ROCM_PATH}/hipfort/
include ${HIPFORT_HOME}/bin/Makefile.hipfort

OUTPUT_DIR ?= $(PWD)
APP         = $(OUTPUT_DIR)/saxpy

.DEFAULT_GOAL := all

all: $(APP)

$(APP): $(OUTPUT_DIR)/main.o $(OUTPUT_DIR)/hipsaxpy.o
	$(FC) $^ $(LINKOPTS) -o $(APP)

$(OUTPUT_DIR)/main.o: main.f03
	$(FC)  -c $^ -o $(OUTPUT_DIR)/main.o

$(OUTPUT_DIR)/hipsaxpy.o: hipsaxpy.cpp
	$(CXX) --x cu -c $^ -o $(OUTPUT_DIR)/hipsaxpy.o

clean:
	rm -f $(APP) *.o *.mod *~

Tip: Not sure how safe it is but if all your cpp files had HIP calls under NVIDIA system, you could define export HIPCC_COMPILE_FLAGS_APPEND="--x cu" and not to modify the Makefile. Be careful as this can break something else.

Compile and submit

module load hip/4.0.0c
make
submit sub.sh

Gromacs

GROMACS is a molecular dynamics package mainly designed for simulations of proteins, lipids, and nucleic acids.

Do not follow these instructions as it could take a long time, they are documented to help you in your case

Download Gromacs and uncompress:

wget https://ftp.gromacs.org/gromacs/gromacs-2021.tar.gz
tar zxvf gromacs-2021.tar.gz
cd gromacs-2021
ls
admin  api  AUTHORS  build  cmake  CMakeLists.txt  computed_checksum  COPYING  CPackInit.cmake	CTestConfig.cmake  docs  INSTALL  python_packaging  README  scripts  share  src  tests

Hipify

Let's hipify the application automatically with the hipconvertinplace-perl.sh script

cd src
 hipconvertinplace-perl.sh . 
 
 info: converted 10 CUDA->HIP refs ( error:0 init:0 version:0 device:3 context:0 module:0 memory:4 virtual_memory:0 addressing:0 stream:0 event:0 external_resource_interop:0 stream_memory:0 execution:0 graph:0 occupancy:0 texture:0 surface:0 peer:0 graphics:0 profiler:0 openGL:0 D3D9:0 D3D10:0 D3D11:0 VDPAU:0 EGL:0 thread:0 complex:0 library:0 device_library:0 device_function:0 include:0 include_cuda_main_header:0 type:1 literal:0 numeric_literal:2 define:0 extern_shared:0 kernel_launch:0 )
  warn:0 LOC:89 in './gromacs/gpu_utils/tests/devicetransfers.cu'
  info: converted 13 CUDA->HIP refs ( error:0 init:0 version:0 device:0 context:0 module:0 memory:2 virtual_memory:0 addressing:0 stream:0 event:0 external_resource_interop:0 stream_memory:0 execution:0 graph:0 occupancy:0 texture:0 surface:0 peer:0 graphics:0 profiler:0 openGL:0 D3D9:0 D3D10:0 D3D11:0 VDPAU:0 EGL:0 thread:0 complex:0 library:0 device_library:0 device_function:0 include:0 include_cuda_main_header:0 type:2 literal:0 numeric_literal:8 define:1 extern_shared:0 kernel_launch:0 )
  warn:0 LOC:126 in './gromacs/gpu_utils/pinning.cu'
  info: converted 12 CUDA->HIP refs ( error:0 init:0 version:0 device:0 context:0 module:0 memory:6 virtual_memory:0 addressing:0 stream:0 event:0 external_resource_interop:0 stream_memory:0 execution:0 graph:0 occupancy:0 texture:0 surface:0 peer:0 graphics:0 profiler:0 openGL:0 D3D9:0 D3D10:0 D3D11:0 VDPAU:0 EGL:0 thread:0 complex:0 library:0 device_library:0 device_function:0 include:0 include_cuda_main_header:0 type:3 literal:0 numeric_literal:0 define:3 extern_shared:0 kernel_launch:0 )
  warn:0 LOC:113 in './gromacs/gpu_utils/pmalloc_cuda.cu'
 warn:0 LOC:113 in './gromacs/gpu_utils/pmalloc_cuda.cu'
  warning: ./gromacs/gpu_utils/gpu_utils.cu:#69 : static bool cudaProfilerRun = ((getenv("NVPROF_ID") != nullptr));
  warning: ./gromacs/gpu_utils/gpu_utils.cu:#90 :             isPinned = (memoryAttributes.type == cudaMemoryTypeHost);
  warning: ./gromacs/gpu_utils/gpu_utils.cu:#114 :     if (cudaProfilerRun)
  warning: ./gromacs/gpu_utils/gpu_utils.cu:#126 :     if (cudaProfilerRun)
  warning: ./gromacs/gpu_utils/gpu_utils.cu:#143 :     if (cudaProfilerRun)
  warning: ./gromacs/gpu_utils/gpu_utils.cu:#154 :  * \param[in] cudaCallName   name of CUDA peer access call
  warning: ./gromacs/gpu_utils/gpu_utils.cu:#160 :                                 const char*          cudaCallName)
  warning: ./gromacs/gpu_utils/gpu_utils.cu:#165 :                 gmx::formatString("%s from GPU %d to GPU %d failed", cudaCallName, gpuA, gpuB);
  warning: ./gromacs/gpu_utils/gpu_utils.cu:#175 :                         gpuA, gpuB, cudaCallName, gmx::getDeviceErrorString(stat).c_str());
  info: converted 32 CUDA->HIP refs ( error:1 init:0 version:0 device:6 context:0 module:0 memory:0 virtual_memory:0 addressing:2 stream:0 event:0 external_resource_interop:0 stream_memory:0 execution:0 graph:0 occupancy:0 texture:0 surface:0 peer:4 graphics:0 profiler:4 openGL:0 D3D9:0 D3D10:0 D3D11:0 VDPAU:0 EGL:0 thread:0 complex:0 library:0 device_library:0 device_function:0 include:1 include_cuda_main_header:0 type:7 literal:0 numeric_literal:7 define:0 extern_shared:0 kernel_launch:0 )
  warn:9 LOC:239 in './gromacs/gpu_utils/gpu_utils.cu'
  info: converted 13 CUDA->HIP refs ( error:0 init:0 version:0 device:1 context:0 module:0 memory:0 virtual_memory:0 addressing:0 stream:5 event:0 external_resource_interop:0 stream_memory:0 execution:0 graph:0 occupancy:0 texture:0 surface:0 peer:0 graphics:0 profiler:0 openGL:0 D3D9:0 D3D10:0 D3D11:0 VDPAU:0 EGL:0 thread:0 complex:0 library:0 device_library:0 device_function:0 include:0 include_cuda_main_header:0 type:4 literal:0 numeric_literal:2 define:1 extern_shared:0 kernel_launch:0 )
  warn:0 LOC:103 in './gromacs/gpu_utils/device_stream.cu'
  warning: ./gromacs/hardware/device_management.cu:#137 :     // it is enough to check for cudaErrorDevicesUnavailable only here because
  warning: ./gromacs/hardware/device_management.cu:#139 :     if (cu_err == cudaErrorDevicesUnavailable)
...
  info: converted 3 CUDA->HIP refs ( error:0 init:0 version:0 device:0 context:0 module:0 memory:0 virtual_memory:0 addressing:0 stream:0 event:0 external_resource_interop:0 stream_memory:0 execution:0 graph:0 occupancy:0 texture:0 surface:0 peer:0 graphics:0 profiler:0 openGL:0 D3D9:0 D3D10:0 D3D11:0 VDPAU:0 EGL:0 thread:0 complex:0 library:0 device_library:0 device_function:3 include:0 include_cuda_main_header:0 type:0 literal:0 numeric_literal:0 define:0 extern_shared:0 kernel_launch:0 )
  warn:0 LOC:320 in './gromacs/ewald/pme_spread.cu'
  warning: ./gromacs/ewald/pme_solve.cu:260: unsupported device function "__shfl_down_sync":         virxx += __shfl_down_sync(activeMask, virxx, 1, width);
  warning: ./gromacs/ewald/pme_solve.cu:261: unsupported device function "__shfl_up_sync":         viryy += __shfl_up_sync(activeMask, viryy, 1, width);
...
  info: converted 1 CUDA->HIP refs ( error:0 init:0 version:0 device:0 context:0 module:0 memory:0 virtual_memory:0 addressing:0 stream:0 event:0 external_resource_interop:0 stream_memory:0 execution:0 graph:0 occupancy:0 texture:0 surface:0 peer:0 graphics:0 profiler:0 openGL:0 D3D9:0 D3D10:0 D3D11:0 VDPAU:0 EGL:0 thread:0 complex:0 library:0 device_library:0 device_function:0 include:0 include_cuda_main_header:0 type:0 literal:0 numeric_literal:0 define:1 extern_shared:0 kernel_launch:0 )
  warn:0 LOC:2709 in './external/googletest/googletest/include/gtest/internal/gtest-port.h'

  info: TOTAL-converted 377 CUDA->HIP refs ( error:9 init:0 version:5 device:27 context:0 module:0 memory:29 virtual_memory:0 addressing:2 stream:18 event:27 external_resource_interop:0 stream_memory:0 execution:3 graph:0 occupancy:0 texture:3 surface:0 peer:4 graphics:0 profiler:4 openGL:0 D3D9:0 D3D10:0 D3D11:0 VDPAU:0 EGL:0 thread:0 complex:0 library:18 device_library:0 device_function:38 include:1 include_cuda_main_header:5 type:81 literal:0 numeric_literal:85 define:11 extern_shared:7 kernel_launch:0 )
  warn:45 LOC:869044
  warning: unconverted cudaProfilerRun : 4
  warning: unconverted cudaCallName : 4
  warning: unconverted cudaErrorDevicesUnavailable : 2
  warning: unconverted cudaMemoryTypeHost : 1
  kernels (0 total) : 

  hipError_t 47
  hipSuccess 46
  hipEventDestroy 8
  hipMemcpyAsync 8
  hipSetDevice 8
  HIP_DYNAMIC_SHARED 7
  hipFuncSetCacheConfig 6
  hipErrorInvalidValue 5
  ...
  hipStream_t 4
  hipHostMalloc 4
  hipGetDevice 4
  hipStreamQuery 4
  hip_runtime 4
  hipStreamSynchronize 4
  ...  
  hipHostMallocDefault 2
  hipProfilerStop 2
  hipPointerAttribute_t 2
  hipDeviceEnablePeerAccess 2
  ...
  hipDeviceGetStreamPriorityRange 1
  hipGetErrorName 1
  hipPeekAtLastError 1
  hipErrorInvalidDeviceFunction 1
  HIPFFT_SUCCESS 1

Issues

Description: We should check all the warnings as these declare that something was not translated, it does not mean for sure that this is a mistake. If a developer declares a variable called cudaXXX the tool will report a warning but it is expected.

warning: ./gromacs/gpu_utils/gpu_utils.cu:#69 : static bool cudaProfilerRun = ((getenv("NVPROF_ID") != nullptr));
warning: ./gromacs/gpu_utils/gpu_utils.cu:#90 :             isPinned = (memoryAttributes.type == cudaMemoryTypeHost);
warning: ./gromacs/gpu_utils/gpu_utils.cu:#114 :     if (cudaProfilerRun)
warning: ./gromacs/gpu_utils/gpu_utils.cu:#126 :     if (cudaProfilerRun)
warning: ./gromacs/gpu_utils/gpu_utils.cu:#143 :     if (cudaProfilerRun)
warning: ./gromacs/gpu_utils/gpu_utils.cu:#154 :  * \param[in] cudaCallName   name of CUDA peer access call
warning: ./gromacs/gpu_utils/gpu_utils.cu:#160 :                                 const char*          cudaCallName)
warning: ./gromacs/gpu_utils/gpu_utils.cu:#165 :                 gmx::formatString("%s from GPU %d to GPU %d failed", cudaCallName, gpuA, gpuB);
warning: ./gromacs/gpu_utils/gpu_utils.cu:#175 :                         gpuA, gpuB, cudaCallName, gmx::getDeviceErrorString(stat).c_str());

warning: ./gromacs/hardware/device_management.cu:#137 :     // it is enough to check for cudaErrorDevicesUnavailable only here because
warning: ./gromacs/hardware/device_management.cu:#139 :     if (cu_err == cudaErrorDevicesUnavailable)

warn:45 LOC:869044
warning: unconverted cudaProfilerRun : 4
warning: unconverted cudaCallName : 4
warning: unconverted cudaErrorDevicesUnavailable : 2
warning: unconverted cudaMemoryTypeHost : 1

Solution:

Check the files with the warnings:

For example the warning of cudaProfilerRun is not actually a serious issue as it is a variable declared by the developers

static bool cudaProfilerRun

Similar for

 const char* cudaCallName

About cudaErrorDevicesUnavailable we can see from the web page https://rocmdocs.amd.com/en/latest/Programming_Guides/CUDAAPIHIPTEXTURE.html that there is no specific HIP call.

    if (cu_err == cudaErrorDevicesUnavailable)
    {
        return DeviceStatus::Unavailable;
    }

However, if we try to track the call path from the code, we can see:


cu_err == cudaErrorDevicesUnavailable)
└── cu_err = checkCompiledTargetCompatibility(deviceInfo.id, deviceInfo.prop);
    └── static cudaError_t checkCompiledTargetCompatibility(int deviceId, const cudaDeviceProp& deviceProp)
        {
         cudaFuncAttributes attributes;
         cudaError_t        stat = cudaFuncGetAttributes(&attributes, dummy_kernel);
         ...
         return stat;
         }

Thus, the returned value is from the call to cudaFuncAttributes. From the PDF of HIP API which is more updated from the web site, we can see for example, for v4.0, see here: https://github.com/RadeonOpenCompute/ROCm/blob/master/HIP-API_Guide_v4.0.pdf where we can find that

4.8.2.11 hipFuncGetAttributes()

Returns
hipSuccess, hipErrorInvalidValue, hipErrorInvalidDeviceFunction

Thus we can replace in the file .gromacs/hardware/device_management.cu manually the cudaErrorDevicesUnavailable with hipErrorInvalidDeviceFunction

About the warning:

warning: unconverted cudaMemoryTypeHost : 1

We can see from the HIP programming guide PDF that there is hipMemoryTypeHost and we are not sure why it was not converted, it could be a bug, so we manually do the appropriate modification in the file ./gromacs/gpu_utils/gpu_utils.cu

Description: The Warp-Level primitives are not supported by HIP (yet)

warning: ./gromacs/ewald/pme_solve.cu:260: unsupported device function "__shfl_down_sync":         virxx += __shfl_down_sync(activeMask, virxx, 1, width);

Github issue: ROCm/HIP#1491

Solution:

Change the calls of shfl_*_sync to shfl_* for example __shfl_down_sync to __shfl_down

Description: Gromacs uses FFT

Solution:

The hipFFT seems not to be able to compile on NVIDIA systems, it is possible with the hip/4.0.0 but the explanation is different.

Description:

Error: src/gromacs/utility/cuda_version_information.cu(49): error: identifier "hipDriverGetVersion" is undefined

Code:

#include "gmxpre.h"
#include "cuda_version_information.h"
#include "gromacs/utility/stringutil.h"

namespace gmx
{

std::string getCudaDriverVersionString()
{
    int cuda_driver = 0;
    if (hipDriverGetVersion(&cuda_driver) != hipSuccess)
    {
        return "N/A";
    }
    return formatString("%d.%d", cuda_driver / 1000, cuda_driver % 100);
}

std::string getCudaRuntimeVersionString()
{
    int cuda_runtime = 0;
    if (hipRuntimeGetVersion(&cuda_runtime) != hipSuccess)
    {
        return "N/A";
    }
    return formatString("%d.%d", cuda_runtime / 1000, cuda_runtime % 100);
}

} // namespace gmx

Solution:

Add #include "hip/hip_runtime.h"

Compilation

This is an example it does not mean that this is the best way

CXX=/appl/opt/rocm/rocm-4.0.0/hip/bin/hipcc cmake -DGMX_GPU=CUDA ..
make

CMAKE

One cmake file, called cmake/gmxManageNvccConfig.cmake from Gromacs is the following:

#
# This file is part of the GROMACS molecular simulation package.
#
# Copyright (c) 2012,2013,2014,2015,2016 by the GROMACS development team.
# Copyright (c) 2017,2018,2019,2020, by the GROMACS development team, led by
# Mark Abraham, David van der Spoel, Berk Hess, and Erik Lindahl,
# and including many others, as listed in the AUTHORS file in the
# top-level source directory and at http://www.gromacs.org.
...
# set up host compiler and its options
if(CUDA_HOST_COMPILER_CHANGED)
    set(CUDA_HOST_COMPILER_OPTIONS "")

    if(APPLE AND CMAKE_C_COMPILER_ID MATCHES "GNU")
        # Some versions of gcc-4.8 and gcc-4.9 have produced errors
        # (in particular on OS X) if we do not use
        # -D__STRICT_ANSI__. It is harmless, so we might as well add
        # it for all versions.
        list(APPEND CUDA_HOST_COMPILER_OPTIONS "-D__STRICT_ANSI__")
    endif()

    work_around_glibc_2_23()

    set(CUDA_HOST_COMPILER_OPTIONS "${CUDA_HOST_COMPILER_OPTIONS}"
        CACHE STRING "Options for nvcc host compiler (do not edit!).")

    mark_as_advanced(CUDA_HOST_COMPILER CUDA_HOST_COMPILER_OPTIONS)
endif()

if (GMX_CUDA_TARGET_SM OR GMX_CUDA_TARGET_COMPUTE)
    set(GMX_CUDA_NVCC_GENCODE_FLAGS)
    set(_target_sm_list ${GMX_CUDA_TARGET_SM})
    foreach(_target ${_target_sm_list})
        list(APPEND GMX_CUDA_NVCC_GENCODE_FLAGS "-gencode;arch=compute_${_target},code=sm_${_target}")
    endforeach()
    set(_target_compute_list ${GMX_CUDA_TARGET_COMPUTE})
    foreach(_target ${_target_compute_list})
        list(APPEND GMX_CUDA_NVCC_GENCODE_FLAGS "-gencode;arch=compute_${_target},code=compute_${_target}")
    endforeach()
else()

  if(CUDA_VERSION VERSION_LESS "11.0")
        list (APPEND GMX_CUDA_NVCC_GENCODE_FLAGS "-gencode;arch=compute_30,code=sm_30")
    endif()
    list (APPEND GMX_CUDA_NVCC_GENCODE_FLAGS "-gencode;arch=compute_35,code=sm_35")
...
    list (APPEND GMX_CUDA_NVCC_GENCODE_FLAGS "-gencode;arch=compute_70,code=compute_70")
    if(NOT CUDA_VERSION VERSION_LESS "10.0")
        list (APPEND GMX_CUDA_NVCC_GENCODE_FLAGS "-gencode;arch=compute_75,code=compute_75")
    endif()
    if(NOT CUDA_VERSION VERSION_LESS "11.0")
        list (APPEND GMX_CUDA_NVCC_GENCODE_FLAGS "-gencode;arch=compute_80,code=compute_80")
    endif()
endif()

if((_cuda_nvcc_executable_or_flags_changed OR CUDA_HOST_COMPILER_CHANGED OR NOT GMX_NVCC_WORKS) AND NOT WIN32)
    message(STATUS "Check for working NVCC/C++ compiler combination with nvcc '${CUDA_NVCC_EXECUTABLE}'")
    execute_process(COMMAND ${CUDA_NVCC_EXECUTABLE} -ccbin ${CUDA_HOST_COMPILER} -c ${CUDA_NVCC_FLAGS} ${CUDA_NVCC_FLAGS_${_build_type}} ${CMAKE_SOURCE_DIR}/cmake/TestCUDA.cu
        RESULT_VARIABLE _cuda_test_res
        OUTPUT_VARIABLE _cuda_test_out
        ERROR_VARIABLE  _cuda_test_err
        OUTPUT_STRIP_TRAILING_WHITESPACE)

...
endif() # GMX_CHECK_NVCC

macro(GMX_SET_CUDA_NVCC_FLAGS)
    set(CUDA_NVCC_FLAGS "${GMX_CUDA_NVCC_FLAGS};${CUDA_NVCC_FLAGS}")
endmacro()


function(gmx_cuda_add_library TARGET)
    add_definitions(-DHAVE_CONFIG_H)
    # Source files generated by NVCC can include gmxmpi.h, and so
    # need access to thread-MPI.
    include_directories(SYSTEM ${PROJECT_SOURCE_DIR}/src/external/thread_mpi/include)
    # Source files can also contain topology related files and need access to
    # the remaining external headers
    include_directories(SYSTEM ${PROJECT_SOURCE_DIR}/src/external)

    # Now add all the compilation options
    gmx_cuda_target_compile_options(CUDA_${TARGET}_CXXFLAGS)
    list(APPEND CMAKE_CXX_FLAGS ${CUDA_${TARGET}_CXXFLAGS})
    foreach(build_type ${build_types_with_explicit_flags})
        list(APPEND CMAKE_CXX_FLAGS_${build_type} ${CUDA_${TARGET}_CXXFLAGS_${build_type}})
    endforeach()

    cuda_add_library(${TARGET} ${ARGN})
endfunction()

Hipify

hipify-cmakefile cmake/gmxManageNvccConfig.cmake > cmake/gmxManageHipConfig.cmake
 warning: cmake/gmxManageNvccConfig.cmake:#38 : unsupported macro/option : # - use the CUDA_HOST_COMPILER if defined by the user, otherwise
 warning: cmake/gmxManageNvccConfig.cmake:#39 : unsupported macro/option : # - check if nvcc works with CUDA_HOST_COMPILER and the generated nvcc and C++ flags
 warning: cmake/gmxManageNvccConfig.cmake:#42 : unsupported macro/option : #   * CUDA_HOST_COMPILER_OPTIONS    - the full host-compiler related option list passed to nvcc
 warning: cmake/gmxManageNvccConfig.cmake:#44 : unsupported macro/option : # Note that from CMake 2.8.10 FindCUDA defines CUDA_HOST_COMPILER internally,
 warning: cmake/gmxManageNvccConfig.cmake:#59 : unsupported macro/option :         list(APPEND CUDA_HOST_COMPILER_OPTIONS "-D_FORCE_INLINES")
 warning: cmake/gmxManageNvccConfig.cmake:#60 : unsupported macro/option :         set(CUDA_HOST_COMPILER_OPTIONS ${CUDA_HOST_COMPILER_OPTIONS} PARENT_SCOPE)
 warning: cmake/gmxManageNvccConfig.cmake:#64 : unsupported macro/option : gmx_check_if_changed(CUDA_HOST_COMPILER_CHANGED CUDA_HOST_COMPILER)
 warning: cmake/gmxManageNvccConfig.cmake:#67 : unsupported macro/option : if(CUDA_HOST_COMPILER_CHANGED)
 warning: cmake/gmxManageNvccConfig.cmake:#68 : unsupported macro/option :     set(CUDA_HOST_COMPILER_OPTIONS "")
 warning: cmake/gmxManageNvccConfig.cmake:#75 : unsupported macro/option :         list(APPEND CUDA_HOST_COMPILER_OPTIONS "-D__STRICT_ANSI__")
 warning: cmake/gmxManageNvccConfig.cmake:#80 : unsupported macro/option :     set(CUDA_HOST_COMPILER_OPTIONS "${CUDA_HOST_COMPILER_OPTIONS}"
 warning: cmake/gmxManageNvccConfig.cmake:#83 : unsupported macro/option :     mark_as_advanced(CUDA_HOST_COMPILER CUDA_HOST_COMPILER_OPTIONS)
 warning: cmake/gmxManageNvccConfig.cmake:#178 : unsupported macro/option : list(APPEND GMX_CUDA_NVCC_FLAGS "${CUDA_HOST_COMPILER_OPTIONS}")
 warning: cmake/gmxManageNvccConfig.cmake:#210 : unsupported macro/option : if((_cuda_nvcc_executable_or_flags_changed OR CUDA_HOST_COMPILER_CHANGED OR NOT GMX_NVCC_WORKS) AND NOT WIN32)
 warning: cmake/gmxManageNvccConfig.cmake:#212 : unsupported macro/option :     execute_process(COMMAND ${CUDA_NVCC_EXECUTABLE} -ccbin ${CUDA_HOST_COMPILER} -c ${HIP_NVCC_FLAGS} ${HIP_NVCC_FLAGS_${_build_type}} ${CMAKE_SOURCE_DIR}/cmake/TestCUDA.cu

The new cmake looks like:

 if(HIP_VERSION VERSION_LESS "11.0")
        list (APPEND GMX_CUDA_NVCC_GENCODE_FLAGS "-gencode;arch=compute_30,code=sm_30")
    endif()
    list (APPEND GMX_CUDA_NVCC_GENCODE_FLAGS "-gencode;arch=compute_35,code=sm_35")
 ...
    if(NOT HIP_VERSION VERSION_LESS "11.0")
        # Requesting sm or compute 35, 37, or 50 triggers deprecation messages with
        # nvcc 11.0, which we need to suppress for use in CI
        list (APPEND GMX_CUDA_NVCC_GENCODE_FLAGS "-Wno-deprecated-gpu-targets")
    endif()    
    ...
    macro(GMX_SET_CUDA_NVCC_FLAGS)
    set(HIP_NVCC_FLAGS "${GMX_CUDA_NVCC_FLAGS};${HIP_NVCC_FLAGS}")
endmacro()

The tool will be improved and also probably you can add more variables using the file: /appl/opt/rocm/rocm-4.0.0c/hip/bin/hipify-cmakefile

Also there will be many improvements regarding CMake ROCm/HIP#2158 (comment)

Optional exercises

Vector Addition

Hipify the code in the repository: https://github.com/csc-training/hip/tree/main/porting/codes/Vector_Addition

Known issues

Some HIP libraries need dependencies not available on NVIDIA platform, need to investigate.
If your CUDA kernel, includes the dim3() call, then hipify will convert wrongly. Issue in Github . It was fixed on February 24th, not yet installed on Puhti
In CUDA, __CUDACC__ is defined by nvcc, but the HIP equivalent __HIPCC__ is defined in hip_runtime.h. Thus, if code uses __CUDACC__ without #include <cuda_runtime_api.h>, one needs to add manually #include <hip_runtime.h> to have the automatically converted __HIPCC__ to get defined. Issue in Github

Questions, answers and notes

question
- answer

:::info Always ask questions at the very bottom of this document, right above this. :::

csc-training / hip Goto Github PK

hip's Introduction

ENCCS/CSC HIP101 workshop - Porting CUDA codes to HIP

April 16 2021

Important links

Instructor

Helpers

Code of conduct

Schedule

Icebreaker question

Structure of the repository

Puhti

Please use the pre-assigned account:

Connect to Puhti

Explore the environment

SLURM

HIP

Porting CUDA codes to HIP

General Guidelines

VERBOSE Mode

Debug hipcc

Exercises - Demonstration

Exercise: SAXPY CUDA

Steps

Check the file saxpy.cu with an editor

Compile CUDA code

Submit

Hipify

Compile

If you want to chance the .cu file to .cpp

Lessons learned:

Exercise: SAXPY CUBLAS

Steps

Check the file with an editor

Compile CUDA code

Submit

Hipify

Compile

Lessons learned:

Exercise: Discrete_Hankel_Transform

Steps

Compile and Execute

Hipify, Compile, and Execute

Exercise: Heat Equation

CUDA

Hipify

Update the Makefile

Exercise: 2D WAVE Propagation

CUDA and CPU

HIP

Hipify

Exercise: KMeans

CUDA

HIP

Exercise: Madgraph 4 GPU

Hipify

CUDA Fortran

Exercise: SAXPY CUDA Fortran

Create the HIP kernel

Observations

Wrap the kernel launch in C function

The new Fortran 2003 code

Makefile

Gromacs

Download Gromacs and uncompress:

Hipify

Issues

Solution:

Compilation

CMAKE

Hipify

Optional exercises

Vector Addition

Known issues

Questions, answers and notes

hip's People

Contributors

Stargazers

Watchers

Forkers