GithubHelp home page GithubHelp logo

yzhaiustc / optimizing-sgemm-on-nvidia-turing-gpus Goto Github PK

View Code? Open in Web Editor NEW
264.0 6.0 43.0 1.11 MB

Optimizing SGEMM kernel functions on NVIDIA GPUs to a close-to-cuBLAS performance.

License: GNU General Public License v3.0

Makefile 0.81% Shell 0.61% MATLAB 3.54% Cuda 95.04%
cuda gemm nvidia optimization

optimizing-sgemm-on-nvidia-turing-gpus's Introduction

How to optimize SGEMM on NVIDIA GPUs

"HPC is all about reducing data movement". Optimizing GEMM on GPU and CPU platforms share the same idea: to hide the memory latency with massive parallelism, cache-/register-level data re-use, and manual prefetching. On CPUs, both instruction-level and data-level parallelisms are exploited as well as delicate prefetching schemes are designed to hide the memory latency. Meanwhile, we partition the input matrices and pack them before computing to ensure a "smooth and low-latancy" computing kernel. The prefetching for matrix C is especially critical to the CPU GEMM performance.

On GPUs, we also need to take advantage of the low-latency "cache" --- shared memory. There are rich opportunities on GPUs for us to exploit data re-use on both shared memory level and register level. More details could be found in the official document of CUTLASS.

All questions are encouraged to send to [email protected].

Hardware platforms and software configurations

  • We compiled the program with gcc 7.3.0 under Ubuntu 18.04.5 LTS.
  • NVIDIA cuBLAS version: CUDA cuBLAS 11.3.1.68.

How to run

Just three steps.

  • We first modify the path of nvcc in Makefile.
  • Second, type in make to compile. A binary executable sgemm_gpu will be generated.
  • Third, run the binary using ./sgemm_gpu [kernel_number], where kernel_number selects the kernel for benchmark. 0 represents NVIDIA cuBLAS and 1-11 represent 11 kernels demonstrating the optimizing strategies.

Step-wise Optimizations

Here we take the column-major implemetation for SGEMM. Both A and B are not transposed.

Kernel 1 (naive version)

source code

Kernel 1 is the most naive implementation of SGEMM in CUDA. This is the triple-for-loop implementation with register re-use when updating C(i,j). In this version, each threa block (TB) is responsible for a 32x32 sub-block of C, and each thread computes only a single element of the C matrix.

Kernel 2 (Kernel1 + 32x32x32 tiling)

source code

Kernel2 partitions the matrix A and matrix B into 32x32 blocks. These 32x32 blocks are loaded into shared memory before being loaded for GEMM computation. When loading the data into shared memory (this is called as packing in CPU GEMM), each thread is responsible to load/store one element and we set 1024 threads per TB using __launch_bounds__(1024). After packing is completed, all the threads are synchronized and then start to compute for their own element. Since each TB is still to compute a 32x32 matrix C, each thread remains to take a single element of C. In short, this version adds cache blocking upon the previous version, with the parameter set {Ms,Ns,Ks}={32,32,32}.

image

Kernel 3 (minor update on Kernel2)

source code

We bring a simple optimization upon kernel 2 here: storing threadIdx.x before re-using it massively, in order to reduce living registers and benefit the compiler optimization. The performance slightly improves in this step.

image

Kernel 4 (kernel 3 + reducing bank conflictions on shared memory)

source code

In the previous version, the memory access on the shared memory is not ideal. We re-ordered the memory access pattern on the shared memory: making the shared memory col-major but transposing matrix B when packing it into the shared memory. This doubles the performance.

image

Kernel 5 (kernel4 + 4x1 micro kernel)

source code

In this step, we ask each thread to compute 4 elements for the C matrix. Therefore, we now have 256 threads in a TB to compute the 32x32 matrix C that the TB is responsible for. Using the CPU-GEMM language, the micro kernel's shape is: 4x1: that is to say, after the packing routine completes, each thread loads a 4x1 A and an 1x1 B and computes C(4x1) += A(4x1)*B(1x1). Starting from this step, we restrict 256 threads for each TB.

image

Kernel 6 (kernel5 + vectorized load/store)

source code

Since our target machine supports a 128-bit transaction from the DRAM, we can apply the vectorized load operation using the float4 data type.

image

Kernel 7 ({Ms,Ns,Ks}={64,64,16}, {Mr,Nr}={4,4})

source code

Considering there are sufficient registers (64K) for a TB while we only assign 256 threads for each TB, it should be safe, in terms of the performance, for us to assign more workloads to each thread. Now we ask each thread to compute a 4x4 sub-matrix of C so we gain massive data re-use at the register level compared with the previous step.

Additionally, when the input matrices are large, we can increase Ms and Ns and maintain enough TBs to map to streaming multiprocessors. Here we increase {Ms,Ns} from the previous {32,32} to {64,64} but decreased the Ks from 32 to 16 to maintain the same shared memory consumption. Since everything but the two parameters Ms,Ns are different, we deduce that the asking TBs to do more jobs benefits the performance when the input matrices are large enough.

image

Kernel 8 ({Ms,Ns,Ks}={128,128,8}, {Mr,Nr}={8,8})

source code

Assign more workloads for each TB AND each thread.

image

Kernel 9 (Kernel 8 + warp-level tiling/parallelism)

source code

Since each warp contains 32 threads and the memory accesses to the same memory address in shared memory within the same warp can be coalesced, we introduce a {Mw,Nw}={4xMr,8xNr} to benefit the warp-level parallelism. We refer readers to (Huang, 2018) for more details.

image

Kernel 10 (Kernel9 + prefetching (Huang, 2018))

source code

image

Kernel 11 (Kernel10 + double buffer to cancel a sync.)

source code

We introduce the double buffer strategy for the shared memory buffers to cancel an unnecessary syncthreads inside the loop body, pushing the performance to the limit.

image

Compare with cuBLAS

image

The performance of our best kernel is mostly same as the close-source NVIDIA cuBLAS. Let us compute their efficiencies here. Peak performance of NVIDIA RTX 2080 Super (GFLOPS) = boost frequency (GHz) * # of CUDA cores * 2 (FMA). Therefore, we have peak perf = 1.815 GHz * 3072 * 2 = 11151.36 GFLOPS = 11.15 TFLOPS. Our best performance is 10.384 TFLOPS, while NVIDIA cuBLAS' best perf is 10.717 TFLOPS, both are observed at the largest input: 6144x6144x6144 SGEMM. Translating into efficiency, we reach 93.1% of the peak perf while cuBLAS reaches 96.1% of the peak.

Some extra notes

It should be noted that the efficiency of both ours and cuBLAS can further increase when we feed them with larger input matrices. This is because introducing more parallelisms helps to better hide the latency. To shorten the test time on my local machine, I trunctate test cases at such a small input (m=n=k=6144). I would encourage interested readers to do more tests by oneself. In addition, one might also be interested in the performance of irregularly shaped SGEMM, such as i) tall-and-skinny matrices or ii) matrices when k is much larger than both m and n. For the tall-and-skinny matrices, one could choose different parameters on macro kernels, micro kernels and the warp-level tiling. For the later case where k is much larger than m and n, one needs to obtain parallelisms by splitting k with a mutex lock added --- this is indeed more complicated but one could still borrow brilliant ideas from source codes of CUTLASS.

optimizing-sgemm-on-nvidia-turing-gpus's People

Contributors

yzhaiustc avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

optimizing-sgemm-on-nvidia-turing-gpus's Issues

unpassed verification against cublas sgemm

hello,

thanks for your great work! I found when I have unsquared inputs, e.g. M = 35, N = 79, K= 19, tests of all kernels are failed. Do you have any idea about this issue?

best regards

thanks for your code!

your code is very helpful to understand the optimization of matrix multiplication on GPU! Thanks very much,
and are there more high-performance approaches which are better than kernel 11?

kernel3

您好,我就用中文提问了呀。在kernel3中,你把blocksize从(32,32)改为(1024),这种做法的优点你说有3点好处:1.storing threadIdx.x before re-using it massively 2. in order to reduce living registers 3. benefit the compiler optimization
这几点我都不太懂是啥意思。在书中和网上都找不到对应的解释,能麻烦您能说的详细一些吗? 如果还能给出参考资料那也是最好不过的!

Cannot build the project

Dear @yzhaiustc , Thanks for you amazing effort for this repo. However, we cannot build this project, because the helper_functions..h is missing. Can you provide such files? Thanks!

event时间统计有问题

for (n_count = 0; n_count < N; n_count++) {
        cudaEventRecord(beg);
        test_kernel(kernel_num, m, n, k, alpha, dA, dB, beta, dC, err);
        cudaEventRecord(end);
        cudaEventSynchronize(beg);
        cudaEventSynchronize(end);
        cudaEventElapsedTime(&ms, beg, end);
        elapsed_time += ms;
      }

这种方式统计时间更准确

Reducing bank conflictions error in kernel 4

Hi, the shared memory in kernel 4 is column major according to line 6 and 7. But line 27 accesses 'sb4' by 'row' (which is consecutive of thread id) on 2nd dimension. That means consecutive threads still access scattered shared memory.
The speed of this code on my RTX 2080ti is:

M=N=K=256:
Average elasped time: 0.000049 second, performance: 679.173522 GFLOPS.

M=N=K=512:
Average elasped time: 0.000324 second, performance: 828.578150 GFLOPS.

M=N=K=768:
Average elasped time: 0.001080 second, performance: 838.995059 GFLOPS.

M=N=K=1024:
Average elasped time: 0.002519 second, performance: 852.364426 GFLOPS.

M=N=K=1280:
Average elasped time: 0.004891 second, performance: 857.639395 GFLOPS.

M=N=K=1536:
Average elasped time: 0.007778 second, performance: 931.870010 GFLOPS.

M=N=K=1792:
Average elasped time: 0.010060 second, performance: 1144.078582 GFLOPS.

M=N=K=2048:
Average elasped time: 0.014952 second, performance: 1148.978555 GFLOPS.

The speed above is even slower than naive implementation of kernel 1 without using shared memory.
I tried to solve the bank conflictions by accessing 'sb4' using consecutive threads 'row' on 1st dimension like this:

sa4(row,col)=A(row,col);
sb4(row,col)=B(row,col);
A+=(lda<<5);B+=32;
__syncthreads();
for (int inner_k_count=0;inner_k_count<KS;inner_k_count++){
    tmp += sa4(row,inner_k_count) * sb4(inner_k_count,col);
}

The speed was improved as below and it's faster than naive kernel 1:

M=N=K=256:
Average elasped time: 0.000031 second, performance: 1072.821737 GFLOPS.

M=N=K=512:
Average elasped time: 0.000183 second, performance: 1463.878317 GFLOPS.

M=N=K=768:
Average elasped time: 0.000608 second, performance: 1489.360494 GFLOPS.

M=N=K=1024:
Average elasped time: 0.001381 second, performance: 1554.788349 GFLOPS.

M=N=K=1280:
Average elasped time: 0.002745 second, performance: 1528.041037 GFLOPS.

M=N=K=1536:
Average elasped time: 0.004654 second, performance: 1557.176527 GFLOPS.

M=N=K=1792:
Average elasped time: 0.007357 second, performance: 1564.402972 GFLOPS.

M=N=K=2048:
Average elasped time: 0.009352 second, performance: 1836.994741 GFLOPS.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.