mratsim / laser Goto Github PK

The HPC toolbox: fused matrix multiplication, convolution, data-parallel strided tensor primitives, OpenMP facilities, SIMD, JIT Assembler, CPU detection, state-of-the-art vectorized BLAS for floats and integers

License: Apache License 2.0

Nim 78.25% C 14.33% C++ 7.42%

high-performance-computing deep-learning blas gemm convolution jit assembler simd openmp tensor

laser's People

Contributors

Stargazers

Watchers

Forkers

zacharycarter ysh329 martinfleming aphexus awr1 refaqtor ba0f3 craft-zhang mgeier seth-lu tail-wag-games head-with-nothing evelynmitchell ego

laser's Issues

[Gemm] Nim devel compiler gets stuck when compiling older commits

Trying to bisect #32 but Nim devel compiler gets stuck while calling GCC/Clang or the linker.

performance of gemm_strided vs numpy

python

time python $timn_D/tests/nim/all/t0147.py
1000.0
python $timn_D/tests/nim/all/t0147.py  5.26s user 0.13s system 293% cpu 1.840 total

import numpy as np
p=1000
a=np.ones((p,p))
b=np.ones((p,p))

for i in np.arange(100):
  c=np.matmul(a,b)

print(c[0,0])

laser

nim c -d:release -d:case2 $timn_D/src/timn/apps/laser.nim
time $timn_D/src/timn/apps/laser
1000.0
$timn_D/src/timn/apps/laser  5.35s user 0.03s system 99% cpu 5.405 total

import pkg/laser/primitives/matrix_multiplication/gemm

when defined(case2):
  proc test =
    # todo: different numbers
    let p1 = 1000
    let p2 = p1
    let p3 = p1

    type T = float

    var a = newSeq[T](p1 * p2)
    for i in 0..<a.len: a[i] = 1.0
    var b = newSeq[T](p2 * p3)
    for i in 0..<b.len: b[i] = 1.0
    var c = newSeq[T](p1 * p3)

    for i in 0..<100:
      gemm_strided(
        p1, p2, p3, # CHECKME ; not sure if order correct, would be nice to document M,N,K in `gemm_strided`
        1.0,
        a[0].addr, p1, 1,
        b[0].addr, p2, 1,

        0.0,
        c[0].addr, p1, 1,
      )
    # echo a
    # echo b
    echo c[0]

test()

https://github.com/numforge/laser/blob/master/nim.cfg copied to src/timn/apps/laser.nim.cfg

Fused assignation shortcut

Currently the way to implement fast sigmoid would be:

var x = randomTensor([1000, 1000], 1.0)
var output = newTensor[float64](x.shape)
forEach o in output, xi in x:
  o = 1 / (1 + exp(-x))

which is quite wordy.

Reusing the Arraymancer syntax for broacasting would be:

let output = 1 ./ (1 .+ exp(-x))

but this would allocate for:

x0 = -x
x1 = exp(x0)
x2 = 1 .+ x1
x3 = 1/x2


Unfortunately we cannot use anything over than `=` in a let/var statement like `let x .=  1 / (1 + exp(-x))`
But we can use `let x = fuse: 1 / (1 + exp(-x))` to request the code to generate `forEach` automatically.

Matrix multiplication: Nested parallelism

cc @Laurae2

On benchmark on dual Xeon Gold 6154 vs MKL:

Warmup: 0.9943 s, result 224 (displayed to avoid compiler optimizing warmup away)

A matrix shape: (M: 2304, N: 2304)
B matrix shape: (M: 2304, N: 2304)
Output shape: (M: 2304, N: 2304)
Required number of operations: 24461.181 millions
Required bytes:                   42.467 MB
Arithmetic intensity:            576.000 FLOP/byte
Theoretical peak single-core:    118.400 GFLOP/s
Theoretical peak multi:         4262.400 GFLOP/s
Make sure to not bench Apple Accelerate or the default Linux BLAS.

Intel MKL benchmark
Collected 100 samples in 0.658 seconds
Average time: 6.211 ms
Stddev  time: 2.274 ms
Min     time: 5.648 ms
Max     time: 28.398 ms
Perf:         3938.203 GFLOP/s

Display output[0] to make sure it's not optimized away
566.68505859375

Laser production implementation
Collected 100 samples in 4.067 seconds
Average time: 40.303 ms
Stddev  time: 12.542 ms
Min     time: 35.367 ms
Max     time: 121.945 ms
Perf:         606.927 GFLOP/s

Display output[0] to make sure it's not optimized away
566.68505859375

PyTorch Glow: libjit matmul implementation
Collected 100 samples in 36.837 seconds
Average time: 368.372 ms
Stddev  time: 3.071 ms
Min     time: 362.655 ms
Max     time: 380.193 ms
Perf:         66.403 GFLOP/s

Display output[0] to make sure it's not optimized away
566.6849975585938

According to the paper

[2] Anatomy of High-Performance Many-Threaded Matrix Multiplication
Smith et al

http://www.cs.utexas.edu/users/flame/pubs/blis3_ipdps14.pdf

Parallelism should be done around jc (dimension nc)

Note that nc is often 4096 so we might need another distribution scheme.

Fast image loading primitives

On small netowrk (ResNet 18) data augmentation on CPU will be the bottleneck and the GPU will not be used fully leading to funky solutions like: https://www.sagivtech.com/2017/09/19/optimizing-pytorch-training-code/ (which uses multiprocessing to bypass Python GIL and resort to killing spawned thread)

Image loading need to be fast. Benchmarks like https://t0rakka.silvrback.com/jpeg-decoding-benchmark shows that libjpeg-turbo can be a bottleneck. See repo https://github.com/t0rakka/mango/tree/master/source/mango/jpeg and Nvidia nvJPEG https://developer.nvidia.com/nvjpeg and Nvidia DALI (Data Augmentation Library) https://github.com/NVIDIA/DALI.

Alternative libraries to look at:

VTK: https://github.com/Kitware/VTK
OpenCV: https://github.com/opencv/opencv
Pillow SIMD: https://github.com/uploadcare/pillow-simd
stb-image (Arraymancer current backend): https://github.com/nothings/stb/blob/master/stb_image.h
Imageflow: https://github.com/imazen/imageflow
Simd: https://github.com/ermig1979/Simd

Regression on GEMM allocation

The gemm float32 benchmark is segfaulting with the following stacktrace

Traceback (most recent call last)
gemm_bench_float32.nim   gemm_bench_float32
gemm_bench_float32.nim   benchLaserGEMM
gemm.nim                 gemm_strided
gemm_tiling.nim          newTiles
alloc.nim                alloc
alloc.nim                rawAlloc
alloc.nim                getBigChunk
alloc.nim                removeChunkFromMatrix2
SIGSEGV: Illegal storage access. (Attempt to read from nil?)

This happens with or without openmp.

The offending commit is e898f02 (#26)

[Showstopper regression] emit does not generate proper symbol

Emit broke OpenMP:

template omp_parallel_if*(condition: bool, body: untyped) =
  let predicate = condition # Make symbol valid and ensure it's lvalue
  {.emit: "#pragma omp parallel if (`predicate`)".}
  block: body

used in

omp_parallel_if(parallelize):
  body

Now generates:

predicateX60gensym409679_ = parallelize;
#pragma omp parallel if (predicate)
{
  ...
}

Optimised random sampling methods

Random sampling methods for common distributions are required.

Normal distribution:

Currently Arraymancer uses Box-Muller: https://github.com/mratsim/Arraymancer/blob/d05ef61847601fe253837329e0c8c47be18e8d9d/src/tensor/init_cpu.nim#L209-L235

proc randomNormal(mean = 0.0, std = 1.0): float =
  ## Random number in the normal distribution using Box-Muller method
  ## See https://en.wikipedia.org/wiki/Box%E2%80%93Muller_transform
  var valid {.global.} = false
  var x {.global.}, y {.global.}, rho {.global.}: float
  if not valid:
    x = rand(1.0)
    y = rand(1.0)
    rho = sqrt(-2.0 * ln(1.0 - y))
    valid = true
    return rho*cos(2.0*PI*x)*std+mean
  else:
    valid = false
    return rho*sin(2.0*PI*x)*std+mean

The polar method and Ziggurat algorithm are apparently faster.

Reading

https://fac.ksu.edu.sa/sites/default/files/introduction-to-probability-model-s.ross-math-cs.blog_.ir_.pdf

Mysterious 2x perf regression on GEMM

With no code or hardware change at all, after month there is a 2x perf regression, OpenBLAS also is a bit slower (with no package update):

A matrix shape: (M: 1920, N: 1920)
B matrix shape: (M: 1920, N: 1920)
Output shape: (M: 1920, N: 1920)
Required number of operations: 14155.776 millions
Required bytes:                   29.491 MB
Arithmetic intensity:            480.000 FLOP/byte
Theoretical peak single-core:    224.000 GFLOP/s
Theoretical peak multi:         4032.000 GFLOP/s
Make sure to not bench Apple Accelerate or the default Linux BLAS.

OpenBLAS benchmark
Collected 10 samples in 0.101 seconds
Average time: 9.440 ms
Stddev  time: 0.141 ms
Min     time: 9.315 ms
Max     time: 9.733 ms
Perf:         1499.508 GFLOP/s

Laser production implementation
Collected 10 samples in 0.146 seconds
Average time: 14.000 ms
Stddev  time: 25.706 ms
Min     time: 5.839 ms
Max     time: 87.161 ms
Perf:         1011.102 GFLOP/s

PyTorch Glow: libjit matmul implementation (with AVX+FMA)
Collected 10 samples in 2.041 seconds
Average time: 204.123 ms
Stddev  time: 0.763 ms
Min     time: 203.362 ms
Max     time: 205.862 ms
Perf:         69.349 GFLOP/s

MKL-DNN reference GEMM benchmark
Collected 10 samples in 0.351 seconds
Average time: 34.305 ms
Stddev  time: 5.588 ms
Min     time: 30.013 ms
Max     time: 49.684 ms
Perf:         412.645 GFLOP/s

MKL-DNN JIT AVX benchmark
Collected 10 samples in 0.130 seconds
Average time: 11.230 ms
Stddev  time: 8.353 ms
Min     time: 7.725 ms
Max     time: 34.426 ms
Perf:         1260.573 GFLOP/s

MKL-DNN JIT AVX512 benchmark
Collected 10 samples in 0.083 seconds
Average time: 7.716 ms
Stddev  time: 7.932 ms
Min     time: 4.601 ms
Max     time: 30.078 ms
Perf:         1834.643 GFLOP/s
Mean Relative Error compared to vendor BLAS: 3.045843413929106e-06

I suspect an issue with glibc OpenMP. (MKL-DNN is linked to Intel OpenMP)

[Benchmarks] Cleanup fp_reduction_latency benchmarks

The fp_reduction_latency benchmarks were the very first benchmark, optimization and primitive code tested in Laser.

Unfortunately it is currently very confusing.

It should be reorganized:

This reorg should take into account nim-lang/Nim#9514

Update for devel OpenMP

Following the merging of nim-lang/Nim#9493 the OpenMP annotation string will need to be patched when compiled for 0.19.1/0.20 vs 0.19.

Additionally we can handle both forEach and reduceEach in a unique macro as nb_chunks: var int wouldn't need to be passed anymore.

forEach would only generate #pragma omp for instead of #pragma omp parallel for and rely on a previous #pragma omp parallel

Devel regression "object constructor needs an object type"

There is a regression on devel when compiling the matrix multiplication bench

gemm_bench_float32.nim(74, 8) Warning: import os.nim instead; ospaths is deprecated [Deprecated]
gemm_bench_float32.nim(145, 8) template/generic instantiation of `bench` from here
gemm_bench_float32.nim(150, 17) template/generic instantiation of `gemm_strided` from here
../../laser/primitives/matrix_multiplication/gemm.nim(201, 46) template/generic instantiation of `dispatch` from here
../../laser/primitives/matrix_multiplication/gemm.nim(194, 14) template/generic instantiation of `apply` from here
../../laser/primitives/matrix_multiplication/gemm.nim(185, 30) template/generic instantiation of `gemm_impl` from here
../../laser/primitives/matrix_multiplication/gemm.nim(126, 29) template/generic instantiation of `pack_B_kc_nc` from here
../../laser/primitives/matrix_multiplication/gemm_tiling.nim(122, 6) Error: object constructor needs an object type

Create a benchmark script

That we can launch easily with config for Windows/Mac/Linux.

parallel reduction

hi, I wanted to try out laser. I have this code working:

proc pmin(s: var seq[float32]): float32 {.noInline.}=

  var min_by_thread = newSeq[float32](omp_get_max_threads())
  for v in min_by_thread.mitems:
    v = float32.high

  omp_parallel_chunks_default(s.len, chunk_offset, chunk_size):
   #[
    attachGC()
    min_by_thread[omp_get_thread_num()] = min(
        min_by_thread[omp_get_thread_num()],
        min(s[chunk_offset..<(chunk_offset + chunk_size)])
        )
    detachGC()
    ]#

    var thread_min = min_by_thread[omp_get_thread_num()]
    #echo chunk_offset, " ", chunk_size

    for idx in chunk_offset ..< chunk_offset + chunk_size:
      thread_min = min(s[idx], thread_min)
    min_by_thread[omp_get_thread_num()] = thread_min

  result = min(min_by_thread)

do I need an omp_critical section for the final result? and/or any other problems?
And here is my calling code from your examples/

proc main() =
  randomize(42) # Reproducibility
  var x = newSeqWith(800_000_000, float32 rand(1.0))
  x[200_000_001] = -42.0'f32
  echo omp_get_num_threads(), " ", omp_get_max_threads()

  var t = cpuTime()
  let m = min(x)

  echo "serial  :", m, &" in {cpuTime() - t:.2f} seconds"

  for i in 0..10:
    t = cpuTime()
    let mp = x.pmin()
    doAssert abs(mp - m) < 1e-10
    echo "parallel:", mp, &" in {cpuTime() - t:.2f} seconds"

main()

Exponential: Dual Xeon Gold 6154 result

cc @Laurae2

Warmup: 0.9938 s, result 224 (displayed to avoid compiler optimizing warmup away)

A - tensor shape: [5000000]
Required number of operations:     5.000 millions
Required bytes:                   20.000 MB
Arithmetic intensity:              0.250 FLOP/byte
Theoretical peak single-core:     86.400 GFLOP/s
Theoretical peak multi:          172.800 GFLOP/s
a[0]: -9.999997138977051

Baseline <math.h>
Collected 300 samples in 5.625 seconds
Average time: 17.733 ms
Stddev  time: 0.054 ms
Min     time: 17.698 ms
Max     time: 18.148 ms
Perf:         0.282 GEXPOP/s

Display output[0] to make sure it's not optimized away
4.540005829767324e-05

SSE mathfun
Collected 300 samples in 2.094 seconds
Average time: 6.021 ms
Stddev  time: 0.043 ms
Min     time: 5.976 ms
Max     time: 6.544 ms
Perf:         0.830 GEXPOP/s

Display output[0] to make sure it's not optimized away
4.540006193565205e-05

SSE fast_exp_sse (low order polynomial)
Collected 300 samples in 1.211 seconds
Average time: 3.073 ms
Stddev  time: 0.062 ms
Min     time: 3.019 ms
Max     time: 3.734 ms
Perf:         1.627 GEXPOP/s

Display output[0] to make sure it's not optimized away
4.545032061287202e-05

AVX2 fmath
Collected 300 samples in 1.060 seconds
Average time: 2.558 ms
Stddev  time: 0.067 ms
Min     time: 2.473 ms
Max     time: 3.056 ms
Perf:         1.955 GEXPOP/s

Display output[0] to make sure it's not optimized away
4.540006193565205e-05

AVX2 FMA Minimax
Collected 300 samples in 1.042 seconds
Average time: 2.492 ms
Stddev  time: 0.076 ms
Min     time: 2.383 ms
Max     time: 3.050 ms
Perf:         2.006 GEXPOP/s

Display output[0] to make sure it's not optimized away
4.539992369245738e-05

AVX2 mathfun
Collected 300 samples in 1.307 seconds
Average time: 3.382 ms
Stddev  time: 0.067 ms
Min     time: 3.275 ms
Max     time: 3.906 ms
Perf:         1.478 GEXPOP/s

Display output[0] to make sure it's not optimized away
4.540006193565205e-05

AVX+FMA Schraudolph-approx
Collected 300 samples in 0.933 seconds
Average time: 2.128 ms
Stddev  time: 0.066 ms
Min     time: 2.062 ms
Max     time: 2.607 ms
Perf:         2.350 GEXPOP/s

Display output[0] to make sure it's not optimized away
4.625692963600159e-05

Bench SIMD Math Prims
Collected 300 samples in 4.826 seconds
Average time: 15.079 ms
Stddev  time: 0.189 ms
Min     time: 14.945 ms
Max     time: 15.937 ms
Perf:         0.332 GEXPOP/s

Display output[0] to make sure it's not optimized away
4.539986548479646e-05

gemm_strided: error: always_inline function '_mm256_setzero_pd' requires target feature 'xsave'

import pkg/laser/primitives/matrix_multiplication/gemm

#[
error:
/tmp/nim/nimcache/laser_gemm_ukernel_avx.c:416:10: error: always_inline function '_mm256_setzero_pd' requires target feature 'xsave', but would be inlined into function 'gebb_ukernel_float64_x86_AVX_Ecs27YPxbc6EG9arud9a0ZTQ' that is compiled without support for 'xsave'
        AB0_0 = _mm256_setzero_pd();
]#
proc test =
  let a = [[1.0, 2, 3],
           [1.0, 1, 1],
           [1.0, 1, 1]]

  let b = [[1.0, 1],
           [1.0, 1],
           [1.0, 1]]

  let ab = [[6.0, 6],
            [3.0, 3],
            [3.0, 3]]

  var res_ab: array[3, array[2, float]]
  gemm_strided(
    3, 2, 3,
    1.0,  a[0][0].unsafeAddr, 3, 1,
          b[0][0].unsafeAddr, 2, 1,
    0.0,  res_ab[0][0].addr,  2, 1
    )

when defined(case1):
  proc test =
    # todo: different numbers
    let p1 = 3
    let p2 = p1
    let p3 = p1

    type T = float

    var a = newSeq[T](p1 * p2)
    for i in 0..<a.len: a[i] = 1.0
    var b = newSeq[T](p2 * p3)
    for i in 0..<b.len: b[i] = 1.0
    var c = newSeq[T](p1 * p3)

    gemm_strided(
      p1, p2, p3, # CHECKME ; not sure if order correct, would be nice to document M,N,K in `gemm_strided`

      1.0,
      a[0].addr, p1, 1,
      b[0].addr, p2, 1,

      0.0,
      c[0].addr, p1, 1,
    )
    echo c

test()

[Design] Error model

Error model is still to be decided.

Currently asserts are used https://github.com/numforge/laser/blob/be3326bca5d9096e912530c6ed946bb89ee01b6f/laser/strided_iteration/map_foreach.nim#L106-L110

Literature:

Challenges:

should the interface exposed be left to higher-level lib? (i.e. leave only asserts as those are removed on release builds)
If not:
- Distinguish between recoverable and unrecoverable errors (bugs).
- For embedded devices recoverable errors "API" must be documented, whether error code or exceptions are chosen.
Macros like forEach cannot use error codes, high level wrapper should take care of that

Parallel strided iteration does not scale linearly

On the following benchmark on my dual core machine with hyper-threading enabled:

https://github.com/numforge/laser/blob/d725651c0f8ca1da3f761e27e9846bc81b9341f3/benchmarks/loop_iteration/iter_bench_prod.nim (branch foreach-dsl #4)

I get the following score without OpenMP

Warmup: 1.1911 s, result 224 (displayed to avoid compiler optimizing warmup away)

Production implementation for tensor iteration - float64
Collected 1000 samples in 8.931 seconds
Average time: 8.930ms
Stddev  time: 0.345ms
Min     time: 8.723ms
Max     time: 14.785ms

Display output[[0,0]] to make sure it's not optimized away
-0.41973403633413

Production implementation for tensor iteration - float64
Collected 1000 samples in 29.597 seconds
Average time: 29.597ms
Stddev  time: 1.125ms
Min     time: 27.002ms
Max     time: 38.606ms

Display output[[0,0]] to make sure it's not optimized away
1.143903810108473

and with OpenMP

Warmup: 1.1874 s, result 224 (displayed to avoid compiler optimizing warmup away)

Production implementation for tensor iteration - float64
Collected 1000 samples in 4.094 seconds
Average time: 4.092ms
Stddev  time: 0.206ms
Min     time: 3.897ms
Max     time: 5.459ms

Display output[[0,0]] to make sure it's not optimized away
-0.41973403633413

Production implementation for tensor iteration - float64
Collected 1000 samples in 24.025 seconds
Average time: 24.022ms
Stddev  time: 1.127ms
Min     time: 22.379ms
Max     time: 33.763ms

Display output[[0,0]] to make sure it's not optimized away
1.143903810108473

Potential explanations:

hitting the maximum memory bandwidth:
- Parallel strided iteration also means parallel cache misses.
- One way to alleviate that would be to use GCC/Clang builtin_prefetch
- Note: this does not reflect experiment in https://github.com/zy97140/omp-benchmark-for-pytorch. Furthermore next element computation shouldn't require any memory access.
  Unfortunately, even if it doesn't require memory access, the CPU cannot leverage instruction level parallelism to execute the main computation at the same time as computing next elements location as there is branching.
accesses to shape and strides for strided iteration are slow:
- This might happen if the compiler doesn't create a local copy or save them in registers if it doesn't detect that no shape/strides access mutate them.
strided iteration is running out of registers and/or instruction cache space
- Unlikely are those are per-CPU and not per socket we should have a linear increase

NUMA-aware memory allocation and computation

Most HPC system have more than 1 socket which poses quite a problem to many parallel libraries.

Even in OpenMP 4, distributing parallel compute to socket proc_bind(spread) and within sockets to actual core (so no hyperthreading before all core are used) was quite an ordeal:

OpenMP 5.0 brings Numa aware allocator, see https://techdecoded.intel.io/essentials/openmp-5-0-a-story-about-threads-and-tasks/ (35min in)

[Lux] Multithreading for JIT code

This issues track multithreading solution for JIT code.

Description

At the moment, Lux only target Nim and so can make use of OpenMP for threading.

In the future, Lux will probably add a JIT solution via LLVM IR, this will reduce code-size, code generation for specialized size and allow targeting new architectures that would otherwise require complex C extensions.

Example: Cuda introduce __global, magic like blockId * blockDim.x + threadIdx.x that requires some gymnastics for Nim to generate code and not throw "undefined".

Unfortunately, when doing JIT on CPU we lose OpenMP support as OpenMP is implemented in Clangand replaced by libraries call in LLVM IR. So we need an alternative solution.

Solutions to explore

Reuse Nim threadpool library.
Implement a threading library from scratch
Wrap a C/C++ library (note that C++ will cause issues with cpuinfo with some compiler due to it using C99)
Wait for OpenMP IR to be merged in LLVM see:

OpenMP code transformation

from https://stackoverflow.com/questions/52285368/how-does-llvm-translate-openmp-multi-threaded-code-with-runtime-library-calls

This OMP code

extern float foo( void );
int main () {
    int i;
    float r = 0.0;
    #pragma omp parallel for schedule(dynamic) reduction(+:r)
    for ( i = 0; i < 10; i ++ ) {
        r += foo();
    }
}

is transformed into

extern float foo( void );
int main () {
    static int zero = 0;
    auto int gtid;
    auto float r = 0.0;
    __kmpc_begin( & loc3, 0 );
    // The gtid is not actually required in this example so could be omitted;
    // We show its initialization here because it is often required for calls into
    // the runtime and should be locally cached like this.
    gtid = __kmpc_global thread num( & loc3 );
    __kmpc_fork call( & loc7, 1, main_7_parallel_3, & r );
    __kmpc_end( & loc0 );
    return 0;
}

struct main_10_reduction_t_5 { float r_10_rpr; };

static kmp_critical_name lck = { 0 };
static ident_t loc10; // loc10.flags should contain KMP_IDENT_ATOMIC_REDUCE bit set
                      // if compiler has generated an atomic reduction.
void main_7_parallel_3( int *gtid, int *btid, float *r_7_shp ) {
    auto int i_7_pr;
    auto int lower, upper, liter, incr;
    auto struct main_10_reduction_t_5 reduce;
    reduce.r_10_rpr = 0.F;
    liter = 0;
    __kmpc_dispatch_init_4( & loc7,*gtid, 35, 0, 9, 1, 1 );
    while ( __kmpc_dispatch_next_4( & loc7, *gtid, & liter, & lower, & upper, & incr
      ) ) {
        for( i_7_pr = lower; upper >= i_7_pr; i_7_pr ++ )
          reduce.r_10_rpr += foo();
    }
    switch( __kmpc_reduce_nowait( & loc10, *gtid, 1, 4, & reduce, main_10_reduce_5, &lck ) ) {
        case 1:
           *r_7_shp += reduce.r_10_rpr;
           __kmpc_end_reduce_nowait( & loc10, *gtid, & lck );
           break;
        case 2:
           __kmpc_atomic_float4_add( & loc10, *gtid, r_7_shp, reduce.r_10_rpr );
           break;
        default:;
    }
}

in LLVM IR:

[...]
; Function Attrs: nounwind uwtable
define dso_local i32 @main() local_unnamed_addr #0 {
entry:
  %i = alloca i32, align 4
  %r = alloca float, align 4
  %0 = bitcast i32* %i to i8*
  call void @llvm.lifetime.start.p0i8(i64 4, i8* nonnull %0) #4
  %1 = bitcast float* %r to i8*
  call void @llvm.lifetime.start.p0i8(i64 4, i8* nonnull %1) #4
  store float 0.000000e+00, float* %r, align 4, !tbaa !2
  call void (%struct.ident_t*, i32, void (i32*, i32*, ...)*, ...) @__kmpc_fork_call(%struct.ident_t* nonnull @0, i32 2, void (i32*, i32*, ...)* bitcast (void (i32*, i32*, i32*, float*)* @.omp_outlined. to void (i32*, i32*, ...)*), i32* nonnull %i, float* nonnull %r) #4
  call void @llvm.lifetime.end.p0i8(i64 4, i8* nonnull %1) #4
  call void @llvm.lifetime.end.p0i8(i64 4, i8* nonnull %0) #4
  ret i32 0
}
[...]

; Function Attrs: norecurse nounwind uwtable
define internal void @.omp_outlined.(i32* noalias nocapture readonly %.global_tid., i32* noalias nocapture readnone %.bound_tid., i32* nocapture readnone dereferenceable(4) %i, float* nocapture dereferenceable(4) %r) #2 {
entry:
  %.omp.lb = alloca i32, align 4
  %.omp.ub = alloca i32, align 4
  %.omp.stride = alloca i32, align 4
  %.omp.is_last = alloca i32, align 4
  %r1 = alloca float, align 4
  %.omp.reduction.red_list = alloca [1 x i8*], align 8
  %0 = bitcast i32* %.omp.lb to i8*
  call void @llvm.lifetime.start.p0i8(i64 4, i8* nonnull %0) #4
  store i32 0, i32* %.omp.lb, align 4, !tbaa !6
  %1 = bitcast i32* %.omp.ub to i8*
  call void @llvm.lifetime.start.p0i8(i64 4, i8* nonnull %1) #4
  store i32 9, i32* %.omp.ub, align 4, !tbaa !6
  %2 = bitcast i32* %.omp.stride to i8*
  call void @llvm.lifetime.start.p0i8(i64 4, i8* nonnull %2) #4
  store i32 1, i32* %.omp.stride, align 4, !tbaa !6
  %3 = bitcast i32* %.omp.is_last to i8*
  call void @llvm.lifetime.start.p0i8(i64 4, i8* nonnull %3) #4
  store i32 0, i32* %.omp.is_last, align 4, !tbaa !6
  %4 = bitcast float* %r1 to i8*
  call void @llvm.lifetime.start.p0i8(i64 4, i8* nonnull %4) #4
  store float 0.000000e+00, float* %r1, align 4, !tbaa !2
  %5 = load i32, i32* %.global_tid., align 4, !tbaa !6
  tail call void @__kmpc_dispatch_init_4(%struct.ident_t* nonnull @0, i32 %5, i32 35, i32 0, i32 9, i32 1, i32 1) #4
  %6 = call i32 @__kmpc_dispatch_next_4(%struct.ident_t* nonnull @0, i32 %5, i32* nonnull %.omp.is_last, i32* nonnull %.omp.lb, i32* nonnull %.omp.ub, i32* nonnull %.omp.stride) #4
  %tobool14 = icmp eq i32 %6, 0
  br i1 %tobool14, label %omp.dispatch.end, label %omp.dispatch.body

omp.dispatch.cond.loopexit:                       ; preds = %omp.inner.for.body, %omp.dispatch.body
  %7 = call i32 @__kmpc_dispatch_next_4(%struct.ident_t* nonnull @0, i32 %5, i32* nonnull %.omp.is_last, i32* nonnull %.omp.lb, i32* nonnull %.omp.ub, i32* nonnull %.omp.stride) #4
  %tobool = icmp eq i32 %7, 0
  br i1 %tobool, label %omp.dispatch.end, label %omp.dispatch.body

omp.dispatch.body:                                ; preds = %entry, %omp.dispatch.cond.loopexit
  %8 = load i32, i32* %.omp.lb, align 4, !tbaa !6
  %9 = load i32, i32* %.omp.ub, align 4, !tbaa !6, !llvm.mem.parallel_loop_access !8
  %cmp12 = icmp sgt i32 %8, %9
  br i1 %cmp12, label %omp.dispatch.cond.loopexit, label %omp.inner.for.body

omp.inner.for.body:                               ; preds = %omp.dispatch.body, %omp.inner.for.body
  %.omp.iv.013 = phi i32 [ %add4, %omp.inner.for.body ], [ %8, %omp.dispatch.body ]
  %call = call float @foo() #4, !llvm.mem.parallel_loop_access !8
  %10 = load float, float* %r1, align 4, !tbaa !2, !llvm.mem.parallel_loop_access !8
  %add3 = fadd float %call, %10
  store float %add3, float* %r1, align 4, !tbaa !2, !llvm.mem.parallel_loop_access !8
  %add4 = add nsw i32 %.omp.iv.013, 1
  %11 = load i32, i32* %.omp.ub, align 4, !tbaa !6, !llvm.mem.parallel_loop_access !8
  %cmp = icmp slt i32 %.omp.iv.013, %11
  br i1 %cmp, label %omp.inner.for.body, label %omp.dispatch.cond.loopexit, !llvm.loop !8

omp.dispatch.end:                                 ; preds = %omp.dispatch.cond.loopexit, %entry
  %12 = bitcast [1 x i8*]* %.omp.reduction.red_list to float**
  store float* %r1, float** %12, align 8
  %13 = bitcast [1 x i8*]* %.omp.reduction.red_list to i8*
  %14 = call i32 @__kmpc_reduce_nowait(%struct.ident_t* nonnull @1, i32 %5, i32 1, i64 8, i8* nonnull %13, void (i8*, i8*)* nonnull @.omp.reduction.reduction_func, [8 x i32]* nonnull @.gomp_critical_user_.reduction.var) #4
  switch i32 %14, label %.omp.reduction.default [
    i32 1, label %.omp.reduction.case1
    i32 2, label %.omp.reduction.case2
  ]

.omp.reduction.case1:                             ; preds = %omp.dispatch.end
  %15 = load float, float* %r, align 4, !tbaa !2
  %16 = load float, float* %r1, align 4, !tbaa !2
  %add5 = fadd float %15, %16
  store float %add5, float* %r, align 4, !tbaa !2
  call void @__kmpc_end_reduce_nowait(%struct.ident_t* nonnull @1, i32 %5, [8 x i32]* nonnull @.gomp_critical_user_.reduction.var) #4
  br label %.omp.reduction.default

.omp.reduction.case2:                             ; preds = %omp.dispatch.end
  %17 = bitcast float* %r to i32*
  %atomic-load = load atomic i32, i32* %17 monotonic, align 4, !tbaa !2
  %18 = load float, float* %r1, align 4, !tbaa !2
  br label %atomic_cont

atomic_cont:                                      ; preds = %atomic_cont, %.omp.reduction.case2
  %19 = phi i32 [ %atomic-load, %.omp.reduction.case2 ], [ %23, %atomic_cont ]
  %20 = bitcast i32 %19 to float
  %add7 = fadd float %18, %20
  %21 = bitcast float %add7 to i32
  %22 = cmpxchg i32* %17, i32 %19, i32 %21 monotonic monotonic
  %23 = extractvalue { i32, i1 } %22, 0
  %24 = extractvalue { i32, i1 } %22, 1
  br i1 %24, label %.omp.reduction.default, label %atomic_cont

.omp.reduction.default:                           ; preds = %atomic_cont, %.omp.reduction.case1, %omp.dispatch.end
  call void @llvm.lifetime.end.p0i8(i64 4, i8* nonnull %4) #4
  call void @llvm.lifetime.end.p0i8(i64 4, i8* nonnull %3) #4
  call void @llvm.lifetime.end.p0i8(i64 4, i8* nonnull %2) #4
  call void @llvm.lifetime.end.p0i8(i64 4, i8* nonnull %1) #4
  call void @llvm.lifetime.end.p0i8(i64 4, i8* nonnull %0) #4
  ret void
}
[...]

performance of avx512 bit ops and popcounts

as requested, I am opening an issue.
somalier calculates relatedness between pairs of samples using bitwise operations and popcounts here

where genotypes is effectively:

type genotypes* = tuple[hom_ref:seq[uint64], het:seq[uint64], hom_alt:seq[uint64]]

and currently, those seqs have a len of about 300.
so for 10K samples, doing relatedness for all-vs-all, it will do ~50 million calls to the IBS function linked above.

the "simple" avx512 version is here which is identical in speed to the version currently in somalier

the unrolled version is here. this gives ~10-15% speedup.

this problem is embarrassingly parallel and it's currently single-threaded, so I could also improve speed that way, but I was interested to explore the simd stuff.

I'd be interested to hear your thoughts, maybe the "parallelization" scheme should be changed to load 8 samples at once instead of current 8 (*64) sites at once.

Benchmark example using Intel MKL (for history)

(this issue for history and potential improvements for laser later: especially AVX-512 and dual port AVX-512)

After chatting for hours with @mratsim to find benchmark Laser with a 72 thread machine and getting a working MKL setup, here is an example benchmark using Intel MKL. We are assuming multiple MKL installations, and using a specific version stored in /opt/intel/compilers_and_libraries_2019.0.117 with the following settings:

We assume also you do not have any nim installation, if you do have you know what lines to skip.

Change the number of threads right at the beginning (OMP_NUM_THREADS). We are using commit 990e59f

source /opt/intel/mkl/bin/mklvars.sh intel64
export OMP_NUM_THREADS=1

curl https://nim-lang.org/choosenim/init.sh -sSf | sh
git clone --recursive git://github.com/numforge/laser
cd laser
git checkout 990e59f
git submodule init
git submodule update

Before compiling, change the following in https://github.com/numforge/laser/blob/990e59fffe50779cdef33aa0b8f22da19e1eb328/benchmarks/blas.nim#L5 to the following (change the MKL folders if needed):

  const blas = "libmkl_intel_ilp64.so"
  {.passC: "-I' /opt/intel/compilers_and_libraries_2019.0.117/linux/mkl/include' -L'/opt/intel/compilers_and_libraries_2019.0.117/linux/mkl/lib/intel64_lin'".}

Change the following here https://github.com/numforge/laser/blob/990e59fffe50779cdef33aa0b8f22da19e1eb328/benchmarks/gemm/gemm_bench_float32.nim#L53-L55 to:

  M     = 2304
  K     = 2304
  N     = 2304

Tune the following to your likings, here I used my Dual Xeon Gold 6154 and put 100 repeated computations:

  NbSamples = 100    # This might stresss the allocator when packing if the matrices are big
  CpuGhz = 3.7      # Assuming no turbo
  NumCpuCores = 36
  CpuFlopCycle = 32 # AVX2: 2xFMA/cycle = 2x8x2 - 2 x 8 floats x (1 add + 1 mul)

For the CpuFlopCycle, you need to check the implemented instructions here:

https://github.com/numforge/laser/blob/990e59fffe50779cdef33aa0b8f22da19e1eb328/laser/primitives/matrix_multiplication/gemm_ukernel_avx_fma.nim#L10-L23

Also, tune this to your preference https://github.com/numforge/laser/blob/990e59fffe50779cdef33aa0b8f22da19e1eb328/laser/primitives/matrix_multiplication/gemm_tiling.nim#L234-L235 (I tuned again for my Dual Xeon Gold 6154):

  result.mc = min(768 div T.sizeof, M)
  result.kc = min(4096 div T.sizeof, K)

And now you can compile with MKL (change the MKL folders if needed):

mkdir build
LD_LIBRARY_PATH=/opt/intel/compilers_and_libraries_2019.0.117/linux/mkl/lib/intel64_lin nim cpp --dynlibOverride:libmkl_intel_ilp64 --passL:"/opt/intel/compilers_and_libraries_2019.0.117/linux/mkl/lib/intel64_lin/libmkl_intel_ilp64.a -Wl,--no-as-needed -lmkl_intel_ilp64 -lmkl_gnu_thread -lmkl_core -lgomp -lpthread -lm -ldl" --passC:"-D_GNU_SOURCE -L$MKLROOT/lib/intel64_lin -DMKL_ILP64 -m64" -r -d:release -d:openmp -o:build/bench_gemm benchmarks/gemm/gemm_bench_float32.nim

On a Dual Xeon 6154 setup (36 physical cores / 72 logical cores, 3.7 GHz all turbo), you should get the following:

Tool	FLOPS
Intel MKL	4 TFLOPS
Laser	600 GFLOPS
PyTorch Glow	60 GFLOPS

As you can see, we are nearly reaching the maximum possible theoretical performance:

[GEMM] Significant performance regression (divided by 5)

Since #28 that fixed #27, another strange regression appeared, dividing per by 5:

from a March 23 build

$  ./build/gemm_f32_omp

A matrix shape: (M: 1920, N: 1920)
B matrix shape: (M: 1920, N: 1920)
Output shape: (M: 1920, N: 1920)
Required number of operations: 14155.776 millions
Required bytes:                   29.491 MB
Arithmetic intensity:            480.000 FLOP/byte
Theoretical peak single-core:    224.000 GFLOP/s
Theoretical peak multi:         4032.000 GFLOP/s
Make sure to not bench Apple Accelerate or the default Linux BLAS.

Reference loop
Collected 10 samples in 10.421 seconds
Average time: 1041.539 ms
Stddev  time: 3.983 ms
Min     time: 1035.329 ms
Max     time: 1047.674 ms
Perf:         13.591 GFLOP/s

OpenBLAS benchmark
Collected 10 samples in 0.091 seconds
Average time: 8.438 ms
Stddev  time: 6.319 ms
Min     time: 6.240 ms
Max     time: 26.393 ms
Perf:         1677.596 GFLOP/s

Laser production implementation
Collected 10 samples in 0.087 seconds
Average time: 8.035 ms
Stddev  time: 4.186 ms
Min     time: 6.517 ms
Max     time: 19.913 ms
Perf:         1761.855 GFLOP/s

PyTorch Glow: libjit matmul implementation (with AVX+FMA)
Collected 10 samples in 1.900 seconds
Average time: 189.987 ms
Stddev  time: 2.893 ms
Min     time: 188.794 ms
Max     time: 198.044 ms
Perf:         74.509 GFLOP/s

MKL-DNN reference GEMM benchmark
Collected 10 samples in 0.368 seconds
Average time: 36.043 ms
Stddev  time: 5.048 ms
Min     time: 34.275 ms
Max     time: 50.364 ms
Perf:         392.748 GFLOP/s

MKL-DNN JIT AVX benchmark
Collected 10 samples in 0.105 seconds
Average time: 9.758 ms
Stddev  time: 5.933 ms
Min     time: 7.715 ms
Max     time: 26.624 ms
Perf:         1450.731 GFLOP/s

MKL-DNN JIT AVX512 benchmark
Collected 10 samples in 0.088 seconds
Average time: 8.154 ms
Stddev  time: 10.128 ms
Min     time: 4.733 ms
Max     time: 36.938 ms
Perf:         1736.020 GFLOP/s
Mean Relative Error compared to vendor BLAS: 3.045843413929106e-06

From a recent rebuild

$  ./build/gemm_omp_f32

A matrix shape: (M: 1920, N: 1920)
B matrix shape: (M: 1920, N: 1920)
Output shape: (M: 1920, N: 1920)
Required number of operations: 14155.776 millions
Required bytes:                   29.491 MB
Arithmetic intensity:            480.000 FLOP/byte
Theoretical peak single-core:    224.000 GFLOP/s
Theoretical peak multi:         4032.000 GFLOP/s
Make sure to not bench Apple Accelerate or the default Linux BLAS.

Laser production implementation
Collected 10 samples in 0.555 seconds
Average time: 54.917 ms
Stddev  time: 5.027 ms
Min     time: 53.250 ms
Max     time: 69.218 ms
Perf:         257.765 GFLOP/s

[GEMM] Enhance serial implementation

With #20, the parallel schedule seems to scale perfectly on many cores:

$  OMP_NUM_THREADS=1 OPENBLAS_NUM_THREADS=1 ./build/gemm_f32_serialWarmup: 0.9036 s, result 224 (displayed to avoid compiler optimizing warmup away)

A matrix shape: (M: 1920, N: 1920)
B matrix shape: (M: 1920, N: 1920)
Output shape: (M: 1920, N: 1920)
Required number of operations: 14155.776 millions
Required bytes:                   29.491 MB
Arithmetic intensity:            480.000 FLOP/byte
Theoretical peak single-core:    230.400 GFLOP/s
Theoretical peak multi:         4147.200 GFLOP/s
Make sure to not bench Apple Accelerate or the default Linux BLAS.

OpenBLAS benchmark
Collected 10 samples in 1.238 seconds
Average time: 123.713 ms
Stddev  time: 0.444 ms
Min     time: 123.335 ms
Max     time: 124.890 ms
Perf:         114.425 GFLOP/s

Laser production implementation
Collected 10 samples in 1.465 seconds
Average time: 146.392 ms
Stddev  time: 0.644 ms
Min     time: 146.006 ms
Max     time: 147.802 ms
Perf:         96.697 GFLOP/s
Mean Relative Error compared to OpenBLAS: 1.243059557509696e-07

------------------------------------------------------------

$  ./build/gemm_f32_omp
Warmup: 0.9021 s, result 224 (displayed to avoid compiler optimizing warmup away)

A matrix shape: (M: 1920, N: 1920)
B matrix shape: (M: 1920, N: 1920)
Output shape: (M: 1920, N: 1920)
Required number of operations: 14155.776 millions
Required bytes:                   29.491 MB
Arithmetic intensity:            480.000 FLOP/byte
Theoretical peak single-core:    230.400 GFLOP/s
Theoretical peak multi:         4147.200 GFLOP/s
Make sure to not bench Apple Accelerate or the default Linux BLAS.

OpenBLAS benchmark
Collected 10 samples in 0.079 seconds
Average time: 7.739 ms
Stddev  time: 4.368 ms
Min     time: 6.020 ms
Max     time: 20.097 ms
Perf:         1829.200 GFLOP/s

Laser production implementation
Collected 10 samples in 0.083 seconds
Average time: 8.126 ms
Stddev  time: 4.777 ms
Min     time: 6.241 ms
Max     time: 21.632 ms
Perf:         1742.123 GFLOP/s
Mean Relative Error compared to OpenBLAS: 0.01456451416015625

with 96.7 GFLOP/s * 18 cores = 1740 on my machine.

However the single-threaded implementation is still quite often below OpenBLAS.

Causes:

To fix regressions in #20, interleaving loading the next A micropanel with the computation on the current A micro panel had to be removed and is currently commented out: https://github.com/numforge/laser/blob/ebb01ad40f30d495f0f4b02ef1ff49c3f54230cd/laser/primitives/matrix_multiplication/gemm_ukernel_generator.nim#L237-L242

It should be reintroduced.

mc and kc should be tuned depending on available L1 and L2 cache and the TLB.

Transpose does not scale well with multithread

Using Dual Intel Xeon Gold 6154 on commit 990e59f.

Compilation flags used: nim cpp --passC:"-D_GNU_SOURCE" --passL:"-lpthread" -r -d:release -d:openmp -o:build/bench_transpose benchmarks/transpose/transpose_bench.nim

Multithreaded results:

Hint: ./build/bench_transpose  [Exec]
Warmup: 0.9945 s, result 224 (displayed to avoid compiler optimizing warmup away)

A matrix shape: (M: 4000, N: 2000)
Output shape: (M: 2000, N: 4000)
Required number of operations:     8.000 millions
Required bytes:                   32.000 MB
Arithmetic intensity:              0.250 FLOP/byte

Laser ForEachStrided
Collected 250 samples in 0.500 seconds
Average time: 1.518 ms
Stddev  time: 2.158 ms
Min     time: 1.153 ms
Max     time: 25.356 ms
Perf:         5.271 GMEMOPs/s

Display output[1] to make sure it's not optimized away
0.7808474898338318

Naive transpose
Collected 250 samples in 0.266 seconds
Average time: 1.062 ms
Stddev  time: 0.418 ms
Min     time: 0.936 ms
Max     time: 3.818 ms
Perf:         7.530 GMEMOPs/s

Display output[1] to make sure it's not optimized away
0.7808474898338318

Naive transpose - input row iteration
Collected 250 samples in 0.400 seconds
Average time: 1.598 ms
Stddev  time: 2.117 ms
Min     time: 0.969 ms
Max     time: 23.107 ms
Perf:         5.006 GMEMOPs/s

Display output[1] to make sure it's not optimized away
0.7808474898338318

Collapsed OpenMP
Collected 250 samples in 0.411 seconds
Average time: 1.642 ms
Stddev  time: 2.530 ms
Min     time: 0.924 ms
Max     time: 31.653 ms
Perf:         4.871 GMEMOPs/s

Display output[1] to make sure it's not optimized away
0.7808474898338318

Collapsed OpenMP - input row iteration
Collected 250 samples in 0.445 seconds
Average time: 1.781 ms
Stddev  time: 2.011 ms
Min     time: 1.162 ms
Max     time: 24.661 ms
Perf:         4.492 GMEMOPs/s

Display output[1] to make sure it's not optimized away
0.7808474898338318

Cache blocking
Collected 250 samples in 0.068 seconds
Average time: 0.270 ms
Stddev  time: 0.222 ms
Min     time: 0.239 ms
Max     time: 2.669 ms
Perf:         29.637 GMEMOPs/s

Display output[1] to make sure it's not optimized away
0.7808474898338318

Cache blocking - input row iteration
Collected 250 samples in 0.179 seconds
Average time: 0.715 ms
Stddev  time: 0.279 ms
Min     time: 0.657 ms
Max     time: 3.240 ms
Perf:         11.184 GMEMOPs/s

Display output[1] to make sure it's not optimized away
0.7808474898338318

2D Tiling
Collected 250 samples in 0.066 seconds
Average time: 0.265 ms
Stddev  time: 0.159 ms
Min     time: 0.241 ms
Max     time: 2.447 ms
Perf:         30.189 GMEMOPs/s

Display output[1] to make sure it's not optimized away
0.7808474898338318

2D Tiling - input row iteration
Collected 250 samples in 0.056 seconds
Average time: 0.223 ms
Stddev  time: 0.095 ms
Min     time: 0.203 ms
Max     time: 1.459 ms
Perf:         35.896 GMEMOPs/s

Display output[1] to make sure it's not optimized away
0.7808474898338318

Cache blocking with Prefetch
Collected 250 samples in 0.069 seconds
Average time: 0.277 ms
Stddev  time: 0.160 ms
Min     time: 0.252 ms
Max     time: 2.446 ms
Perf:         28.844 GMEMOPs/s

Display output[1] to make sure it's not optimized away
0.7808474898338318

2D Tiling + Prefetch - input row iteration
Collected 250 samples in 0.175 seconds
Average time: 0.698 ms
Stddev  time: 1.759 ms
Min     time: 0.371 ms
Max     time: 18.627 ms
Perf:         11.455 GMEMOPs/s

Display output[1] to make sure it's not optimized away
0.7808474898338318

Production implementation
Collected 250 samples in 0.144 seconds
Average time: 0.574 ms
Stddev  time: 0.975 ms
Min     time: 0.382 ms
Max     time: 12.650 ms
Perf:         13.933 GMEMOPs/s

Display output[1] to make sure it's not optimized away
0.7808474898338318

Without OpenMP: nim cpp --passC:"-D_GNU_SOURCE" --passL:"-lpthread" -r -d:release -o:build/bench_transpose benchmarks/transpose/transpose_bench.nim

Singlethreaded results:

Hint: ./build/bench_transpose  [Exec]
Warmup: 0.9940 s, result 224 (displayed to avoid compiler optimizing warmup away)

A matrix shape: (M: 4000, N: 2000)
Output shape: (M: 2000, N: 4000)
Required number of operations:     8.000 millions
Required bytes:                   32.000 MB
Arithmetic intensity:              0.250 FLOP/byte

Laser ForEachStrided
Collected 250 samples in 9.080 seconds
Average time: 35.957 ms
Stddev  time: 0.289 ms
Min     time: 35.666 ms
Max     time: 37.249 ms
Perf:         0.222 GMEMOPs/s

Display output[1] to make sure it's not optimized away
0.7808474898338318

Naive transpose
Collected 250 samples in 8.580 seconds
Average time: 34.320 ms
Stddev  time: 0.320 ms
Min     time: 32.876 ms
Max     time: 35.604 ms
Perf:         0.233 GMEMOPs/s

Display output[1] to make sure it's not optimized away
0.7808474898338318

Naive transpose - input row iteration
Collected 250 samples in 8.637 seconds
Average time: 34.549 ms
Stddev  time: 0.243 ms
Min     time: 34.378 ms
Max     time: 35.767 ms
Perf:         0.232 GMEMOPs/s

Display output[1] to make sure it's not optimized away
0.7808474898338318

Collapsed OpenMP
Collected 250 samples in 8.674 seconds
Average time: 34.695 ms
Stddev  time: 0.361 ms
Min     time: 33.291 ms
Max     time: 36.134 ms
Perf:         0.231 GMEMOPs/s

Display output[1] to make sure it's not optimized away
0.7808474898338318

Collapsed OpenMP - input row iteration
Collected 250 samples in 8.694 seconds
Average time: 34.775 ms
Stddev  time: 0.339 ms
Min     time: 34.471 ms
Max     time: 36.496 ms
Perf:         0.230 GMEMOPs/s

Display output[1] to make sure it's not optimized away
0.7808474898338318

Cache blocking
Collected 250 samples in 2.383 seconds
Average time: 9.533 ms
Stddev  time: 0.172 ms
Min     time: 9.345 ms
Max     time: 10.990 ms
Perf:         0.839 GMEMOPs/s

Display output[1] to make sure it's not optimized away
0.7808474898338318

Cache blocking - input row iteration
Collected 250 samples in 4.512 seconds
Average time: 18.047 ms
Stddev  time: 0.232 ms
Min     time: 17.833 ms
Max     time: 19.423 ms
Perf:         0.443 GMEMOPs/s

Display output[1] to make sure it's not optimized away
0.7808474898338318

2D Tiling
Collected 250 samples in 3.625 seconds
Average time: 14.498 ms
Stddev  time: 0.236 ms
Min     time: 14.244 ms
Max     time: 15.882 ms
Perf:         0.552 GMEMOPs/s

Display output[1] to make sure it's not optimized away
0.7808474898338318

2D Tiling - input row iteration
Collected 250 samples in 2.491 seconds
Average time: 9.964 ms
Stddev  time: 0.222 ms
Min     time: 9.820 ms
Max     time: 11.652 ms
Perf:         0.803 GMEMOPs/s

Display output[1] to make sure it's not optimized away
0.7808474898338318

Cache blocking with Prefetch
Collected 250 samples in 2.583 seconds
Average time: 10.331 ms
Stddev  time: 0.169 ms
Min     time: 9.836 ms
Max     time: 11.829 ms
Perf:         0.774 GMEMOPs/s

Display output[1] to make sure it's not optimized away
0.7808474898338318

2D Tiling + Prefetch - input row iteration
Collected 250 samples in 2.699 seconds
Average time: 10.796 ms
Stddev  time: 0.216 ms
Min     time: 10.669 ms
Max     time: 12.463 ms
Perf:         0.741 GMEMOPs/s

Display output[1] to make sure it's not optimized away
0.7808474898338318

Production implementation
Collected 250 samples in 2.712 seconds
Average time: 10.849 ms
Stddev  time: 0.181 ms
Min     time: 10.708 ms
Max     time: 12.350 ms
Perf:         0.737 GMEMOPs/s

Display output[1] to make sure it's not optimized away
0.7808474898338318

System Profile Dual Xeon Gold 6154

Using: https://github.com/Mysticial/Flops/tree/master

Compare also scaling with: https://github.com/numforge/laser/blob/master/benchmarks/system_profile_i9-9980XE.md

Turbo table:

Threads	Non AVX	AVX	AVX2	AVX-512
1	3.7	3.7	3.6	3.5
72	3.7	3.7	3.3	2.7

Compile flags:

flags="-O3 -pthread -std=c++11"

~/Documents/FLOPS/Flops/version3/binaries-linux$ ./2017-SkylakePurley 
Running Skylake Purley tuned binary with 1 thread...

Single-Precision - 128-bit AVX - Add/Sub
    GFlops = 29.6
    Result = 3.77062e+06

Double-Precision - 128-bit AVX - Add/Sub
    GFlops = 14.8
    Result = 1.89082e+06

Single-Precision - 128-bit AVX - Multiply
    GFlops = 29.616
    Result = 3.76264e+06

Double-Precision - 128-bit AVX - Multiply
    GFlops = 14.808
    Result = 1.88361e+06

Single-Precision - 128-bit AVX - Multiply + Add
    GFlops = 28.464
    Result = 3.01437e+06

Double-Precision - 128-bit AVX - Multiply + Add
    GFlops = 14.232
    Result = 1.49273e+06

Single-Precision - 128-bit FMA3 - Fused Multiply Add
    GFlops = 59.232
    Result = 3.75951e+06

Double-Precision - 128-bit FMA3 - Fused Multiply Add
    GFlops = 29.616
    Result = 1.88064e+06

Single-Precision - 256-bit AVX - Add/Sub
    GFlops = 56.448
    Result = 7.13658e+06

Double-Precision - 256-bit AVX - Add/Sub
    GFlops = 28.224
    Result = 3.60599e+06

Single-Precision - 256-bit AVX - Multiply
    GFlops = 56.544
    Result = 7.1869e+06

Double-Precision - 256-bit AVX - Multiply
    GFlops = 28.224
    Result = 3.57487e+06

Single-Precision - 256-bit AVX - Multiply + Add
    GFlops = 54.24
    Result = 5.74009e+06

Double-Precision - 256-bit AVX - Multiply + Add
    GFlops = 27.12
    Result = 2.85433e+06

Single-Precision - 256-bit FMA3 - Fused Multiply Add
    GFlops = 112.896
    Result = 7.17821e+06

Double-Precision - 256-bit FMA3 - Fused Multiply Add
    GFlops = 56.448
    Result = 3.58505e+06

Single-Precision - 512-bit AVX512 - Add/Sub
    GFlops = 109.568
    Result = 1.38866e+07

Double-Precision - 512-bit AVX512 - Add/Sub
    GFlops = 54.912
    Result = 6.97365e+06

Single-Precision - 512-bit AVX512 - Multiply
    GFlops = 109.824
    Result = 1.39655e+07

Double-Precision - 512-bit AVX512 - Multiply
    GFlops = 54.912
    Result = 6.95621e+06

Single-Precision - 512-bit AVX512 - Multiply + Add
    GFlops = 109.824
    Result = 1.16442e+07

Double-Precision - 512-bit AVX512 - Multiply + Add
    GFlops = 54.912
    Result = 5.8363e+06

Single-Precision - 512-bit AVX512 - Fused Multiply Add
    GFlops = 219.648
    Result = 1.39502e+07

Double-Precision - 512-bit AVX512 - Fused Multiply Add
    GFlops = 109.44
    Result = 6.91304e+06


Running Skylake Purley tuned binary with 72 thread(s)...

Single-Precision - 128-bit AVX - Add/Sub
    GFlops = 1061.66
    Result = 1.34525e+08

Double-Precision - 128-bit AVX - Add/Sub
    GFlops = 529.344
    Result = 6.71256e+07

Single-Precision - 128-bit AVX - Multiply
    GFlops = 1059.94
    Result = 1.34484e+08

Double-Precision - 128-bit AVX - Multiply
    GFlops = 530.76
    Result = 6.73558e+07

Single-Precision - 128-bit AVX - Multiply + Add
    GFlops = 1060.75
    Result = 1.12212e+08

Double-Precision - 128-bit AVX - Multiply + Add
    GFlops = 530.016
    Result = 5.6014e+07

Single-Precision - 128-bit FMA3 - Fused Multiply Add
    GFlops = 2124.67
    Result = 1.3479e+08

Double-Precision - 128-bit FMA3 - Fused Multiply Add
    GFlops = 1060.51
    Result = 6.73101e+07

Single-Precision - 256-bit AVX - Add/Sub
    GFlops = 1896.51
    Result = 2.404e+08

Double-Precision - 256-bit AVX - Add/Sub
    GFlops = 945.344
    Result = 1.19742e+08

Single-Precision - 256-bit AVX - Multiply
    GFlops = 1893.7
    Result = 2.40188e+08

Double-Precision - 256-bit AVX - Multiply
    GFlops = 949.776
    Result = 1.20462e+08

Single-Precision - 256-bit AVX - Multiply + Add
    GFlops = 1895.42
    Result = 2.00145e+08

Double-Precision - 256-bit AVX - Multiply + Add
    GFlops = 946.608
    Result = 1.0001e+08

Single-Precision - 256-bit FMA3 - Fused Multiply Add
    GFlops = 3793.15
    Result = 2.40581e+08

Double-Precision - 256-bit FMA3 - Fused Multiply Add
    GFlops = 1897.92
    Result = 1.20378e+08

Single-Precision - 512-bit AVX512 - Add/Sub
    GFlops = 3106.56
    Result = 3.94345e+08

Double-Precision - 512-bit AVX512 - Add/Sub
    GFlops = 1549.82
    Result = 1.96823e+08

Single-Precision - 512-bit AVX512 - Multiply
    GFlops = 3101.95
    Result = 3.94153e+08

Double-Precision - 512-bit AVX512 - Multiply
    GFlops = 1557.89
    Result = 1.97929e+08

Single-Precision - 512-bit AVX512 - Multiply + Add
    GFlops = 3114.62
    Result = 3.29293e+08

Double-Precision - 512-bit AVX512 - Multiply + Add
    GFlops = 1555.2
    Result = 1.64751e+08

Single-Precision - 512-bit AVX512 - Fused Multiply Add
    GFlops = 6241.54
    Result = 3.96636e+08

Double-Precision - 512-bit AVX512 - Fused Multiply Add
    GFlops = 3111.17
    Result = 1.97388e+08