GithubHelp home page GithubHelp logo

gaius.jl's Introduction

Gaius.jl

Because Caesar.jl was taken

Continuous Integration Continuous Integration (Julia nightly) Code Coverage

Gaius.jl is a multi-threaded BLAS-like library using a divide-and-conquer strategy to parallelism, and built on top of the fantastic LoopVectorization.jl. Gaius spawns threads using Julia's depth first parallel task runtime and so Gaius's routines may be fearlessly nested inside multi-threaded Julia programs.

Gaius is not stable or well tested. Only use it if you're adventurous.

Note: Gaius is not actively maintained and I do not anticipate doing further work on it. However, you may find it useful as a relatively simple playground for learning about the implementation of linear algebra routines.

There are other, more promising projects that may result in a scalable, multi-threaded pure Julia BLAS library such as:

  1. Tullio.jl
  2. Octavian.jl

In general:

  • Octavian is the most performant.
  • Tullio is the most flexible.

Quick Start

julia> using Gaius

julia> Gaius.mul!(C, A, B) # (multi-threaded) multiply A×B and store the result in C (overwriting the contents of C)

julia> Gaius.mul(A, B) # (multi-threaded) multiply A×B and return the result

julia> Gaius.mul_serial!(C, A, B) # (single-threaded) multiply A×B and store the result in C (overwriting the contents of C)

julia> Gaius.mul_serial(A, B) # (single-threaded) multiply A×B and return the result

Remember to start Julia with multiple threads with e.g. one of the following:

  • julia -t auto
  • julia -t 4
  • Set the JULIA_NUM_THREADS environment variable to 4 before starting Julia

The functions in this list are part of the public API of Gaius:

  • Gaius.mul!
  • Gaius.mul
  • Gaius.mul_serial!
  • Gaius.mul_serial

All other functions are internal (private).

Matrix Multiplication

Currently, fast, native matrix-multiplication is only implemented between matrices of types Matrix{<:Union{Float64, Float32, Int64, Int32, Int16}}, and StructArray{Complex}. Support for other other commonly encountered numeric struct types such as Rational and Dual numbers is planned.

Using Gaius

Click to expand:

Gaius defines the public functions Gaius.mul and Gaius.mul!. Gaius.mul is to be used like the regular * operator between two matrices whereas Gaius.mul! takes in three matrices C, A, B and stores A*B in C overwriting the contents of C.

The functions Gaius.mul and Gaius.mul! use multithreading. If you want to run the single-threaded variants, use Gais.mul_serial and Gaius.mul_serial! respectively.

julia> using Gaius, BenchmarkTools, LinearAlgebra

julia> A, B, C = rand(104, 104), rand(104, 104), zeros(104, 104);

julia> @btime mul!($C, $A, $B); # from LinearAlgebra
  68.529 μs (0 allocations: 0 bytes)

julia> @btime mul!($C, $A, $B); #from Gaius
  31.220 μs (80 allocations: 10.20 KiB)
julia> using Gaius, BenchmarkTools

julia> A, B = rand(104, 104), rand(104, 104);

julia> @btime $A * $B;
  68.949 μs (2 allocations: 84.58 KiB)

julia> @btime let * = Gaius.mul # Locally use Gaius.mul as * operator.
           $A * $B
       end;
  32.950 μs (82 allocations: 94.78 KiB)

julia> versioninfo()
Julia Version 1.4.0-rc2.0
Commit b99ed72c95* (2020-02-24 16:51 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: AMD Ryzen 5 2600 Six-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-8.0.1 (ORCJIT, znver1)
Environment:
  JULIA_NUM_THREADS = 6

Multi-threading in Gaius works by recursively splitting matrices into sub-blocks to operate on. You can change the matrix sub-block size by calling mul! with the block_size keyword argument. If left unspecified, Gaius will use a (very rough) heuristic to choose a good block size based on the size of the input matrices.

The size heuristics I use are likely not yet optimal for everyone's machines.

Complex Numbers

Click to expand:

Gaius supports the multiplication of matrices of complex numbers, but they must first by converted explicity to structs of arrays using StructArrays.jl (otherwise the multiplication will be done by OpenBLAS):

julia> using Gaius, StructArrays

julia> begin
           n = 150
           A = randn(ComplexF64, n, n)
           B = randn(ComplexF64, n, n)
           C = zeros(ComplexF64, n, n)


           SA =  StructArray(A)
           SB =  StructArray(B)
           SC = StructArray(C)

           @btime mul!($SC, $SA, $SB)
           @btime         mul!($C, $A, $B)
           SC  C
       end
   515.587 μs (80 allocations: 10.53 KiB)
   546.481 μs (0 allocations: 0 bytes)
 true

Benchmarks

Floating Point Performance

Click to expand:

The following benchmarks were run on this

julia> versioninfo()
Julia Version 1.4.0-rc2.0
Commit b99ed72c95* (2020-02-24 16:51 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: AMD Ryzen 5 2600 Six-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-8.0.1 (ORCJIT, znver1)
Environment:
  JULIA_NUM_THREADS = 6

and compared to OpenBLAS running with 6 threads (BLAS.set_num_threads(6)). I would be keenly interested in seeing analogous benchmarks on a machine with an AVX512 instruction set and/or Intel's MKL.

Float64 Matrix Multiplication

Float32 Matrix Multiplication

Note that these are log-log plots.

Gaius outperforms OpenBLAS over a large range of matrix sizes, but does begin to appreciably fall behind around 800 x 800 matrices for Float64 and 650 x 650 matrices for Float32. I believe there is a large amount of performance left on the table in Gaius and I look forward to beating OpenBLAS for more matrix sizes.

Complex Floating Point Performance

Click to expand:

Here is Gaius operating on Complex{Float64} structs-of-arrays competeing relatively evenly against OpenBLAS operating on Complex{Float64} arrays-of-structs:

Complex{Float64} Matrix Multiplication

I think with some work, we can do much better.

Integer Performance

Click to expand:

These benchmarks compare Gaius (on the same machine as above) and compare against Julia's generic matrix multiplication implementation (OpenBLAS does not provide integer mat-mul) which is not multi-threaded.

Int64 Matrix Multiplication

Int32 Matrix Multiplication

Note that these are log-log plots.

Benchmarks performed on a machine with the AVX512 instruction set show an even greater performance gain.

If you find yourself in a high performance situation where you want to multiply matrices of integers, I think this provides a compelling use-case for Gaius since it will outperform it's competition at any matrix size and for large matrices will benefit from multi-threading.

Other BLAS Routines

I have not yet worked on implementing other standard BLAS routines with this strategy, but doing so should be relatively straightforward.

Safety

If you must break the law, do it to seize power; in all other cases observe it.

-Gaius Julius Caesar

If you use only the functions Gaius.mul!, Gaius.mul, Gaius.mul_serial!, and Gaius.mul_serial, automatic array size-checking will occur before the matrix multiplication begins. This can be turned off in mul! by calling Gaius.mul!(C, A, B; sizecheck=false), in which case no sizechecks will occur on the arrays before the matrix multiplication occurs and all sorts of bad, segfaulty things can happen.

All other functions in this package are to be considered internal and should not be expected to check for safety or obey the law. The functions Gaius.gemm_kernel! and Gaius.add_gemm_kernel! may be of utility, but be warned that they do not check array sizes.

gaius.jl's People

Contributors

carstenbauer avatar chriselrod avatar dilumaluthge avatar jutho avatar masonprotter avatar sosiristseng avatar tkf avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

gaius.jl's Issues

Analyzing matrix-matrix multiplication performance degradation in "more balanced" branch

This is the benchmark backing up the reason why I rolled back matrix-matrix multiplication in #83 (05417dc).

Here's the "speedup" of the target (7b4ba9a) with respect to the baseline (42b2820). The value smaller than 1 indicates it's slower with the target:

image

Here's a more "raw data" plot, showing the speedup for each versions with respect to the performance in julia --threads=1:

image

Analyzing the task DAG

The above result was puzzling for me so I tried to figure out why. But the analysis itself confuses me more. I'm not sure if it's useful to share a confusing analysis. But, just as a record, here's where I stuck at now.

I used TaskDAGAnalyzers.jl to analyze the DAG created from matrix-matrix multiplication. You can find the full script at: https://gist.github.com/tkf/94b1a4110d90f26251c1d2de4999cab2

NOTE: I'm not sure if TaskDAGAnalyzers.jl is doing the right thing ATM, partially because its output is puzzling. So, take it with a grain of salt.

If I run Gaius.jl's GEMM with three 256x256 matrices, it reports the following metrics for target (c8a1e13 which is equivalent to the first commit of #83):

TaskDAGAnalyzers.summary:
work: 1.6 ms (single-thread run-time T₁)
span: 109 μs (theoretical fastest run-time Tₒₒ)
parallelism (work/span): 14.872990869043585

and for baseline (08da3a7):

TaskDAGAnalyzers.summary:
work: 4.5 ms (single-thread run-time T₁)
span: 80 μs (theoretical fastest run-time Tₒₒ)
parallelism (work/span): 56.691479003345485

The result that span is shorter in the baseline is compatible with the above measurements. OTOH, baseline has a larger amount of work. If I believe this number, I'd expect that the target to perform well at least for low thread count.

DAG dump

The shape of the task DAG is expected in that the target has less sequential dependencies.

Baseline Target
08da3a7 c8a1e13

Gaius broken due to Static.StaticInt not being an integer

Gaius._print_num_threads_warning expects integers. But since Static v0.7, Static.StaticInt is not an integer anymore. Therefore, Gaius is currently broken:

julia> using Gaius
[ Info: Precompiling Gaius [bffe22d1-cb55-4f4e-ac2c-f4dd4bf58912]
ERROR: InitError: MethodError: no method matching _print_num_threads_warning(::Static.StaticInt{6}, ::Int64)
Closest candidates are:
  _print_num_threads_warning(::Integer, ::Integer) at ~/.julia/packages/Gaius/xYgac/src/init.jl:12
during initialization of module Gaius

This is the related issue on Static.jl: SciML/Static.jl#73

Possible solutions would be to make Static.StaticInt <: Integer again, or to explicitly support Static.StaticInt in the corresponding Gaius functions.

Implement PointerVector to Support BLAS2 functions

I want to eventually add in some BLAS2 support (i.e. matrix vector products) but to do so I'll need to make a PointerVector type.

Most of it is pretty straightforward just copy-pasting definitions with slight tweaks from PointerMatrix but I'm not sure about the maybeview function. As is, the matrix definition is

@inline Base.maybeview(A::PointerMatrix, r::UnitRange, c::UnitRange) = PointerMatrix(gesp(A.ptr, (first(r) - 1, (first(c) - 1))), (length(r), length(c)))

but I'm not exactly sure what gesp (https://github.com/chriselrod/VectorizationBase.jl/blob/master/src/vectorizable.jl#L262) does. Could you advise @chriselrod?

TagBot trigger issue

This issue is used to trigger TagBot; feel free to unsubscribe.

If you haven't already, you should update your TagBot.yml to include issue comment triggers.
Please see this post on Discourse for instructions and more details.

If you'd like for me to do this for you, comment TagBot fix on this issue.
I'll open a PR within a few hours, please be patient!

Support Int16

Great package!
Is there something that limits this package to Int64 and Int32? Would it be possible to also support Int16?

name typo?

in src/kernels.jl at line 108
function add_gemm_kernel( should this be function add_gemm_kernel!(?

Improve Performance

Here I wanted to talk about my experiments in PaddedMatrices.
I'd also suggest that if Gaius's performance were at least as good, I could use it and have PaddedMatrices mostly just define array types. My discussion of matmul here however will focus on Julia's base array type, rather than anything defined in PaddedMatrices -- which is why it'd make just as much sense for that code to live in a separate repo.
Although the implementation does currently make use of the PtrArray type defined in PaddedMatrices.

Additionally, OpenBLAS's release notes suggested AVX512 and AVX2 should both be seeing much better performance with release 0.3.9 than 0.3.7 (which Julia is currently on), so I checkout this PR which upgrades Julia's OpenBLAS to 0.3.9:
JuliaLang/julia#35113
Hopefully it'll get merged to master soon (looks like AArch64 is having problems).

I'm seeing a phenomenal performance improvement by OpenBLAS on that release.

PaddedMatrices currently provides two different implementation approaches: jmul! and jmulh!. It also offers jmult!, which is a threaded version of jmul!.
The function's aren't too complicated. Should be straightforward to copy them here, if you'd like to change approaches.

Here is a plot of percent theoretical performance vs size:
image

Interestingly, beyond around 200x200, OpenBLAS was now the fastest by a fairly comfortable margin.
It also looks like my implementation needs work. Aside from being slower than more OpenBLAS and MKL, it seems to slowly decline as size increases beyond 1000x1000, while OpenBLAS improves steadily and MKL sporadically.

But it's a marked improvement over Gaius on my computer -- hovering around 75% of peak CPU, while Gaius declined to barely more than a third of that by the largest tested sizes.

Script to generate the plot:

using Gaius, StructArrays, LinearAlgebra, BenchmarkTools
using PaddedMatrices: jmul!
BLAS.set_num_threads(1); Base.Threads.nthreads()

randa(::Type{T}, dim...) where {T} = rand(T, dim...)
randa(::Type{T}, dim...) where {T <: Signed} = rand(T(-100):T(200), dim...)

const LIBDIRECTCALLJIT = "/home/chriselrod/.julia/dev/LoopVectorization/benchmark/libdcjtest.so"
istransposed(x) = false
istransposed(x::Adjoint) = true
istransposed(x::Transpose) = true
mkl_set_num_threads(N::Integer) = ccall((:set_num_threads, LIBDIRECTCALLJIT), Cvoid, (Ref{UInt32},), Ref(N % UInt32))
function mklmul!(C::AbstractVecOrMat{Float32}, A::AbstractVecOrMat{Float32}, B::AbstractVecOrMat{Float32})
    M, N = size(C); K = size(B, 1)
    ccall(
        (:sgemmjit, LIBDIRECTCALLJIT), Cvoid,
        (Ptr{Float32},Ptr{Float32},Ptr{Float32},Ref{UInt32},Ref{UInt32},Ref{UInt32},Ref{Bool},Ref{Bool}),
        parent(C), parent(A), parent(B),
        Ref(M % UInt32), Ref(K % UInt32), Ref(N % UInt32),
        Ref(istransposed(A)), Ref(istransposed(B))
    )
end
function mklmul!(C::AbstractVecOrMat{Float64}, A::AbstractVecOrMat{Float64}, B::AbstractVecOrMat{Float64})
    M, N = size(C); K = size(B, 1)
    ccall(
        (:dgemmjit, LIBDIRECTCALLJIT), Cvoid,
        (Ptr{Float64},Ptr{Float64},Ptr{Float64},Ref{UInt32},Ref{UInt32},Ref{UInt32},Ref{Bool},Ref{Bool}),
        parent(C), parent(A), parent(B),
        Ref(M % UInt32), Ref(K % UInt32), Ref(N % UInt32),
        Ref(istransposed(A)), Ref(istransposed(B))
    )
end

mkl_set_num_threads(1)

function runbench(::Type{T}) where {T}
    (StructVector  map)([2, 4, 8:8:128..., round.(Int, (10:65) .^2.2)...]) do sz
        n, k, m = sz, sz, sz
        C1 = zeros(T, n, m)
        C2 = zeros(T, n, m)
        C3 = zeros(T, n, m)
        C4 = zeros(T, n, m)
        A  = randa(T, n, k)
        B  = randa(T, k, m)

        opb = @elapsed mul!(C1, A, B)
        if 2opb < BenchmarkTools.DEFAULT_PARAMETERS.seconds
            opb = min(opb, @belapsed mul!($C1, $A, $B))         #samples=100
        end
        lvb = @elapsed blocked_mul!(C2, A, B)
        if 2lvb < BenchmarkTools.DEFAULT_PARAMETERS.seconds
            lvb = min(lvb, @belapsed blocked_mul!($C2, $A, $B)) #samples=100
        end
        @assert C1  C2
        pmb = @elapsed jmul!(C3, A, B)
        if 2pmb < BenchmarkTools.DEFAULT_PARAMETERS.seconds
            pmb = min(pmb, @belapsed jmul!($C3, $A, $B))         #samples=100
        end
        @assert C1  C3
        if T <: Integer
            @show (matrix_size=sz, lvBLAS=lvb, OpenBLAS=opb, PaddedMatrices = pmb)
        else
            mklb = @elapsed mklmul!(C4, A, B)
            if 2mklb < BenchmarkTools.DEFAULT_PARAMETERS.seconds
                mklb = min(mklb, @belapsed mklmul!($C4, $A, $B))         #samples=100
            end
            @assert C1  C4
            @show (matrix_size=sz, lvBLAS=lvb, OpenBLAS=opb, PaddedMatrices = pmb, MKL = mklb)
        end
    end
end
tf64 = runbench(Float64);
tf32 = runbench(Float32);
ti64 = runbench(Int64);
ti32 = runbench(Int32);

gflops(sz, st) = 2e-9 * sz^3 /st
using VectorizationBase: REGISTER_SIZE, FMA3
# I don't know how to query GHz;
# Your best bet would be to check your bios
# Alternatives are to look up your CPU model or `watch -n1 "cat /proc/cpuinfo | grep MHz"`
# Boosts and avx downclocking complicate it.
const GHz = 4.1 
const W64 = REGISTER_SIZE ÷ sizeof(Float64) # vector width
const FMA_RATIO = FMA3 ? 2 : 1
const INSTR_PER_CLOCK = 2 # I don't know how to query this, but true for most recent CPUs
const PEAK_DGFLOPS = GHz * W64 * FMA_RATIO * INSTR_PER_CLOCK

using DataFrames, VegaLite
function create_df(res)
    df = DataFrame(
        Size = res.matrix_size,
        Gaius = res.lvBLAS,
        PaddedMatrices = res.PaddedMatrices,
        OpenBLAS = res.OpenBLAS,
        MKL = res.MKL
    );
    dfs = stack(df, [:Gaius, :PaddedMatrices, :OpenBLAS, :MKL], variable_name = :Library, value_name = :Time);
    dfs.GFLOPS = gflops.(dfs.Size, dfs.Time);
    dfs.Percent_Peak = 100 .* dfs.GFLOPS ./ PEAK_DGFLOPS;
    dfs
end

res = create_df(tf64)
plt = res |> @vlplot(
    :line, color = :Library,
    x = {:Size, scale={type=:log}}, y = {:Percent_Peak},#, scale={type=:log}},
    width = 900, height = 600
)
save(joinpath(PICTURES, "gemmf64.png"), plt)

It could use some work on threading. It eventually hung -> crashed on interrupt, which is a known issue with multi threading on Julia master.

julia> tf64 = runbench(Float64);
(matrix_size = sz, lvBLAS = lvb, OpenBLAS = opb, PaddedMatrices = pmb, MKL = mklb) = (matrix_size = 2, lvBLAS = 2.3133534136546185e-8, OpenBLAS = 1.61001001001001e-8, PaddedMatrices = 2.432831325301205e-8, MKL = 9.716261879619851e-8)
(matrix_size = sz, lvBLAS = lvb, OpenBLAS = opb, PaddedMatrices = pmb, MKL = mklb) = (matrix_size = 4, lvBLAS = 2.709547738693467e-8, OpenBLAS = 1.1704939626783753e-7, PaddedMatrices = 2.841608040201005e-8, MKL = 1.4730528846153845e-7)
(matrix_size = sz, lvBLAS = lvb, OpenBLAS = opb, PaddedMatrices = pmb, MKL = mklb) = (matrix_size = 8, lvBLAS = 4.726619433198381e-8, OpenBLAS = 1.4663221153846155e-7, PaddedMatrices = 5.7628687690742625e-8, MKL = 1.6828853754940714e-7)
(matrix_size = sz, lvBLAS = lvb, OpenBLAS = opb, PaddedMatrices = pmb, MKL = mklb) = (matrix_size = 16, lvBLAS = 1.292370203160271e-7, OpenBLAS = 2.966718146718147e-7, PaddedMatrices = 1.4987575392038602e-7, MKL = 2.406112469437653e-7)
(matrix_size = sz, lvBLAS = lvb, OpenBLAS = opb, PaddedMatrices = pmb, MKL = mklb) = (matrix_size = 24, lvBLAS = 2.877228464419476e-7, OpenBLAS = 5.840718232044199e-7, PaddedMatrices = 3.277236842105263e-7, MKL = 4.06315e-7)
(matrix_size = sz, lvBLAS = lvb, OpenBLAS = opb, PaddedMatrices = pmb, MKL = mklb) = (matrix_size = 32, lvBLAS = 6.516871165644171e-7, OpenBLAS = 1.0868e-6, PaddedMatrices = 7.048827586206896e-7, MKL = 8.12183908045977e-7)
(matrix_size = sz, lvBLAS = lvb, OpenBLAS = opb, PaddedMatrices = pmb, MKL = mklb) = (matrix_size = 40, lvBLAS = 5.594e-6, OpenBLAS = 1.9357e-6, PaddedMatrices = 1.2344e-6, MKL = 1.3357e-6)
(matrix_size = sz, lvBLAS = lvb, OpenBLAS = opb, PaddedMatrices = pmb, MKL = mklb) = (matrix_size = 48, lvBLAS = 6.238e-6, OpenBLAS = 2.9292222222222223e-6, PaddedMatrices = 2.477444444444444e-6, MKL = 2.0276666666666667e-6)
(matrix_size = sz, lvBLAS = lvb, OpenBLAS = opb, PaddedMatrices = pmb, MKL = mklb) = (matrix_size = 56, lvBLAS = 5.435e-6, OpenBLAS = 4.584571428571428e-6, PaddedMatrices = 3.721e-6, MKL = 3.086625e-6)
(matrix_size = sz, lvBLAS = lvb, OpenBLAS = opb, PaddedMatrices = pmb, MKL = mklb) = (matrix_size = 64, lvBLAS = 5.819e-6, OpenBLAS = 6.3324e-6, PaddedMatrices = 5.145333333333333e-6, MKL = 3.002e-6)
(matrix_size = sz, lvBLAS = lvb, OpenBLAS = opb, PaddedMatrices = pmb, MKL = mklb) = (matrix_size = 72, lvBLAS = 1.8404e-5, OpenBLAS = 1.1107e-5, PaddedMatrices = 7.32275e-6, MKL = 3.200625e-6)
(matrix_size = sz, lvBLAS = lvb, OpenBLAS = opb, PaddedMatrices = pmb, MKL = mklb) = (matrix_size = 80, lvBLAS = 2.0688e-5, OpenBLAS = 2.1968e-5, PaddedMatrices = 9.406e-6, MKL = 3.353125e-6)
(matrix_size = sz, lvBLAS = lvb, OpenBLAS = opb, PaddedMatrices = pmb, MKL = mklb) = (matrix_size = 88, lvBLAS = 1.9789e-5, OpenBLAS = 2.0916e-5, PaddedMatrices = 1.3432e-5, MKL = 3.8824285714285714e-6)
(matrix_size = sz, lvBLAS = lvb, OpenBLAS = opb, PaddedMatrices = pmb, MKL = mklb) = (matrix_size = 96, lvBLAS = 1.8742e-5, OpenBLAS = 2.2784e-5, PaddedMatrices = 1.7908e-5, MKL = 4.099428571428571e-6)
(matrix_size = sz, lvBLAS = lvb, OpenBLAS = opb, PaddedMatrices = pmb, MKL = mklb) = (matrix_size = 104, lvBLAS = 2.8341e-5, OpenBLAS = 2.3449e-5, PaddedMatrices = 2.0834e-5, MKL = 4.346714285714285e-6)
(matrix_size = sz, lvBLAS = lvb, OpenBLAS = opb, PaddedMatrices = pmb, MKL = mklb) = (matrix_size = 112, lvBLAS = 2.9761e-5, OpenBLAS = 2.3784e-5, PaddedMatrices = 2.5557e-5, MKL = 4.823285714285715e-6)
(matrix_size = sz, lvBLAS = lvb, OpenBLAS = opb, PaddedMatrices = pmb, MKL = mklb) = (matrix_size = 120, lvBLAS = 3.4951e-5, OpenBLAS = 2.3948e-5, PaddedMatrices = 3.3876e-5, MKL = 4.676857142857143e-6)
(matrix_size = sz, lvBLAS = lvb, OpenBLAS = opb, PaddedMatrices = pmb, MKL = mklb) = (matrix_size = 128, lvBLAS = 3.4679e-5, OpenBLAS = 2.4699e-5, PaddedMatrices = 4.0365e-5, MKL = 5.17e-6)
(matrix_size = sz, lvBLAS = lvb, OpenBLAS = opb, PaddedMatrices = pmb, MKL = mklb) = (matrix_size = 158, lvBLAS = 4.6798e-5, OpenBLAS = 5.6155e-5, PaddedMatrices = 8.2639e-5, MKL = 1.0184e-5)
(matrix_size = sz, lvBLAS = lvb, OpenBLAS = opb, PaddedMatrices = pmb, MKL = mklb) = (matrix_size = 195, lvBLAS = 6.4276e-5, OpenBLAS = 6.7389e-5, PaddedMatrices = 0.000157312, MKL = 1.3662e-5)
(matrix_size = sz, lvBLAS = lvb, OpenBLAS = opb, PaddedMatrices = pmb, MKL = mklb) = (matrix_size = 237, lvBLAS = 0.000102349, OpenBLAS = 8.5743e-5, PaddedMatrices = 0.000276445, MKL = 2.4355e-5)
(matrix_size = sz, lvBLAS = lvb, OpenBLAS = opb, PaddedMatrices = pmb, MKL = mklb) = (matrix_size = 282, lvBLAS = 0.000129655, OpenBLAS = 9.3392e-5, PaddedMatrices = 0.000390776, MKL = 3.1123e-5)
(matrix_size = sz, lvBLAS = lvb, OpenBLAS = opb, PaddedMatrices = pmb, MKL = mklb) = (matrix_size = 332, lvBLAS = 0.000178715, OpenBLAS = 0.000143093, PaddedMatrices = 0.00061251, MKL = 4.6239e-5)
(matrix_size = sz, lvBLAS = lvb, OpenBLAS = opb, PaddedMatrices = pmb, MKL = mklb) = (matrix_size = 387, lvBLAS = 0.000251371, OpenBLAS = 0.000233954, PaddedMatrices = 0.000880218, MKL = 8.6194e-5)
(matrix_size = sz, lvBLAS = lvb, OpenBLAS = opb, PaddedMatrices = pmb, MKL = mklb) = (matrix_size = 446, lvBLAS = 0.000341801, OpenBLAS = 0.000334483, PaddedMatrices = 0.001141395, MKL = 0.000148388)
(matrix_size = sz, lvBLAS = lvb, OpenBLAS = opb, PaddedMatrices = pmb, MKL = mklb) = (matrix_size = 509, lvBLAS = 0.000471455, OpenBLAS = 0.000403684, PaddedMatrices = 0.001459838, MKL = 0.000218345)
(matrix_size = sz, lvBLAS = lvb, OpenBLAS = opb, PaddedMatrices = pmb, MKL = mklb) = (matrix_size = 578, lvBLAS = 0.000601629, OpenBLAS = 0.000588513, PaddedMatrices = 0.00173781, MKL = 0.000281904)
(matrix_size = sz, lvBLAS = lvb, OpenBLAS = opb, PaddedMatrices = pmb, MKL = mklb) = (matrix_size = 651, lvBLAS = 0.000865018, OpenBLAS = 0.000740441, PaddedMatrices = 0.002665009, MKL = 0.000396309)
(matrix_size = sz, lvBLAS = lvb, OpenBLAS = opb, PaddedMatrices = pmb, MKL = mklb) = (matrix_size = 728, lvBLAS = 0.000850863, OpenBLAS = 0.000860389, PaddedMatrices = 0.003238482, MKL = 0.000501772)
(matrix_size = sz, lvBLAS = lvb, OpenBLAS = opb, PaddedMatrices = pmb, MKL = mklb) = (matrix_size = 811, lvBLAS = 0.001391089, OpenBLAS = 0.001634212, PaddedMatrices = 0.00392088, MKL = 0.000721424)
(matrix_size = sz, lvBLAS = lvb, OpenBLAS = opb, PaddedMatrices = pmb, MKL = mklb) = (matrix_size = 898, lvBLAS = 0.001917808, OpenBLAS = 0.001884708, PaddedMatrices = 0.004728201, MKL = 0.000931751)
(matrix_size = sz, lvBLAS = lvb, OpenBLAS = opb, PaddedMatrices = pmb, MKL = mklb) = (matrix_size = 990, lvBLAS = 0.002527852, OpenBLAS = 0.002173313, PaddedMatrices = 0.005980197, MKL = 0.001318157)
(matrix_size = sz, lvBLAS = lvb, OpenBLAS = opb, PaddedMatrices = pmb, MKL = mklb) = (matrix_size = 1088, lvBLAS = 0.003033522, OpenBLAS = 0.00255263, PaddedMatrices = 0.00694296, MKL = 0.001701482)
(matrix_size = sz, lvBLAS = lvb, OpenBLAS = opb, PaddedMatrices = pmb, MKL = mklb) = (matrix_size = 1190, lvBLAS = 0.004519446, OpenBLAS = 0.003599941, PaddedMatrices = 0.008287474, MKL = 0.002416439)
(matrix_size = sz, lvBLAS = lvb, OpenBLAS = opb, PaddedMatrices = pmb, MKL = mklb) = (matrix_size = 1297, lvBLAS = 0.005846859, OpenBLAS = 0.004166108, PaddedMatrices = 0.010642048, MKL = 0.002729079)
(matrix_size = sz, lvBLAS = lvb, OpenBLAS = opb, PaddedMatrices = pmb, MKL = mklb) = (matrix_size = 1409, lvBLAS = 0.007676825, OpenBLAS = 0.005031821, PaddedMatrices = 0.012124711, MKL = 0.003461087)
(matrix_size = sz, lvBLAS = lvb, OpenBLAS = opb, PaddedMatrices = pmb, MKL = mklb) = (matrix_size = 1527, lvBLAS = 0.009942827, OpenBLAS = 0.006043474, PaddedMatrices = 0.014811819, MKL = 0.004338835)
(matrix_size = sz, lvBLAS = lvb, OpenBLAS = opb, PaddedMatrices = pmb, MKL = mklb) = (matrix_size = 1649, lvBLAS = 0.01252906, OpenBLAS = 0.007397781, PaddedMatrices = 0.016704083, MKL = 0.005259515)
(matrix_size = sz, lvBLAS = lvb, OpenBLAS = opb, PaddedMatrices = pmb, MKL = mklb) = (matrix_size = 1777, lvBLAS = 0.015304498, OpenBLAS = 0.008571394, PaddedMatrices = 0.019045784, MKL = 0.006528608)
(matrix_size = sz, lvBLAS = lvb, OpenBLAS = opb, PaddedMatrices = pmb, MKL = mklb) = (matrix_size = 1910, lvBLAS = 0.019378427, OpenBLAS = 0.009448816, PaddedMatrices = 0.022792723, MKL = 0.008063704)
(matrix_size = sz, lvBLAS = lvb, OpenBLAS = opb, PaddedMatrices = pmb, MKL = mklb) = (matrix_size = 2048, lvBLAS = 0.051137142, OpenBLAS = 0.010611481, PaddedMatrices = 0.025143273, MKL = 0.010806103)
(matrix_size = sz, lvBLAS = lvb, OpenBLAS = opb, PaddedMatrices = pmb, MKL = mklb) = (matrix_size = 2191, lvBLAS = 0.030902612, OpenBLAS = 0.013496804, PaddedMatrices = 0.030446266, MKL = 0.011869482)
(matrix_size = sz, lvBLAS = lvb, OpenBLAS = opb, PaddedMatrices = pmb, MKL = mklb) = (matrix_size = 2340, lvBLAS = 0.040938443, OpenBLAS = 0.015844881, PaddedMatrices = 0.033634225, MKL = 0.015386527)
(matrix_size = sz, lvBLAS = lvb, OpenBLAS = opb, PaddedMatrices = pmb, MKL = mklb) = (matrix_size = 2494, lvBLAS = 0.047911804, OpenBLAS = 0.018747578, PaddedMatrices = 0.040595921, MKL = 0.018101961)
(matrix_size = sz, lvBLAS = lvb, OpenBLAS = opb, PaddedMatrices = pmb, MKL = mklb) = (matrix_size = 2654, lvBLAS = 0.058224365, OpenBLAS = 0.022381999, PaddedMatrices = 0.044077997, MKL = 0.020562614)
(matrix_size = sz, lvBLAS = lvb, OpenBLAS = opb, PaddedMatrices = pmb, MKL = mklb) = (matrix_size = 2819, lvBLAS = 0.073470818, OpenBLAS = 0.027041187, PaddedMatrices = 0.050594931, MKL = 0.025797043)
(matrix_size = sz, lvBLAS = lvb, OpenBLAS = opb, PaddedMatrices = pmb, MKL = mklb) = (matrix_size = 2989, lvBLAS = 0.090583819, OpenBLAS = 0.030988165, PaddedMatrices = 0.056894353, MKL = 0.030153094)
(matrix_size = sz, lvBLAS = lvb, OpenBLAS = opb, PaddedMatrices = pmb, MKL = mklb) = (matrix_size = 3165, lvBLAS = 0.10973272, OpenBLAS = 0.037257813, PaddedMatrices = 0.067711555, MKL = 0.03372817)
(matrix_size = sz, lvBLAS = lvb, OpenBLAS = opb, PaddedMatrices = pmb, MKL = mklb) = (matrix_size = 3346, lvBLAS = 0.13072293, OpenBLAS = 0.042420543, PaddedMatrices = 0.078059161, MKL = 0.037653038)
(matrix_size = sz, lvBLAS = lvb, OpenBLAS = opb, PaddedMatrices = pmb, MKL = mklb) = (matrix_size = 3533, lvBLAS = 0.171608963, OpenBLAS = 0.053996652, PaddedMatrices = 0.084271564, MKL = 0.047052213)
(matrix_size = sz, lvBLAS = lvb, OpenBLAS = opb, PaddedMatrices = pmb, MKL = mklb) = (matrix_size = 3725, lvBLAS = 0.212338202, OpenBLAS = 0.063929694, PaddedMatrices = 0.126757958, MKL = 0.054310823)
(matrix_size = sz, lvBLAS = lvb, OpenBLAS = opb, PaddedMatrices = pmb, MKL = mklb) = (matrix_size = 3923, lvBLAS = 0.247940048, OpenBLAS = 0.074915892, PaddedMatrices = 0.158934426, MKL = 0.063976922)
(matrix_size = sz, lvBLAS = lvb, OpenBLAS = opb, PaddedMatrices = pmb, MKL = mklb) = (matrix_size = 4127, lvBLAS = 0.295284462, OpenBLAS = 0.085741933, PaddedMatrices = 0.175556075, MKL = 0.070077551)
(matrix_size = sz, lvBLAS = lvb, OpenBLAS = opb, PaddedMatrices = pmb, MKL = mklb) = (matrix_size = 4336, lvBLAS = 0.315173005, OpenBLAS = 0.095162078, PaddedMatrices = 0.196636965, MKL = 0.08202249)
(matrix_size = sz, lvBLAS = lvb, OpenBLAS = opb, PaddedMatrices = pmb, MKL = mklb) = (matrix_size = 4551, lvBLAS = 0.426154178, OpenBLAS = 0.108096275, PaddedMatrices = 0.221490565, MKL = 0.100128825)
(matrix_size = sz, lvBLAS = lvb, OpenBLAS = opb, PaddedMatrices = pmb, MKL = mklb) = (matrix_size = 4771, lvBLAS = 0.452505485, OpenBLAS = 0.124190155, PaddedMatrices = 0.245447973, MKL = 0.105866294)
(matrix_size = sz, lvBLAS = lvb, OpenBLAS = opb, PaddedMatrices = pmb, MKL = mklb) = (matrix_size = 4997, lvBLAS = 0.504627015, OpenBLAS = 0.146173103, PaddedMatrices = 0.277543906, MKL = 0.125505135)
^Cfatal: error thrown and no exception handler available.
InterruptException()
jl_mutex_unlock at /home/chriselrod/Documents/languages/julia-old/src/locks.h:143 [inlined]
jl_task_get_next at /home/chriselrod/Documents/languages/julia-old/src/partr.c:441
poptaskref at ./task.jl:702
wait at ./task.jl:709 [inlined]
task_done_hook at ./task.jl:444
jl_apply at /home/chriselrod/Documents/languages/julia-old/src/julia.h:1687 [inlined]
jl_finish_task at /home/chriselrod/Documents/languages/julia-old/src/task.c:198
start_task at /home/chriselrod/Documents/languages/julia-old/src/task.c:697
unknown function (ip: (nil))

Results for the largest size:

julia> M = round(Int, 65^2.2)
9737

julia> A = rand(M,M); B = rand(M,M); C1 = Matrix{Float64}(undef,M,M); C2 = similar(C1); C3 = similar(C1); C4 = similar(C1);

julia> @time blocked_mul!(C1, A, B);
  5.515995 seconds (184.73 k allocations: 22.786 MiB, 0.16% gc time)

julia> @time jmult!(C2, A, B);
  1.578601 seconds (172.36 k allocations: 21.697 MiB)

julia> @time mul!(C3, A, B);
  0.890681 seconds

julia> @time mklmul!(C4, A, B);
  0.889272 seconds

julia> C1  C2  C3  C4
true

Associated GFLOPS (peak is min(num_cores, Threads.nthreads()) times the single threaded peak; that was 18x in this case):

julia> gflops(M, 5.515995) |> x -> (x, 100x/PEAK_DGFLOPS) # Gaius
(334.7199838118055, 14.17344104894163)

julia> gflops(M, 1.578601) |> x -> (x, 100x/PEAK_DGFLOPS) # PaddedMatrices
(1169.5886149229605, 49.52526316577576)

julia> gflops(M, 0.890681) |> x -> (x, 100x/PEAK_DGFLOPS) # OpenBLAS
(2072.923703442647, 87.77624083005789)

julia> gflops(M, 0.889272) |> x -> (x, 100x/PEAK_DGFLOPS) # MKL
(2076.208131039772, 87.91531720188738)

So OpenBLAS and MKL were both close 88% at this point.
PaddedMatrices was at just under half the theoretical peak, and Gaius was at about 14%.
A lot of room for improvement in both cases.

PaddedMatrices is very slow to start using threads. In the copy and paste from when the benchmarks were running, Gaius was faster until 2191 x 2191 matrices, where they were equally fast, but by 4997 x 4997, PaddedMatrices was approaching twice as fast, and at 9737 it was three times faster -- but still far behind OpenBLAS and MKL.

MKL did very well with multi-threading, even at relatively small sizes.
jmult! could be modified to ramp up thread use more intelligently.

The approach is summarized by this graphic:
image
For which you can thank the BLIS team.
LoopVectorization takes care of the microkernel, and 2 loops around it.
So the jmul! and jmult! functions add the other 3 loops (and also handle the remainders).
The function PaddedMatrices.matmul_params returns the mc, kc, and nc.

I've also compiled BLIS, so I'll add it to future benchmarks as well.

The above BLIS loop link also talks about threading. BLIS threads any of the loops aside from the k loop (I don't think you can multi-thread that one). 'PaddedMatrices.jmult!' threads both the 3rd and 5th. However, rather than threading in a smart way (i.e., in a manner reflecting the total number of threads), it spawns a new task for per iteration. When there are a lot of iterations, that's an excessive number of high-overhead tasks.
On the otherhand, the outer most loop takes steps of size 600 on my computer -- which also means it takes a long time before it uses many threads at all.

Would you be in favor of replacing the name-sake recursive implementation with this nested loop-based GEMM implementation?

Bikeshedding: A better package name?

Love the package(!) but not the name, to be honest. I think we can find something better!

I like the Caesar quote. If we want to relate to it, I'd prefer Caesar.jl over Gaius.jl.

Other names that came to my mind were Bolt.jl, a la Usain Bolt, the fastest man on the planet, and Sonic.jl, in reference to Sonic the Hedgehog. But I'll try to think about further alternatives.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.