tensorbfs / cutropicalgemm.jl Goto Github PK

View Code? Open in Web Editor NEW

7.0 7.0 0.0 1.51 MB

The fastest Tropical number matrix multiplication on GPU

License: MIT License

Julia 100.00%

cuda gemm tropical-algebra

cutropicalgemm.jl's People

Contributors

Stargazers

Watchers

cutropicalgemm.jl's Issues

Register package (after setting up CI)

Setting up a benchmark repo

Items to be compared.

GemmKernels.jl
https://github.com/hpcgarage/cuASR (if compilable)
CUDA.jl fallback

Failure of BenchmarkTools

BenchmarkTools are not working correctly:

julia> using TropicalNumbers, CUDA, BenchmarkTools, LinearAlgebra, CuTropicalGEMM

julia> a = Tropical.(CUDA.randn(4096, 4096));

julia> @btime $a * $a;
  3.375 μs (7 allocations: 256 bytes)

julia> @benchmark $a * $a
BenchmarkTools.Trial: 158 samples with 8 evaluations.
 Range (min … max):   3.554 μs …    1.733 s  ┊ GC (min … max): 0.00% … 0.07%
 Time  (median):      3.976 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   13.475 ms ± 137.779 ms  ┊ GC (mean ± σ):  0.06% ± 0.01%

  █                                                          ▄
  █▁▁▁▁▁▁▁▁▁▁▁▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▄
  3.55 μs       Histogram: log(frequency) by time      13.5 ms <

 Memory estimate: 256 bytes, allocs estimate: 7.

Comparing to results directly from the C-CUDA tests, the result of @ benchmark is correct.

Silent break on [email protected]

I am benchmarking an application on CUDA version v12.2, but the CuTropicalGEMM complaints that it can not find the bianry. Then I check the the TropicalGEMMC_jll, it seems the relevant binary was not built.

I am wondering if we can add relevant binaries files to
https://github.com/JuliaBinaryWrappers/TropicalGemmC_jll.jl/tree/main/src/wrappers

Meanwhile, can we provide better error information? Silent break is difficult to debug.

Optimizations for long and narrow matrices

In this package we are using a padding stragety to handle the boundary elements as that of GEMM, and the minimum size of the block is set as $64 \times 32$ and $32 \times 64$ for matrix A and matrix B.
So that for narrow matrices which are widely used in tensor network calculations, there will be tons of useless calculations.
For example, when the size of the matrices are $4 \times 4 \times 10^6$, what is actually calculated are matrices with size of $64 \times 32 \times 10^6$, and only $\frac{1}{128}$ of these calculation are useful.

Optimizations for such long and narrow matrices are needed.

Overwrite `LinearAlgebra.mul!`

I wish the behavior of * operation can change upon using this package. Otherwise, one needs to modify the program to let this package to speed up the code.

Example:

https://github.com/TensorBFS/TropicalGEMM.jl/blob/5ed2f0ae3e39e82f25ef75c44615a846ac8a5b8c/src/gemm.jl#L146

More semiring algebras

From book: "Machine Learning: a probabilistic perspective"

How to proceed

The number types should be implemented by making a pull request to:
https://github.com/TensorBFS/TropicalNumbers.jl
Then we register a new version of TropicalNumbers.
Bind the tropical matrix multiplications over CuArrays to LinearAlgebra.mul!

TagBot trigger issue

This issue is used to trigger TagBot; feel free to unsubscribe.

If you haven't already, you should update your TagBot.yml to include issue comment triggers.
Please see this post on Discourse for instructions and more details.

If you'd like for me to do this for you, comment TagBot fix on this issue.
I'll open a PR within a few hours, please be patient!

Polish README

A good template to follow is (it just underwent a strict review of JOSS):
https://github.com/TensorBFS/TensorInference.jl

Some aspects that could be improved:

The benchmark should be less messy, like using @btimes instead of @benchmark. The benchmark plot is not very good (following the bad example in TropicalGEMM.). I like more with the previous bar plot.
Missing section: How to contribute.
Warn users that this package is under GPL license and explain why.

Min-Multiply (as we discussed, it can be replaced by Max-Multiply, hence not very useful)
Min-Max
Max-Min
Or-And

This is a low priority issue.

Investigate the performance issues and consider moving to GemmKernels.jl

          Sorry for the previous chaos, I thought these parts will not be publish as part of the package.

The following changes have been made:

The .so file is uploaded to gist as an artifact, so that there no more binary in the repo now.
I relocated all the files into folder src, test and benchmark.
Scripts used for benchmarks are given, including the fall back implementation in CUDA.jl. However I found something strange: it seems that CUDA.@sync do not work when using the function from a .so lib, so I failed the benchmark our code in julia.

The new benchmark result is show here:

Originally posted by @ArrogantGao in #1 (comment)

Unstable result in the GenericTensorNetwork example

          I see. I notice that although the code is much faster and all tests pass, the current version still can not produce the correct result in the following test case.

The output is probabilistic, hence it is very likely you did not sync threads after some computation. Can you please help me to make the following code produce correct result?

using GenericTensorNetworks, GenericTensorNetworks.Graphs
using CUDA
g = Graphs.random_regular_graph(200, 3)
optimizer = TreeSA(ntrials=3)
gp = IndependentSet(g; optimizer=optimizer)
contraction_complexity(gp)
@time CUDA.@sync solve(gp, SizeMax(); usecuda=true, T=Float32)
using CuTropicalGEMM
# If you run the following line multiple times, the result changes.
@time CUDA.@sync solve(gp, SizeMax(); usecuda=true, T=Float32)

Originally posted by @GiggleLiu in #9 (comment)

Cleanup repo

Some files are not used.

travis.yml
Artifacts.toml

tensorbfs / cutropicalgemm.jl Goto Github PK

cutropicalgemm.jl's People

Contributors

Stargazers

Watchers

cutropicalgemm.jl's Issues

Example:

How to proceed

Recommend Projects

Recommend Topics

Recommend Org

Jobs