tensorbfs / cutropicalgemm.jl Goto Github PK
View Code? Open in Web Editor NEWThe fastest Tropical number matrix multiplication on GPU
License: MIT License
The fastest Tropical number matrix multiplication on GPU
License: MIT License
Items to be compared.
BenchmarkTools are not working correctly:
julia> using TropicalNumbers, CUDA, BenchmarkTools, LinearAlgebra, CuTropicalGEMM
julia> a = Tropical.(CUDA.randn(4096, 4096));
julia> @btime $a * $a;
3.375 μs (7 allocations: 256 bytes)
julia> @benchmark $a * $a
BenchmarkTools.Trial: 158 samples with 8 evaluations.
Range (min … max): 3.554 μs … 1.733 s ┊ GC (min … max): 0.00% … 0.07%
Time (median): 3.976 μs ┊ GC (median): 0.00%
Time (mean ± σ): 13.475 ms ± 137.779 ms ┊ GC (mean ± σ): 0.06% ± 0.01%
█ ▄
█▁▁▁▁▁▁▁▁▁▁▁▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▄
3.55 μs Histogram: log(frequency) by time 13.5 ms <
Memory estimate: 256 bytes, allocs estimate: 7.
Comparing to results directly from the C-CUDA tests, the result of @ benchmark
is correct.
I am benchmarking an application on CUDA version v12.2, but the CuTropicalGEMM complaints that it can not find the bianry. Then I check the the TropicalGEMMC_jll, it seems the relevant binary was not built.
I am wondering if we can add relevant binaries files to
https://github.com/JuliaBinaryWrappers/TropicalGemmC_jll.jl/tree/main/src/wrappers
Meanwhile, can we provide better error information? Silent break is difficult to debug.
In this package we are using a padding stragety to handle the boundary elements as that of GEMM, and the minimum size of the block is set as A
and matrix B
.
So that for narrow matrices which are widely used in tensor network calculations, there will be tons of useless calculations.
For example, when the size of the matrices are
Optimizations for such long and narrow matrices are needed.
I wish the behavior of *
operation can change upon using this package. Otherwise, one needs to modify the program to let this package to speed up the code.
From book: "Machine Learning: a probabilistic perspective"
TropicalNumbers
.LinearAlgebra.mul!
This issue is used to trigger TagBot; feel free to unsubscribe.
If you haven't already, you should update your TagBot.yml
to include issue comment triggers.
Please see this post on Discourse for instructions and more details.
If you'd like for me to do this for you, comment TagBot fix
on this issue.
I'll open a PR within a few hours, please be patient!
A good template to follow is (it just underwent a strict review of JOSS):
https://github.com/TensorBFS/TensorInference.jl
Some aspects that could be improved:
@btimes
instead of @benchmark
. The benchmark plot is not very good (following the bad example in TropicalGEMM.). I like more with the previous bar plot.Add a wrapper to allow dispatching different routines based on tensor element type. This will make our life much easier, e.g. in tensor network simulation.
It requires migrating the package to TensorBFS
In https://github.com/hpcgarage/cuASR, they also implement the following extra tropical algebra.
Min-Multiply (as we discussed, it can be replaced by Max-Multiply, hence not very useful)
Min-Max
Max-Min
Or-And
This is a low priority issue.
Sorry for the previous chaos, I thought these parts will not be publish as part of the package.
The following changes have been made:
The new benchmark result is show here:
Originally posted by @ArrogantGao in #1 (comment)
I see. I notice that although the code is much faster and all tests pass, the current version still can not produce the correct result in the following test case.
The output is probabilistic, hence it is very likely you did not sync threads after some computation. Can you please help me to make the following code produce correct result?
using GenericTensorNetworks, GenericTensorNetworks.Graphs
using CUDA
g = Graphs.random_regular_graph(200, 3)
optimizer = TreeSA(ntrials=3)
gp = IndependentSet(g; optimizer=optimizer)
contraction_complexity(gp)
@time CUDA.@sync solve(gp, SizeMax(); usecuda=true, T=Float32)
using CuTropicalGEMM
# If you run the following line multiple times, the result changes.
@time CUDA.@sync solve(gp, SizeMax(); usecuda=true, T=Float32)
Originally posted by @GiggleLiu in #9 (comment)
Some files are not used.
travis.yml
Artifacts.toml
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.