Topic: gemm Goto Github

Some thing interesting about gemm

👇 Here are 65 public repositories matching this topic...

andylolu2 / simplegemm

gemm,The simplest but fast implementation of matrix multiplication in CUDA.

User: andylolu2

cuda gemm matrix-multiplication

aredden / torch-cublas-hgemm

gemm,PyTorch half precision gemm lib w/ fused optional bias + optional relu/gelu

User: aredden

cuda float16 gemm pytorch

blackccpie / fastconv

gemm,fast 2D convolution implementation benchmark

User: blackccpie

cpp convolution gemm simd avx im2col toeplitz

boooc / implementation-of-a-flexible-and-energy-efficient-accelerator-for-sparse-convolution-neural-network

gemm,A Flexible and Energy Efficient Accelerator For Sparse Convolution Neural Network

User: boooc

accelerator convolutional-neural-networks deep-neural-networks dla eyeriss gemm hardware-accelerator im2col rtl sparse-matrix

bruce-lee-ly / cuda_back2back_hgemm

gemm,Use tensor core to calculate back-to-back HGEMM (half-precision general matrix multiplication) with MMA PTX instruction.

User: bruce-lee-ly

cublas cuda gemm gpu hgemm matrix-multiply nvidia tensor-core back2back-gemm back2back-hgemm

bruce-lee-ly / cuda_hgemm

gemm,Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruction.

User: bruce-lee-ly

cuda gemm matrix-multiply tensor-core hgemm cublas nvidia gpu

bruce-lee-ly / cuda_hgemv

gemm,Several optimization methods of half-precision general matrix vector multiplication (HGEMV) using CUDA core.

User: bruce-lee-ly

cublas cuda gemm gemv gpu hgemm matrix-multiply nvidia tensor-core cuda-core

cambriconeco / bangc_gemm_tutorial

gemm,

User: cambriconeco

cambricon bangc gemm algorithm

cnugteren / clblast

gemm,Tuned OpenCL BLAS

User: cnugteren

blas opencl blas-libraries clblas matrix-multiplication gemm gpu

coffeebeforearch / mmul

gemm,Serial and parallel implementations of matrix multiplication

User: coffeebeforearch

mmul serial matrix-multiplication benchmarks gemm parallel

cp2k / dbcsr

gemm,DBCSR: Distributed Block Compressed Sparse Row matrix library

Organization: cp2k

Home Page: https://cp2k.github.io/dbcsr/

cp2k blas matrix-multiplication gemm cuda sparse-matrix openmp-parallelization mpi hpc linear-algebra

cyrusmsk / gemm_apple

gemm,GEMM on Apple Silicon

User: cyrusmsk

applesilicon benchmark deep-learning gemm m1-mac

deftruth / cuda-learn-notes

gemm,🎉CUDA 笔记 / 大模型手撕CUDA / C++笔记，更新随缘: flash_attn、sgemm、sgemv、warp reduce、block reduce、dot product、elementwise、softmax、layernorm、rmsnorm、hist etc.

User: deftruth

Home Page: https://github.com/DefTruth/cuda-learn-notes

cuda cuda-kernels cuda-programming elementwise flash-attention flash-attention-2 gemm gemv layernorm rmsnorm

dev0x13 / gemm-benchmark-2023

gemm,Benchmarks for some modern (2023) high-performance floating-point GEMM implementations compared to Mojo language

User: dev0x13

benchmark gemm mojo

digital-nomad-cheng / matmul_cuda_kernel_tvm

gemm,Generate optimized MatMul cuda kernel automatically using tvm auto schedule.

User: digital-nomad-cheng

gemm gemm-optimization tvm cuda hpc matmul gpu

enp1s0 / cumpsgemm

gemm,Fast SGEMM emulation on Tensor Cores

User: enp1s0

Home Page: https://arxiv.org/abs/2303.08989

cuda gpu fp32 gemm half-precision mixed-precision tensorcore tensorcores

enp1s0 / ozimmu

gemm,FP64 equivalent GEMM via Int8 Tensor Cores using the Ozaki scheme

User: enp1s0

Home Page: https://arxiv.org/abs/2306.11975

cuda gemm mixed-precision tensorcore tensorcores

eth-cscs / spla

gemm,Specialized Parallel Linear Algebra, providing distributed GEMM functionality for specific matrix distributions with optional GPU acceleration.

Organization: eth-cscs

linear-algebra gemm cuda rocm mpi

flame / blislab

gemm,BLISlab: A Sandbox for Optimizing GEMM

Organization: flame

gemm matrix-multiplication code-optimization blis

flame / how-to-optimize-gemm

gemm,

Organization: flame

gemm matrix-multiplication gotoblas blis code-optimization

foreverrookie / cuda-opt-samples

gemm,CUDA optimization samples including sgemm, reduce... To be continued.

User: foreverrookie

cuda gemm gpu reduce

hma02 / cublasgemm-benchmark

gemm,code for benchmarking GPU performance based on cublasSgemm and cublasHgemm

User: hma02

gpu benchmarking gemm cublas gpu-performance cuda

hma02 / cublashgemm-p100

gemm,Code for testing the native float16 matrix multiplication performance on Tesla P100 and V100 GPU based on cublasHgemm

User: hma02

gpu precision float16 half-precision p100 cublas gemm v100

ivishalr / gemm

gemm,Fast Matrix Multiplication Implementation in C programming language. This matrix multiplication algorithm is similar to what Numpy uses to compute dot products.

User: ivishalr

c gemm gemm-optimization matrix-multiplication

jhson989 / fast-conv

gemm,Fast Convoluion Implementation via CUDA

User: jhson989

convolution cuda gemm

joerucodes / cuda-gemm-kernel

gemm,My attempt of making a GEMM kernel...

User: joerucodes

cuda cuda-kernels cuda-programming gemm gemm-optimization gemms parallel-computing

kaiserklayton / lpa_cnn

gemm,Low Precision Arithmetic for Convolutional Neural Network Inference

User: kaiserklayton

convolutional-neural-networks deep-learning image-recognition caffe benchmarking 8-bit gemm

karhoutam / cuda-kernels

gemm,Some common CUDA kernel implementations (Not the fastest).

User: karhoutam

cuda-kernels cuda-programming cuda-learning gemm layernorm relu softmax

merledu / magma-si

gemm,Matrix Accelerator Generator for GeMM Operations based on SIGMA Architecture in CHISEL HDL

Organization: merledu

accelerator chisel chisel-generator chisel3 gemm matrix matrix-multiplication

gemm,The HPC toolbox: fused matrix multiplication, convolution, data-parallel strided tensor primitives, OpenMP facilities, SIMD, JIT Assembler, CPU detection, state-of-the-art vectorized BLAS for floats and integers

User: mratsim

high-performance-computing deep-learning blas gemm convolution jit assembler simd openmp tensor

mz24cn / gemm_optimization

gemm,The repository targets the OpenCL gemm function performance optimization. It compares several libraries clBLAS, clBLAST, MIOpenGemm, Intel MKL(CPU) and cuBLAS(CUDA) on different matrix sizes/vendor's hardwares/OS. Out-of-the-box easy as MSVC, MinGW, Linux(CentOS) x86_64 binary provided. 在不同矩阵大小/硬件/操作系统下比较几个BLAS库的sgemm函数性能，提供binary，开盒即用。

User: mz24cn

blas cublas clblas clblast mkl sgemm gemm-optimization clnet gemm opencl

opennmt / ctranslate2

gemm,Fast inference engine for Transformer models

Organization: opennmt

Home Page: https://opennmt.net/CTranslate2

neural-machine-translation cpp mkl quantization cuda thrust opennmt deep-neural-networks openmp onednn

pminhtam / xnor_conv_pytorch_extension

gemm,XNOR-Net with binary conv2d kernels with XNOR GEMM op, support both CPU and GPU.

User: pminhtam

binary-convolutions binary-op cpp cuda pytorch-extension xnor-convolutions xnor-net binary-neural-networks pytorch gemm

rocm / hipblaslt

gemm,hipBLASLt is a library that provides general matrix-matrix operations with a flexible API and extends functionalities beyond a traditional BLAS library

Organization: rocm

Home Page: https://rocm.docs.amd.com/projects/hipBLASLt/en/latest/index.html

amd assembly blas gemm gpu-computing hip machine-learning matrix-multiplication rocm