<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

[Q&A] Cutlass and contributing about flashinfer HOT 3 CLOSED

jeromeku commented on June 28, 2024

[Q&A] Cutlass and contributing

from flashinfer.

Comments (3)

yzh119 commented on June 28, 2024

Sure we are exploring CuTE, and we believe it's the best way to use TMA.

The main reason we are still sticking to custom implementation is we haven't figured out how to TMA for sparse memory loading (e.g. in page attention prefill). We also have some ongoing effort on supporting AMD GPUs, and I suppose porting cutlass 3.0 code to rocm might be hard (pls correct me if I'm wrong).

I'll gradually replace some of the existing code with higher level abstractions in the next few months, and yes we welcome your contributions.

from flashinfer.

jeromeku commented on June 28, 2024

@yzh119 Thanks for the response!

Is there an easy way to tweak the source install such that only a few of the kernels are compiled (e.g., prefill + decode only, or some subset thereof)? I realize that there are certain environment variables that can be set to limit the template instantiations but haven't found a coarser-grained way of building only select kernel categories.

from flashinfer.

jeromeku commented on June 28, 2024

@yzh119 nevermind -- ended up stripping out certain kernels then using DEFINE flags and torch.utils.cpp_extension.load to jit-compile specific kernel instantiations.

from flashinfer.

[Q&A] Cutlass and contributing about flashinfer HOT 3 CLOSED

Comments (3)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs