T4 GPU doesn't support BF16 matmul. Because of this, XLA switches BF16 matmul to F32 m

BF16 matmul slower than F32 matmul on T4 GPU about xla HOT 3 OPEN

sagelywizard commented on August 16, 2024

BF16 matmul slower than F32 matmul on T4 GPU

from xla.

Comments (3)

cheshire commented on August 16, 2024

T4 GPU doesn't support BF16 matmul

It actually does, but it wouldn't use TensorCores and is incredibly slow

XLA switches BF16 matmul to F32 matmul on T4

This is a fairly recent change I did, you could try to find a commit with this. Without that change, matmuls are >4x slower from what I recall (depending on shape)

If I understand correctly, "BF16" matmul should be the same performance as F32.

Why would it? T4 has neither vector nor TensorCore support for BF16, so it has to emulate it, slowly.

Or do you mean on T4? On T4, you can look at the GPU profile.

Here the problem is we use Triton for fusions, which recently dropped support for pre-Ampere GPUs (or at least they aren't officially supported). Without fusions, we need to run an extra kernel to cast from BF16 to F32, which can be as expensive as the matmul itself.

from xla.

sagelywizard commented on August 16, 2024

Why would it?

Sorry, misspoke a bit. I meant that I'd expect that the emulation on T4 would be in the ballpark of (or at least not slower than) F32. But it sounds like it could be slower than F32 because of the extra cast?

from xla.

cheshire commented on August 16, 2024

Yes. Since we support cutlass fusions I might look into supporting that fusion (cast into matmul) via cutlass.

from xla.

Recommend Projects

BF16 matmul slower than F32 matmul on T4 GPU about xla HOT 3 OPEN

Comments (3)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs