Comments (3)
T4 GPU doesn't support BF16 matmul
It actually does, but it wouldn't use TensorCores and is incredibly slow
XLA switches BF16 matmul to F32 matmul on T4
This is a fairly recent change I did, you could try to find a commit with this. Without that change, matmuls are >4x slower from what I recall (depending on shape)
If I understand correctly, "BF16" matmul should be the same performance as F32.
Why would it? T4 has neither vector nor TensorCore support for BF16, so it has to emulate it, slowly.
Or do you mean on T4? On T4, you can look at the GPU profile.
Here the problem is we use Triton for fusions, which recently dropped support for pre-Ampere GPUs (or at least they aren't officially supported). Without fusions, we need to run an extra kernel to cast from BF16 to F32, which can be as expensive as the matmul itself.
from xla.
Why would it?
Sorry, misspoke a bit. I meant that I'd expect that the emulation on T4 would be in the ballpark of (or at least not slower than) F32. But it sounds like it could be slower than F32 because of the extra cast?
from xla.
Yes. Since we support cutlass fusions I might look into supporting that fusion (cast into matmul) via cutlass.
from xla.
Related Issues (20)
- gpu_hlo_cost_analysis NumOfDevices always return 0 HOT 2
- [Feature Request]Add more comm op support on gpu_hlo_cost_analysis HOT 4
- gpu f16 cast to fp32 calculation, and then converted back? HOT 1
- "Could not find executable `nvidia-smi`" for `./configure.py --backend=CUDA` HOT 6
- Build from source fails HOT 7
- XLA documentation for Windows HOT 1
- Implement GitHub Presubmit Checks for Windows Environment HOT 1
- Support builds with cuDNN 9
- gpu_fused_mha_test fails at HEAD on H100 HOT 1
- Workable example on normal DNN model
- [xla:auto_sharding] Question about resharding costs of Reshape strategies
- OpenCL Support. HOT 2
- PJRT `CopyCpuBufferToLiteral` of JAX buffer taking too long HOT 9
- Porting XLA to different backends. HOT 4
- Compiling xla/mlir/tools/mlir_interpreter/dialects/util.cc failed HOT 3
- MHLO Extraction from XLA Compiler HOT 4
- Controlling a single compiler pass in XLA for CPU target HOT 2
- Compiling scatter results in very slow while-loop on TPU
- Wrong output from JAX test after onednn change HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from xla.