Comments (9)
Thanks for the question.
But why the computeType "fp32_type" is used for hgemm? if I changed it back to fp16_type, the performance will be reduced to 38 TFLOPs again.
Following Daine's comments, there are two categories of gemm functions: High-precision accumulate (HPA) functions, where the compute type is different (and more precise) from the data type, and non high-precision accumulation functions, where all data types are the same. For fp16 data type or hgemm, where the input and output data types are fp16, you have two options:
1- non-HPA functions where input/output/compute types are all fp16. This function is slow and less accurate, due to the underlying instructions for this operation. This is what you are using now, and, generally, we don't recommend users use this function.
2- HPA functions (gemm_ex), where the input/output data types are fp16, and the compute data type is fp32. This one is fast and more accurate. From the users' perspective, these two options are the same because the input/output data types are fp16.
Can SGEMM and CGEMM do the mix-precision calculation, too?
No. The HPA functions are available only if the input data type is fp16/bf16/int8.
do you have a complete and detail table to explain what type combinations can be used together for A/B/C, just like cublasGemmEx?
Yes, please refer to 3.7.1 rocblas-bench section of the updated rocBLAS user guide. I recently added a table with this information.
from hipblas.
No, you should ensure that each datatype provided matches the data stored in the corresponding matrix. The pointers will just be casted to the specified datatype (see rocblas_gemm_ex_template() and gemm_ex_typecasting() if you're interested in the code for the rocBLAS backend).
You can also take a look at the cuBLAS documentation and rocBLAS documentation for supported datatypes and some other info regarding this function.
from hipblas.
Thanks. no other questions
from hipblas.
Hi @mathbird,
You're correct in reading that the theoretical peak performance for MI250X for Fp16 is ~383 TFLOPs. This comes from the following calculation:
theoretical perf = (frequency) * (# CUs) * (flop/cycle/cu)
For MI250X, the peak frequency is 1700MHz, and each compute die has 110 compute units. For FP16, we have 1024 flop/cycle/cu. This gives: theoretical perf = 1700MHz * 110CUs * 1024Flops/cycle/cu = ~191.5 TFLOPs per GCD
. Each MI250X has 2 GCDs, giving ~383TFLOPs theoretical peak performance. Note that a hipBLAS call will only use a single GCD, so the theoretical performance for this would more accurately be ~191TFLOPs. This number is theoretical and can be limited by other factors such as clock throttling.
hipblasHgemm() will not be able to get this performance, but you should be able to get substantially better performance if you use the mixed precision hipblasGemmEx(...) function. With FP16 input/output and FP32 compute, you should see performance more in line with what is expected. An example call would be as follows:
hipblasDatatype_t fp16_type = HIPBLAS_R_16F;
hipblasDatatype_t fp32_type = HIPBLAS_R_32F;
status = hipblasGemmEx(handle, transA, transB, m, n, k, alpha,
dA, fp16_type, lda,
dB, fp16_type, ldb, beta,
dC, fp16_type, ldc,
fp32_type, HIPBLAS_GEMM_DEFAULT);
You can take a look at the gemmEx documentation in hipblas.h, or feel free to ask any questions you have and I'll be happy to help.
Thanks,
Daine
from hipblas.
Thanks, Daine. I did observe the big performance improvement with GemmEX! But why the computeType "fp32_type" is used for hgemm? if I changed it back to fp16_type, the performance will be reduced to 38 TFLOPs again.
Can SGEMM and CGEMM do the mix-precision calculation, too? do you have a complete and detail table to explain what type combinations can be used together for A/B/C, just like cublasGemmEx?
from hipblas.
Thanks for useful info. If d_A, d_B and d_C are all defined as float (32 bit) using hipMalloc, the following GemmEx call can convert d_A and d_B correctly as bf16 numbers, and result in the right d_C?
CHECK_HIPBLAS_ERROR(hipblasGemmEx(handle, transa, transb, M, N, K, (const hipblasHalf*)&alpha,
d_A, HIPBLAS_R_16B, M,
d_B, HIPBLAS_R_16B, K, (const hipblasHalf*)&beta,
d_C, HIPBLAS_R_32F, N,
HIPBLAS_R_32F, HIPBLAS_GEMM_DEFAULT));
from hipblas.
@mathbird Do you have any further questions or should we go ahead and close this issue?
from hipblas.
I tried the following type combinations, it could be compiled, but the test showed the error "rocBLAS error: HIPBLAS_STATUS_INVALID_ENUM". did I miss anything? or the complex single precision gemm could only use "HIPBLAS_C_32F" input on MI250?
CHECK_HIPBLAS_ERROR(hipblasGemmEx(handle, transa, transb, M, N, K, (const hipblasComplex *)&alpha,
d_A, HIPBLAS_C_16B, M,
d_B, HIPBLAS_C_16B, K, (const hipblasComplex*)&beta,
d_C, HIPBLAS_C_32F, N,
HIPBLAS_C_32F, HIPBLAS_GEMM_DEFAULT));
from hipblas.
That error code is incorrect, I see we are missing some type conversions in hipBLAS. I'll make a PR for that today.
However, that type combination isn't supported in rocBLAS or cuBLAS, so once the above is fixed, it will just return HIPBLAS_STATUS_NOT_SUPPORTED. What you have is "bfloat16 complex" for A/B and "float complex" for C/Compute. For the rocBLAS backend, our only support for HIPBLAS_C_32F in GemmEx is essentially the same as hipblasCgemm(), i.e. a_type = b_type = c_type = compute_type = HIPBLAS_C_32F
.
from hipblas.
Related Issues (20)
- Error with `DgetriBatched` function on AMD devices HOT 5
- GemmEx of float16 runs so fast but with wrong result HOT 12
- hipBLAS Ubuntu packaging is broken HOT 2
- hipBLAS compiled for CUDA cannot be found in CMake HOT 4
- Link broken in the README for `hipBLAS functions` HOT 1
- hipblasGemmEx does not match the CPU or ROCBlas results for int8 x int8 to int32 matrix multiplication HOT 3
- [5.3.X] TRMM functions do not have correct correspondence in hipBLAS HOT 4
- [5.3.X][CUDA >= 11.0] `hipblasGemmEx` doesn't fully match `cublasGemmEx` HOT 2
- Type of AP in hipblasStrsm functions HOT 2
- hipblasSgemm extremely slow when compared to cublasSgemm HOT 4
- Building rocBLAS for CUDA backend fails HOT 2
- 64-bit interface HOT 3
- hipEventCreate HOT 1
- question about axpby HOT 2
- question about mixed precision dot HOT 3
- compute type of hipblasGemmEx HOT 2
- add topic tags HOT 3
- hipblasXgelsBatched() failed with error HOT 1
- Results of hipblasHgemm seems incorrect HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from hipblas.