rocm / rocsolver Goto Github PK

View Code? Open in Web Editor NEW

90.0 28.0 46.0 9.16 MB

Next generation LAPACK implementation for ROCm platform

Home Page: https://rocm.docs.amd.com/projects/rocSOLVER/en/latest/

License: Other

CMake 1.34% Shell 0.35% C++ 80.08% C 17.02% Groovy 0.16% Python 0.76% MATLAB 0.30%

lapack linear-algebra rocm

rocsolver's Introduction

rocSOLVER

rocSOLVER is a work-in-progress implementation of a subset of LAPACK functionality on the ROCm platform.

Documentation

For a detailed description of the rocSOLVER library, its implemented routines, the installation process and user guide, see the rocSOLVER documentation.

How to build documentation

Please follow the instructions below to build the documentation.

cd docs

pip3 install -r sphinx/requirements.txt

python3 -m sphinx -T -E -b html -d _build/doctrees -D language=en . _build/html

Building rocSOLVER

To download the rocSOLVER source code, clone this repository with the command:

git clone https://github.com/ROCmSoftwarePlatform/rocSOLVER.git

rocSOLVER requires rocBLAS as a companion GPU BLAS implementation. For more information about rocBLAS and how to install it, see the rocBLAS documentation.

After a standard installation of rocBLAS, the following commands will build rocSOLVER and install to /opt/rocm:

cd rocSOLVER
./install.sh -i

Once installed, rocSOLVER can be used just like any other library with a C API. The header file will need to be included in the user code, and both the rocBLAS and rocSOLVER shared libraries will become link-time and run-time dependencies for the user application.

If you are a developer contributing to rocSOLVER, you may wish to run ./scripts/install-hooks to install the git hooks for autoformatting. You may also want to take a look at the contributing guidelines

Using rocSOLVER

The following code snippet shows how to compute the QR factorization of a general m-by-n real matrix in double precision using rocSOLVER. A longer version of this example is provided by example_basic.cpp in the samples directory. For a description of the rocsolver_dgeqrf function, see the rocSOLVER API documentation.

/////////////////////////////
// example.cpp source code //
/////////////////////////////

#include <algorithm> // for std::min
#include <stddef.h>  // for size_t
#include <vector>
#include <hip/hip_runtime_api.h> // for hip functions
#include <rocsolver/rocsolver.h> // for all the rocsolver C interfaces and type declarations

int main() {
  rocblas_int M;
  rocblas_int N;
  rocblas_int lda;

  // here is where you would initialize M, N and lda with desired values

  rocblas_handle handle;
  rocblas_create_handle(&handle);

  size_t size_A = size_t(lda) * N;          // the size of the array for the matrix
  size_t size_piv = size_t(std::min(M, N)); // the size of array for the Householder scalars

  std::vector<double> hA(size_A);      // creates array for matrix in CPU
  std::vector<double> hIpiv(size_piv); // creates array for householder scalars in CPU

  double *dA, *dIpiv;
  hipMalloc(&dA, sizeof(double)*size_A);      // allocates memory for matrix in GPU
  hipMalloc(&dIpiv, sizeof(double)*size_piv); // allocates memory for scalars in GPU

  // here is where you would initialize matrix A (array hA) with input data
  // note: matrices must be stored in column major format,
  //       i.e. entry (i,j) should be accessed by hA[i + j*lda]

  // copy data to GPU
  hipMemcpy(dA, hA.data(), sizeof(double)*size_A, hipMemcpyHostToDevice);
  // compute the QR factorization on the GPU
  rocsolver_dgeqrf(handle, M, N, dA, lda, dIpiv);
  // copy the results back to CPU
  hipMemcpy(hA.data(), dA, sizeof(double)*size_A, hipMemcpyDeviceToHost);
  hipMemcpy(hIpiv.data(), dIpiv, sizeof(double)*size_piv, hipMemcpyDeviceToHost);

  // the results are now in hA and hIpiv, so you can use them here

  hipFree(dA);                        // de-allocate GPU memory
  hipFree(dIpiv);
  rocblas_destroy_handle(handle);     // destroy handle
}

The exact command used to compile the example above may vary depending on the system environment, but here is a typical example:

/opt/rocm/bin/hipcc -I/opt/rocm/include -c example.cpp
/opt/rocm/bin/hipcc -o example -L/opt/rocm/lib -lrocsolver -lrocblas example.o

rocsolver's People

Contributors

Stargazers

Watchers

rocsolver's Issues

Poor performance of jacobi eigensolver routine.

We've found the CUDA jacobi eigensolvers useful for certain matrix size ranges so I tried the rocsolver versions. For rocsolver_dsygvj with N=240 using a Radeon VII I'm seeing performance an order of magnitude or more worse than lapacks dsygvd running on a single CPU core. I enabled profile logging and have included the output from a run that did 4 total calls to rocsolver_dsygvj. I am wondering if this the expected level of performance at this point or is it possible there is something wrong in my machines setup?

------- PROFILE -------
rocsolver_sygvj_hegvj_template: Calls: 4, Total Time: 3262.748 ms (in nested functions: 3262.723 ms)
sygv_update_info: Calls: 4, Total Time: 4.217 ms
rocsolver_syevj_heevj_template: Calls: 4, Total Time: 2895.013 ms (in nested functions: 2895.002 ms)
syevj_small_kernel: Calls: 4, Total Time: 2895.002 ms
rocsolver_trsm_lower_template: Calls: 4, Total Time: 1.259 ms (in nested functions: 1.248 ms)
conj_nonunit_backward_substitution_kernel: Calls: 4, Total Time: 1.248 ms
rocsolver_sygst_hegst_template: Calls: 4, Total Time: 121.691 ms (in nested functions: 121.644 ms)
rocblas_syr2k_template: Calls: 12, Total Time: 6.292 ms
rocblas_symm_template: Calls: 24, Total Time: 4.878 ms
rocsolver_trsm_lower_template: Calls: 24, Total Time: 3.027 ms (in nested functions: 2.986 ms)
nonunit_forward_substitution_kernel: Calls: 12, Total Time: 1.653 ms
conj_nonunit_forward_substitution_kernel: Calls: 12, Total Time: 1.333 ms
rocsolver_sygs2_hegs2_template: Calls: 16, Total Time: 107.447 ms (in nested functions: 96.767 ms)
rocblas_trsv_template: Calls: 944, Total Time: 52.673 ms
rocblas_syr2_template: Calls: 944, Total Time: 18.072 ms
rocblas_scal_template: Calls: 944, Total Time: 11.694 ms
sygs2_set_diag1: Calls: 960, Total Time: 14.328 ms
rocsolver_potrf_template: Calls: 4, Total Time: 240.543 ms (in nested functions: 240.493 ms)
rocblas_syrk_template: Calls: 8, Total Time: 154.154 ms
rocsolver_trsm_lower_template: Calls: 8, Total Time: 1.809 ms (in nested functions: 1.789 ms)
conj_nonunit_forward_substitution_kernel: Calls: 8, Total Time: 1.789 ms
chk_positive: Calls: 12, Total Time: 0.175 ms
rocsolver_potf2_template: Calls: 12, Total Time: 83.880 ms (in nested functions: 82.554 ms)
rocblas_gemv_template: Calls: 948, Total Time: 32.727 ms
sqrtDiagOnward: Calls: 960, Total Time: 13.006 ms
rocblas_dot_template: Calls: 960, Total Time: 21.591 ms
rocblas_scal_template: Calls: 948, Total Time: 15.074 ms
reset_info: Calls: 12, Total Time: 0.156 ms
reset_info: Calls: 12, Total Time: 0.475 ms
iota_n: Calls: 4, Total Time: 0.823 ms

SVD incorrect for 1x1 matrices

When rocsolver_sgesvd is called on a 1x1 matrix then the V returned is not correct:

#include <algorithm>
#include <stdio.h>
#include <vector>
#include <hip/hip_runtime_api.h>
#include <rocsolver.h>
#include <iostream>

int main() {

  rocblas_handle handle;
  rocblas_create_handle(&handle);

  float A, U, V;
  rocblas_int info;
  float S, E;

  float *dA, *dU, *dV;
  rocblas_int *dinfo;
  float *dS, *dE;

  hipMalloc(&dA, sizeof(float));
  hipMalloc(&dU, sizeof(float));
  hipMalloc(&dV, sizeof(float));
  hipMalloc(&dS, sizeof(float));
  hipMalloc(&dE, sizeof(float));
  hipMalloc(&dinfo, sizeof(rocblas_int));

  A = 1.23;
  std::cout << "input" << std::endl;
  std::cout << "A = " << A << std::endl;

  hipMemcpy(dA, &A, sizeof(float), hipMemcpyHostToDevice);

  rocsolver_sgesvd(handle, rocblas_svect_all, rocblas_svect_all, 1, 1, dA, 1, dS, dU, 1, dV, 1, dE, rocblas_inplace, dinfo);

  hipMemcpy(&U, dU, sizeof(float), hipMemcpyDeviceToHost);
  hipMemcpy(&V, dV, sizeof(float), hipMemcpyDeviceToHost);
  hipMemcpy(&S, dS, sizeof(float), hipMemcpyDeviceToHost);
  hipMemcpy(&info, dinfo, sizeof(rocblas_int), hipMemcpyDeviceToHost);

  std::cout << std::endl;
  std::cout << "output" << std::endl;
  std::cout << "info = " << info << std::endl;
  std::cout << "U = " << U << std::endl;
  std::cout << "S = " << S << std::endl;
  std::cout << "V = " << V << std::endl;
}

(run on a RX480, ubuntu 20.04.1, rocm 4.0-23, rocblas 2.32.0.2844-cc18d25f, rocsolver 3.10.0.183-6b820a8)

input
A = 1.23

output
info = 0
U = 1
S = 1.23
V = 1.23

where V here should be 1 instead
The same bug is also present in rocsolver_dgesvd, rocsolver_cgesvd and rocsolver_zgesvd.

cc @deven-amd

Cannot include rocsolver.h from C code in ROCm 3.7

When using rocsolver.h from c code in ROCm 3.5 (needed just go get the handle type at a minimum), there are errors many errors related to the pivot = 1 arguments:

/opt/rocm/include/rocsolver-functions.h:2548:76: error: C does not support default arguments
                                                   const rocblas_int pivot = 1);

This did not occur in ROCm 3.5 of course.

Submitted by member of CoE Frontier team.

Possible test data corruption for rocSOLVER (version rocm 4.3.0)

I'm packaging rocSOLVER-4.3.0 for Gentoo and testing the package using openblas as CPU reference when I encountered a segmentation fault, detailed information here. Choosing netlib lapack as reference won't have issues, so at first I suppose there is a bug in openblas; but after short investigation with openblas team, I found that the input data to reference lapack seems corrupted and causes the segfault (somehow netlib runs normally even with the same, suspicious input). So I turned here for more help on finding the reason and solution.

My environment:

kernel(GPU driver): Linux 5.15.8
compiler: gcc-11.2; llvm-rocm-4.3.0 + hipcc-4.3.0
ROCm: all 4.3.0, gentoo packages
googletest: 1.11.0
openblas: 0.3.19
netlib lapack: 3.10.0

Segfault with batched syev/heev from ROCm 4.2.0

Hello, I am a CuPy core contributor working on supporting the new eigensolvers in CuPy, but I found that the batched version would just segfault (the non-batched version works fine and passes our tests). Below is a super minimal reproducer:

#include <iostream>
#include <algorithm> // for std::min
#include <stddef.h>  // for size_t
#include <vector>
#include <hip/hip_runtime_api.h> // for hip functions
#include <rocsolver.h> // for all the rocsolver C interfaces and type declarations

int main() {
  int batch_size = 2;
  rocblas_int M = 3;
  rocblas_int N = 3;
  rocblas_int lda = 3;

  rocblas_handle handle;
  rocblas_create_handle(&handle);

  size_t size_A = size_t(lda) * N;

  // A = [ [1, 2, 3,
  //        2, 5, 4,
  //        3, 4, 9],
  //       [2, 2, 3,
  //        2, 6, 4,
  //        3, 4, 10]]
  std::vector<float> hA = {1, 2, 3, 2, 5, 4, 3, 4, 9, 2, 2, 3, 2, 6, 4, 3, 4, 10}; // creates array for matrix in CPU

  float *dA, *dW, *dE;
  hipMalloc(&dA, sizeof(float)*size_A*batch_size);
  hipMalloc(&dW, sizeof(float)*M*batch_size);
  hipMalloc(&dE, sizeof(float)*M*batch_size);
  int *info;
  hipMalloc(&info, sizeof(int)*batch_size);

  hipMemcpy(dA, hA.data(), sizeof(float)*size_A*batch_size, hipMemcpyHostToDevice);
  rocblas_status status;

  #ifdef DO_BATCH
  // this would segfault
  status = rocsolver_ssyev_batched(handle, rocblas_evect_original, rocblas_fill_upper,
                                   N, reinterpret_cast<float* const*>(dA), lda,
                                   dW, N,
                                   dE, N, info, batch_size);
  if (status != 0) {
    std::cout << "got error: " << status << std::endl;
    exit(1);
  }
  #else
  // but this would not
  status = rocsolver_ssyev(handle, rocblas_evect_original, rocblas_fill_upper,
                           N, dA, lda,
                           dW,
                           dE, info);
  if (status != 0) {
    std::cout << "got error: " << status << std::endl;
    exit(1);
  }
  #endif

  // don't bother doing cleanup as it's unimportant here...
  return 0;
}

When compiled with

hipcc test_batched_syev.cpp -I$ROCM_PATH/include -L$ROCM_PATH/lib -lrocblas -lrocsolver -o test_batched_syev

the non-batched version is called (and operates on the first matrix in A) and runs fine, but when compiled with

hipcc test_batched_syev.cpp -I$ROCM_PATH/include -L$ROCM_PATH/lib -lrocblas -lrocsolver -DDO_BATCH -o test_batched_syev

the batched version is called and segfaults:

$ ./test_batched_syev
:0:rocdevice.cpp            :2533: 47346961354 us: Device::callbackQueue aborting with error : HSA_STATUS_ERROR_MEMORY_APERTURE_VIOLATION:  The agent attempted to access memory beyond the largest legal address. code: 0x29
Aborted (core dumped)

This happens for all four types (ssyev, dsyev, cheev, zheev), so I assume the bug might be in the type dispatch layer?

cc: @amathews-amd (for awareness)

`HSA_STATUS_ERROR_ILLEGAL_INSTRUCTION` in `rocsolver_dgels_batched`

I am playing around with the batched dgels implementation. However, I cannot run more than one call. The second call crashes with :0:rocdevice.cpp :2614: 1042550426289 us: 1410912: [tid:0x155505b6a700] Device::callbackQueue aborting with error : HSA_STATUS_ERROR_ILLEGAL_INSTRUCTION: The agent attempted to execute an illegal shader instruction. code: 0x2a

Any ideas? Thanks.

hipcc --version
HIP version: 5.1.20531-cacfa990
AMD clang version 14.0.0 (https://github.com/RadeonOpenCompute/llvm-project roc-5.1.0 22114 5cba46feb6af367b1cafaa183ec42dbfb8207b14)

Source Code

Unable to dlopen librocsolver.so

TensorFlow needs to dynamically load the rocSOLVER lib at runtime (TF does the same for all ROCm libs).

Attempting to do so results in the following error

2021-03-24 14:12:48.355140: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] 
Could not load dynamic library 'librocsolver.so'; 
dlerror: /opt/rocm-4.0.1/lib/librocsolver.so: undefined symbol: _Z20rocblas_ger_templateILb0EfPKfKPffE15rocblas_status_P15_rocblas_handleiiPKT3_lPKT1_iiiSC_iiiPT2_iiii

2021-03-24 14:12:48.355196: F tensorflow/stream_executor/lib/statusor.cc:34] 
Attempting to fetch value instead of handling error Failed precondition: 
Could not load dynamic library 'librocsolver.so'; 
dlerror: /opt/rocm-4.0.1/lib/librocsolver.so: undefined symbol: _Z20rocblas_ger_templateILb0EfPKfKPffE15rocblas_status_P15_rocblas_handleiiPKT3_lPKT1_iiiSC_iiiPT2_iiii

You can reproduce this error using the following simple standalone program

// main.cpp
// /opt/rocm/bin/hipcc -std=c++11 main.cpp -ldl -o test_runner
// ./test_runner
#include <dlfcn.h>
#include <iostream>

#define ROCSOLVER_LIB_NAME "/opt/rocm-4.0.1/lib/librocsolver.so"


void* open_roc_lib(const char* lib_name) {
  void* handle = dlopen(lib_name, RTLD_NOW | RTLD_LOCAL);
  if (!handle) {
    std::cout << dlerror() << "\n";
    return nullptr;
  }
  std::cout << "loaded " << lib_name << "\n";
  return handle;
}

int main() {
  void* rocsolver_handle = open_roc_lib(ROCSOLVER_LIB_NAME);
  dlclose(rocsolver_handle);
  return 0;
}

It seems that the rocblas library is not in the list of dependencies for rocsolver library...could that be the cause of this error?

root@prj47-rack-15:/root# ldd /opt/rocm-4.0.1/lib/librocsolver.so
	linux-vdso.so.1 (0x00007fffdccd9000)
	libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f28eff6e000)
	libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f28efd4f000)
	libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f28ef9b1000)
	libamdhip64.so.4 => /opt/rocm-4.0.1/lib/libamdhip64.so.4 (0x00007f28eea2b000)
	libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007f28ee6a2000)
	libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f28ee2b1000)
	/lib64/ld-linux-x86-64.so.2 (0x00007f291a23f000)
	libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f28ee0ad000)
	libhsa-runtime64.so.1 => /opt/rocm-4.0.1/lib/libhsa-runtime64.so.1 (0x00007f28edc98000)
	libamd_comgr.so.1 => /opt/rocm-4.0.1/lib/libamd_comgr.so.1 (0x00007f28e7160000)
	librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007f28e6f58000)
	libhsakmt.so.1 => /opt/rocm-4.0.1/lib/libhsakmt.so.1 (0x00007f28e6d2f000)
	libelf.so.1 => /usr/lib/x86_64-linux-gnu/libelf.so.1 (0x00007f28e6b15000)
	libtinfo.so.5 => /lib/x86_64-linux-gnu/libtinfo.so.5 (0x00007f28e68eb000)
	libnuma.so.1 => /usr/lib/x86_64-linux-gnu/libnuma.so.1 (0x00007f28e66e0000)
	libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x00007f28e64c3000)

/cc @whchung @ekuznetsov139 @sunway513

Change upper bound for fmt dependency on rocsolver spack package

Is the upper bound of the fmt dependency of the rocsolver spack package here for a specific reason or just because the other versions are not tested? https://github.com/spack/spack/blob/develop/var/spack/repos/builtin/packages/rocsolver/package.py#L125
I would need fmt@9 for my spack build-env, and it seems to work good when I remove this upper bound, I didn't test rocsolver exhaustively though.

Behaviour with `hipStreamNonBlocking` streams in handles

Is it safe to use streams created with hipStreamNonBlocking in rocblas_handles?

We have a test case using rocsolver_potrf where it seems like using a stream with hipStreamNonBlocking leads to incorrect results. Unfortunately the test case is not particularly small or self-contained at the moment, so before I attempt to minimize the test case I wanted to ask here if it's possible that some (or all) rocsolver functions actually assume that the stream synchronizes with the default stream (as it would without hipStreamNonBlocking)? Possibly for the workspace allocations or similar?

add topic tags in About section

I suggest adding topics such as lapack, linear-algebra, rocm in the About section at https://github.com/ROCm/rocSOLVER , as described at https://docs.github.com/en/[email protected]/repositories/managing-your-repositorys-settings-and-features/customizing-your-repository/classifying-your-repository-with-topics

Low performance of xPOTRF.

The Cholesky decomposition doens't performs well on a MI50.

Using 10240 matrices the double precision performance is just 150 GFlop/s, and increasing the matrix size to 20480 the performance even decrease to 110 GFlop/s.
Full output of https://gist.github.com/rasolca/8a302639a75f79bfb3f767a2b4ab3014:

iteration: 0, size: 10240
Perf: 59.6001 GFlop/s
iteration: 1, size: 10240
Perf: 159.634 GFlop/s
iteration: 2, size: 10240
Perf: 159.479 GFlop/s
iteration: 3, size: 10240
Perf: 158.875 GFlop/s
iteration: 4, size: 10240
Perf: 159.327 GFlop/s
iteration: 0, size: 20480
Perf: 110.742 GFlop/s
iteration: 1, size: 20480
Perf: 110.517 GFlop/s
iteration: 2, size: 20480
Perf: 109.787 GFlop/s
iteration: 3, size: 20480
Perf: 109.32 GFlop/s
iteration: 4, size: 20480
Perf: 109.335 GFlop/s

As comparison an Nvidia P100 reaches ~70% of the peak performance with a 10240 matrix using cuSolver.

Symmetric eigensolvers missing

I'm not sure if this is a Ubuntu packaging problem or some issue with synchronization of docs and code but the ROCm documentation here https://rocmdocs.amd.com/en/latest/ lists symmetric eigensolver routines in rocsolver. I installed the 4.1 Ubuntu packages but the functions are not there.

Release or tag for ROCm 2.9?

Hi,

will there be a release or tag for ROCm 2.9?
This would make packaging easier.

build errors with fmt=8.1.1

Build log

In file included from /home/acxz/vcs/git/github/rocm-arch/rocm-arch/rocsolver/src/rocSOLVER-rocm-5.0.1/library/src/include/rocsolver_logger.hpp:9:
/usr/include/fmt/ranges.h:737:21: error: no matching member function for call to 'format'
                 .format(std::get<sizeof...(T) - N>(value.tuple), ctx);
                 ~^~~~~~
/usr/include/fmt/ranges.h:695:12: note: in instantiation of function template specialization 'fmt::formatter<fmt::tuple_join_view<char, rocsolver_logvalue<const char *> &, rocsolver_logvalue<int> &, rocsolver_logvalue<const char *> &, rocsolver_logvalue<int> &>>::do_format<fmt::basic_format_context<fmt::appender, char>, 4UL>' requested here
  return do_format(value, ctx,
         ^
/usr/include/fmt/core.h:1282:22: note: in instantiation of function template specialization 'fmt::formatter<fmt::tuple_join_view<char, rocsolver_logvalue<const char *> &, rocsolver_logvalue<int> &, rocsolver_logvalue<const char *> &, rocsolver_logvalue<int> &>>::format<fmt::basic_format_context<fmt::appender, char>>' requested here
  ctx.advance_to(f.format(*static_cast<qualified_type*>(arg), ctx));
                   ^
/usr/include/fmt/core.h:1261:21: note: in instantiation of function template specialization 'fmt::detail::value<fmt::basic_format_context<fmt::appender, char>>::format_custom_arg<fmt::tuple_join_view<char, rocsolver_logvalue<const char *> &, rocsolver_logvalue<int> &, rocsolver_logvalue<const char *> &, rocsolver_logvalue<int> &>, fmt::formatter<fmt::tuple_join_view<char, rocsolver_logvalue<const char *> &, rocsolver_logvalue<int> &, rocsolver_logvalue<const char *> &, rocsolver_logvalue<int> &>>>' requested here
  custom.format = format_custom_arg<
                  ^
/usr/include/fmt/ostream.h:131:15: note: in instantiation of function template specialization 'fmt::make_args_checked<const char *&, const char &, fmt::tuple_join_view<char, rocsolver_logvalue<const char *> &, rocsolver_logvalue<int> &, rocsolver_logvalue<const char *> &, rocsolver_logvalue<int> &>, char[34], char>' requested here
       fmt::make_args_checked<Args...>(format_str, args...));
            ^
/home/acxz/vcs/git/github/rocm-arch/rocm-arch/rocsolver/src/rocSOLVER-rocm-5.0.1/library/src/include/rocsolver_logger.hpp:184:14: note: in instantiation of function template specialization 'fmt::print<char[34], const char *&, const char &, fmt::tuple_join_view<char, rocsolver_logvalue<const char *> &, rocsolver_logvalue<int> &, rocsolver_logvalue<const char *> &, rocsolver_logvalue<int> &>, char>' requested here
      fmt::print(*bench_os, "./rocsolver-bench -f {} -r {} {}\n", func_name,
           ^
/home/acxz/vcs/git/github/rocm-arch/rocm-arch/rocsolver/src/rocSOLVER-rocm-5.0.1/library/src/include/rocsolver_logger.hpp:281:13: note: in instantiation of function template specialization 'rocsolver_logger::log_bench<rocblas_complex_num<float>, rocsolver_logvalue<const char *>, rocsolver_logvalue<int>, rocsolver_logvalue<const char *>, rocsolver_logvalue<int>>' requested here
          log_bench<T>(entry.level, func_prefix, func_name, rocsolver_make_logvalue(args)...);
          ^
/home/acxz/vcs/git/github/rocm-arch/rocm-arch/rocsolver/src/rocSOLVER-rocm-5.0.1/library/src/auxiliary/rocauxiliary_lacgv.cpp:11:5: note: in instantiation of function template specialization 'rocsolver_logger::log_enter_top_level<rocblas_complex_num<float>, const char *, int, const char *, int>' requested here
  ROCSOLVER_ENTER_TOP("lacgv", "-n", n, "--incx", incx);
  ^
/home/acxz/vcs/git/github/rocm-arch/rocm-arch/rocsolver/src/rocSOLVER-rocm-5.0.1/library/src/include/rocsolver_logger.hpp:34:43: note: expanded from macro 'ROCSOLVER_ENTER_TOP'
          rocsolver_logger::instance()->log_enter_top_level<T>(handle, "rocsolver", name, \
                                        ^
/home/acxz/vcs/git/github/rocm-arch/rocm-arch/rocsolver/src/rocSOLVER-rocm-5.0.1/library/src/auxiliary/rocauxiliary_lacgv.cpp:49:12: note: in instantiation of function template specialization 'rocsolver_lacgv_impl<rocblas_complex_num<float>>' requested here
  return rocsolver_lacgv_impl<rocblas_float_complex>(handle, n, x, incx);
         ^
/home/acxz/vcs/git/github/rocm-arch/rocm-arch/rocsolver/src/rocSOLVER-rocm-5.0.1/library/src/include/rocsolver_logvalue.hpp:40:10: note: candidate function template not viable: 'this' argument has type 'const __tuple_element_t<0UL, tuple<fmt::formatter<rocsolver_logvalue<const char *>>, fmt::formatter<rocsolver_logvalue<int>>, fmt::formatter<rocsolver_logvalue<const char *>>, fmt::formatter<rocsolver_logvalue<int>>>>' (aka 'const fmt::formatter<rocsolver_logvalue<const char *>>'), but method is not marked const
  auto format(rocsolver_logvalue<T> wrapper, FormatCtx& ctx)
       ^
In file included from /home/acxz/vcs/git/github/rocm-arch/rocm-arch/rocsolver/src/rocSOLVER-rocm-5.0.1/library/src/auxiliary/rocauxiliary_lacgv.cpp:5:
In file included from /home/acxz/vcs/git/github/rocm-arch/rocm-arch/rocsolver/src/rocSOLVER-rocm-5.0.1/library/src/auxiliary/rocauxiliary_lacgv.hpp:12:
In file included from /home/acxz/vcs/git/github/rocm-arch/rocm-arch/rocsolver/src/rocSOLVER-rocm-5.0.1/library/src/include/rocblas.hpp:10:
In file included from /home/acxz/vcs/git/github/rocm-arch/rocm-arch/rocsolver/src/rocSOLVER-rocm-5.0.1/library/src/include/init_scalars.hpp:13:
In file included from /home/acxz/vcs/git/github/rocm-arch/rocm-arch/rocsolver/src/rocSOLVER-rocm-5.0.1/library/src/include/rocsolver_logger.hpp:9:
/usr/include/fmt/ranges.h:737:21: error: no matching member function for call to 'format'
                 .format(std::get<sizeof...(T) - N>(value.tuple), ctx);
                 ~^~~~~~
/usr/include/fmt/ranges.h:741:14: note: in instantiation of function template specialization 'fmt::formatter<fmt::tuple_join_view<char, rocsolver_logvalue<const char *> &, rocsolver_logvalue<int> &, rocsolver_logvalue<const char *> &, rocsolver_logvalue<int> &>>::do_format<fmt::basic_format_context<fmt::appender, char>, 3UL>' requested here
    return do_format(value, ctx, std::integral_constant<size_t, N - 1>());
           ^
/usr/include/fmt/ranges.h:695:12: note: in instantiation of function template specialization 'fmt::formatter<fmt::tuple_join_view<char, rocsolver_logvalue<const char *> &, rocsolver_logvalue<int> &, rocsolver_logvalue<const char *> &, rocsolver_logvalue<int> &>>::do_format<fmt::basic_format_context<fmt::appender, char>, 4UL>' requested here
  return do_format(value, ctx,
         ^
/usr/include/fmt/core.h:1282:22: note: in instantiation of function template specialization 'fmt::formatter<fmt::tuple_join_view<char, rocsolver_logvalue<const char *> &, rocsolver_logvalue<int> &, rocsolver_logvalue<const char *> &, rocsolver_logvalue<int> &>>::format<fmt::basic_format_context<fmt::appender, char>>' requested here
  ctx.advance_to(f.format(*static_cast<qualified_type*>(arg), ctx));
                   ^
/usr/include/fmt/core.h:1261:21: note: in instantiation of function template specialization 'fmt::detail::value<fmt::basic_format_context<fmt::appender, char>>::format_custom_arg<fmt::tuple_join_view<char, rocsolver_logvalue<const char *> &, rocsolver_logvalue<int> &, rocsolver_logvalue<const char *> &, rocsolver_logvalue<int> &>, fmt::formatter<fmt::tuple_join_view<char, rocsolver_logvalue<const char *> &, rocsolver_logvalue<int> &, rocsolver_logvalue<const char *> &, rocsolver_logvalue<int> &>>>' requested here
  custom.format = format_custom_arg<
                  ^
/usr/include/fmt/ostream.h:131:15: note: in instantiation of function template specialization 'fmt::make_args_checked<const char *&, const char &, fmt::tuple_join_view<char, rocsolver_logvalue<const char *> &, rocsolver_logvalue<int> &, rocsolver_logvalue<const char *> &, rocsolver_logvalue<int> &>, char[34], char>' requested here
       fmt::make_args_checked<Args...>(format_str, args...));
            ^
/home/acxz/vcs/git/github/rocm-arch/rocm-arch/rocsolver/src/rocSOLVER-rocm-5.0.1/library/src/include/rocsolver_logger.hpp:184:14: note: in instantiation of function template specialization 'fmt::print<char[34], const char *&, const char &, fmt::tuple_join_view<char, rocsolver_logvalue<const char *> &, rocsolver_logvalue<int> &, rocsolver_logvalue<const char *> &, rocsolver_logvalue<int> &>, char>' requested here
      fmt::print(*bench_os, "./rocsolver-bench -f {} -r {} {}\n", func_name,
           ^
/home/acxz/vcs/git/github/rocm-arch/rocm-arch/rocsolver/src/rocSOLVER-rocm-5.0.1/library/src/include/rocsolver_logger.hpp:281:13: note: in instantiation of function template specialization 'rocsolver_logger::log_bench<rocblas_complex_num<float>, rocsolver_logvalue<const char *>, rocsolver_logvalue<int>, rocsolver_logvalue<const char *>, rocsolver_logvalue<int>>' requested here
          log_bench<T>(entry.level, func_prefix, func_name, rocsolver_make_logvalue(args)...);
          ^
/home/acxz/vcs/git/github/rocm-arch/rocm-arch/rocsolver/src/rocSOLVER-rocm-5.0.1/library/src/auxiliary/rocauxiliary_lacgv.cpp:11:5: note: in instantiation of function template specialization 'rocsolver_logger::log_enter_top_level<rocblas_complex_num<float>, const char *, int, const char *, int>' requested here
  ROCSOLVER_ENTER_TOP("lacgv", "-n", n, "--incx", incx);
  ^
/home/acxz/vcs/git/github/rocm-arch/rocm-arch/rocsolver/src/rocSOLVER-rocm-5.0.1/library/src/include/rocsolver_logger.hpp:34:43: note: expanded from macro 'ROCSOLVER_ENTER_TOP'
          rocsolver_logger::instance()->log_enter_top_level<T>(handle, "rocsolver", name, \
                                        ^
/home/acxz/vcs/git/github/rocm-arch/rocm-arch/rocsolver/src/rocSOLVER-rocm-5.0.1/library/src/auxiliary/rocauxiliary_lacgv.cpp:49:12: note: in instantiation of function template specialization 'rocsolver_lacgv_impl<rocblas_complex_num<float>>' requested here
  return rocsolver_lacgv_impl<rocblas_float_complex>(handle, n, x, incx);
         ^
/home/acxz/vcs/git/github/rocm-arch/rocm-arch/rocsolver/src/rocSOLVER-rocm-5.0.1/library/src/include/rocsolver_logvalue.hpp:40:10: note: candidate function template not viable: 'this' argument has type 'const __tuple_element_t<1UL, tuple<fmt::formatter<rocsolver_logvalue<const char *>>, fmt::formatter<rocsolver_logvalue<int>>, fmt::formatter<rocsolver_logvalue<const char *>>, fmt::formatter<rocsolver_logvalue<int>>>>' (aka 'const fmt::formatter<rocsolver_logvalue<int>>'), but method is not marked const
  auto format(rocsolver_logvalue<T> wrapper, FormatCtx& ctx)
       ^
In file included from /home/acxz/vcs/git/github/rocm-arch/rocm-arch/rocsolver/src/rocSOLVER-rocm-5.0.1/library/src/auxiliary/rocauxiliary_lacgv.cpp:5:
In file included from /home/acxz/vcs/git/github/rocm-arch/rocm-arch/rocsolver/src/rocSOLVER-rocm-5.0.1/library/src/auxiliary/rocauxiliary_lacgv.hpp:12:
In file included from /home/acxz/vcs/git/github/rocm-arch/rocm-arch/rocsolver/src/rocSOLVER-rocm-5.0.1/library/src/include/rocblas.hpp:10:
In file included from /home/acxz/vcs/git/github/rocm-arch/rocm-arch/rocsolver/src/rocSOLVER-rocm-5.0.1/library/src/include/init_scalars.hpp:13:
In file included from /home/acxz/vcs/git/github/rocm-arch/rocm-arch/rocsolver/src/rocSOLVER-rocm-5.0.1/library/src/include/rocsolver_logger.hpp:9:
/usr/include/fmt/ranges.h:737:21: error: no matching member function for call to 'format'
                 .format(std::get<sizeof...(T) - N>(value.tuple), ctx);
                 ~^~~~~~
/usr/include/fmt/ranges.h:741:14: note: in instantiation of function template specialization 'fmt::formatter<fmt::tuple_join_view<char, rocsolver_logvalue<const char *> &, rocsolver_logvalue<int> &, rocsolver_logvalue<const char *> &, rocsolver_logvalue<int> &>>::do_format<fmt::basic_format_context<fmt::appender, char>, 2UL>' requested here
    return do_format(value, ctx, std::integral_constant<size_t, N - 1>());
           ^
/usr/include/fmt/ranges.h:741:14: note: in instantiation of function template specialization 'fmt::formatter<fmt::tuple_join_view<char, rocsolver_logvalue<const char *> &, rocsolver_logvalue<int> &, rocsolver_logvalue<const char *> &, rocsolver_logvalue<int> &>>::do_format<fmt::basic_format_context<fmt::appender, char>, 3UL>' requested here
/usr/include/fmt/ranges.h:695:12: note: in instantiation of function template specialization 'fmt::formatter<fmt::tuple_join_view<char, rocsolver_logvalue<const char *> &, rocsolver_logvalue<int> &, rocsolver_logvalue<const char *> &, rocsolver_logvalue<int> &>>::do_format<fmt::basic_format_context<fmt::appender, char>, 4UL>' requested here
  return do_format(value, ctx,
         ^
/usr/include/fmt/core.h:1282:22: note: in instantiation of function template specialization 'fmt::formatter<fmt::tuple_join_view<char, rocsolver_logvalue<const char *> &, rocsolver_logvalue<int> &, rocsolver_logvalue<const char *> &, rocsolver_logvalue<int> &>>::format<fmt::basic_format_context<fmt::appender, char>>' requested here
  ctx.advance_to(f.format(*static_cast<qualified_type*>(arg), ctx));
                   ^
/usr/include/fmt/core.h:1261:21: note: in instantiation of function template specialization 'fmt::detail::value<fmt::basic_format_context<fmt::appender, char>>::format_custom_arg<fmt::tuple_join_view<char, rocsolver_logvalue<const char *> &, rocsolver_logvalue<int> &, rocsolver_logvalue<const char *> &, rocsolver_logvalue<int> &>, fmt::formatter<fmt::tuple_join_view<char, rocsolver_logvalue<const char *> &, rocsolver_logvalue<int> &, rocsolver_logvalue<const char *> &, rocsolver_logvalue<int> &>>>' requested here
  custom.format = format_custom_arg<
                  ^
/usr/include/fmt/ostream.h:131:15: note: in instantiation of function template specialization 'fmt::make_args_checked<const char *&, const char &, fmt::tuple_join_view<char, rocsolver_logvalue<const char *> &, rocsolver_logvalue<int> &, rocsolver_logvalue<const char *> &, rocsolver_logvalue<int> &>, char[34], char>' requested here
       fmt::make_args_checked<Args...>(format_str, args...));
            ^
/home/acxz/vcs/git/github/rocm-arch/rocm-arch/rocsolver/src/rocSOLVER-rocm-5.0.1/library/src/include/rocsolver_logger.hpp:184:14: note: in instantiation of function template specialization 'fmt::print<char[34], const char *&, const char &, fmt::tuple_join_view<char, rocsolver_logvalue<const char *> &, rocsolver_logvalue<int> &, rocsolver_logvalue<const char *> &, rocsolver_logvalue<int> &>, char>' requested here
      fmt::print(*bench_os, "./rocsolver-bench -f {} -r {} {}\n", func_name,
           ^
/home/acxz/vcs/git/github/rocm-arch/rocm-arch/rocsolver/src/rocSOLVER-rocm-5.0.1/library/src/include/rocsolver_logger.hpp:281:13: note: in instantiation of function template specialization 'rocsolver_logger::log_bench<rocblas_complex_num<float>, rocsolver_logvalue<const char *>, rocsolver_logvalue<int>, rocsolver_logvalue<const char *>, rocsolver_logvalue<int>>' requested here
          log_bench<T>(entry.level, func_prefix, func_name, rocsolver_make_logvalue(args)...);
          ^
/home/acxz/vcs/git/github/rocm-arch/rocm-arch/rocsolver/src/rocSOLVER-rocm-5.0.1/library/src/auxiliary/rocauxiliary_lacgv.cpp:11:5: note: in instantiation of function template specialization 'rocsolver_logger::log_enter_top_level<rocblas_complex_num<float>, const char *, int, const char *, int>' requested here
  ROCSOLVER_ENTER_TOP("lacgv", "-n", n, "--incx", incx);
  ^
/home/acxz/vcs/git/github/rocm-arch/rocm-arch/rocsolver/src/rocSOLVER-rocm-5.0.1/library/src/include/rocsolver_logger.hpp:34:43: note: expanded from macro 'ROCSOLVER_ENTER_TOP'
          rocsolver_logger::instance()->log_enter_top_level<T>(handle, "rocsolver", name, \
                                        ^
/home/acxz/vcs/git/github/rocm-arch/rocm-arch/rocsolver/src/rocSOLVER-rocm-5.0.1/library/src/auxiliary/rocauxiliary_lacgv.cpp:49:12: note: in instantiation of function template specialization 'rocsolver_lacgv_impl<rocblas_complex_num<float>>' requested here
  return rocsolver_lacgv_impl<rocblas_float_complex>(handle, n, x, incx);
         ^
/home/acxz/vcs/git/github/rocm-arch/rocm-arch/rocsolver/src/rocSOLVER-rocm-5.0.1/library/src/include/rocsolver_logvalue.hpp:40:10: note: candidate function template not viable: 'this' argument has type 'const __tuple_element_t<2UL, tuple<fmt::formatter<rocsolver_logvalue<const char *>>, fmt::formatter<rocsolver_logvalue<int>>, fmt::formatter<rocsolver_logvalue<const char *>>, fmt::formatter<rocsolver_logvalue<int>>>>' (aka 'const fmt::formatter<rocsolver_logvalue<const char *>>'), but method is not marked const
  auto format(rocsolver_logvalue<T> wrapper, FormatCtx& ctx)
       ^
In file included from /home/acxz/vcs/git/github/rocm-arch/rocm-arch/rocsolver/src/rocSOLVER-rocm-5.0.1/library/src/auxiliary/rocauxiliary_lacgv.cpp:5:
In file included from /home/acxz/vcs/git/github/rocm-arch/rocm-arch/rocsolver/src/rocSOLVER-rocm-5.0.1/library/src/auxiliary/rocauxiliary_lacgv.hpp:12:
In file included from /home/acxz/vcs/git/github/rocm-arch/rocm-arch/rocsolver/src/rocSOLVER-rocm-5.0.1/library/src/include/rocblas.hpp:10:
In file included from /home/acxz/vcs/git/github/rocm-arch/rocm-arch/rocsolver/src/rocSOLVER-rocm-5.0.1/library/src/include/init_scalars.hpp:13:
In file included from /home/acxz/vcs/git/github/rocm-arch/rocm-arch/rocsolver/src/rocSOLVER-rocm-5.0.1/library/src/include/rocsolver_logger.hpp:9:
/usr/include/fmt/ranges.h:737:21: error: no matching member function for call to 'format'
                 .format(std::get<sizeof...(T) - N>(value.tuple), ctx);
                 ~^~~~~~
/usr/include/fmt/ranges.h:741:14: note: in instantiation of function template specialization 'fmt::formatter<fmt::tuple_join_view<char, rocsolver_logvalue<const char *> &, rocsolver_logvalue<int> &, rocsolver_logvalue<const char *> &, rocsolver_logvalue<int> &>>::do_format<fmt::basic_format_context<fmt::appender, char>, 1UL>' requested here
    return do_format(value, ctx, std::integral_constant<size_t, N - 1>());
           ^
/usr/include/fmt/ranges.h:741:14: note: in instantiation of function template specialization 'fmt::formatter<fmt::tuple_join_view<char, rocsolver_logvalue<const char *> &, rocsolver_logvalue<int> &, rocsolver_logvalue<const char *> &, rocsolver_logvalue<int> &>>::do_format<fmt::basic_format_context<fmt::appender, char>, 2UL>' requested here
/usr/include/fmt/ranges.h:741:14: note: in instantiation of function template specialization 'fmt::formatter<fmt::tuple_join_view<char, rocsolver_logvalue<const char *> &, rocsolver_logvalue<int> &, rocsolver_logvalue<const char *> &, rocsolver_logvalue<int> &>>::do_format<fmt::basic_format_context<fmt::appender, char>, 3UL>' requested here
/usr/include/fmt/ranges.h:695:12: note: in instantiation of function template specialization 'fmt::formatter<fmt::tuple_join_view<char, rocsolver_logvalue<const char *> &, rocsolver_logvalue<int> &, rocsolver_logvalue<const char *> &, rocsolver_logvalue<int> &>>::do_format<fmt::basic_format_context<fmt::appender, char>, 4UL>' requested here
  return do_format(value, ctx,
         ^
/usr/include/fmt/core.h:1282:22: note: in instantiation of function template specialization 'fmt::formatter<fmt::tuple_join_view<char, rocsolver_logvalue<const char *> &, rocsolver_logvalue<int> &, rocsolver_logvalue<const char *> &, rocsolver_logvalue<int> &>>::format<fmt::basic_format_context<fmt::appender, char>>' requested here
  ctx.advance_to(f.format(*static_cast<qualified_type*>(arg), ctx));
                   ^
/usr/include/fmt/core.h:1261:21: note: (skipping 1 context in backtrace; use -ftemplate-backtrace-limit=0 to see all)
  custom.format = format_custom_arg<
                  ^
/usr/include/fmt/ostream.h:131:15: note: in instantiation of function template specialization 'fmt::make_args_checked<const char *&, const char &, fmt::tuple_join_view<char, rocsolver_logvalue<const char *> &, rocsolver_logvalue<int> &, rocsolver_logvalue<const char *> &, rocsolver_logvalue<int> &>, char[34], char>' requested here
       fmt::make_args_checked<Args...>(format_str, args...));
            ^
/home/acxz/vcs/git/github/rocm-arch/rocm-arch/rocsolver/src/rocSOLVER-rocm-5.0.1/library/src/include/rocsolver_logger.hpp:184:14: note: in instantiation of function template specialization 'fmt::print<char[34], const char *&, const char &, fmt::tuple_join_view<char, rocsolver_logvalue<const char *> &, rocsolver_logvalue<int> &, rocsolver_logvalue<const char *> &, rocsolver_logvalue<int> &>, char>' requested here
      fmt::print(*bench_os, "./rocsolver-bench -f {} -r {} {}\n", func_name,
           ^
/home/acxz/vcs/git/github/rocm-arch/rocm-arch/rocsolver/src/rocSOLVER-rocm-5.0.1/library/src/include/rocsolver_logger.hpp:281:13: note: in instantiation of function template specialization 'rocsolver_logger::log_bench<rocblas_complex_num<float>, rocsolver_logvalue<const char *>, rocsolver_logvalue<int>, rocsolver_logvalue<const char *>, rocsolver_logvalue<int>>' requested here
          log_bench<T>(entry.level, func_prefix, func_name, rocsolver_make_logvalue(args)...);
          ^
/home/acxz/vcs/git/github/rocm-arch/rocm-arch/rocsolver/src/rocSOLVER-rocm-5.0.1/library/src/auxiliary/rocauxiliary_lacgv.cpp:11:5: note: in instantiation of function template specialization 'rocsolver_logger::log_enter_top_level<rocblas_complex_num<float>, const char *, int, const char *, int>' requested here
  ROCSOLVER_ENTER_TOP("lacgv", "-n", n, "--incx", incx);
  ^
/home/acxz/vcs/git/github/rocm-arch/rocm-arch/rocsolver/src/rocSOLVER-rocm-5.0.1/library/src/include/rocsolver_logger.hpp:34:43: note: expanded from macro 'ROCSOLVER_ENTER_TOP'
          rocsolver_logger::instance()->log_enter_top_level<T>(handle, "rocsolver", name, \
                                        ^
/home/acxz/vcs/git/github/rocm-arch/rocm-arch/rocsolver/src/rocSOLVER-rocm-5.0.1/library/src/auxiliary/rocauxiliary_lacgv.cpp:49:12: note: in instantiation of function template specialization 'rocsolver_lacgv_impl<rocblas_complex_num<float>>' requested here
  return rocsolver_lacgv_impl<rocblas_float_complex>(handle, n, x, incx);
         ^
/home/acxz/vcs/git/github/rocm-arch/rocm-arch/rocsolver/src/rocSOLVER-rocm-5.0.1/library/src/include/rocsolver_logvalue.hpp:40:10: note: candidate function template not viable: 'this' argument has type 'const __tuple_element_t<3UL, tuple<fmt::formatter<rocsolver_logvalue<const char *>>, fmt::formatter<rocsolver_logvalue<int>>, fmt::formatter<rocsolver_logvalue<const char *>>, fmt::formatter<rocsolver_logvalue<int>>>>' (aka 'const fmt::formatter<rocsolver_logvalue<int>>'), but method is not marked const
  auto format(rocsolver_logvalue<T> wrapper, FormatCtx& ctx)
       ^
4 errors generated when compiling for gfx1030.

sysinfo:
OS: Archlinux
fmt: 8.1.1

[feature request] rocsolver_zgetrs and rocsolver_getri_batched

rocsolver_zgetrs is the complex version of rocsolver_dgetrs
rocsolver_getri_batched is used after rocsolver_getrf_batched to complete matrix inversion in a batched fashion.

Question: Does the mention of rocsolver_getrf in README.md refer to availability of all the s/c/d/z versions?

Low performance of batched zgetrs on MI100

I have a batched zgetrs case, with 140x140 matrices, 1 rhs, and 384 batches. This case averages around 0.090 seconds on MI100 (ROCm 4.2 and 4.3), and on summit with a V100, it averages 0.00055, around 163 times faster. This is currently the biggest bottleneck running our test case on 3 spock nodes.

Here is a reproducer:
https://github.com/bd4/hip-bugs/blob/main/batched_zgetrs.cxx
it needs this file decompressed to same directory to read input data from:
https://www.mcs.anl.gov/~ballen/zgetrs.txt.bz2
around 260MB uncompressed.

I see there has been performance work recently on getrf - should this affect getrs as well?

8 test failures against openblas

After resolving OpenMathLib/OpenBLAS#3513 and #363 I performed the test on rocSOLVER-rocm-4.3.0 against openblas-0.3.19, and there are 8 tests failures reported:

[  FAILED  ] 8 tests, listed below:
[  FAILED  ] checkin_lapack/SYGS2.batched__float/16, where GetParam() = ({ 70, 100, 110 }, { '1' (49, 0x31), 'L' (76, 0x4C) })
[  FAILED  ] checkin_lapack/SYGS2.batched__float/18, where GetParam() = ({ 70, 100, 110 }, { '1' (49, 0x31), 'U' (85, 0x55) })
[  FAILED  ] checkin_lapack/SYGS2.strided_batched__float/16, where GetParam() = ({ 70, 100, 110 }, { '1' (49, 0x31), 'L' (76, 0x4C) })
[  FAILED  ] checkin_lapack/SYGS2.strided_batched__float/18, where GetParam() = ({ 70, 100, 110 }, { '1' (49, 0x31), 'U' (85, 0x55) })
[  FAILED  ] checkin_lapack/SYGST.batched__float/16, where GetParam() = ({ 70, 100, 110 }, { '1' (49, 0x31), 'L' (76, 0x4C) })
[  FAILED  ] checkin_lapack/SYGST.batched__float/18, where GetParam() = ({ 70, 100, 110 }, { '1' (49, 0x31), 'U' (85, 0x55) })
[  FAILED  ] checkin_lapack/SYGST.strided_batched__float/16, where GetParam() = ({ 70, 100, 110 }, { '1' (49, 0x31), 'L' (76, 0x4C) })
[  FAILED  ] checkin_lapack/SYGST.strided_batched__float/18, where GetParam() = ({ 70, 100, 110 }, { '1' (49, 0x31), 'U' (85, 0x55) })

with gzip compressed full test log:
rocSOLVER-rocm4.3.0-against-openblas.log.gz

Earlier I have reported 8 tests failed comparing rocBLAS and openblas, maybe they are related with these failures.

Environment

Hardware	description
GPU	Vega 20 [Radeon VII]
CPU	AMD Ryzen 7 5800X

Software	version
Linux	5.15.8
ROCK	Upstream Kernel
ROCR	v4.3.0
rocBLAS	rocm-4.3.0 with Tensile asm_full
Host Compiler	gcc-11.2
Device Compiler	hipcc-4.3.0, llvm-rocm-4.3.0

Low performance of DSYEVD

Hello, I am currently profiling a Quantum chemistry code that I recently hipified. The algorithm involves iteratively diagonalizing a symmetric matrix and for this purpose, I hooked up DSYEVD from rocsolver. From the initial timings, I notice that diagonalization is significantly slow and in fact kills the overall performance. See the two snapshots below where I compare a computation (that involves 20 iterations) on MI100 with that of NVIDIA V100. For the latter, we use Dndsyevd from cusolver. As apparent from column 8 (DIAG_TIME) of the tables, rocsolver DSYEVD diagonalization is surprisingly slow. Can someone help me to figure out what's happening here?

rocsolver_ dsyevd on MI100

cusolver_dndsyevd on V100

I have attached a test program where I hooked up the same fortran drivers and diagonalize the same matrix in the first iteration of the above example.
To compile and run the example, simply run make and make run. This should print diagnolization time with and without memory operations ("Time rocDIAG" and "Time rocsolver_dsyevd" respectively). Hope this will be useful. Thanks!
roctest.tar.gz
.

Low performance for medium size matrices

I use the example code to do QR factorization for matrices of different sizes and I get the following results:

QR factorization of a rank 3 matrix took 34701 microseconds
QR factorization of a rank 30 matrix took 1316 microseconds
QR factorization of a rank 300 matrix took 5766996 microseconds
QR factorization of a rank 3000 matrix took 71824 microseconds
QR factorization of a rank 30000 matrix took 473530 microseconds

It looks like calculating a medium-size matrix is really slow. Is it a bug or intended?

Size of E of SYEV and other LAPACK(like) functions

The documentation currently doesn't mention the size of the E-parameter in these functions.

I've found that for rocsolver_syev it should be n * 3 - 1. Is this correct? Quite a few other projects use |A| = n * n. However, if it were just the values of the tridiagonal matrix, shouldn't it be n * 3 - 2? It would be nice if someone could verify this, and also update the docs 😄

new rocsolver build error with fmt 9

Manjaro, aur build rocsolver
`[ 10%] Building CXX object library/src/CMakeFiles/rocsolver.dir/specialized/roclapack_getri_specialized_kernels_c.cpp.o

In file included from /var/tmp/pamac-build-chameleon/rocsolver/src/rocSOLVER-rocm-5.2.3/library/src/common/rocsolver_logger.cpp:12:

In file included from /var/tmp/pamac-build-chameleon/rocsolver/src/rocSOLVER-rocm-5.2.3/library/src/include/rocsolver_logger.hpp:7:

/usr/include/fmt/format.h:3122:38: error: implicit instantiation of undefined template 'fmt::detail::dragonbox::float_info'

const auto f = basic_fp<typename info::carrier_uint>(converted_value);
                                 ^

/usr/include/fmt/format.h:1408:46: error: implicit instantiation of undefined template 'fmt::detail::dragonbox::float_info'

using carrier_uint = typename dragonbox::float_info<Float>::carrier_uint;`

rocblas_evect type definition not found in rocblas_module.f90

Hello, I am working on a fortran driver to use rocsolver_dsyevd. Since there is no wrapper function available (at least to my knowledge), I looked at your basic fortran example and wrote the following in my driver.

    interface
        function rocsolver_dsyevd(handle, evect, uplo, N, A, lda, D, E, info) &
                result(c_int) &
                bind(c, name = 'rocsolver_dsyevd')
            use iso_c_binding
            implicit none
            type(c_ptr), value :: handle
            integer(kind(rocblas_evect_original)), value :: evect
            integer(kind(rocblas_fill_full)), value :: uplo
            integer(c_int), value :: N
            type(c_ptr), value :: A
            integer(c_int), value :: lda
            type(c_ptr), value :: D
            type(c_ptr), value :: E
            type(c_ptr), value :: info
        end function rocsolver_dsyevd
    end interface

According to rocSolver manual, I am supposed to pass a rocblas_evect type parameter into the function; but this type doesnt exist in rocblas_module.f90 of rocm 4.5. I am thinking about having something like below in my driver as a quick fix. But I dont know what are the correct integers to assign. Can you please help me with this?

    enum, bind(c)
      enumerator :: rocblas_evect_original = foo_int
        enumerator :: rocblas_evect_tridiagonal = bar_int
        enumerator :: rocblas_evect_none = baz_int
    end enum

Thanks!

Generating rocsolver-export.h

Is it possible to generate rocsolver-export.h on a system with no AMD gpu?