Problem description
First thanks for releasing this code, great work! I saw various implementions of lower triangular solve (forward substitution), but not for upper triangular solve (back-substitution) . In particular, I want to test parallel upper triangular solve with HDagg partitioning, which is very useful for applying incomplete cholesky preconditioners.
Related findings
I did a quite exhaustive search over many related repositories, but did not find a ready-to-use implementation. Some are quite useful inferences though. They are summarized below:
-
The sparse_blas/sptrsv_lt.cpp file only contains a serial ltsolve
for upper sptrsv, unlike the sparse_blas/sptrsv.cpp that contains executors for serial/level-set/blocked/LBC partitionings in CSR/CSC formats.
-
The cholesky_demo.cpp in the sympiler-bench repository contains the solve phase of complete cholesky. The call sym_chol->solve_only()
calls both forward- and back- substitions. The call stack goes to cholesky_solver.cpp (calls solve_phase_ll_blocked_parallel()
), then solve_phase.cpp (calls H2LeveledBlockedLsolve()
and H2LeveledBlockedLTsolve()
), and finally Triangular_BCSC.cpp where H2LeveledBlockedLTsolve
is implemented like:
for (int i1 = levels - 1; i1 >= 0; --i1) {
int j1 = 0;
#pragma omp parallel //shared(lValues)//private(map, contribs)
{
#pragma omp for schedule(static) private(j1, tempVec)
for (j1 = levelPtr[i1]; j1 < levelPtr[i1 + 1]; ++j1) {
//tempVec = new double[n]();
tempVec = (double *) calloc(n, sizeof(double));
for (int k1 = parPtr[j1 + 1] - 1; k1 >= parPtr[j1]; --k1) {
int i = partition[k1];
int curCol = sup2col[i];
int nxtCol = sup2col[i + 1];
...
As opposed to the forward solve H2LeveledBlockedLsolve
:
for (int i1 = 0; i1 < levels; ++i1) {
int j1 = 0;
#pragma omp parallel //shared(lValues)//private(map, contribs)
{
#pragma omp for schedule(static) private(j1, tempVec)
for (j1 = levelPtr[i1]; j1 < levelPtr[i1 + 1]; ++j1) {
//tempVec = new double[n]();
tempVec = (double *) calloc(n, sizeof(double));
for (int k1 = parPtr[j1]; k1 < parPtr[j1 + 1]; ++k1) {
int i = partition[k1];
int curCol = sup2col[i];
int nxtCol = sup2col[i + 1];
So both the first loop i1
and the third loop k1
run in reversed order, which makes sense for backward-solve. The second loop j1
is parallel so the order doesn't matter.
This is the only parallel back-substitution I found in Sympiler-related repos. However there is still missing a general (non-supernode) version. Also I am not sure if HDagg partitioning can reuse the same executor code, or this is specific to LBC partitioning.
- The example/SpTRSV_runtime.h in the aggregation repository contains various lower sptrsv solver wrappers
SpTrSv_LL_LBC
, SpTrSv_LL_HDAGG
, SpTrSv_LL_Tree_HDAGG
, etc. Most of them calls into the sptrsv_csr_lbc()
executor defined in sptrsv.cpp:
void sptrsv_csr_lbc(int n, int *Lp, int *Li, double *Lx, double *x,
int level_no, int *level_ptr,
int *par_ptr, int *partition) {
#pragma omp parallel
{
for (int i1 = 0; i1 < level_no; ++i1) {
#pragma omp for schedule(auto)
for (int j1 = level_ptr[i1]; j1 < level_ptr[i1 + 1]; ++j1) {
for (int k1 = par_ptr[j1]; k1 < par_ptr[j1 + 1]; ++k1) {
int i = partition[k1];
for (int j = Lp[i]; j < Lp[i + 1] - 1; j++) {
x[i] -= Lx[j] * x[Li[j]];
}
x[i] /= Lx[Lp[i + 1] - 1];
}
}
}
};
Am I right that simply reversing the order of loop i1
and k1
would lead to the backward version?
- The sparse_blas/sptrsv.cpp in the HDagg-benchmark repository contains more lower sptrsv implementations, for example the CSC (right-looking) version
sptrsv_csc_lbc
, sptrsv_csc_group_lbc
, and buffer version sptrsv_csr_lbc_buffer
. However all assume lower triangular matrices (forward-substition).
void sptrsv_csc_lbc(int n, int *Lp, int *Li, double *Lx, double *x,
int level_no, int *level_ptr, int *par_ptr, int *partition)
{
#pragma omp parallel
{
// iterate over l-partitions
for (int i1 = 0; i1 < level_no; ++i1)
{
// Iterate over all the w-partitions of a l-partition
#pragma omp for schedule(auto)
for (int j1 = level_ptr[i1]; j1 < level_ptr[i1 + 1]; ++j1)
{
// Iterate over all the node of a w-partition
for (int k1 = par_ptr[j1]; k1 < par_ptr[j1 + 1]; ++k1)
{
//Detect the node
int i = partition[k1];
...
While it is true that the the CSR format of L can be equivalently viewed as the CSC format of L^T (now upper-triangular), one still needs to provide a backward version of sptrsv_csc
(looping backward), and cannot simply reuse the existing sptrsv_csc_lbc
(assumes lower-triangular CSC ). Thus, the difference between CSR and CSC (left- and right-looking) implementations here might lead to different performance, but both only take lower-triangular L.
- The trsv_test.cpp in the SpMP repostiory (used as performance baseline for HDagg) contains both parallel forward and backward solves. The serial
backwardSolveRef
simply reverses the outer loop of forwardSolveRef
. The parallel backwardSolveWithBarrier
reverses all three loops of forwardSolveWithBarrier
. Both forward and backward solves share the same barrierSchedule
.
Questions
In summary, I have several questions:
-
What's the most proper way to derive parallel back-substitution code that is compatible with both tree (LBC) and DAG (HDagg) partitioning? I am thinking about manually reversing the loop in the sptrsv_csr_lbc()
function. Did I miss something else?
-
Am I right that the forward solve for L and backward solve for L^T can use the same tree/DAG partitioning result? Thus only pays for one analysis/inspection phase?
-
In SpMP, only CSR matrix is used throughout all algorithms (both forward and backward). In sympiler/aggregation, however, many functions require both CSR and CSC formats -- for example in SpTRSV_runtime.cpp:
SpTrSv_LL_Tree_HDAGG LL_lvl_obj(Lower_A_CSR, Lower_A_CSC, y_correct,
"LL Tree HDAGG ", core, isLfactor, bin);
even if the actual executor uses CSR value only:
sptrsv_csr_lbc(n_, L1_csr_->p, L1_csr_->i, L1_csr_->x, x_in_,
final_level_no, final_level_ptr.data(),
final_part_ptr.data(), final_node_ptr.data());
Is it possible to only use CSR (or only CSC) to save half of memory?
Any suggestions would be helpful, thanks again!