sympiler / sympiler Goto Github PK

Sympiler is a Code Generator for Transforming Sparse Matrix Codes

License: Other

C 4.18% CMake 3.97% C++ 91.68% Shell 0.16%

sympiler cholesky triangular-solve linear-algebra code-generator high-performance-computing supercomputing sparse-matrix

sympiler's Introduction

Sympiler

Sympiler is a code generator for transforming sparse matrix methods. To access the list of publication and resources please visit: http://www.sympiler.com/

Quick Build Guide for Impatient Users

If you have CMake 3.16 or higher and a C++11 compiler, then:

git  clone --recursive https://github.com/sympiler/sympiler.git
cd sympiler
cmake -DCMAKE_BUILD_TYPE=Release  -S . -B build
cmake --build build --config Release -j 6

For details, please see the table below.

All the current sparse BLAS routines assume int type for row/column pointers Lp and column/row indices Li. However for very large matrices the nnz size can exceed 2.1 billion (INT32_MAX=2,147,483,647), then Lp needs to be a int64_t or long long array. Due to the fill-in caused by complete factorization, it is not too rare to have nnz > 2.1 billion (for example a 0.1 billion-row matrix with 30 nonzeros per row)

While the CSC/CSR class uses size_t for nnz that can indicate 64-bit size, the actual integer arrays are hard-coded as int*:

 struct CSC {
  size_t m; // rows
  size_t n; // columns
  size_t nnz; // nonzeros
  int stype;
  bool is_pattern;
  bool pre_alloc; //if memory is allocated somewhere other than const.
  int *p; // Column pointer array
  int *i; // Row index array
  double *x;

and all the functions throughout the codebase have hard-coded int* or vector<int> types. A similar problem is how to switch double to float for matrix and vector values.

An elegant solution is Eigen::SparseMatrix< Scalar_, Options_, StorageIndex_> where the templated StorageIndex_ can be either int32 or int64:

Eigen::SparseMatrix<double, Eigen::ColMajor, int64_t>
Eigen::SparseMatrix<double, Eigen::ColMajor, int32_t>
Eigen::SparseMatrix<float, Eigen::ColMajor, int64_t>
Eigen::SparseMatrix<float, Eigen::ColMajor, int32_t>

However this requires changing the entire Sympiler codebase to templated implementation, a huge effort...

Out-of-place version of sptrsv executors?

All current sptrsv executors are in-place:

 void sptrsv_csr(int n, int *Lp, int *Li, double *Lx, double *x) {
  int i, j;
  for (i = 0; i < n; i++) {
   for (j = Lp[i]; j < Lp[i + 1] - 1; j++) {
    x[i] -= Lx[j] * x[Li[j]];
   }
   x[i] /= Lx[Lp[i + 1] - 1];
  }
 }

(same for all parallel versions)

In some algorithms like preconditioned CG, the out-of-place sptrsv is preferred:

 void sptrsv_csr(int n, int *Lp, int *Li, double *Lx, double *x, double *b) {
  int i, j;
  for (i = 0; i < n; i++) {
   for (j = Lp[i]; j < Lp[i + 1] - 1; j++) {
    x[i] -= Lx[j] * b[Li[j]];
   }
   x[i] /= Lx[Lp[i + 1] - 1];
  }
 }

(update: this code is incorrect, see the later correction)

Seems like the in-place and out-of-place versions should better have separate code implementations? Calling the out-of-place function with duplicated arguments sptrsv(..., b, b) will behave like in-place but will cause pointer aliasing.

Incorrect MKLROOT in CMakeLists.txt

Dear Kazem,

It seems you used the parent directory of MKLROOT (which is e.g. /opt/intel instead of /opt/intel/mkl) as MKLROOT and used it in the other statements afterward.

sympiler/CMakeLists.txt

Line 4 in abdee26

set(MKL_INC "$ENV{MKLROOT}/mkl/include")

Since MKLROOT is an env and can be used by other apps, I think it's better to correct it.

Parallel executor for upper triangular solve (back-substitution)

Problem description

First thanks for releasing this code, great work! I saw various implementions of lower triangular solve (forward substitution), but not for upper triangular solve (back-substitution) . In particular, I want to test parallel upper triangular solve with HDagg partitioning, which is very useful for applying incomplete cholesky preconditioners.

Related findings

I did a quite exhaustive search over many related repositories, but did not find a ready-to-use implementation. Some are quite useful inferences though. They are summarized below:

The sparse_blas/sptrsv_lt.cpp file only contains a serial ltsolve for upper sptrsv, unlike the sparse_blas/sptrsv.cpp that contains executors for serial/level-set/blocked/LBC partitionings in CSR/CSC formats.
The cholesky_demo.cpp in the sympiler-bench repository contains the solve phase of complete cholesky. The call sym_chol->solve_only() calls both forward- and back- substitions. The call stack goes to cholesky_solver.cpp (calls solve_phase_ll_blocked_parallel()), then solve_phase.cpp (calls H2LeveledBlockedLsolve() and H2LeveledBlockedLTsolve()), and finally Triangular_BCSC.cpp where H2LeveledBlockedLTsolve is implemented like:

   for (int i1 = levels - 1; i1 >= 0; --i1) {
    int j1 = 0;
#pragma omp parallel //shared(lValues)//private(map, contribs)
    {
#pragma omp  for schedule(static) private(j1, tempVec)
     for (j1 = levelPtr[i1]; j1 < levelPtr[i1 + 1]; ++j1) {
      //tempVec = new double[n]();
      tempVec = (double *) calloc(n, sizeof(double));
      for (int k1 = parPtr[j1 + 1] - 1; k1 >= parPtr[j1]; --k1) {
       int i = partition[k1];
       int curCol = sup2col[i];
       int nxtCol = sup2col[i + 1];
...

As opposed to the forward solve H2LeveledBlockedLsolve:

   for (int i1 = 0; i1 < levels; ++i1) {
    int j1 = 0;
#pragma omp parallel //shared(lValues)//private(map, contribs)
    {
#pragma omp  for schedule(static) private(j1, tempVec)
     for (j1 = levelPtr[i1]; j1 < levelPtr[i1 + 1]; ++j1) {
      //tempVec = new double[n]();
      tempVec = (double *) calloc(n, sizeof(double));
      for (int k1 = parPtr[j1]; k1 < parPtr[j1 + 1]; ++k1) {
       int i = partition[k1];
       int curCol = sup2col[i];
       int nxtCol = sup2col[i + 1];

So both the first loop i1 and the third loop k1 run in reversed order, which makes sense for backward-solve. The second loop j1 is parallel so the order doesn't matter.

This is the only parallel back-substitution I found in Sympiler-related repos. However there is still missing a general (non-supernode) version. Also I am not sure if HDagg partitioning can reuse the same executor code, or this is specific to LBC partitioning.

The example/SpTRSV_runtime.h in the aggregation repository contains various lower sptrsv solver wrappers SpTrSv_LL_LBC, SpTrSv_LL_HDAGG, SpTrSv_LL_Tree_HDAGG, etc. Most of them calls into the sptrsv_csr_lbc() executor defined in sptrsv.cpp:

    void sptrsv_csr_lbc(int n, int *Lp, int *Li, double *Lx, double *x,
                        int level_no, int *level_ptr,
                        int *par_ptr, int *partition) {
#pragma omp parallel
     {
      for (int i1 = 0; i1 < level_no; ++i1) {
#pragma omp  for schedule(auto)
       for (int j1 = level_ptr[i1]; j1 < level_ptr[i1 + 1]; ++j1) {
        for (int k1 = par_ptr[j1]; k1 < par_ptr[j1 + 1]; ++k1) {
         int i = partition[k1];
         for (int j = Lp[i]; j < Lp[i + 1] - 1; j++) {
          x[i] -= Lx[j] * x[Li[j]];
         }
         x[i] /= Lx[Lp[i + 1] - 1];
        }
       }
      }
     };

Am I right that simply reversing the order of loop i1 and k1 would lead to the backward version?

The sparse_blas/sptrsv.cpp in the HDagg-benchmark repository contains more lower sptrsv implementations, for example the CSC (right-looking) version sptrsv_csc_lbc, sptrsv_csc_group_lbc, and buffer version sptrsv_csr_lbc_buffer. However all assume lower triangular matrices (forward-substition).

    void sptrsv_csc_lbc(int n, int *Lp, int *Li, double *Lx, double *x,
                        int level_no, int *level_ptr, int *par_ptr, int *partition)
    {

        #pragma omp parallel
        {
            // iterate over l-partitions
            for (int i1 = 0; i1 < level_no; ++i1)
            {
                // Iterate over all the w-partitions of a l-partition
                #pragma omp for schedule(auto)
                for (int j1 = level_ptr[i1]; j1 < level_ptr[i1 + 1]; ++j1)
                {
                    // Iterate over all the node of a w-partition
                    for (int k1 = par_ptr[j1]; k1 < par_ptr[j1 + 1]; ++k1)
                    {
                        //Detect the node
                        int i = partition[k1];
...

While it is true that the the CSR format of L can be equivalently viewed as the CSC format of L^T (now upper-triangular), one still needs to provide a backward version of sptrsv_csc (looping backward), and cannot simply reuse the existing sptrsv_csc_lbc (assumes lower-triangular CSC ). Thus, the difference between CSR and CSC (left- and right-looking) implementations here might lead to different performance, but both only take lower-triangular L.

The trsv_test.cpp in the SpMP repostiory (used as performance baseline for HDagg) contains both parallel forward and backward solves. The serial backwardSolveRef simply reverses the outer loop of forwardSolveRef. The parallel backwardSolveWithBarrier reverses all three loops of forwardSolveWithBarrier. Both forward and backward solves share the same barrierSchedule.

Questions

In summary, I have several questions:

What's the most proper way to derive parallel back-substitution code that is compatible with both tree (LBC) and DAG (HDagg) partitioning? I am thinking about manually reversing the loop in the sptrsv_csr_lbc() function. Did I miss something else?
Am I right that the forward solve for L and backward solve for L^T can use the same tree/DAG partitioning result? Thus only pays for one analysis/inspection phase?
In SpMP, only CSR matrix is used throughout all algorithms (both forward and backward). In sympiler/aggregation, however, many functions require both CSR and CSC formats -- for example in SpTRSV_runtime.cpp:

      SpTrSv_LL_Tree_HDAGG LL_lvl_obj(Lower_A_CSR, Lower_A_CSC, y_correct,
                                      "LL Tree HDAGG ", core, isLfactor, bin);

even if the actual executor uses CSR value only:

    sptrsv_csr_lbc(n_, L1_csr_->p, L1_csr_->i, L1_csr_->x, x_in_,
                   final_level_no, final_level_ptr.data(),
                   final_part_ptr.data(), final_node_ptr.data());

Is it possible to only use CSR (or only CSC) to save half of memory?

Any suggestions would be helpful, thanks again!

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.