GithubHelp home page GithubHelp logo

Comments (14)

IruNikZe avatar IruNikZe commented on May 30, 2024 1

Hello,

Suspected Main Problem

As far as I can judge, the problem also affects the WhittakerSmooth-class because all of them rely on the sparse matrix representation of the involved weighting and penalty matrices. This is way more performant than using the pure dense NumPy-Arrays, yet dense NumPy-Arrays are part of the solution here.
With scipy.sparse most of the functions (especially the spsolve) do not distinguish between general sparse matrices that can have nonzeros at any position, e.g.,

A = [[ a    0    0    0    b]
     [ 0    c    0    d    0]
     [ 0    e    f    0    g]
     [ 0    0    0    h    0]
     [ 0    0    0    i    j]]

and the highly structured sparse format encountered in all the described functionalities (I'll call them Whittaker-Like) which is banded (only a few central diagonals are non-zero) and symmetric, e.g.,

B = [[ a    b    0    0    0]
     [ b    c    d    0    0]
     [ 0    d    e    f    0]
     [ 0    0    f    g    h]
     [ 0    0    0    h    i]]

Usually sparse solvers rely on LU-decomposition, but as far as I'm aware, the LU-decomposition of a sparse matrix like A is not necessarily sparse as well but dense, so the gain is not as much as it could be even when using a high-performance sparse solver like spsolve from pypardiso.

Proposed solution

In contrast to this, B has an LU-decomposition that is (almost) as sparse as B itself with (almost) the same banded structure, so the performance can be increased quite a lot. The LAPACK-wrapper solve_banded from scipy.linalg exploits this together with a special banded storage as dense Array which makes the excecution way faster.
Actually, one can go down even further. Since for Whittaker-Like problems, the matrix to invert is I + lam * D.T @ D where I is the identity matrix and D.T @ D is the square of the finite difference matrix D (which is of order difference), the problem simplifies. D.T @ D has difference zero eigenvalues and if one adds I to it, one basically lifts the main diagonal of D.T @ D. So, all the resulting eigenvalues are lifted to be > 0 (at least mathematically, but not numerically in all cases). This allows to rely on Cholesky decomposition rather than LU-decomposition which is again faster because it only computes a lower triangular matrix L instead of a lower and an upper triangular matrix L and U (but the banded structure (almost) remains for both).
Again, SciPy offers the LAPACK-wrapper solveh_banded for doing so.
Typically what I do is I first attempt to use solveh_banded and upon a numpy.linalg.LinalgError which indicates that the Cholesky decomposition was not possible, I fall back to the LU-decomposition via solve_banded which is more stable due to the fact that it also uses partial pivoting (row swapping) to ensure stability.

a_banded = ... # matrix to invert in banded storage with `n_upp` super- and `n_upp` sub-diagonals above or below the main diagonal
b = ... # right hand side vector
try:
    # Attempt banded Cholesky decomposition
    x = solveh_banded(
        a_banded[0 : n_upp + 1, ::], # taking only the superdiagonals and the main diagonal since symmetric
        b,
        lower = False, # only the superdiagonals and main diagonal were taken
    )
    
except np.linalg.LinalgError:
    # Fall back to banded LU decomposition
    x = solve_banded(
        l_and_u = (n_upp, n_upp), # number of super- and subdiagonals
        a_banded,
        b,
    )

Further tweaks

What I just described is as far as one can get for the general case of arbitrary difference order of D. However, there are special cases that allow for further enhancement:

  • difference = 1: this results in a tridiagonal I + lam * D.T @ D which can be solved even faster. This is already included in scipy.linalg.solve_banded for LU-decomposition, but unfortunately, there is no equivalent for scipy.linalg.solveh_banded for Cholesky-factorisation
  • difference = 2: this results in a pentadiagonal I + lam * D.T @ D for whose solution the Python package pentapy is highly optimised. As an added bonus differences = 2 is used in many baseline correction algorithms, so I'd say this will be the best speedup for the Air- and ArPLS.

pentapy also gives an overview over the performances of a variety of banded solvers which indicates that there is still a lot of performance to gain when moving away from SciPy's spsolve:
SparseBandedSolversPerformance

Besides, there is another added bonus. All these functions can solve (I + lam * D.T @ D)Z = Y for Z also when Y has multiple columns, i.e., spectra to smooth. So, the same expensive factorisation - after computed only once - can be used for solving each column in Y for the corresponding column in Z which is way less expensive since the solving is only a forward and a backward substitution. On top of that, the loops then also happen in low-level languages and not in Python anymore. Consequently, the Python loops in all the Whittaker-Like functions of chemotools could simply be dropped for performance.

from chemotools.

paucablop avatar paucablop commented on May 30, 2024

Hi @IruNikZe thanks a lot for the fantastic explanation and it looks super promising! I will turn this issue into a branch and you can add your changes there!!

from chemotools.

paucablop avatar paucablop commented on May 30, 2024

@IruNikZe I have assigned you to that issue, and this is the branch you can add your changes too 😄 😄

44-improve-airpls-and-arpls-performance-sparse-matrix-operations

from chemotools.

IruNikZe avatar IruNikZe commented on May 30, 2024

Hi-hi,
It's been a while, so I wanted to give a quick update.
From testing I figured out two things:

1) Numerical instability of Whittaker smoother

Problem
For difference orders m > 4 and large signal n > 1000 solving the linear system (W + lamba * D^T @ D)z = Wy becomes instable. Unfortunately, these sizes are in the order of magnitude of the typical size of spectra and m = 6 is required for choosing lambda based on the signal in a spatially adaptive fashion (which is the next step I would want to tackle in a later issue).
The problem affects both the banded Cholesky decomposition as well as the more robust (but more expensive) banded pivoted LU-decomposition. pentapy fails as well because it uses the same matrix to solve the system.

Cause
The problem comes from squaring the linear system before solving because (W + lamba * D^T @ D) = (sqrt(W)^T @ sqrt(W) + sqrt(lambda) * sqrt(lambda) * D^T @ D, so

  • the weights sqrt(W) are being squared which boosts any imbalanced weight distributions even further and also causes more imbalances in normal weights (compare 1 and 1.5 which are close, but squared 1 and 2.25 are more apart, and here the order of magnitude was the same, so it can get way more worse)
  • sqrt(lambda) which can be arbitrarily small or large is also squared and this affects
  • the squaring of D which is the most critical in this context. Since D does not have full row rank (less rows than columns in this case), D^T @ D has m zero eigenvalues, so solving a system where lambda is so large that basically D^T @ D is the only numerically present summands causes a breakdown of the system.

All in all, this blows up the condition number of A = (W + lamba * D^T @ D) since it is squared. High condition numbers indicate that the system is ill-conditioned and slightest perturbations have strong effects on the final solution. With the limited precision of float64, this causes build-up of rounding errors until the system - even though solvable from a mathematical point of view - cannot be solved numerically (see this and related Wikipedia articles).

Partial solution
While it's not possible to properly cure the case when lambda is so large that the system becomes too ill-conditioned to be solved (at least this problem should not and cannot be tackled from within the solver), all the other cases can be handled way by avoiding the squaring just described.
This requires some insight in how the solver works, but basically A is decomposed in two uniquely defined triangular matrices L and L^T that have the same entries, but are transposed, i.e., A = L @ L^T. Triangular matrices are easy to solve, but we dont need uniquely defined matrices because A = S @ S^T can be solved in the exact same way as for L as long as S is lower triangular.
One such decomposition can be derived when writing A = B^T @ B where B is a full matrix with no special structure. From the above, it is obvious that B = [[sqrt(W)], [sqrt(lambda) * D]] (if you multiply it out one by one, you get A). Now, B needs to be triangularized, which can be achieved by QR-decomposition B = Q @ R. Here, Q is an orthonormal matrix (Q^T = Q^(-1); Q^T @ Q = I), and R is upper triangular.
With this A = B^T @ B = R^T @ Q^T @ Q @ R = R^T @ I @ R = R^T @ R, i.e., we get a triangular decomposition, but A was never formed by squaring which leaves the condition number relatively low (it is not squared like before), i.e., the system can be solved with a strongly improved numerical accuaracy. Since Q cancels out, only R needs to be formed. To keep the computations accurate, I resorted to a floating point exact sparse Givens rotation as well as fused-multiply-add operations (FMA).
For this I went down to Cython (1) that I combined with SciPy's cython_lapack for solving the triangular system after the QR-decomposition.
I'll go for a draft pull request first for this because we need to think about packaging in more detail then and probably first need to set up the packaging in the CI pipeline (if we wanna go down this road at all). If not, I also can re-write it in NumPy which will be slower and less accurate since it has no FMA, or Numba which also does not easily offer an FMA.

(1) and also a Rust implementation which would probably make the package more scalable as functionality grows and new external functionalities are required which can be kind of a nightmare in Cython.

Added bonus
Since the QR-decomposition was some effort to code and also optimize, I went a step further and extended the system of the Whittaker smoother to solve (K^T @ W_1 @ K + lambda * D^T @ W_2 @ D)z = K^T @ W_1 @ y where W_2 will be need later for spatial adaptiveness.
The added bonus is the kernel matrix K which also needs to be banded. Assume your spectrum was blurred by, e.g., a Lorentzian peak (like the natural linewidth), this can deconvolved out and z will not become a smooth version of y, but a smooth version with the peak width reduced since the Lorentzian is "missing". Of course other point spread functions can also be deconvolved out with this.
The original smoothing solver just has K = I.

2) Performance of baseline algorithms

Problem
The baseline algorithms can take some time for converging.

Cause
The initialisation of baseline algorithms is key here. Right now, they are initialised with a constant vector, but with a more elaborate initialisation, the convergence can be sped up dramatcially.

Solution
Typically, one would divide the signal in like 20 windows, find the minima of these windows and interpolate them. This estimate is often very very close to the baseline (given that lambda is well chosen).
Alongside this, I would also give the baseline two modes:

  • the baseline is below the peaks (e.g., absorbance spectroscopy)
  • the baseline is above the peaks (e.g., transmittance spectroscopy)

Implications
I would have to adapt the tests for ArPLS and AirPLS then because the different initialisation will change the results. Thus, the tests will fail because they just reference to the result of the very initial implementation (at least that's my guess).

Any feedback is highly welcome, but I guess when the draft pull request is submitted, things will get clearer 😁

from chemotools.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.