Comments (14)
Hello,
Suspected Main Problem
As far as I can judge, the problem also affects the WhittakerSmooth
-class because all of them rely on the sparse matrix representation of the involved weighting and penalty matrices. This is way more performant than using the pure dense NumPy-Arrays, yet dense NumPy-Arrays are part of the solution here.
With scipy.sparse
most of the functions (especially the spsolve
) do not distinguish between general sparse matrices that can have nonzeros at any position, e.g.,
A = [[ a 0 0 0 b]
[ 0 c 0 d 0]
[ 0 e f 0 g]
[ 0 0 0 h 0]
[ 0 0 0 i j]]
and the highly structured sparse format encountered in all the described functionalities (I'll call them Whittaker-Like) which is banded (only a few central diagonals are non-zero) and symmetric, e.g.,
B = [[ a b 0 0 0]
[ b c d 0 0]
[ 0 d e f 0]
[ 0 0 f g h]
[ 0 0 0 h i]]
Usually sparse solvers rely on LU-decomposition, but as far as I'm aware, the LU-decomposition of a sparse matrix like A
is not necessarily sparse as well but dense, so the gain is not as much as it could be even when using a high-performance sparse solver like spsolve
from pypardiso
.
Proposed solution
In contrast to this, B
has an LU-decomposition that is (almost) as sparse as B
itself with (almost) the same banded structure, so the performance can be increased quite a lot. The LAPACK-wrapper solve_banded
from scipy.linalg
exploits this together with a special banded storage as dense Array which makes the excecution way faster.
Actually, one can go down even further. Since for Whittaker-Like problems, the matrix to invert is I + lam * D.T @ D
where I
is the identity matrix and D.T @ D
is the square of the finite difference matrix D
(which is of order difference
), the problem simplifies. D.T @ D
has difference
zero eigenvalues and if one adds I
to it, one basically lifts the main diagonal of D.T @ D
. So, all the resulting eigenvalues are lifted to be > 0 (at least mathematically, but not numerically in all cases). This allows to rely on Cholesky decomposition rather than LU-decomposition which is again faster because it only computes a lower triangular matrix L
instead of a lower and an upper triangular matrix L
and U
(but the banded structure (almost) remains for both).
Again, SciPy offers the LAPACK-wrapper solveh_banded
for doing so.
Typically what I do is I first attempt to use solveh_banded
and upon a numpy.linalg.LinalgError
which indicates that the Cholesky decomposition was not possible, I fall back to the LU-decomposition via solve_banded
which is more stable due to the fact that it also uses partial pivoting (row swapping) to ensure stability.
a_banded = ... # matrix to invert in banded storage with `n_upp` super- and `n_upp` sub-diagonals above or below the main diagonal
b = ... # right hand side vector
try:
# Attempt banded Cholesky decomposition
x = solveh_banded(
a_banded[0 : n_upp + 1, ::], # taking only the superdiagonals and the main diagonal since symmetric
b,
lower = False, # only the superdiagonals and main diagonal were taken
)
except np.linalg.LinalgError:
# Fall back to banded LU decomposition
x = solve_banded(
l_and_u = (n_upp, n_upp), # number of super- and subdiagonals
a_banded,
b,
)
Further tweaks
What I just described is as far as one can get for the general case of arbitrary difference
order of D
. However, there are special cases that allow for further enhancement:
difference = 1
: this results in a tridiagonalI + lam * D.T @ D
which can be solved even faster. This is already included inscipy.linalg.solve_banded
for LU-decomposition, but unfortunately, there is no equivalent forscipy.linalg.solveh_banded
for Cholesky-factorisationdifference = 2
: this results in a pentadiagonalI + lam * D.T @ D
for whose solution the Python packagepentapy
is highly optimised. As an added bonusdifferences = 2
is used in many baseline correction algorithms, so I'd say this will be the best speedup for the Air- and ArPLS.
pentapy
also gives an overview over the performances of a variety of banded solvers which indicates that there is still a lot of performance to gain when moving away from SciPy's spsolve
:
Besides, there is another added bonus. All these functions can solve (I + lam * D.T @ D)Z = Y
for Z
also when Y
has multiple columns, i.e., spectra to smooth. So, the same expensive factorisation - after computed only once - can be used for solving each column in Y
for the corresponding column in Z
which is way less expensive since the solving is only a forward and a backward substitution. On top of that, the loops then also happen in low-level languages and not in Python anymore. Consequently, the Python loops in all the Whittaker-Like functions of chemotools
could simply be dropped for performance.
from chemotools.
Hi @IruNikZe thanks a lot for the fantastic explanation and it looks super promising! I will turn this issue into a branch and you can add your changes there!!
from chemotools.
@IruNikZe I have assigned you to that issue, and this is the branch you can add your changes too 😄 😄
44-improve-airpls-and-arpls-performance-sparse-matrix-operations
from chemotools.
Hi-hi,
It's been a while, so I wanted to give a quick update.
From testing I figured out two things:
1) Numerical instability of Whittaker smoother
Problem
For difference orders m > 4 and large signal n > 1000 solving the linear system (W + lamba * D^T @ D)z = Wy
becomes instable. Unfortunately, these sizes are in the order of magnitude of the typical size of spectra and m = 6 is required for choosing lambda
based on the signal in a spatially adaptive fashion (which is the next step I would want to tackle in a later issue).
The problem affects both the banded Cholesky decomposition as well as the more robust (but more expensive) banded pivoted LU-decomposition. pentapy
fails as well because it uses the same matrix to solve the system.
Cause
The problem comes from squaring the linear system before solving because (W + lamba * D^T @ D) = (sqrt(W)^T @ sqrt(W) + sqrt(lambda) * sqrt(lambda) * D^T @ D
, so
- the weights
sqrt(W)
are being squared which boosts any imbalanced weight distributions even further and also causes more imbalances in normal weights (compare 1 and 1.5 which are close, but squared 1 and 2.25 are more apart, and here the order of magnitude was the same, so it can get way more worse) sqrt(lambda)
which can be arbitrarily small or large is also squared and this affects- the squaring of
D
which is the most critical in this context. SinceD
does not have full row rank (less rows than columns in this case),D^T @ D
has m zero eigenvalues, so solving a system wherelambda
is so large that basicallyD^T @ D
is the only numerically present summands causes a breakdown of the system.
All in all, this blows up the condition number of A = (W + lamba * D^T @ D)
since it is squared. High condition numbers indicate that the system is ill-conditioned and slightest perturbations have strong effects on the final solution. With the limited precision of float64
, this causes build-up of rounding errors until the system - even though solvable from a mathematical point of view - cannot be solved numerically (see this and related Wikipedia articles).
Partial solution
While it's not possible to properly cure the case when lambda
is so large that the system becomes too ill-conditioned to be solved (at least this problem should not and cannot be tackled from within the solver), all the other cases can be handled way by avoiding the squaring just described.
This requires some insight in how the solver works, but basically A
is decomposed in two uniquely defined triangular matrices L
and L^T
that have the same entries, but are transposed, i.e., A = L @ L^T
. Triangular matrices are easy to solve, but we dont need uniquely defined matrices because A = S @ S^T
can be solved in the exact same way as for L
as long as S
is lower triangular.
One such decomposition can be derived when writing A = B^T @ B
where B
is a full matrix with no special structure. From the above, it is obvious that B = [[sqrt(W)], [sqrt(lambda) * D]]
(if you multiply it out one by one, you get A
). Now, B
needs to be triangularized, which can be achieved by QR-decomposition B = Q @ R
. Here, Q
is an orthonormal matrix (Q^T = Q^(-1)
; Q^T @ Q = I
), and R
is upper triangular.
With this A = B^T @ B = R^T @ Q^T @ Q @ R = R^T @ I @ R = R^T @ R
, i.e., we get a triangular decomposition, but A
was never formed by squaring which leaves the condition number relatively low (it is not squared like before), i.e., the system can be solved with a strongly improved numerical accuaracy. Since Q
cancels out, only R
needs to be formed. To keep the computations accurate, I resorted to a floating point exact sparse Givens rotation as well as fused-multiply-add operations (FMA).
For this I went down to Cython (1) that I combined with SciPy's cython_lapack
for solving the triangular system after the QR-decomposition.
I'll go for a draft pull request first for this because we need to think about packaging in more detail then and probably first need to set up the packaging in the CI pipeline (if we wanna go down this road at all). If not, I also can re-write it in NumPy which will be slower and less accurate since it has no FMA, or Numba which also does not easily offer an FMA.
(1) and also a Rust implementation which would probably make the package more scalable as functionality grows and new external functionalities are required which can be kind of a nightmare in Cython.
Added bonus
Since the QR-decomposition was some effort to code and also optimize, I went a step further and extended the system of the Whittaker smoother to solve (K^T @ W_1 @ K + lambda * D^T @ W_2 @ D)z = K^T @ W_1 @ y
where W_2
will be need later for spatial adaptiveness.
The added bonus is the kernel matrix K
which also needs to be banded. Assume your spectrum was blurred by, e.g., a Lorentzian peak (like the natural linewidth), this can deconvolved out and z
will not become a smooth version of y
, but a smooth version with the peak width reduced since the Lorentzian is "missing". Of course other point spread functions can also be deconvolved out with this.
The original smoothing solver just has K = I
.
2) Performance of baseline algorithms
Problem
The baseline algorithms can take some time for converging.
Cause
The initialisation of baseline algorithms is key here. Right now, they are initialised with a constant vector, but with a more elaborate initialisation, the convergence can be sped up dramatcially.
Solution
Typically, one would divide the signal in like 20 windows, find the minima of these windows and interpolate them. This estimate is often very very close to the baseline (given that lambda
is well chosen).
Alongside this, I would also give the baseline two modes:
- the baseline is below the peaks (e.g., absorbance spectroscopy)
- the baseline is above the peaks (e.g., transmittance spectroscopy)
Implications
I would have to adapt the tests for ArPLS
and AirPLS
then because the different initialisation will change the results. Thus, the tests will fail because they just reference to the result of the very initial implementation (at least that's my guess).
Any feedback is highly welcome, but I guess when the draft pull request is submitted, things will get clearer 😁
from chemotools.
Related Issues (20)
- Normalize by given index HOT 1
- SelectByIndices
- Implement Weighted Least Squares HOT 1
- Enable set output API HOT 2
- Improve speed in ArPls() HOT 1
- Add docstrings
- Add data Augmentation module
- Improve compatability with the ```set_output``` API from ```scikit-learn```
- Migrate to pyptoject.toml
- Improve feature selection integration with sklearn API
- Harmonize naming convention with other libraries for better LLM integration
- Substitute check_array by ._validate_data()
- Implement variable of importance for projections and selectioc raito HOT 7
- Deploy package to conda
- Provide Polars support when scikit-learn 1.4 is released
- Use of TensorLy for N-way data?
- SPC file format reader HOT 2
- ModuleNotFoundError when loading from chemotools.variable_selection import RangeCut HOT 3
- BaselineShift fails on 2d array with `dtype=int` HOT 9
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from chemotools.