GithubHelp home page GithubHelp logo

paucablop / chemotools Goto Github PK

View Code? Open in Web Editor NEW
37.0 3.0 5.0 30.73 MB

Integrate your chemometric tools with the scikit-learn API ๐Ÿงช ๐Ÿค–

Home Page: https://paucablop.github.io/chemotools/

License: MIT License

Python 99.98% Ruby 0.01% TeX 0.02%
chemometrics machine-learning python scikit-learn sklearn spectra hacktoberfest artificial-intelligence autoencoders deep-learning

chemotools's Introduction

Hi there ๐Ÿ‘‹ I am Pau Cabaneros (paucablop)!

๐Ÿ‘€ I love modelling biological ๐Ÿงฌ and chemical โš› systems to better understand how they can be used to help society ๐ŸŒ

๐Ÿ”ญ I am currently working at as a process analytical technology scientist and data scientist in the biotech industry ๐Ÿ‘จโ€๐Ÿ”ฌ

โ™ฅ I enjoy open source projects and sharing knowledge through code ๐Ÿ“š

๐ŸŒฑ I am working to improve my devops and automation skills ๐Ÿคนโ€โ™€๏ธ

ResearchGate Profile

paucablop's github stats

chemotools's People

Contributors

dependabot[bot] avatar paucablop avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

chemotools's Issues

Improve AirPLS and ArPLS performance - sparse matrix operations

Description

AirPLS (Adaptive Iteratively Reweighted Penalized Least Squares) and ArPLS (Asymmetrically Reweighted Penalized Least Squares) are powerful algorithms for removing complex non-linear baselines from spectral signals. However, their computational cost can be significant, especially when processing large numbers of spectra. Currently, we use the csc_matrix representation from scipy.sparse to optimize performance, but further improvements are needed.

Improving Attempts

To improve the performance, I have tried just-in-time compilation of some key functions using numba. However, numba does not support the csc_matrix type, and I cannot JIT compile the code. To overcome this issue, I thought of looking for a numba compatible representation of sparse matrices, but could not find one. Therefore, I have created my own, together with some functions to make basic algebra operations with them code to Gist. Unfortunately, this did not improve the performance over the current implementation.

Hacktoberfest Challenge

We invite open source developers to contribute to our project during Hacktoberfest. The goal is to improve the performance of both algorithms

Here are some ideas to work on:

  • Find a more efficient way to JIT compile the code using tools like numba.
  • Investigate parallel or distributed computing techniques to speed up the processing of multiple spectra.

How to Contribute

Here is the contributing guidelines

Contact

We can have the the conversation in the Issue or the Discussion

Resources

Here are some relevant resources and references for understanding the theory and implementation of the AirPLS and ArPLS algorithms:

  • Paper on AirPLS: Z.-M. Zhang, S. Chen, and Y.-Z. Liang, Baseline correction using adaptive iteratively reweighted penalized least squares. Analyst 135 (5), 1138-1146 (2010).
  • Paper on ArPLS: Sung-June Baek, Aaron Park, Young-Jin Ahn, Jaebum Choo Baseline correction using asymmetrically reweighted penalized least squares smoothing

Improve compatability with the ```set_output``` API from ```scikit-learn```

Description

All the transformers from chemotools are compatible with scikit-learn, that is the objective of chemotools ๐Ÿ‘. In one of the most recent releases of scikit-learn they have introduced the set_output API, which basically allows the user to define an pandas as output. This will produce a pandas.DataFrame object as output instead of the default numpy.ndarray. This works fine with most of chemotools transformers, but I have some specific issues:

๐Ÿ‘‰ The column names are lost after the transformation

When I use a chemotools transformer setup to produce a pandas.DataFrame, it does not keep the column names, and produces an output without column names. I have compared the functionality with other scikit-learn transformers (such as StandardScaler(), and I have seen that they do keep the column names in the output.

๐Ÿ‘‰ The API does not work when the transformer reduces the number of features

Some transformers will reduce the number of features on our dataset (e.g., will select a subset of columns from it). These are under the variable selection transformers. I don't really know how to fix this issue.

Hacktoberfest Challenge

We invite open source developers to contribute to our project during Hacktoberfest. The goal is to improve compatibility with the set_output API

How to Contribute

Here is the contributing guidelines

Contact

[We can have the the conversation in the Issue or the Discussion](#45)

Resources

๐Ÿ‘‰ Link to set_output API form scikit-learn

๐Ÿ‘‰Link to problem description

BaselineShift fails on 2d array with `dtype=int`

BaselineShift class fails on an int dtype 2d array. The issue lies on the implementation here. The issue is this:

x = np.array([[1,2,3]]) # dtype int
new_x = np.add(x, 0.5) # dtype float
x[0] = new_x 
x >>> np.array([[1,2,3]])  # you cannot replace arrays with different dtypes

There are several fixes to this, including an easy one which is to just assign the dtype of the array being transformed to float. Or we can use numpy.apply_along_axis which fixes the issue we were having by first creating an array of zeros (which by default is float dtype) and then does the replacement thing.

I don't know if this is a major issue, but could be useful regardless of its importance.

Update: I see that other classes of augmentation have similar implementation and have the same issue. Should I make changes and do a pull request? Let me know (with the solution of your choice)

SPC file format reader

Are there any plans to include an SPC file format reader? Or can you recommend a different Python OSS library that can be used with chemotools?

Add docstrings

Add docstrings to the following methods:

Scatter

  • Multiplicative signal correction
  • Standard normal variate

Baseline

  • AirPls
  • ArPls
  • Constant baseline correction
  • Cubic spline correction
  • Linear correction
  • Non negativity
  • Polynomial correction
  • Subtract reference

Derivative

  • Norris william
  • Savitzky golay

Scale

  • Index scaler
  • Min max scaler
  • norm scaler

Smooth

  • Mean filter
  • Median filter
  • Savitzky golay filter
  • Whittaker smooth

Variable selection

  • Range cut

Substitute check_array by ._validate_data()

Right now, the input in each method is checked by a custom function called check_input. This function uses the check_array() function from scikit-learn.utils. However, is a better practice to use the ._validate_data() inherited from the BaseEstimator class. in addition, the .validate_data() will also set the n_features_in, so we do not have to do it explicitly, making the code more readable.

ModuleNotFoundError when loading from chemotools.variable_selection import RangeCut

I wanted to try your Coffee Spectra Classifier exercise in google colab but when I run the line

from chemotools.variable_selection import RangeCut

I receive the following error.


ModuleNotFoundError Traceback (most recent call last)
in <cell line: 7>()
5 from chemotools.scatter import StandardNormalVariate
6
----> 7 from chemotools.variable_selection import RangeCut
8 pipeline = make_pipeline(
9 StandardNormalVariate(),

ModuleNotFoundError: No module named 'chemotools.variable_selection'


NOTE: If your import is failing due to a missing package, you can
manually install dependencies using either !pip or !apt.

To view examples of installing some common dependencies, click the
"Open Examples" button below.

Can you please help?
Thank you

Harmonize naming convention with other libraries for better LLM integration

Some of the functions implemented in chemotools, are similar to other functions available in other libraries. For example, the savitzky golay filter, which is inherited from scipy. The way to call the arguments from of the Savitzky Golay method from chemotools, should be the same as in scipy, so that LLM have a larger chance to make the right suggestions.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.