GithubHelp home page GithubHelp logo

riken-aip / pyhsiclasso Goto Github PK

View Code? Open in Web Editor NEW
167.0 10.0 41.0 8.18 MB

Versatile Nonlinear Feature Selection Algorithm for High-dimensional Data

License: MIT License

Python 100.00%
feature-selection feature-extraction machine-learning-algorithms nonlinear python blackbox-algorithm

pyhsiclasso's Introduction

pyHSICLasso

pypi MIT License Build Status

pyHSICLasso is a package of the Hilbert Schmidt Independence Criterion Lasso (HSIC Lasso), which is a black box (nonlinear) feature selection method considering the nonlinear input and output relationship. HSIC Lasso can be regarded as a convex variant of widely used minimum redundancy maximum relevance (mRMR) feature selection algorithm.

Advantage of HSIC Lasso

  • Can find nonlinearly related features efficiently.
  • Can find non-redundant features.
  • Can obtain a globally optimal solution.
  • Can deal with both regression and classification problems through kernels.

Feature Selection

The goal of supervised feature selection is to find a subset of input features that are responsible for predicting output values. By using this, you can supplement the dependence of nonlinear input and output and you can calculate the optimal solution efficiently for high dimensional problem. The effectiveness are demonstrated through feature selection experiments for classification and regression with thousands of features. Finding a subset of features in high-dimensional supervised learning is an important problem with many real- world applications such as gene selection from microarray data, document categorization, and prosthesis control.

Install

$ pip install -r requirements.txt
$ python setup.py install

or

$ pip install pyHSICLasso

Usage

First, pyHSICLasso provides the single entry point as class HSICLasso()

This class has the following methods.

  • input
  • regression
  • classification
  • dump
  • plot_path
  • plot_dendrogram
  • plot_heatmap
  • get_features
  • get_features_neighbors
  • get_index
  • get_index_score
  • get_index_neighbors
  • get_index_neighbors_score
  • save_param

The input format corresponds to the following formats.

  • MATLAB file (.mat)
  • .csv
  • .tsv
  • numpy's ndarray

Input file

When using .mat, .csv, .tsv, we support pandas dataframe. The rows of the dataframe are sample number. The output variable should have class tag. If you wish to use your own tag, you need to specify the output variables by list (output_list=['tag']) The remaining columns are values of each features. The following is a sample data (csv format).

class,v1,v2,v3,v4,v5,v6,v7,v8,v9,v10
-1,2,0,0,0,-2,0,-2,0,2,0
1,2,2,0,0,-2,0,0,0,2,0
...

For multi-variate output cases, you can specify the output by using the list (output_list). See Sample code for details.

Save results to a csv file

If you want to save the feature selection results in csv file, please call the following function:

>>> hsic_lasso.save_param()

To get rid of specific covariates effect

In biology applications, we may want to get rid of the effect of some covariates such as gender and/or age. In such cases, we can pre-specify the covariates X in classification or regression functions as

>>> hsic_lasso.regression(5,covars=X)

>>> hsic_lasso.classification(10,covars=X)

Please check the example/sample_covars.py for details.

To handle large number of samples

HSIC Lasso scales well with respect to the number of features d. However, the vanilla HSIC Lasso requires O(dn^2) memory space and may run out the memory if the number of samples n is more than 1000. In such case, we can use the block HSIC Lasso which requires only O(dnBM) space, where B << n is the block parameter and M is the permutation parameter to stabilize the final result. This can be done by specifying B and M parameters in the regression or classification function. Currently, the default parameters are B=20 and M=3, respectively. If you wish to use the vanilla HSIC Lasso, please use B=0 and M=1.

Example

>>> from pyHSICLasso import HSICLasso
>>> hsic_lasso = HSICLasso()

>>> hsic_lasso.input("data.mat")

>>> hsic_lasso.input("data.csv")

>>> hsic_lasso.input("data.tsv")

>>> hsic_lasso.input(np.array([[1, 1, 1], [2, 2, 2]]), np.array([0, 1]))

You can specify the number of subset of feature selections with arguments regression and classification.

>>> hsic_lasso.regression(5)

>>> hsic_lasso.classification(10)

About output method, it is possible to select plots on the graph, details of the analysis result, output of the feature index. Note, to run the dump() function, it needs at least 5 features in the dataset.

>>> hsic_lasso.plot()
# plot the graph

>>> hsic_lasso.dump()
============================================== HSICLasso : Result ==================================================
| Order | Feature      | Score | Top-5 Related Feature (Relatedness Score)                                          |
| 1     | 1100         | 1.000 | 100          (0.979), 385          (0.104), 1762         (0.098), 762          (0.098), 1385         (0.097)|
| 2     | 100          | 0.537 | 1100         (0.979), 385          (0.100), 1762         (0.095), 762          (0.094), 1385         (0.092)|
| 3     | 200          | 0.336 | 1200         (0.979), 264          (0.094), 1482         (0.094), 1264         (0.093), 482          (0.091)|
| 4     | 1300         | 0.140 | 300          (0.984), 1041         (0.107), 1450         (0.104), 1869         (0.102), 41           (0.101)|
| 5     | 300          | 0.033 | 1300         (0.984), 1041         (0.110), 41           (0.106), 1450         (0.100), 1869         (0.099)|
>>> hsic_lasso.get_index()
[1099, 99, 199, 1299, 299]

>>> hsic_lasso.get_index_score()
array([0.09723658, 0.05218047, 0.03264885, 0.01360242, 0.00319763])

>>> hsic_lasso.get_features()
['1100', '100', '200', '1300', '300']

>>> hsic_lasso.get_index_neighbors(feat_index=0,num_neighbors=5)
[99, 384, 1761, 761, 1384]

>>> hsic_lasso.get_features_neighbors(feat_index=0,num_neighbors=5)
['100', '385', '1762', '762', '1385']

>>> hsic_lasso.get_index_neighbors_score(feat_index=0,num_neighbors=5)
array([0.9789888 , 0.10350618, 0.09757666, 0.09751763, 0.09678892])

>>> hsic_lasso.save_param() #Save selected features and its neighbors 

Citation

If you use this softwawre for your research, please cite the following two papers: Original HSIC Lasso and its block counterparts.

@article{yamada2014high,
  title={High-dimensional feature selection by feature-wise kernelized lasso},
  author={Yamada, Makoto and Jitkrittum, Wittawat and Sigal, Leonid and Xing, Eric P and Sugiyama, Masashi},
  journal={Neural computation},
  volume={26},
  number={1},
  pages={185--207},
  year={2014},
  publisher={MIT Press}
}

@article{climente2019block,
  title={Block HSIC Lasso: model-free biomarker detection for ultra-high dimensional data},
  author={Climente-Gonz{\'a}lez, H{\'e}ctor and Azencott, Chlo{\'e}-Agathe and Kaski, Samuel and Yamada, Makoto},
  journal={Bioinformatics},
  volume={35},
  number={14},
  pages={i427--i435},
  year={2019},
  publisher={Oxford University Press}
}

References

Algorithms

Theory

HSIC Lasso based algorithms

Applications of HSIC Lasso

  • Takahashi, Y., Ueki, M., Yamada, M., Tamiya, G., Motoike, I., Saigusa, D., Sakurai, M., Nagami, F., Ogishima, S., Koshiba, S., Kinoshita, K., Yamamoto, M., Tomita, H. Improved metabolomic data-based prediction of depressive symptoms using nonlinear machine learning with feature selection. Translational Psychiatry volume 10, Article number: 157 (2020).

Contributors

Developers

Name : Makoto Yamada (Kyoto University/RIKEN AIP), Héctor Climente-González (RIKEN AIP)

E-mail : [email protected]

Distributor

Name : Hirotaka Suetake (RIKEN AIP)

E-mail : [email protected]

pyhsiclasso's People

Contributors

hclimente avatar inktoyou avatar myamada0321 avatar suecharo avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pyhsiclasso's Issues

Number of selected features

Hello, I just tried this tool on a Metabolomics data I have. Interestingly, HSIC Lasso selects just 76 metabolites out of 2035 available metabolites. And the R-squared score if I use these selected metabolites is just 0.18. In comparison to Lasso on the original 2035 metabolites which obtains an R-squared of about 0.60. My assumption is probably the amount of selected features are too small. I used SVR (kernel='ref') from sklearn after feature selection with HSIC.
Is there a way to increase the number of features HSIC Lasso selects ?

As a predictor?

Hey That's awesome and I'm trying to use it in my thesis , but may I ask how to use it as a classifier ? I have looked the whole code but how to fit to different subsets and get overall precision score?

Modeling combinatorial effects of features?

Hi,

I've been extending this HSIC-LASSO implementation to use specific types of distance-based kernels for microbiome data. I'd like to verify if my understanding of the implementation and purpose of the "block" HSIC implementation is correct. First off, my understanding of the "block" part of the HSIC-LASSO is an optimization to speed up kernel computation time, correct? Second, in this code here, I have noticed that the "block" HSIC LASSO kernel computation constructs essentially "mini" (subsets of) kernels based on subsets of samples for single features (over the range of d features). If I am reading this is correctly, then this means that the kernel computation is constructed for a single dimension only, which misses modeling the combinatorial effects of multiple features. Of course, this function is ideal when combinatorial effects are present in the data. Perhaps I am missing something or not looking at the full picture. Could someone please elaborate on this? Thank you!

Multi-variate output support

Currently, the HSIC Lasso can only handle uni-variate output. Thus, extending the HSIC Lasso to multi-variate output.

Is there a way to extract the predicted value of the trained HSIC Lasso (Regression)?

After HSIC Lasso (Regression) has finished executing, we will have the beta values for every feature in the training dataset. Therefore, is there a way to determine the predicted value of a given instance? I am trying to evaluate the model fit via mean squared error, as done in the original paper (High-Dimensional Feature Selection by Feature-Wise Kernelized Lasso, Section 4.3.2)

Block Lasso selects less features than vanilla algorithm

When I used block Lasso for 77 features treshold (from 770 features) I got only 57 features. Block was divisor of number of data instances. However, when I used block as zero, I got exactly 77 features. Is it normal when block Lasso returns less features? This happened, when I used permutation parameter M with value one.

The other difference is, that when I use vanilla Lasso I get following warning:
C:\Program Files\Python37\lib\site-packages\pyHSICLasso\nlars.py:77: RuntimeWarning: divide by zero encountered in true_divide gamma1 = (C - c[I]) / (XtXw[A[0]] - XtXw[I])

Block lasso had no warnings.

Then I tried block Lasso with M=2. I got 77 features, but also following warnings:
C:\Program Files\Python37\lib\site-packages\pyHSICLasso\nlars.py:77: RuntimeWarning: invalid value encountered in true_divide gamma1 = (C - c[I]) / (XtXw[A[0]] - XtXw[I])
C:\Program Files\Python37\lib\site-packages\pyHSICLasso\nlars.py:83: RuntimeWarning: invalid value encountered in less_equal gamma[gamma <= 1e-9] = np.inf
C:\Program Files\Python37\lib\site-packages\pyHSICLasso\nlars.py:85: RuntimeWarning: invalid value encountered in less mu = min(gamma)
C:\Program Files\Python37\lib\site-packages\pyHSICLasso\nlars.py:77: RuntimeWarning: divide by zero encountered in true_divide gamma1 = (C - c[I]) / (XtXw[A[0]] - XtXw[I])

At last, I tried M=3, also got 77 features and the same warning as with vanilla Lasso.

I have two questions. Should I use M=1 with no warning and less features or M=3 with the same warning as vanilla Lasso had? Are these warnings of some importance, or they are within normal expected behavior?

UPDATE
Now I tried to get 9200 features from 92000 with block Lasso with B=19, M=3 but I got even less features than before - only 33. Should I scale M with number of features?

Clarification on the difference between an input vs. output kernel

y_kernel We employ the Gaussian kernel for inputs. For output kernels,

Hi, I'm wondering if some clarification could be provided on this difference.

In addition, is it necessary that the y_kernel and x_kernel are the same? My intuition is that they should be. But what I can see from the code, that is not enforced. What is the rationale that the y and X could be projected to a different space?

input

What does Y represent when numpy is the input? How do I use it? I'm a little confused

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.