GithubHelp home page GithubHelp logo

erdogant / pca Goto Github PK

View Code? Open in Web Editor NEW
284.0 5.0 42.0 19.37 MB

pca: A Python Package for Principal Component Analysis.

Home Page: https://erdogant.github.io/pca

License: MIT License

Shell 0.02% Python 9.31% Jupyter Notebook 90.67%
pca biplot explained-variance principal-component-analysis 3d-plot outliers hotelling-t2

pca's Introduction

Python Pypi Docs LOC Downloads Downloads License Github Forks Open Issues Project Status DOI Medium Colab GitHub repo size Donate

pca A Python Package for Principal Component Analysis. The core of PCA is build on sklearn functionality to find maximum compatibility when combining with other packages. But this package can do a lot more. Besides the regular pca, it can also perform SparsePCA, and TruncatedSVD. Depending on your input data, the best approach will be chosen. ⭐️ Star this repo if you like it ⭐️

Other functionalities of PCA are:

  • Biplot to plot the loadings
  • Determine the explained variance
  • Extract the best performing features
  • Scatter plot with the loadings
  • Outlier detection using Hotelling T2 and/or SPE/Dmodx

Support

Your ❤️ is important to keep maintaining my packages. You can support in various ways, have a look at the sponser page. Report bugs, issues or help out with developing new features! If you don't have the time to help or are still learning, you can also take a Medium Mebership using my referral link to keep reading all my hands-on blogs. If you also don't need that, there is always the coffee! Thank you! Buy Me a Coffee at ko-fi.com

Read the Medium blog for more details.

On the documentation pages you can find detailed information about the working of the pca with many examples.

Installation

pip install pca
Import pca package
from pca import pca

Quick start Make biplot

Plot Explained variance 3D plots

Normalizing out the 1st and more components from the data. This is usefull if the data is seperated in its first component(s) by unwanted or biased variance. Such as sex or experiment location etc.

Make the biplot. It can be nicely seen that the first feature with most variance (f1), is almost horizontal in the plot, whereas the second most variance (f2) is almost vertical. This is expected because most of the variance is in f1, followed by f2 etc.

Explained variance

Biplot in 2d and 3d. Here we see the nice addition of the expected f3 in the plot in the z-direction.

biplot

biplot3d

To detect any outliers across the multi-dimensional space of PCA, the hotellings T2 test is incorporated. This basically means that we compute the chi-square tests across the top n_components (default is PC1 to PC5). It is expected that the highest variance (and thus the outliers) will be seen in the first few components because of the nature of PCA. Going deeper into PC space may therefore not required but the depth is optional. This approach results in a P-value matrix (samples x PCs) for which the P-values per sample are then combined using fishers method. This approach allows to determine outliers and the ranking of the outliers (strongest tot weak). The alpha parameter determines the detection of outliers (default: 0.05).


Citation

Please cite in your publications if this is useful for your research (see citation).

Maintainers


pca's People

Contributors

alyetama avatar erdogant avatar hovinh avatar nickgirardo avatar nightvision04 avatar tgy avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

pca's Issues

Circular import issue

To whom it may concern,

While I finished following your installation tutorial, I am stuck on importing the package.
This is the only line in my script:
from pca import pca

And here is the error.
ImportError: cannot import name 'pca' from partially initialized module 'pca' (most likely due to a circular import)

I am using an m1 mac with python3.8 installed via conda and pip. I didn't use the github link tho.

Thanks!

customizing marker shape

Really nice, easy to use package. Thanks for putting it together!

I did want to ask if it was possible to bring a second column of annotations (in addition to the 'y' you use as convention), in order to customize the marker shape. I suspect not based on the API, but it would be a great enhancement in the future.

Cheers,
Jason

Unable to set colors for labels

Problem

model.biplot(
    y=df['target'],  # list of 'Y' and 'N' values or any other qualifiers
    cmap=mpl.colors.ListedColormap(['red', 'green']),
)
  1. You can't set 'Y' to be green and 'N' to be red. They can swap colors depending on the data.
  2. If there is only 'Y's or only 'N's. The color is black.

Current approach is kind of okay if you want to distinguish different categories but it doesn't allows consistency across different datasets.

I haven't found a workaround.

UPD: master...koutoftimer:pca:master it is not good by any chance, but, at least, it works.

Feature request: use helper function approach for plotting

Problem

The only way to nest figures is via helper function as stated in Matplotlib Quickstart.

Current implementation requires monkey patching matplotlib before binplot call if one would like to have several subplots on the figure.

Workaround

@contextmanager
def monkey_patched_figure_and_subplot(fig: plt.Figure, ax: plt.Axes):
    old_figure = plt.figure
    old_add_subplot = fig.add_subplot

    def new_figure(*args, **kwargs):
        return fig

    def new_add_subplot(*args, **kwargs):
        return ax

    plt.figure = new_figure
    fig.add_subplot = new_add_subplot

    yield

    plt.figure = old_figure
    fig.add_subplot = old_add_subplot

...

model = pca(n_components=2)
model.fit_transform(data)
with monkey_patched_figure_and_subplot(fig, ax):
    model.biplot()

Suggestion

Allow to pass ax and fig to binplot and other methods.

Outlier detection plots

Great library, thanks for sharing! Building all the plots is very convenient :)

This is a feature request to add plots that are useful for outlier detection:

  • SPE/DmodX score vs row scatter with DCrit shown for moderate outliers
  • Hotelling’s T^2 on the scores plot for strong outliers

For Hotelling’s this SO post has one way.

Figure size

Is there a way to reduce the figure dimension on Jupyter?
For example: model.scatter gives a huge figure. I wanted to make it smaller.

Feature: Add loadings scale on secondary axes

When visualising bi-plots, it's often useful to interpret the loadings placed on each feature in order to make deductions such as "principle component 1 places approximately double the loading on features A,B & C than it places on feature D".

Without the scale of the loading vectors in each principle component direction explicitly known however, this can lead to misleading deductions.

This feature request is thus to add the loading scales to each principle component axis.

Example of biplot with loading scales shown:

image

Optionally suppress the center-annotations in scatter plots

When creating scatter plots (using scatterd), labels are plotted at the center of mass for each point cloud, see below.

Is it possible to optionally suppress those labels? A couple of reasons for this:

  • For smaller point clouds and larger number of classes, this can become quite messy.
  • The information is redundant, as the legend also indicates the class category
  • The labels can interfere with other content in the plot, see for example the label for the category "versicolor", that interferes with the arrow-annotation "petal length".
image

I noticed around L656 in pca.py that the labels are extracted from the available arguments. But it appears not to be possible to disable the creation of these annotations.

The Rolls-Royce improvement would be, to optionally include those cloud-mass-annotations when placing texts with "adjustText". But again, adding a switch to disable these cloud-annotations would be nice!

How to remove text labels from scatterplots?

Hi,

Thanks for the great package. Whenever I plot my data with the scatter() function, I get large text labels on each point. I'd like to remove these, but if I use any of the "fontcolor", "fontsize", or "fontweight" arguments, I get a TypeError stating "scatter() got an unexpected keyword argument 'fontcolor'" or the like. Do you know what may be happening?

Thanks!

Conda recipe?

Hi @erdogant,

Thank you for providing this library. It's exactly what I've been looking for.

It would be great to have this installable through conda-forge. Would you be interested in this? If yes, I would be happy in assisting with drafting a recipe for the conda build.

Best,
Vini

AttributeError if pca().fit_transform used on list

The following case will throw an AttributeError:

from pca import pca
X = [[1,2],
     [2,1],
     [3,3]]

p = pca(n_components=2,normalize=True)
p.fit_transform(X)

328             if verbose>=3: print('[pca] >n_components is set to %d' %(self.n_components))
    329 
--> 330         self.n_feat = np.min([self.n_feat, X.shape[1]])
    331 
    332         if (not self.onehot) and (not self.normalize) and isinstance(X, pd.DataFrame) and (str(X.values.dtype)=='bool'):

AttributeError: 'list' object has no attribute 'shape'

Can we check X's type and convert it to np.ndarray if it is a list?

Data Column Adjusted PCA

If y_continuous_cmap set to True, use continuous cmap in plot

Could I create a PR for mapping y to a continuous color map? I was thinking syntax could be like this:

ax = model.biplot3d(n_feat=10, legend=False, y_continuous_cmap=True)

And it could generate something like the right side (in terms of the point colors).

image

Fit_transform but no transform()

Hi, maybe I've missed it, but could you add a transform method to supplement fit_transform? Ideally I'd like to call this on new data after it's already been fit and return PCs without refitting.

PLS dmodx

Hi Erdogant,

How could it be possible to compute DMODX in PLS?

Thanks
Pablo

Verbose parameter not passed to scatterd() in pca.scatter()

In pca.py#L675 the call to scatterd.scatterd() does not pass in a verbose parameter. This results in some verbose output when calling pca.scatter() regardless of verbose initialization for the PCA object.

  fig, ax = scatterd(x=xs,
                     y=ys,
                     z=zs,
                     ....
                     fig=fig,
                     ax=ax)

Should probably become

  fig, ax = scatterd(x=xs,
                     y=ys,
                     z=zs,
                     ....
                     fig=fig,
                     ax=ax,
                     verbose=verbose)

Unable turn off the default plot title of `biplot()`.

How do I remove the default plot title of biplot()?
I tried using the title parameter of the biplot() function, but I'm still unable to remove the default plot title.
fig, ax = model.biplot(label=True, legend=False, cmap = 'autumn', figsize=(15,10), title= None)
I also tried to modify the returned axis of biplot():
ax.set_title(' ', loc='center')
but it is not modifying the title.

What is the purpose of calling fig.set_visible(visible)?

I like this package a lot, thanks for the work!

Often, users wish to adjust the appearance of plots before displaying it. When model.biplot() is called, one can suppress the call to plt.show() by setting the argument visible=False. However, this also sets fig.set_visible(False). In my Python / IPython environment, this has the side-effect that a later call to plt.show() will just show an empty figure, see below.

image

Setting the visible-status of the figure is not particularly user-friendly. I had to dig through this package's code to understand why the figure is not shown properly. For me, only the following works:

fig, ax = model.biplot(n_feat=4, visible=False)
ax.axis("scaled")  # Some user adjustment...
fig.set_visible(True) # Required, otherwise the figure will be empty!
fig.show()

The question: What's the purpose of the call to fig.set_visible()? I recommend omitting this in pca.py (I've counted two occurrences).

Another thing one might want to consider: don't call plt.show() in plotting routines! That's often left to the users in other plotting packages. Personally, I find it intrusive when a package determines the time when a window appears in my scripts 😉

NIPALS decomposition method

Hello,
I've tried your PCA package, It's great and I want to say : Thank you for your efforts
So, I'm looking for the NIPALS decomposition method
Because in our domaine (spectral data) this method is better than SVD
Do you have this method in your package?
Thanks for your listening

Defining Necessary Number of Dimensions

For fun I also borrowed some other data from This Link and see how personality and test performance can be condensed to a dimensionally reduced model.
personality_score.csv

Question1 : what is the proper way of selecting sufficient amont of dimensions to preserve data and avoiding noise? Kaiser–Meyer–Olkin, Levene, and others all seem to be better descriptors compared to "Eigenvalue > 1" rule.
Question 2: Can PCA be integrated with something else such that it can behave like PCR and Lasso Regression? (as in reducing the amounf of unnecessary columns before attempting to be accurate)
Questions 3: Can ICA be used to discover significant columns? It is seen as A way to isolate components after using PCA to assess proper dimension count

!pip install pca
from pandas import read_csv
from pca import pca

df = read_csv('https://files.catbox.moe/4nztka.csv')
df = df.drop(columns=df.columns[0], axis=1)
y = df[['AFQT']]
X = df.drop(columns=['AFQT'])
model = pca(normalize=True)
results = model.fit_transform(X)
print(model.results['explained_var'])
fig, ax = model.plot()
fig.savefig('personality_performance.png')

personality_performance.png

how to plot part of data by function model.scatter()

hi,

Thanks for your python library, I am using the library for PCA.

But have a question, for the function model.scatter()

model.scatter(SPE=True, hotellingt2=True,legend=False , label=None, cmap='Set1', ='#ffffff' )

the data has 2000 lines, can we just plot the data from line 1000 to line 2000?

Or other dataset?

Thanks a lot.

Willa.

Is DmodX estimate consistent with SIMCA (Umetrics)?

Hi Erdogan,

Thanks for building this package! Much needed for Python ecosystem.
I'm wondering if DmodX outputted by pca.spe_dmodx() is consistent with SIMCA? I'm aware that SIMCA's methods are proprietary, but did get a chance to make a comparison of the results?

Thanks!

Other graphs of interest (Correlation Plots)?

PCA correlation matrix plot, sorted by similarity, might be good to see which variable has most effect in PC0/PC1. https://www.reneshbedre.com/blog/principal-component-analysis.html#perform-pca-using-scikit-learn

P.S. Multicollinearity Graphs are useful, but they are not quite PCA. Maybe first factor as amount of green, second as amount of red, third factor for amount of blue? https://www.algorhythmblog.be/2022/04/05/visualizing-multicollinearity-in-python/

model.biplot() error when pd.options.mode.copy_on_write = True

Python 3.11.4
Pandas 2.0.3
pca 2.0.3

Code:

from sklearn.datasets import load_wine
import pandas as pd
from pca import pca

pd.options.mode.copy_on_write = True

data = load_wine()
df = pd.DataFrame(index=data.target, data=data.data, columns=data.feature_names)
model = pca(normalize=True, detect_outliers=['ht2', 'spe'], n_std=2)
results = model.fit_transform(df)

model.biplot(SPE=False, HT2=True, density=True)

Results:

ValueError: assignment destination is read-only

Fixed code (no errors):

pd.options.mode.copy_on_write = False
model.biplot(SPE=False, HT2=True, density=True)
pd.options.mode.copy_on_write = True

Typo in axis label

Typo in pca.py#L970 and pca.py#L1401: It should be "principal" not "principle".

Biplot `n_features` has no effect

I'm recreating figure 10.1 from the book Introduction to Statistical Learning with this library, specifically creating a 2-d biplot from the USA Arrests dataset.

However, when creating a biplot only the first 2 loading vectors are displayed irrespective of what I pass to n_features. In addition, although a separate issue, the loading vector label is plotted outside the scale of the plot.

From a quick scan of the code it looks like the issue is in the compute_topfeat method, where n_feat never gets taken into account, but rather n_pcs gets iterated over twice.

P.S - great work on this library. Exactly what I was looking for when googling "PCA biplots python". For that reason, I wouldn't mind helping out with maintaining this library if needs be.

Allow train/infer mode for outlier detection in Hotelling's T2 and SPE?

Hi erdogant,

Firstly, I use your package in daily work and feel grateful for your sharing of this cool package with the community.

I'm employing Hotelling's T2 and SPE in form of a quality control chart, in which each product attribute is plotted as a single data point, those deviates above a certain threshold will be flagged out and intervened. Given this context, I have a training dataset to extract required parameter during train mode, i.e. mean(X) & var(X) in Hotelling's T2 and g_ell_center & cov in SPE, then reuse them to transform new coming data in infer mode.

I have implemented this feature in the compute_outliers() function and wonder if I can contribute this to the pca package?

Sincerely,
Vinh

Plotting an ellipsoid in a 3D scatter plot

Hi Erdogant,

Is there a build-in functionality to plot an ellipsoid in a 3D scatter plot?
I want to use the three most significant components of my data to do an outlier detection by using this ellipsoid.

Cumulative explained variance plot (Scree plot) starting from 0 instead of 1

Hi @erdogant,

I have been looking for a while for a good Python library that makes all PCA plots beautiful and I think your package achieves this goal by far.

On the other hand while using your package I noticed that the Explained variance plot is starting from 0 (like normal Python indexes) and because of that I think it might be erasing the "last component" when generating the plot, moreover the "0" component is always starting from 0 which is not the common case for the Scree plot:

Screen Shot 2020-04-28 at 5 21 37 PM

If I wasn't clear enough with the issue, I'll be glad to answer your questions.
Cheers,
Camilo.

Scaling biplot

As I see, there is no way of scaling the biplot axes with the variance of the components as in e.g. R biplot.prcomp (see here). This is really important, since this way, the features length and the angles between them relates to the variance and covariance of the variables, and this way, so much information became visible on the biplot.

E.g. in iris dataset, 'sepal length' and 'petal length' are correlated (0.87), however, they are almost orthogonal in the biplot provided by the pca library. Also, scaling the observations make it possible to map the observations onto the variables, giving qualitive and quantitive insight into the PCA.

Transform with update_outlier_params=False will still change Hotelling T2 outlier results on the fit data

I have found edge cases that transforming new unseen data will change the results in the 'outliers' dataframe on the original data used in fit, even with update_outlier_params=False. This specifically applies to the Hotelling T2 statistic only.

Digging into it, the cause it is the usage of all rows in the PC dataframe in hotellingsT2(), called by compute_outliers() from transform().

The hotellingsT2() function uses all rows of the PC dataframe to compute the outliers in the new data, and the results don't change for the calculation of y_score (as the mean, var are locked), or y_proba, or even Pcomb variables.

But, the calculation of Pcorr using the multitest_correction() will be directly affected by using more rows than before, and it is this column that is compared to alpha to determine the y_bool column in results['outliers'].

So, in short, fitting data then transforming data with update_outlier_params=False will change the y_proba and y_bool of original fit data in certain cases.

I experimented and created a simple dummy data example that replicates this. To be fair, I'm not even sure this is a huge concern, but I figure that the expectation is that the outlier params of previously-fit data won't change if update_outlier_params=False. And it showed up in the usage I'm building.

This example changes the number of HotellingT2 outliers (as determined by y_bool) of original fit data from 1 to 0.

import numpy as np
import pandas as pd

from pca import pca

# Create dataset
np.random.seed(42)
X_orig = pd.DataFrame(np.random.randint(low=1, high=10, size=(10000, 10)))
# Insert Outliers
X_orig.iloc[500:510, 8:] = 15

# PCA Training
model = pca(n_components=5, alpha=0.05, n_std=3, normalize=True, random_state=42)
results = model.fit_transform(X=X_orig)

outliers_original = model.results['outliers']

# Create New Data
X_new = pd.DataFrame(np.random.randint(low=1, high=10, size=(1000, 10)))

# Transform New Data
model.transform(X=X_new, update_outlier_params=False)
outliers_new = model.results['outliers']

# Compare Original Points Outlier Results Before and After Transform
print("Before:", outliers_original['y_bool'].value_counts())
print("After:", outliers_new.iloc[:n_total]['y_bool'].value_counts())

I'm not sure what the fix is from a statistics standpoint, whether it's running the multitest differently or checking for changes, etc. But I wanted to raise the question.

I understand that inherently it makes sense for the y_proba to change for the previous data once more is added in, so it seems more a philosophical problem than a statistical one, but as someone tracking outliers as more and more data is transformed, it showed up.

Enable transparency alpha?

Would you be comfortable if we added transparency alpha for the samples? I use your library for lots of stuff at work with particularly dense datasets. I was thinking something like:

model.biplot(...transparecy_alpha=0.2)

Anyhow, let me know if you would like a pull request for it.

External set

Hi!

I have a question, if you want work with a external set to check the outliers and another functions of this library, How would it be possible?

Thanks!
Pablo

Logistic PCA and PPMI-based methods?

Currently I am awaiting datasets with a data format of "liked items by user", and that certain items are similar in nature.
Currently there are a few ways of reducing dimensionality:

What are the trade-off and characteristics of each method? Are there other methods for large number of binary data columns?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.