erdogant / pca Goto Github PK

pca: A Python Package for Principal Component Analysis.

Home Page: https://erdogant.github.io/pca

License: MIT License

Shell 0.02% Python 9.31% Jupyter Notebook 90.67%

pca biplot explained-variance principal-component-analysis 3d-plot outliers hotelling-t2

pca's Introduction

pca A Python Package for Principal Component Analysis. The core of PCA is build on sklearn functionality to find maximum compatibility when combining with other packages. But this package can do a lot more. Besides the regular pca, it can also perform SparsePCA, and TruncatedSVD. Depending on your input data, the best approach will be chosen. ⭐️ Star this repo if you like it ⭐️

Other functionalities of PCA are:

Biplot to plot the loadings
Determine the explained variance
Extract the best performing features
Scatter plot with the loadings
Outlier detection using Hotelling T2 and/or SPE/Dmodx

Support

Your ❤️ is important to keep maintaining my packages. You can support in various ways, have a look at the sponser page. Report bugs, issues or help out with developing new features! If you don't have the time to help or are still learning, you can also take a Medium Mebership using my referral link to keep reading all my hands-on blogs. If you also don't need that, there is always the coffee! Thank you!

Read the Medium blog for more details.

1. What are PCA loadings and how to effectively use Biplots?

2. Outlier Detection Using Principal Component Analysis and Hotelling’s T2 and SPE/DmodX Methods

3. Quantitative comparisons between t-SNE, UMAP, PCA, and Other Mappings.

Documentation pages

On the documentation pages you can find detailed information about the working of the pca with many examples.

Installation

pip install pca

Import pca package

from pca import pca

Quick start	Make biplot

Plot Explained variance	3D plots

Example: Set alpha transparency

Example: Normalizing out Principal Components

Normalizing out the 1st and more components from the data. This is usefull if the data is seperated in its first component(s) by unwanted or biased variance. Such as sex or experiment location etc.

Example: Extract Feature Importance

Make the biplot. It can be nicely seen that the first feature with most variance (f1), is almost horizontal in the plot, whereas the second most variance (f2) is almost vertical. This is expected because most of the variance is in f1, followed by f2 etc.

Biplot in 2d and 3d. Here we see the nice addition of the expected f3 in the plot in the z-direction.

Example: Detection of outliers

To detect any outliers across the multi-dimensional space of PCA, the hotellings T2 test is incorporated. This basically means that we compute the chi-square tests across the top n_components (default is PC1 to PC5). It is expected that the highest variance (and thus the outliers) will be seen in the first few components because of the nature of PCA. Going deeper into PC space may therefore not required but the depth is optional. This approach results in a P-value matrix (samples x PCs) for which the P-values per sample are then combined using fishers method. This approach allows to determine outliers and the ranking of the outliers (strongest tot weak). The alpha parameter determines the detection of outliers (default: 0.05).

Example: Plot only the loadings (arrows)

Example: Selection of outliers

Example: Toggle visible status

Example: Map unseen (new) datapoint to the transfomred space

Citation

Please cite in your publications if this is useful for your research (see citation).

Maintainers

Erdogan Taskesen, github: erdogant

pca's People

Contributors

Stargazers

Watchers

pca's Issues

Circular import issue

To whom it may concern,

While I finished following your installation tutorial, I am stuck on importing the package.
This is the only line in my script:
from pca import pca

And here is the error.
ImportError: cannot import name 'pca' from partially initialized module 'pca' (most likely due to a circular import)

I am using an m1 mac with python3.8 installed via conda and pip. I didn't use the github link tho.

Thanks!

New pca version not producing expected graphs

pca v2.0.4 not producing the expected graphs and producing errors. See google colab: https://colab.research.google.com/drive/1UTets1alPB1tcA_wpKAKBd_14nENqoXX?usp=sharing
I tried solving this by installing different versions of matplotlib and python versions but to no avail.

customizing marker shape

Really nice, easy to use package. Thanks for putting it together!

I did want to ask if it was possible to bring a second column of annotations (in addition to the 'y' you use as convention), in order to customize the marker shape. I suspect not based on the API, but it would be a great enhancement in the future.

Cheers,
Jason

Question: plans to extend this to principal coordinates analysis?

Do you have plans to generalize these methods to principal coordinates analysis so we can use non-Euclidean distance? That would be absolutely incredible and would use this for all of my projects.

request: turn off datapoints in biplot

get directions only

Overlapping annotations / text in the biplot() function

When we create a biplot with the biplot() function, the labels of the variable names overlap for the variables with the same eigenvector. An example of this can be found below:
https://drive.google.com/file/d/1OuzXq1NvLU3qBknbmdi0kzPvheZwt_Aq/view?usp=share_link

Unable to set colors for labels

Problem

model.biplot(
    y=df['target'],  # list of 'Y' and 'N' values or any other qualifiers
    cmap=mpl.colors.ListedColormap(['red', 'green']),
)

You can't set 'Y' to be green and 'N' to be red. They can swap colors depending on the data.
If there is only 'Y's or only 'N's. The color is black.

Current approach is kind of okay if you want to distinguish different categories but it doesn't allows consistency across different datasets.

I haven't found a workaround.

UPD: master...koutoftimer:pca:master it is not good by any chance, but, at least, it works.

about the loading plot and score plot

hi，
could you please tell how to do the loading plot by PCA library?
and what is the different between loading plot and score plot?

Feature request: use helper function approach for plotting

Problem

The only way to nest figures is via helper function as stated in Matplotlib Quickstart.

Current implementation requires monkey patching matplotlib before binplot call if one would like to have several subplots on the figure.

Workaround

@contextmanager
def monkey_patched_figure_and_subplot(fig: plt.Figure, ax: plt.Axes):
    old_figure = plt.figure
    old_add_subplot = fig.add_subplot

    def new_figure(*args, **kwargs):
        return fig

    def new_add_subplot(*args, **kwargs):
        return ax

    plt.figure = new_figure
    fig.add_subplot = new_add_subplot

    yield

    plt.figure = old_figure
    fig.add_subplot = old_add_subplot

...

model = pca(n_components=2)
model.fit_transform(data)
with monkey_patched_figure_and_subplot(fig, ax):
    model.biplot()

Suggestion

Allow to pass ax and fig to binplot and other methods.

Outlier detection plots

Great library, thanks for sharing! Building all the plots is very convenient :)

This is a feature request to add plots that are useful for outlier detection:

SPE/DmodX score vs row scatter with DCrit shown for moderate outliers
Hotelling’s T^2 on the scores plot for strong outliers

For Hotelling’s this SO post has one way.

Figure size

Is there a way to reduce the figure dimension on Jupyter?
For example: model.scatter gives a huge figure. I wanted to make it smaller.

request: pass column labels explicitly to biplot

such as when a numpy matrix instead of pandas dataframe is given

Feature: Add loadings scale on secondary axes

When visualising bi-plots, it's often useful to interpret the loadings placed on each feature in order to make deductions such as "principle component 1 places approximately double the loading on features A,B & C than it places on feature D".

Without the scale of the loading vectors in each principle component direction explicitly known however, this can lead to misleading deductions.

This feature request is thus to add the loading scales to each principle component axis.

Example of biplot with loading scales shown:

request: option to not show biplot, and pass back fig, ax only

in jupyter notebooks, it's difficult to use the axes passed back because plt.show has already been called

Optionally suppress the center-annotations in scatter plots

When creating scatter plots (using scatterd), labels are plotted at the center of mass for each point cloud, see below.

Is it possible to optionally suppress those labels? A couple of reasons for this:

For smaller point clouds and larger number of classes, this can become quite messy.
The information is redundant, as the legend also indicates the class category
The labels can interfere with other content in the plot, see for example the label for the category "versicolor", that interferes with the arrow-annotation "petal length".

I noticed around L656 in pca.py that the labels are extracted from the available arguments. But it appears not to be possible to disable the creation of these annotations.

The Rolls-Royce improvement would be, to optionally include those cloud-mass-annotations when placing texts with "adjustText". But again, adding a switch to disable these cloud-annotations would be nice!

PCA for binary data

Hi @erdogant ,

I was thinking that it would be interesting to have access to other types of PCA, more specifically, for dealing with binary data such as this library: https://cran.r-project.org/web/packages/logisticPCA/vignettes/logisticPCA.html

What do you think? is it complicated?

Thanks!
Pablo

How to remove text labels from scatterplots?

Hi,

Thanks for the great package. Whenever I plot my data with the scatter() function, I get large text labels on each point. I'd like to remove these, but if I use any of the "fontcolor", "fontsize", or "fontweight" arguments, I get a TypeError stating "scatter() got an unexpected keyword argument 'fontcolor'" or the like. Do you know what may be happening?

Thanks!

Conda recipe?

Hi @erdogant,

Thank you for providing this library. It's exactly what I've been looking for.

It would be great to have this installable through conda-forge. Would you be interested in this? If yes, I would be happy in assisting with drafting a recipe for the conda build.

Best,
Vini

How to fix the head of the arrows in biplot?

When I run this code

fig,ax=cl.pca.biplot(n_feat=2, y=df_labels.label_names, legend = True)

I get the next image

How to fix the arrowshead width or length?

title field in CITATION.cff contains description

Is this on purpose? I.e., do you want pca to be cited as having the title pca is a python package that performs the principal component analysis and makes insightful plots.?

AttributeError if pca().fit_transform used on list

The following case will throw an AttributeError:

from pca import pca
X = [[1,2],
     [2,1],
     [3,3]]

p = pca(n_components=2,normalize=True)
p.fit_transform(X)


328             if verbose>=3: print('[pca] >n_components is set to %d' %(self.n_components))
    329 
--> 330         self.n_feat = np.min([self.n_feat, X.shape[1]])
    331 
    332         if (not self.onehot) and (not self.normalize) and isinstance(X, pd.DataFrame) and (str(X.values.dtype)=='bool'):

AttributeError: 'list' object has no attribute 'shape'

Can we check X's type and convert it to np.ndarray if it is a list?

Data Column Adjusted PCA

Sometimes the base columnar data might not be normally distributed (bell curve) or continuously uniform (quantile-esque).

Guides on Power Transform (Yeo-Johnson vs Box-Cox):

Library Choice:

Demo on how scaling helps with PCA visualization https://towardsdatascience.com/feature-scaling-and-normalisation-in-a-nutshell-5319af86f89b
For non-transforms: Robust Scaling > Normalization > Standardization https://stats.stackexchange.com/questions/476394/impact-of-different-scaling-methods-on-pca-for-clustering

Q: is Yeo-Johnson good for quantile-esque columns?

Side note: this looks useful? https://github.com/erdogant/distfit

If y_continuous_cmap set to True, use continuous cmap in plot

Could I create a PR for mapping y to a continuous color map? I was thinking syntax could be like this:

ax = model.biplot3d(n_feat=10, legend=False, y_continuous_cmap=True)

And it could generate something like the right side (in terms of the point colors).

Fit_transform but no transform()

Hi, maybe I've missed it, but could you add a transform method to supplement fit_transform? Ideally I'd like to call this on new data after it's already been fit and return PCs without refitting.

PLS dmodx

Hi Erdogant,

How could it be possible to compute DMODX in PLS?

Thanks
Pablo

Verbose parameter not passed to scatterd() in pca.scatter()

In pca.py#L675 the call to scatterd.scatterd() does not pass in a verbose parameter. This results in some verbose output when calling pca.scatter() regardless of verbose initialization for the PCA object.

  fig, ax = scatterd(x=xs,
                     y=ys,
                     z=zs,
                     ....
                     fig=fig,
                     ax=ax)

Should probably become

  fig, ax = scatterd(x=xs,
                     y=ys,
                     z=zs,
                     ....
                     fig=fig,
                     ax=ax,
                     verbose=verbose)

Unable turn off the default plot title of `biplot()`.

How do I remove the default plot title of biplot()?
I tried using the title parameter of the biplot() function, but I'm still unable to remove the default plot title.
fig, ax = model.biplot(label=True, legend=False, cmap = 'autumn', figsize=(15,10), title= None)
I also tried to modify the returned axis of biplot():
ax.set_title(' ', loc='center')
but it is not modifying the title.

What is the purpose of calling fig.set_visible(visible)?

I like this package a lot, thanks for the work!

Often, users wish to adjust the appearance of plots before displaying it. When model.biplot() is called, one can suppress the call to plt.show() by setting the argument visible=False. However, this also sets fig.set_visible(False). In my Python / IPython environment, this has the side-effect that a later call to plt.show() will just show an empty figure, see below.

Setting the visible-status of the figure is not particularly user-friendly. I had to dig through this package's code to understand why the figure is not shown properly. For me, only the following works:

fig, ax = model.biplot(n_feat=4, visible=False)
ax.axis("scaled")  # Some user adjustment...
fig.set_visible(True) # Required, otherwise the figure will be empty!
fig.show()

The question: What's the purpose of the call to fig.set_visible()? I recommend omitting this in pca.py (I've counted two occurrences).

Another thing one might want to consider: don't call plt.show() in plotting routines! That's often left to the users in other plotting packages. Personally, I find it intrusive when a package determines the time when a window appears in my scripts 😉

NIPALS decomposition method

Hello,
I've tried your PCA package, It's great and I want to say : Thank you for your efforts
So, I'm looking for the NIPALS decomposition method
Because in our domaine (spectral data) this method is better than SVD
Do you have this method in your package?
Thanks for your listening

Issue with biplot labels colour after new PCA installation using pip

I noticed that after installing a new version of PCA using pip, the biplot is not generating unique and important feature labels in green, and weak and/or non-unique features labels in yellow. Would you happen to know what could be causing this issue?

Question: `n_components` in documentations?

In https://erdogant.github.io/pca/pages/html/pca.pca.html
It says pca.pca.pca(n_components=0.95, n_feat=25, method='pca', alpha=0.05, n_std=2, onehot=False, normalize=False, detect_outliers=['ht2', 'spe'], random_state=None, verbose=3)
n_components is not an integer like hotellingsT2 but instead something similar to alpha, but neither is documented on how these two are used.

implementation for singular value cutoff using gavish donoho method and similar approaches

hi! thank you for developing this neat package. Do you have any plans to implement some way to select the optimal numbers of PCs based on various methods such as:
https://arxiv.org/abs/1305.5870

Thank you!

Defining Necessary Number of Dimensions

For fun I also borrowed some other data from This Link and see how personality and test performance can be condensed to a dimensionally reduced model.
personality_score.csv

Question1 : what is the proper way of selecting sufficient amont of dimensions to preserve data and avoiding noise? Kaiser–Meyer–Olkin, Levene, and others all seem to be better descriptors compared to "Eigenvalue > 1" rule.
Question 2: Can PCA be integrated with something else such that it can behave like PCR and Lasso Regression? (as in reducing the amounf of unnecessary columns before attempting to be accurate)
Questions 3: Can ICA be used to discover significant columns? It is seen as A way to isolate components after using PCA to assess proper dimension count

!pip install pca
from pandas import read_csv
from pca import pca

df = read_csv('https://files.catbox.moe/4nztka.csv')
df = df.drop(columns=df.columns[0], axis=1)
y = df[['AFQT']]
X = df.drop(columns=['AFQT'])
model = pca(normalize=True)
results = model.fit_transform(X)
print(model.results['explained_var'])
fig, ax = model.plot()
fig.savefig('personality_performance.png')

how to plot part of data by function model.scatter()

hi,

Thanks for your python library, I am using the library for PCA.

But have a question, for the function model.scatter()

model.scatter(SPE=True, hotellingt2=True,legend=False , label=None, cmap='Set1', ='#ffffff' )

the data has 2000 lines, can we just plot the data from line 1000 to line 2000?

Or other dataset?

Thanks a lot.

Willa.

Is DmodX estimate consistent with SIMCA (Umetrics)?

Hi Erdogan,

Thanks for building this package! Much needed for Python ecosystem.
I'm wondering if DmodX outputted by pca.spe_dmodx() is consistent with SIMCA? I'm aware that SIMCA's methods are proprietary, but did get a chance to make a comparison of the results?

Thanks!

How do change the biplot (PC1 and PC2) to other components (like PC1 and PC3 or PC4)?

Other graphs of interest (Correlation Plots)?

PCA correlation matrix plot, sorted by similarity, might be good to see which variable has most effect in PC0/PC1. https://www.reneshbedre.com/blog/principal-component-analysis.html#perform-pca-using-scikit-learn

P.S. Multicollinearity Graphs are useful, but they are not quite PCA. Maybe first factor as amount of green, second as amount of red, third factor for amount of blue? https://www.algorhythmblog.be/2022/04/05/visualizing-multicollinearity-in-python/

model.biplot() error when pd.options.mode.copy_on_write = True

Python 3.11.4
Pandas 2.0.3
pca 2.0.3

Code:

from sklearn.datasets import load_wine
import pandas as pd
from pca import pca

pd.options.mode.copy_on_write = True

data = load_wine()
df = pd.DataFrame(index=data.target, data=data.data, columns=data.feature_names)
model = pca(normalize=True, detect_outliers=['ht2', 'spe'], n_std=2)
results = model.fit_transform(df)

model.biplot(SPE=False, HT2=True, density=True)

Results:

ValueError: assignment destination is read-only

Fixed code (no errors):

pd.options.mode.copy_on_write = False
model.biplot(SPE=False, HT2=True, density=True)
pd.options.mode.copy_on_write = True

Issues with 2 principal components

How do I fix this?

Typo in axis label

Typo in pca.py#L970 and pca.py#L1401: It should be "principal" not "principle".

Varimax and Promax usage, an alternative to ICA?

there is this repo that brought up the ideas of a transformation based on increasing variance. https://github.com/alfredsasko/advanced-principle-component-analysis
If maximizing variance and maximizing independence between factors look so similar, what could be the difference? https://danielroelfs.com/blog/a-basic-comparison-between-factor-analysis-pca-and-ica/

Biplot `n_features` has no effect

I'm recreating figure 10.1 from the book Introduction to Statistical Learning with this library, specifically creating a 2-d biplot from the USA Arrests dataset.

However, when creating a biplot only the first 2 loading vectors are displayed irrespective of what I pass to n_features. In addition, although a separate issue, the loading vector label is plotted outside the scale of the plot.

From a quick scan of the code it looks like the issue is in the compute_topfeat method, where n_feat never gets taken into account, but rather n_pcs gets iterated over twice.

P.S - great work on this library. Exactly what I was looking for when googling "PCA biplots python". For that reason, I wouldn't mind helping out with maintaining this library if needs be.

Allow train/infer mode for outlier detection in Hotelling's T2 and SPE?

Hi erdogant,

Firstly, I use your package in daily work and feel grateful for your sharing of this cool package with the community.

I'm employing Hotelling's T2 and SPE in form of a quality control chart, in which each product attribute is plotted as a single data point, those deviates above a certain threshold will be flagged out and intervened. Given this context, I have a training dataset to extract required parameter during train mode, i.e. mean(X) & var(X) in Hotelling's T2 and g_ell_center & cov in SPE, then reuse them to transform new coming data in infer mode.

I have implemented this feature in the compute_outliers() function and wonder if I can contribute this to the pca package?

Sincerely,
Vinh

Plotting an ellipsoid in a 3D scatter plot

Hi Erdogant,

Is there a build-in functionality to plot an ellipsoid in a 3D scatter plot?
I want to use the three most significant components of my data to do an outlier detection by using this ellipsoid.

Cumulative explained variance plot (Scree plot) starting from 0 instead of 1

Hi @erdogant,

I have been looking for a while for a good Python library that makes all PCA plots beautiful and I think your package achieves this goal by far.

On the other hand while using your package I noticed that the Explained variance plot is starting from 0 (like normal Python indexes) and because of that I think it might be erasing the "last component" when generating the plot, moreover the "0" component is always starting from 0 which is not the common case for the Scree plot:

If I wasn't clear enough with the issue, I'll be glad to answer your questions.
Cheers,
Camilo.

Scaling biplot

As I see, there is no way of scaling the biplot axes with the variance of the components as in e.g. R biplot.prcomp (see here). This is really important, since this way, the features length and the angles between them relates to the variance and covariance of the variables, and this way, so much information became visible on the biplot.

E.g. in iris dataset, 'sepal length' and 'petal length' are correlated (0.87), however, they are almost orthogonal in the biplot provided by the pca library. Also, scaling the observations make it possible to map the observations onto the variables, giving qualitive and quantitive insight into the PCA.

Transform with update_outlier_params=False will still change Hotelling T2 outlier results on the fit data

I have found edge cases that transforming new unseen data will change the results in the 'outliers' dataframe on the original data used in fit, even with update_outlier_params=False. This specifically applies to the Hotelling T2 statistic only.

Digging into it, the cause it is the usage of all rows in the PC dataframe in hotellingsT2(), called by compute_outliers() from transform().

The hotellingsT2() function uses all rows of the PC dataframe to compute the outliers in the new data, and the results don't change for the calculation of y_score (as the mean, var are locked), or y_proba, or even Pcomb variables.

But, the calculation of Pcorr using the multitest_correction() will be directly affected by using more rows than before, and it is this column that is compared to alpha to determine the y_bool column in results['outliers'].

So, in short, fitting data then transforming data with update_outlier_params=False will change the y_proba and y_bool of original fit data in certain cases.

I experimented and created a simple dummy data example that replicates this. To be fair, I'm not even sure this is a huge concern, but I figure that the expectation is that the outlier params of previously-fit data won't change if update_outlier_params=False. And it showed up in the usage I'm building.

This example changes the number of HotellingT2 outliers (as determined by y_bool) of original fit data from 1 to 0.

import numpy as np
import pandas as pd

from pca import pca

# Create dataset
np.random.seed(42)
X_orig = pd.DataFrame(np.random.randint(low=1, high=10, size=(10000, 10)))
# Insert Outliers
X_orig.iloc[500:510, 8:] = 15

# PCA Training
model = pca(n_components=5, alpha=0.05, n_std=3, normalize=True, random_state=42)
results = model.fit_transform(X=X_orig)

outliers_original = model.results['outliers']

# Create New Data
X_new = pd.DataFrame(np.random.randint(low=1, high=10, size=(1000, 10)))

# Transform New Data
model.transform(X=X_new, update_outlier_params=False)
outliers_new = model.results['outliers']

# Compare Original Points Outlier Results Before and After Transform
print("Before:", outliers_original['y_bool'].value_counts())
print("After:", outliers_new.iloc[:n_total]['y_bool'].value_counts())

I'm not sure what the fix is from a statistics standpoint, whether it's running the multitest differently or checking for changes, etc. But I wanted to raise the question.

I understand that inherently it makes sense for the y_proba to change for the previous data once more is added in, so it seems more a philosophical problem than a statistical one, but as someone tracking outliers as more and more data is transformed, it showed up.

Enable transparency alpha?

Would you be comfortable if we added transparency alpha for the samples? I use your library for lots of stuff at work with particularly dense datasets. I was thinking something like:

model.biplot(...transparecy_alpha=0.2)

Anyhow, let me know if you would like a pull request for it.

External set

Hi!

I have a question, if you want work with a external set to check the outliers and another functions of this library, How would it be possible?

Thanks!
Pablo

Logistic PCA and PPMI-based methods?

Currently I am awaiting datasets with a data format of "liked items by user", and that certain items are similar in nature.
Currently there are a few ways of reducing dimensionality:

Logistic PCA, which uses logit curves to render binary information similar to scalar data, data as either +1 or -1 https://github.com/brudfors/logistic-PCA-Tipping/blob/main/pca.py#L6
PPMI-based methods that uses co-occurrence of tags or words within images or sentences https://aclanthology.org/L18-1156.pdf https://github.com/Bollegala/svdmi

What are the trade-off and characteristics of each method? Are there other methods for large number of binary data columns?

erdogant / pca Goto Github PK

pca's Introduction

Support

Read the Medium blog for more details.

Installation

Import pca package

Citation

Maintainers

pca's People

Contributors

Stargazers

Watchers

Forkers

pca's Issues

Problem

Problem

Workaround

Suggestion

Recommend Projects

Recommend Topics

Recommend Org

Jobs