GithubHelp home page GithubHelp logo

dsc-pca-in-scikitlearn-dc-ds-010620's Introduction

Principal Component Analysis in scikit-learn

Introduction

Now that you've seen the curse of dimensionality, it's time to take a look at a dimensionality reduction technique! This will help you overcome the challenges of the curse of dimensionality (amongst other things). Essentially, PCA, or Principal Component Analysis, attempts to capture as much information from the dataset as possible while reducing the overall number of features.

Objectives

You will be able to:

  • Explain at a high level how PCA works
  • Explain use cases for PCA
  • Implement PCA using the scikit-learn library
  • Determine the optimal number of n components when performing PCA by observing the explained variance

Generate some data

First, you need some data to perform PCA on. With that, here's a quick dataset you can generate using NumPy:

import numpy as np

x1 = np.linspace(-10, 10, 100)
# A linear relationship, plus a little noise
x2 = np.array([xi*2 + np.random.normal(loc=0, scale=0.5) for xi in x1]) 

X = np.matrix(list(zip(x1, x2)))

Let's also generate a quick plot of this simple dataset to further orient ourselves:

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

sns.set_style('darkgrid')

plt.scatter(x1, x2);

png

PCA with scikit-learn

Now onto PCA. First, take a look at how simple it is to implement PCA with scikit-learn:

from sklearn.decomposition import PCA

pca = PCA()
transformed = pca.fit_transform(X)

And you can once again plot the updated dataset:

plt.scatter(transformed[:,0], transformed[:,1]);

png

pca.components_
array([[ 0.44712728,  0.89447035],
       [-0.89447035,  0.44712728]])
pca.mean_
array([7.10542736e-17, 2.61587597e-02])

Interpret Results

Let's take a look at what went on here. PCA transforms the dataset along principal axes. The first of these axes is designed to capture the maximum variance within the data. From here, additional axes are constructed which are orthogonal to the previous axes and continue to account for as much of the remaining variance as possible.

For the current 2-d case, the axes which the data were projected onto look like this:

plt.scatter(x1, x2);
ax1, ay1 = pca.mean_[0], pca.mean_[1]
ax2, ay2 = pca.mean_[0] + pca.components_[0][0], pca.mean_[1] + pca.components_[0][1]
ax3, ay3 = pca.mean_[0] + pca.components_[1][0], pca.mean_[1] + pca.components_[1][1]
plt.plot([ax1, ax2], [ay1, ay2], color='red')
plt.plot([ax2, ax3], [ay2, ay3], color='red');

png

So, the updated graph you saw is the same dataset rotated onto these red axes:

plt.scatter(transformed[:,0], transformed[:,1])
plt.axhline(color='red')
plt.axvline(color='red');

png

Note the small scale of the y-axis. You can also plot the transformed dataset on the new axes with a scale similar to what you saw before:

plt.scatter(transformed[:,0], transformed[:,1])
plt.axhline(color='red')
plt.axvline(color='red')
plt.ylim(-10,10);

png

Again, this is the geographical interpretation of what just happened:

Determine the Explained Variance

Typically, one would use PCA to actually reduce the number of dimensions. In this case, you've simply re-re-parametrized the dataset along new axes. That said, if you look at the first of these primary axes, you can see the patterns encapsulated by the principal component. Moreover, scikit-learn also lets you quickly determine the overall variance in the dataset accounted for in each of the principal components.

pca.explained_variance_ratio_
array([9.99717770e-01, 2.82229957e-04])

Keep in mind that these quantities are cumulative: principal component 2 attempts to account for the variance not accounted for in the primary component. You can view the total variance using np.cumsum():

np.cumsum(pca.explained_variance_ratio_)
array([0.99971777, 1.        ])

Visualize the Principal Component

To help demonstrate the structure captured by the first principal component, observe the impact of coloring the dataset and then visualizing the first component.

plt.scatter(x1,x2, c=sns.color_palette('RdBu', n_colors=100));

png

plt.scatter(transformed[:,0], [0 for i in range(100)] , c=sns.color_palette('RdBu', n_colors=100));

png

Steps for Performing PCA

The theory behind PCA rests upon many foundational concepts of linear algebra. After all, PCA is re-encoding a dataset into an alternative basis (the axes). Here are the exact steps:

  1. Recenter each feature of the dataset by subtracting that feature's mean from the feature vector
  2. Calculate the covariance matrix for your centered dataset
  3. Calculate the eigenvectors of the covariance matrix
    1. You'll further investigate the concept of eigenvectors in the upcoming lesson
  4. Project the dataset into the new feature space: Multiply the eigenvectors by the mean-centered features

You can see some of these intermediate steps from the pca instance object itself.

# Pulling up the original feature means which were used to center the data
pca.mean_ 
array([7.10542736e-17, 2.61587597e-02])
# Pulling up the covariance matrix of the mean centered data
pca.get_covariance() 
array([[ 34.35023637,  68.62012556],
       [ 68.62012556, 137.32181974]])
# Pulling up the eigenvectors of the covariance matrix
pca.components_ 
array([[ 0.44712728,  0.89447035],
       [-0.89447035,  0.44712728]])

Summary

In this lesson, you looked at implementing PCA with scikit-learn and the geometric interpretations of principal components. From here, you'll get a chance to practice implementing PCA yourself before going on to code some of the underlying components implemented by scikit-learn using NumPy.

dsc-pca-in-scikitlearn-dc-ds-010620's People

Contributors

loredirick avatar shakeelraja avatar h-parker avatar sumedh10 avatar mathymitchell avatar

Watchers

James Cloos avatar  avatar Mohawk Greene avatar Victoria Thevenot avatar Bernard Mordan avatar Otha avatar raza jafri avatar  avatar Joe Cardarelli avatar The Learn Team avatar  avatar  avatar Ben Oren avatar Matt avatar Antoin avatar Alex Griffith avatar  avatar Amanda D'Avria avatar  avatar Ahmed avatar Nicole Kroese  avatar Kaeland Chatman avatar  avatar Lisa Jiang avatar Vicki Aubin avatar Maxwell Benton avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.