GithubHelp home page GithubHelp logo

pauldevos / ml-utils Goto Github PK

View Code? Open in Web Editor NEW

This project forked from stefmolin/ml-utils

0.0 1.0 0.0 214 KB

Machine learning utility functions and classes.

License: MIT License

Python 100.00%

ml-utils's Introduction

Machine learning utility functions and classes

All examples derived from chapters 9-11 in my book: Hands-On Data Analysis with Pandas.

Note: This package uses scikit-learn for metrics calculation; however, with the except of the PartialFitPipeline the functionality should work for other purposes provided the input data is in the proper format.

Setup

# should install requirements.txt packages
$ pip install -e ml-utils # path to top level where setup.py is

# if not, install them explicitly
$ pip install -r requirements.txt

Example Usage

Classification

Plot a confusion matrix as a heatmap:

>>> from ml_utils.classification import confusion_matrix_visual
>>> confusion_matrix_visual(y_test, preds, ['white', 'red'])

confusion matrix

ROC curves for binary classification can be visualized as follows:

>>> from ml_utils.classification import plot_roc
>>> plot_roc(y_test, white_or_red.predict_proba(X_test)[:,1])

ROC curve

Use ml_utils.classification.plot_multi_class_roc() for a multi-class ROC curve.

Precision-recall curves for binary classification can be visualized as follows:

>>> from ml_utils.classification import plot_pr_curve
>>> plot_pr_curve(y_test, white_or_red.predict_proba(X_test)[:,1])

precision recall curve

Use ml_utils.classification.plot_multi_class_pr_curve() for a multi-class precision-recall curve.

Finding probability thresholds that yield target TPR/FPR:

>>> from ml_utils.classification import find_threshold_roc
>>> find_threshold_roc(
...     y_jan, model.predict_proba(X_jan)[:,1], fpr_below=0.05, tpr_above=0.75
... ).max()
0.011191747078992526

Finding probability thresholds that yield target precision/recall:

>>> from ml_utils.classification import find_threshold_pr
>>> find_threshold_pr(
...     y_jan, model.predict_proba(X_jan)[:,1], min_precision=0.95, min_recall=0.75
... ).max()
0.011191747078992526

Elbow Point Plot

Use the elbow point method to find good value for k when using k-means clustering:

>>> from sklearn.pipeline import Pipeline
>>> from sklearn.preprocessing import StandardScaler
>>> from ml_utils.elbow_point import elbow_point

>>> elbow_point(
...     kmeans_data, # features that will be passed to fit() method of the pipeline
...     Pipeline([
...         ('scale', StandardScaler()), ('kmeans', KMeans(random_state=0))
...     ])
... )

elbow point plot with k-means

Pipeline with partial_fit()

>>> from sklearn.linear_model import SGDClassifier
>>> from sklearn.preprocessing import StandardScaler
>>> from ml_utils.partial_fit_pipeline import PartialFitPipeline

>>> model = PartialFitPipeline([
...     ('scale', StandardScaler()),
...     ('sgd', SGDClassifier(
...         random_state=0, max_iter=1000, tol=1e-3, loss='log',
...         average=1000, learning_rate='adaptive', eta0=0.01
...     ))
... ]).fit(X_2018, y_2018)

>>> model.partial_fit(X_2019, y_2019)
PartialFitPipeline(memory=None, steps=[
    ('scale', StandardScaler(copy=True, with_mean=True, with_std=True)),
    ('sgd', SGDClassifier(
       alpha=0.0001, average=1000, class_weight=None,
       early_stopping=False, epsilon=0.1, eta0=0.01, fit_intercept=True,
       l1_ratio=0.15, learning_rate='adaptive', loss='log', max_iter=1000,
       n_iter=None, n_iter_no_change=5, n_jobs=None, penalty='l2',
       power_t=0.5, random_state=0, shuffle=True, tol=0.001,
       validation_fraction=0.1, verbose=0, warm_start=False
    ))
])

PCA

Use PCA with two components to see if the classification problem is linearly separable:

>>> from ml_utils.pca import pca_scatter
>>> pca_scatter(wine_X, wine_y, 'wine is red?')
>>> plt.title('Wine Kind PCA (2 components)')

PCA scatter in 2D

Try in 3D:

>>> from ml_utils.pca import pca_scatter_3d
>>> pca_scatter_3d(wine_X, wine_y, 'wine is red?', elev=20, azim=-10)
>>> plt.title('Wine Type PCA (3 components)')

PCA scatter in 3D

See how much variance is explained by PCA components, cumulatively:

>>> from sklearn.decomposition import PCA
>>> from sklearn.pipeline import Pipeline
>>> from sklearn.preprocessing import MinMaxScaler
>>> from ml_utils.pca import pca_explained_variance_plot

>>> pipeline = Pipeline([
...     ('normalize', MinMaxScaler()), ('pca', PCA(8, random_state=0))
... ]).fit(X_train, y_train)

>>> pca_explained_variance_plot(pipeline.named_steps['pca'])

cumulative explained variance of PCA components

See how much variance each PCA component explains:

>>> from sklearn.decomposition import PCA
>>> from sklearn.pipeline import Pipeline
>>> from sklearn.preprocessing import MinMaxScaler
>>> from ml_utils.pca import pca_scree_plot

>>> pipeline = Pipeline([
...     ('normalize', MinMaxScaler()), ('pca', PCA(8, random_state=0))
... ]).fit(w_X_train, w_y_train)

>>> pca_scree_plot(pipeline.named_steps['pca'])

scree plot

Regression

With the test y values and the predicted y values, we can look at the residuals:

>>> from ml_utils.regression import plot_residuals
>>> plot_residuals(y_test, preds)

residuals plots

Look at the adjusted R^2 of the linear regression model, lm:

>>> from ml_utils.regression import adjusted_r2
>>> adjusted_r2(lm, X_test, y_test)
0.9289371493826968

ml-utils's People

Contributors

stefmolin avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.