Light

amitha353 / machine-learning-regression Goto Github PK

View Code? Open in Web Editor NEW

15.0 2.0 12.0 10.27 MB

Machine-Learning-Regression

Jupyter Notebook 100.00%

simple-linear-regression multiple-regression gradient-descent coordinate-descent ridge-regre lasso-regression nearest-neighbor-search kernel-regression loss-functions bias-variance-tradeoff

machine-learning-regression's Introduction

Machine-Learning-Regression

MODELS

Linear Regression - Simple and Multiple
Regularization - Ridge (L2), Lasso (L1)
Nearest neighbors and kernel regression

ALGORITHMS

Gradient Descent
Coordinate Descent

GENERAL CONCEPTS

Loss function
Bias-Variance Trade off
Cross Validation
Sparsity
Overfitting
Model selection
Feature selection

Information	Modules
Simple Regression	Module 1
Multiple Regression	Module 2
Assesing Performance	Module 3
Ridge Regression	Module 4
Feature Selection & Lasso	Module 5
Nearest Neighbor & Kernel Regression	Module 6

Simple Regression

1 input and fit a line to data. (intercept and the slope coefficients).

Cost of the line

Residual sum of squares (RSS) - Sum of the square of difference between the original value and the predicted value.
Use RSS to asses different fits to the model.
Choose the best fit on the training data that minimizes over the "intercept" and "slope".

Gradient Descent

Iterative Algorithm that moves in the direction of the negative gradient.
for convex functions it converges to the optimum.

Multiple Regression

Allows to fit more complicated relationships between single input and output. Example - polynomial regression, seasonality, etc.
It also incorporates more inputs and features and using these various inputs to compute the prediction.
It is the sum of the weighted collection of features h of inputs xi + epsilon (error / noise term).

Cost -> RSS for multiple regression

RSS for the coefficients -> sum of the square of the difference between the output and the predicted value.
Predicted value = transpose of the feature matrix and coeffcients.

Gradient Descent

The gradient is used for the closed-form solution as well. Complexity of inverse: O(D^3) -> D - #features.
Gradient of the RSS.
Requires a step-size.

Assesing Performance

Variours measure to assess the efficieny of the model fit.

Measuring Loss

It is a measure of how good the fit is performing.
It is the cost of using estimated parameters w-hat at x when y is true.
Absolute error - symmetric error - Absolute difference between true and predicted values.
Squared error - symmetric error - Squared difference between the actual and predicted values.

3 Measures of errors

Training Error - Average over the loss measure pf the training dataset. Not a good predictive performance on the model.
Generalization / True Error - Measure of how well the error is being predicted for every possible observation available. It can't be computed.
Test Error - Examines the traing data fit on the test set. It is a noisy approximation to the generalization error.

Error xs. Model complexity

Training error - decreases with model complexity.
Generalization error - decreases and then increases with model complexity.
Test error - noisy generalization of the true error.

Overfit

If the training error is decrease below certain amount and the true error increases.
At this point the magnitude of the coefficients increases.

3 source of prediction error

Noise - inherent to the model, cannot be controlled.
Bias - Measure of how well the model fits the true prediction / relationship by averaging over all possible training data sets.
Variance - Measure of how a fitted function vary from the training data set to training set of all size and observations.

Bias-Variance tradeoff

Require low bias and low variance to have good predictive performance.
Model complexity increases -> bias decreases and variance increases.
Mean Square Error (MSE) = bias-variance tradeoff = bias^2 + variance.

Model selection and Assessment

Fit the model on the training data set.
Select between different models on the validation set.
Test the performance on the test data.

Ridge Regression

As model complexity increases, the models become overfit.
Symptom of overfitting -> magnitude of coefficients increases.
It trades of between the bias and the variance.
Ridge total cost = measure of fit(RSS on training data) + measure of the magnitude of the coefficients.
It is the L2 regularization parameter = Rss(w) + lambda * ||w||^2

Coefficient path

The magnitude of the coefficients decreases with increases in the tuning parameter "lambda".

Ridge closed-form solution -> complexity O(D^3);

Cross-Validation

In case of insuuffient data to form a separate validation set.
Then perform k-fold cross validation.
Here the training set is divided into blocks and each block is treated as the validation set.
- training block -> parameters or coefficients are extimated.
- validation block the error is computed.
The average error across all validation set is computed.

Feature Selection & Lasso

Various methods to search over models with different number of features.

All Subset - exhaustive approach, where feature combinations with least RSS is chosen.
Greedy Algorithm - forward selection - suboptimal solution but eventually provides the desired model set and is more efficient.

Lasso objective function - L1 regularized regression

It leads to sparse solutions.
L1 norm = RSS(w) + lambda ||w||

Coefficient path

Here the coefficient path becomes sparser with increasing lambda value. This provideds better feature solutions.

Coordinate Descent

Better model since it is difficult to find the derivate of an absolute value. Need to use sub-gradients, alternative is coordinate descent.
Iterate through the different dimensions of the objective or different features of the regression model.
The coefficients for lasso was setup based on "soft-thresholding" - provides sparse solutions.

Nearest Neighbor & Kernel Regression - Nonparametric fits

1-NN - simple procedure

Look for the most similar dataset observation and base the predictions on it.

Weighted k-NN

weigh the more similar observations more than those less similar in the list of k-NN.
Average across the rating to form the estimated prediction.

Kernel Regression

Weight all the points rather than just weighting NN.
The kernels have a bandwidth - lambda, outside which the observations are 0. Within the range/bandwidth also the observations can decay based on how far they are from the target point.
It leads to local constant fits.
Parametric fits -> global constant fits.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.

Jobs