GithubHelp home page GithubHelp logo

linear_regression_project's Introduction

Applying Linear Regression to a Lego dataset

Lego

Overview

Using a linear regression model, can we help predict the price of a lego set based on specific variables?

Process followed

Librairies used

  • pandas
  • statsmodels
  • scipy.stats
  • matplotlib.pyplot

Steps

1. Data collection

Dataset imported from : https://github.com/seankross/lego/tree/master/data-tidy

2. Data cleaning:

  • Deleting 8 columns

  • Deleting ~2000 empty rows

  • Reducing the number of themes

  • Replacing "Year" by "Age" to avoid time series

  • Creating dummies for non-numerical columns

  • Checking if the dataset has duplicates - does not have any

  • Convert numerical columns to be normally distributed using the boxcox method:

Distribution in the dataset: Distribution

Both are lognormally distributed.

Distribution after using the boxcox method with a lambda 0: Distribution_transformed

  • Identifying outliers:

Outliers Solution: creating 2 new columns identifying outliers

3. Regression analysis: Dropping 9 more variables with the first analysis based on high pvalues

4. Checking the 5 assumptions for linear regression

  1. Multicollinearity

Multicollineraity

Dropping 5 more columns, we checked the assumption.

  1. Linearity

Linearity

The assumption is verified

  1. Autocorrelation

The Durbin-Watson test shows a positive autocorrelation with a coefficient of 1.11

  1. Homoscedasticity

Homoscedasticity

The assumption is potentially not verified.

  1. Exogeneity of residuals

Exogeneity

The assumption is not verified as the residuals don't follow a normal law according to the Anderson-Darling test (pvalue < 0.05)

Results

The final linear regression model has a high R², however the 5 assumptions are not verified.

Result of the OLS model:

Results

The model equation: ln(y)= -0.21+1.8xOutliers_pieces+ β1xThemes+β2xPackagings+β3xAvailability+0.68xNumber_pieces

Plotting the model : Plotting

Conclusion

The model can be used to explain the current prices of these lego sets however is not yet usable for predictions as all assumptions are not checked.

linear_regression_project's People

Contributors

camillelib avatar

Watchers

 avatar

Forkers

sissiqu

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.