GithubHelp home page GithubHelp logo

devosmitachatterjee2018 / statistical_learning_for_big_data_report12062020 Goto Github PK

View Code? Open in Web Editor NEW
0.0 1.0 0.0 5.24 MB

The project encompasses the statistical analysis of a high-dimensional data using different classification, feature selection, clustering and dimension reduction techniques.

Python 100.00%
l1-regularization elastic-net-regression logistic-regression standardization normalization tsne kmeans-clustering

statistical_learning_for_big_data_report12062020's Introduction

Context

The assignment is a part of the course 'Statistical_Learning_for_Big_Data', course code MVE441 at Chalmers.

Project 1

A high-dimensional dataset with binary labels is being provided containing

  • the training feature matrix (X_train) of order 323 * 800
  • the training binary response vector (y_train) of order 323 * 1
  • the validation feature matrix (X_valid) of order 175 * 800
  • the validation binary response vector (y_valid) of order 175 * 1.

The project is to analyse the data with two different classification methods on the training dataset, compare the methods’ performance on the validation dataset, determine the best predictors for classification, and explain the selection.

Responsibilities for project 1

  • Perform an exploratory data analysis in order to understand the dataset by summarizing their main characteristics, either statistically or visually.
    • Data size
    • Data type
    • Missing data
    • Duplicate data
    • Constant columns
    • Distribution and count of class labels of the binary response variable
  • Standardize the data.
  • Since the training response variable (y_train) is binary and the number of features is greater than the number of observations (p > n) in the dataset which motivates to choose the following two penalized logistic regression methods for classification and feature selection.
    • L1–regulated logistic regression
    • Elastic net–regulated logistic regression.

Project 2

A high-dimensional dataset is being provided in form of the data matrix (X) of order 302 * 728.

The project is to perform an exploratory data analysis, discover clusters in the data, and find five variables that are most indicative of each found cluster.

Responsibilities for project 2

  • Perform an exploratory data analysis in order to understand the dataset by summarizing their main characteristics, either statistically or visually.
    • Data size
    • Data type
    • Missing data
    • Duplicate data
    • Constant columns
  • Normalize the data.
  • Choose t-Distributed Stochastic Neighbor Embedding (tSNE) for dimensionality reduction of the data (X).
  • Use k-means clustering on the tSNE reduced data with optimal number of clusters k.

Environment

Windows, Python.

statistical_learning_for_big_data_report12062020's People

Contributors

devosmitachatterjee2018 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.