GithubHelp home page GithubHelp logo

indrikwijaya / fyp-ml-for-genomics Goto Github PK

View Code? Open in Web Editor NEW
0.0 1.0 0.0 32.3 MB

This repo contains relevant files for my final year project in NUS (2018).

TeX 1.44% Jupyter Notebook 98.47% Python 0.10%

fyp-ml-for-genomics's Introduction

FYP-ML-For-Genomics

This repo contains relevant files for my final year project in NUS (2018), 'Machine Learning For Genomics'.

Abstract

This report will explore the performance of different unsupervised learning algorithms, particularly on clustering, on short time-series data from gene expression values. Many biological data are in the form of short time-series, yet there are not many studies done on this area. Standard machine learning algorithms normally work well on longer time-series. Yet, these algorithms tend to fail to separate different short time-series data into meaningful clusters as the data are not long enough to develop distinct and clear patterns. As such, data that are not supposed to be clustered together, may be clustered together. In this report, we will explore a few algorithms: Short Time-series Expression Miner (STEM), Gaussian Mixture Model, K-means and Hierarchical Clustering. STEM was specifically developed to address the problem of clustering short time-series data, whereas the other three algorithms are the standard machine learning algorithms that are still widely used to cluster time-series data.

STEM is a widely-cited algorithm for short time-series data.

Datasets

  1. Unlabelled gene expression values
  2. Short time-series, 7 time points
  3. N ~ 10k data points, 2 sets: Bulk mitochondria and Crude mitochondria
  4. Special subset: Mitochondria genes (~ 1.1k data points)

Conclusion

STEM as benchmark

  1. The only algorithm that gives optimal number of clusters using statistical test
  2. Takes into account the sequential nature of time-series
  3. Generates profiles independent of data

Findings

  1. STEM may exclude many genes (or remove noises)
  2. Algorithms using correlation coefficient as distance measure perform better
  3. Use STEM to initialize number of clusters
  4. Euclidean distance is not a good distance measure

fyp-ml-for-genomics's People

Contributors

indrikwijaya avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.