fyp-ml-for-genomics's Introduction

FYP-ML-For-Genomics

This repo contains relevant files for my final year project in NUS (2018), 'Machine Learning For Genomics'.

Abstract

This report will explore the performance of different unsupervised learning algorithms, particularly on clustering, on short time-series data from gene expression values. Many biological data are in the form of short time-series, yet there are not many studies done on this area. Standard machine learning algorithms normally work well on longer time-series. Yet, these algorithms tend to fail to separate different short time-series data into meaningful clusters as the data are not long enough to develop distinct and clear patterns. As such, data that are not supposed to be clustered together, may be clustered together. In this report, we will explore a few algorithms: Short Time-series Expression Miner (STEM), Gaussian Mixture Model, K-means and Hierarchical Clustering. STEM was specifically developed to address the problem of clustering short time-series data, whereas the other three algorithms are the standard machine learning algorithms that are still widely used to cluster time-series data.

STEM is a widely-cited algorithm for short time-series data.

Datasets

Unlabelled gene expression values
Short time-series, 7 time points
N ~ 10k data points, 2 sets: Bulk mitochondria and Crude mitochondria
Special subset: Mitochondria genes (~ 1.1k data points)

Conclusion

STEM as benchmark

The only algorithm that gives optimal number of clusters using statistical test
Takes into account the sequential nature of time-series
Generates profiles independent of data

Findings

STEM may exclude many genes (or remove noises)
Algorithms using correlation coefficient as distance measure perform better
Use STEM to initialize number of clusters
Euclidean distance is not a good distance measure

Recommend Projects

indrikwijaya / fyp-ml-for-genomics Goto Github PK