Project 1c: Predicting Genetic Interactions
For this project, you will implement and run the featurization and random forest procedure described in Yu, et al. (Cell Systems, 2016) on the S. cerevisiae (baker's yeast) data from Costanzo, et al. (Science, 2010).
Data
The input data for your algorithm is a matrix of genetic interaction scores for pairs of genes and a hierarchy of gene sets. The genetic interactions are stored in a square NumPy matrix format with a corresponding file that lists the gene names for the rows/columns. The hierarchy is stored in a tab-separated text file, where each line lists the genes (leaves) in a set (internal node) of the hiearchy.
Example data
You can find a small example dataset for your project in data/examples.
Real data
You will need to download real data for your project and process it into the same format as the example data. You will create a S. cerevisiae hierarchy from the Gene Ontology, and use the genetic interactions data from Costanzo, et al. (Science, 2010). (Costanzo, et al. recently took down their website hosting files from their paper, so you can access a copy I downloaded previously using this link.)