GithubHelp home page GithubHelp logo

greenelab / model-free-data Goto Github PK

View Code? Open in Web Editor NEW
5.0 6.0 2.0 14.26 MB

Case-control genetics datasets evolved to be epistatic

Home Page: https://doi.org/b44mk9

License: Creative Commons Zero v1.0 Universal

Jupyter Notebook 14.43% PostScript 85.57%
biodata-mining simulation epistasis snps genetics dataset moore-lab open-data supplement notebook

model-free-data's Introduction

Case-control genetics datasets evolved to be epistatic

This repository contains the data for the following two studies:

  1. Evolving hard problems: Generating human genetics datasets with a complex etiology (Prefered Citation)
    Daniel S. Himmelstein, Casey S. Greene, Jason H. Moore
    BioData Mining (2011) DOI: 10.1186/1756-0381-4-21 · PMID: 21736753 · PMCID: PMC3154150

  2. A Model Free Method to Generate Human Genetics Datasets with Complex Gene-Disease Relationships
    Casey S. Greene, Daniel S. Himmelstein, Jason H. Moore
    Evolutionary Computation, Machine Learning and Data Mining in Bioinformatics (2010) DOI: 10.1007/978-3-642-12211-8_7

Previously, this data was hosted at http://discovery.dartmouth.edu/model_free_data/. In July 2016, the data was migrated to this repository and modernized. The legacy site is no longer supported, use this site instead. The data contents are equivalent, although directory structures, file extensions, and text-file header rows have been refined.

Introduction

This repository contains genetics datasets simulated to be complex. Each dataset contains 3,000 observations (rows), which represent samples/subjects for a genetic association study. Subjects are classified as cases (i.e. diseased when Class column is 1) or controls (i.e. healthy when Class column is 0). The remaining columns (e.g. X1, X2, X3) represent biallelic SNPs and are coded as 0 for homozygous, 1 for heterozygous, and 2 for homozygous of the alternate allele. Datasets were created with 3, 4, and 5 SNPs. For example, a subset of a 5 SNP dataset looks like:

X1 X2 X3 X4 X5 Class
0 1 1 1 1 0
0 0 0 1 0 0
2 1 1 2 0 0
1 1 0 1 0 1
1 1 1 1 2 1
1 2 2 0 0 1

Datasets were created using an evolution strategy. Each run used a population size of 1,000 (number of datasets) and evolved for 2,000 generations. Each generation consisted of introducing mutations and selecting to survive datasets that were optimal for at least one attribute. Only the Pareto-optimal datasets from the final generation of a run were retained. From the Pareto-optimal datasets yielded by a run, a single "best" dataset was chosen.

We created datasets with following attributes:

  • n-way: the number of SNPs in the dataset, ranging from 3–5. The joint-predictiveness of all SNPs was maximized, while the predictiveness of lower order SNP-combinations was minimized.
  • NoLow. All runs were optimized for having no one-way (marginal) or two-way (pairwise epistatic) associations. NoLow refers to whether, in addition to minimizing 1 and 2-way effects, all lower order effects were minimized. For example, fivewayNoLow maximized the 5-way effect, while minimizing 1, 2, 3, and 4-way effects.
  • HWE: whether SNPs were optimized to maintain Hardy-Weinberg equilibrium.

In total, eight types of datasets were created, which combined the attributes above. For each dataset type, 100 runs were performed resulting in 100 best datasets per type. Most users will be interested in only the best datasets, since multiple datasets from the same run (set) may not be independent.

Access datasets

The following table links to repository location with the datasets for a given attribute combination. SNPs refers to the number of SNPs in each dataset. 1-Way through 5-Way indicate whether associations of that order exist: No means datasets were evolved not to have that effect; Yes means datasets were evolved to have that effect; and Possibly means datasets were not optimized for that effect.

SNPs 1-Way 2-Way 3-Way 4-Way 5-Way HWE Name
3 No No Yes No threeway
3 No No Yes Yes HWthreeway
4 No No Possibly Yes No fourway
4 No No Possibly Yes Yes HWfourway
4 No No No Yes No fourwayNoLow
5 No No Possibly Possibly Yes No fiveway
5 No No Possibly Possibly Yes Yes HWfiveway
5 No No No No Yes No fivewayNoLow

Need help?

If you have any questions or feedback, please submit an Issue on GitHub.

License

No rights reserved. This repository including its datasets are released under the CC0 Public Domain Dedication (see LICENSE.md).

model-free-data's People

Contributors

dhimmel avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

model-free-data's Issues

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.