GithubHelp home page GithubHelp logo

dwinter / dfe Goto Github PK

View Code? Open in Web Editor NEW
0.0 2.0 0.0 107 KB

Fitting, simulating and generally exploring the distribution of fitness effects from MA studies

R 53.66% C++ 46.34%

dfe's Introduction

Travis-CI Build Status

#Distributions of fitness effects

This is a work-in-progress package, aiming to provide functions for users to fit existing and new models of the distribution of fitness effects from data arising from mutation accumulation experiments

As of Jan 2015, everything here is bleedingly alpha and will almost certainly change in the future.

##Package design

###Simulating MA experiments

Functions starting rma_*() simulate the fitness effects arising from a mutation accumulation study, in which the fitness-effects are distributed according to a * distribution. Current options are:

rma_normal()
rma_gamma()
rma_FGM()

rma_FGM() simulates mutations under a paramaterization of Fisher's Geometric Model (by default fitness is determined by the squared distance form the origin, user-defined fitness functions are allowed).

###Likelihood for observed data

Functions starting dma_*() calculate likelihood (densities) under the normal or gamma distributed models:

dma_gamma()
dma_normal()

There are also ML fitting functions, fit_ma*(), which are... a work in progress

###Miscellaneous functions

BM() calculates Bateman-Mukai (method of moments) estimators, moments_gamma() calculates the mean and variance of a given Gamma distribution, moments_FGM() estimates the mean, variance, skewness and proportion of beneficial mutations via simulation.

dfe's People

Contributors

dwinter avatar

Watchers

 avatar  avatar

dfe's Issues

Generic rma_* functions

It would be nice to have a "generic" rma function, which allows users to define any probability distribution to sample fitness-effects:

discrete_dfe <- function(n, a,p) sample(a, n, prob=p, replace=TRUE)
rma_custom(n=100, dfe=discrete_dfe, a=c(0.01, 0.1), p=c(0.1, 0.9))

It may even be cleaner to include the existing rma functions in a framework like this

Use Rcpp object where possible

At the moment we flip and flop between std::vectors and Rcpp::[Type]Vectors.

In the very least the Rcpp flavored is easier to read, and it probably saves a little time converting to and from types to explicitly use these types for anything that is used by Rcpp

Implement known mutation models

This will likely end up being a meta-issue while smaller ones come up connected to it

  • All mutations have an effect models
    • Normal
    • Gamma
    • IG
  • Proportion-neutral models for
    • Normal
    • Gamma
    • IG
  • Simulation functions for both (can have proportion neutral = 0)
    • Normal
    • Gamma
    • IG

Catch gsl error

The default behavuour for gsl errors is to abort bringing the whole R session down.

There are cases in which the intgrand is not-well behaved (like #3) where it would make more sense to return NA or -999999 (thus allowing for loops / fitting functions to charge on ahead).

It's possible to turn this behaviour off with gsl_set_error_handler_off (), but seems like the better course is to write our own error-handling function

N-G integral is divergent for some simulated values

At present, some fitness values, including those simulated with our rma_gamma function, break they density function. There doesn't seem to be any rhyme or reason to which fitness values do this, but they occur most often when the mean effect of mutations islarge, and the mutation rate is low:

dma_gamma(B=13, Ut=1.3, a=2, log=TRUE, Ve=1e-4, w=0.889)
dma_gamma(B=13, Ut=1.3, a=2, log=TRUE, Ve=1e-4, w=0.888)
dma_gamma(B=13, Ut=1.3, a=2, log=TRUE, Ve=1e-4, w=0.890)

One work around, as demonstrated above, might be to catch these errors and take values very-slightly either side of the error-producing one. This likely relates to the errors we want to catch in #2 .

Contraints for optimization

At present, the way we are setting up calls to mle means the box contstraints are not being respected.

Either need to drop down the optim method (and lose mle class stuff like AIC anc coef methods) or get tot he bottom of programatically setting upper and lower bounds with variable number of starts.

Memory leak with gsl_intergration_workspace

Whenever a gsl_intergration_workspace is created is should be freed w/ gsl_intergration_workspace_free. Especially important for the likelihood functions that be called many tines in fitting.

At present, at least dma_gamma_known can choke due to this bug

Consistent ordering / naming of arguments

All functions relating toDFEs, including the internal functions. should have a consistent ordering of arguments, and the arugments should clear names.

At present the rate parameter of the Gamma is variously called Beta or B and the shape is called a. These should all be replaced by rate and shape to make their meaning clear. The mean of the normal distribution is called s, which we should repalce with mean_effect.

In terms of odering of arguments, and idealised function would look like this:

f <- function(n, shape, rate, Ve, Ut, [misc. args like log/verbose){
    ...
}

Generic fitting function

It might make the code easier maintain if we wrap optim or stats4::mle with a generic fitting function designed to meet our models.

Doing so we could create named-arguments for each model arg and write acessors for the returned fit to allow inspection/plotting

Optionally return likelihood of MoM estimate

One major reason to use the methods of moments estimators is perform a profile/line search to find starting values for the likelihood functions.

This is a little awkward at the moment, but could be made easier by having the option to include the likelihood in the results of the MoM results

manual usage section for fitting functions

Using the new fiiting approach ( #7 ) means the roxygen-style automatic documentations doesn't make a properly formed \usage section.

We will need to override the defaults with @usage

Error catching for distribution fitting functions

The fit_ma_* functions still throw errors. Especially with whacky starting values or mutation sizes that would required very largue mutation rate. This is a pain when the functions are used in apply family functions as part of simulations because an error will kill the whole 'loop' and take any earlier results with it.

Short term solution is to have these functions check for errors and restart a variable number of times before giving up and returning an non-result object. Longer term, it may make sense to try an pick better starting values based on the data

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.