GithubHelp home page GithubHelp logo

gotellilab / ecosimr Goto Github PK

View Code? Open in Web Editor NEW
26.0 26.0 10.0 8.14 MB

Repository for EcoSimR, by Gotelli, N.J. , Hart E. M. and A.M. Ellison. 2014. EcoSimR 0.1.0

Home Page: http://ecosimr.org

License: Other

R 100.00%

ecosimr's People

Contributors

adamtclark avatar davharris avatar emhart avatar ngotelli avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

ecosimr's Issues

Fix Summary Functions

Summary functions print out the internal name used in the package, not the one the user inputs. We should change this..e.g.

Elapsed Time:  0.19 secs 
Metric:  var_ratio 
Algorithm:  uniform_size 

Should be:

Elapsed Time:  0.19 secs 
Metric:  Var.Ratio 
Algorithm:  Uniform.Size

Added functionality

Are there any more algorithms to add? Otherwise I think we should be ready for a release after one more commit.

Error in plot type= "cooc"

Although the stats look correct for all of the different co-occurrence algorithms, this graph is not. It keeps showing the two data matrices as almost identical, no matter what algorithm is used. Note the contrast with the appearance of the simulated matrix in the Quick Start vignette.

errors on cran checking

On R CMD CHECK --as-cran a whole series of errors are thrown. It's why travis-ci builds are failing.

I need to fix the following:

  • checking S3 generic/method consistency ... WARNING
  • no visible binding for global variable
  • no visible global function definition
  • Undocumented arguments in documentation object
  • Undocumented code objects:
  • Running examples in ‘EcoSimR-Ex.R’ failed

null_model_engine not finding algo functions

The following code works for setting up and getting into dev mode:

# Help system set up
# Code from Ted for updating EcoSimR help system
#30 July 2014


# caution: next line appears to reinstall from 
# github and will probably wipe out changes 
# unless they were first committed and pushed
#------------------------------
# install_github("EcoSimR","gotellilab",ref="dev")
#------------------------------


# Add devtools and roxygen and pbapply libraries
library(devtools)
library(roxygen2)
library(pbapply)


# Set path for location of EcoSimR main folder
path <- "C:/Users/Administrator/Documents/GitHub/EcoSimR"

# enter develop mode ON
dev_mode()

# Rebuild files and reinstall library
roxygenize(path)
install(path)

# Open library
library(EcoSimR)

However, at this point, the example code in the help documentation for null_model_engine is failing:

d> data(macwarb)
Warning message:
In data(macwarb) : data set ‘macwarb’ not found

d> data(macwarblers) # but this works OK

d> null_model_engine(macwarb)
Error in parse(text = algo) : argument "algo" is missing, with no default

d> null_model_engine(macwarb,algo="RA1")
Error in eval(expr, envir, enclos) : object 'RA1' not found

d> null_model_engine(macwarblers)
Error in null_model_engine(macwarblers) : object 'macwarblers' not found

d> null_model_engine(macwarblers,algo="RA2")
Error in null_model_engine(macwarblers, algo = "RA2") : 
  object 'macwarblers' not found

Change algorithm and plugin structure

Currently we use a very simple way to running the simulations, via a call to replicate(). However I don't think this is the right approach because it is very rigid. A better way to do it is to call algorithms and metrics via a do.call() set-up. This means we'll have to ditch the replicate() call and move to using a for loop. However it appears that a preallocated for loop is just as fast, see: http://stackoverflow.com/questions/13412312/replicate-verses-a-for-loop

The advantage of this is that it allows us to easily add new functionality as well as other users could add whatever functionality they want.

RA1 returns 1D matrix with example data set

the code RA1(macwarb) returns a matrix that throws an error with the Pianka metric. This is because RA1 returns:

 [,1]     [,2]      [,3]      [,4]      [,5]      [,6]     [,7]      [,8]     [,9]     [,10]
[1,] 0.4076 0.633725 0.5044575 0.6635675 0.8789925 0.2179653 0.209956 0.7695814 0.220267 0.2328854
        [,11]     [,12]    [,13]     [,14]     [,15]     [,16]
[1,] 0.884197 0.9193044 0.574637 0.4272182 0.9468288 0.2616613

Seems like RA1 needs to be fixed.

Change naming convention

Naming conventions are currently a bit mish-mash with '.', '_' and camelCase. I need to make this consistent.

  • Functions will be '_'
  • Parameters camelCase

Degenerate matrix error handling

Hi @ngotelli you mentioned that we need to check if matrices are singular or not. Do you mean to say that when a matrix is simulated that we need to make sure that it is not singular? Or that a user input isn't singular? I was thinking of using this: http://www.inside-r.org/packages/cran/matrixcalc/docs/is.singular.matrix. However I do worry that checking every simulated matrix will have a computational cost. We'll have to see. Is there anything else we want to check?

sim9 single

If I understand sim9 single, it will only do a swap on columns that sum to 1, which would mean sites that have only 1 species will be swapped. A simple test is to just test a sample matrix vs the simulation.

out <- rep(NA,1000)
m <- matrix(rbinom(100,1,0.5),nrow=10)
for(i in 1:1000){
  out[i] <- sum(m - sim9.single(m))
}
sum(out)

I noticed this because the plotting of the simulated vs. real matrix reveals the same plot. I'm not sure if the swapping is working the way we want it to.

Consistent labels for sim9 and burnin

We have sim9.fast for the underlying algorithm, but users call simFast for the algo= argument. The algo argument should probably be named sim9 for consistency with all the other sims. @emhart I will let you fix that one so that nothing else breaks that is using it.

Also, we should change burnin to burn_in.

Complete DESCRIPTION file

We need to complete this file before pushing to CRAN. I'm happy to chime in on this first @ngotelli or I can contribute to what you write. We just need a short and a long description.

RA2 throws error

RA2 throws an error:

> RA2(species_data)
> Error in `[<-.data.frame`(`*tmp*`, z, value = c(0.541982688708231, 0.0249929069541395,  : 
  new columns would leave holes after existing columns

Complete documentation

Just wanted to keep an issue open that we can close when all the documentation for metrics and algorithms is sufficient and we'll know that we've reached milestone 0.1.0

Now using MASS library for gamma function in size overlap

I looked at my code for the gamma distribution and decided to use the MLE estimator of the shape and rate parameters that are produced with the fitdistr function in the MASS library. Aaron and I decided early on we would try to not call any other libraries (e.g. ggplot), but the MASS library should be on all R distributions, so I don't think this will be an issue. However, the MASS library is now loaded from within the function, and you probably want to move that up to load initially as a dependency when EcoSimR is loaded.

That brings up another issue, which is that the structure of null_model_engine (at least as we originally built it) introduces a lot of unnecessary calculation inside the functions. Each algorithm has passed to it only an empirical data matrix, and sometimes additional data vectors for weights. All calculations are made inside the function to generate a random data matrix.

This means that the same parameters calculated from the data set are re-calculated every time the algorithm is called to produce a new random matrix. In this case, we are getting the MLE estimators from the same empirical data every time we generate a new randomized data set. But that really isn't necessary. Ideally we should calculate the needed parameters once at the start, and then pass those to the function along with the data. For example, in gamma_size, we pass only the speciesData vector. But the algorithm would run faster if we passed the data and 3 constants:

speciesData
n = length(speciesData)
a = shape  # shape parameter for gamma estimated from fitdistr applied to speciesData
b = rate     # rate parameter for gamma estimated from fitdistr applied to speciesData

Although our method is slower, we probably want to keep it the way it is. Otherwise, we would have to reprogram the null_model_engine, which could be complicated. More important, by doing all the parameter calculations within the function, it can be used outside of null_model_engine so it is a lot more portable.

Reset plot parameters

Currently if a plot happens and is set par(mfrow=c(1,2)), and then you want another plot, it's still divided into 2 panels. So I just need to reset the plot area.

Inequalities for tail extreme tail probabilities not correct with RA1

Need to check code for summary function in niche overlap with ra1and make sure it is handling extreme cases the same way for all algorithms:

# If observed index is smaller than any of the 1000 simulated values, 
# tail inequalities should read:
p(Obs <= Null) < 0.001
p(Obs >= Null) > 0.999

Add second plot to niche overlap.

Code here

Niche.Overlap.Plot <- function (Data=matrix(rpois(80,1),nrow=5), Algorithm="RA3", Date.Stamp=date(), Plot.Output="screen")
{

opar<- par(no.readonly=TRUE)
if (Plot.Output == "file") par(mfrow=c(2,1))
Data <- Data/rowSums(Data)
plot(rep(1:ncol(Data),times = nrow(Data)),
     rep(1:nrow(Data),each=ncol(Data)),
     xlab="Resource Category",ylab="Species",cex=10*sqrt(t(Data)/pi),col="red3",lwd=2,
     main="Observed Utilization Matrix",col.main="red3",cex.main=1.5)
if (Plot.Output=="file") mtext(as.character(Date.Stamp),side=3,adj=1,line=3)

Fun.Alg <- get(Algorithm)
One.Null.Matrix <- Fun.Alg(Data)
One.Null.Matrix <- One.Null.Matrix/rowSums(One.Null.Matrix)
plot(rep(1:ncol(One.Null.Matrix),times = nrow(One.Null.Matrix)),
     rep(1:nrow(One.Null.Matrix),each=ncol(One.Null.Matrix)),
     xlab="Resource Category",ylab="Species",cex=10*sqrt(t(One.Null.Matrix)/pi),col="royalblue3",lwd=2,
     main="Simulated Utilization Matrix",col.main="royalblue3",cex.main=1.5)
par(opar)
}

Speed enhancements

Many of the algorithms could probably use a speed bump. @ngotelli identified the following.

  • Cscore
  • pianka
  • pianka.skew
  • pianka.var
  • czekanowski
  • czekanowski.skew
  • czekanowski.var

Preparing to update the documentation

Hi Ted:

Good talking to you the other day. I was able to use the code you sent to set up a batch file that roxygenizes and assembles the library after updating the documentation. I have cloned this repository and I have the following files in the R folder:

algorithms.R
cocurrence_null.R
EcoSimR-package.r
general_functions.R
graphics.R
metrics.R
niche_overlap_null.R
null_model_engine.R
sim9fast.R
sizeratio_null.R

Should I be adding documentation for the functions in each of the files? Aaron and I had originally intended for null_model_engine to run generically for most null models in which there is an algorithm, a metric, and a data input file. This structure works for niche overlap and for co-occurrence with all of the algorithms except Sim9. It should also work with size_overlap, although some of those algorithms have some additional inputs.

In this code, it looks like you are breaking out the null_model_engine steps into separate functions for each module. That's fine, but we may want to rename or re-organize things.

For now, unless you tell me otherwise, I will just add documentation to all functions in the files above and also check their performance and functionality.

Best,

Nick

min_ratio question

@ngotelli I was working on tests, and I'm not sure I understand the logic behind min_ratio().

Here's a sequence:

m <- c(2,3,4,5,6)

So on it's face I'd expect the minimum ratio to be
2/3 ~= 0.66667

But min_ratio() gives 0.1823
Which is strange because looking at the code it seems that you would want to exponentiate to get the ratio in non-log terms. When I do that, I get 1.2 (which is 6/5). So the reason we get that is because when diff is working on a sorted vector of a,b,c, it takes b - a and c - b. Is that what you want?
e.g.

> diff(c(4,2,4))
[1] -2  2

Just ask because based on the description you wrote of the function I would write it as:

min_ratio2 <- function(m=runif(20)) {
  m <- sort(log(m), decreasing=T)
  mr <- min(exp(diff(m)))
  return(mr)
}

m <- c(2,3,4,5,6)
min_ratio2(m)
[1] 0.6666667

This gives me what I think of as: "the minimum size ratio difference between adjacent, ordered values".

I just came across this while I was writing tests and need to get back the expected values from a known vector.

change output names

I realize that one of the issues with the null model engine is that I have done a lot of work to create output that matches user inputs to outputs e.g. "Uniform.Size" calls "uniform_size" and the outputs the former name. This requires some annoying matching and I realize will break the null model engine where people input their own functions. So I'll just remove this and not worry about the output.

roxygenize not creating usable .Rd files

Hi Ted:

I went in to finish the documentation of existing functions in the library and ran into a problem. Briefly, I cloned the repository and initially found all the .Rd files that should be there in the man folder and everything else in order. Next, I ran through the following code that you sent me:

library(devtools)
library(roxygen2)

path <-  getwd()
dev_mode()
roxygenize(path)
install(path)
library(EcoSimR)

When the path was roxygenized, I noticed it was also deleting some files, and those were ones that used to be named in caps (RA1-RA4, Pianka, Czechanowski, Sim1-Sim8,Sim10). After that first pass through with roxygenize, those .Rd files are now missing, even though the functions are still present in the algorithms and metrics source files.

Next, I re-roxygenized the path and the renamed functions were restored with their .Rd files written. All good, and I never generated any errors or warnings with the R commands. But when I go to load anything in the EcoSimR help system from the library, I get the following sort of error:

error in fetch(key): cannot allocate memory block of size 2.2 GB

Possibly I screwed something up 5 months ago when I pushed my changes, but everything was working correctly in my cloned version then.

Thanks for your help. I would really like to get this documentation completed so we can move on to the next phase!

Best,

Nick

Documentation style look OK?

Hi Ted:

OK, I have pushed some documentation for the pianka niche overlap index. Please take a look and see if this is OK. I have tried to imitate the style in other R packages. The material found said I could use LaTeX in the documentation, but when I tried I could only generate very simple equations with no superscripts, subscripts, or summation. That's a shame, but it seems to be that way for other packages as well.

If this looks OK to you, I will move on to the rest of the documentation. It sure is a nuisance to have to quit R every time I want to inspect the help system, but at least now I have a working template.

Best,

Nick

Change set.seed behavior

@davharris pointed out that the way we have set.seed() in the function reduces entropy. A better practice is to just set the seed at the beginning of the session. Will have to give a think on this and possibly change

Labels for data sets

macwarb, rodents, and wiFinches should be renamed dataMacWarb, dataRodents, and dataWiFinches so they are obvious and pop up together in the help system.

Write tests

There are many different algorithm combinations, and we should write tests for all of them to ensure that all of our code works.

  • Size Ratio Tests
  • Niche Tests
  • Co-Occurrence Tests
  • Null Model engine tests

`niche_null_model` giving incorrect results for observed index

For the MacArthur's warblers data set, the observed value for the pianka index should be 0.5551383. This value is confirmed if you strip out the row names in the first column of the macwarb data frame and run the pianka function directly on the matrix:

z <- as.matrix(macwarb[,-1],nrow=5)
pianka(z)
> [1]   0.5551383

In the printed output on the quickstart page, this is also what was calculated. However, if you now run this through niche_null_model you see the following

 myWarblers <- niche_null_model(macwarb)
summary(myWarblers)

Time Stamp:  Fri Feb 20 21:25:52 2015 
Random Number Seed Saved:  TRUE 
Number of Replications:  1000 
Elapsed Time:  0.4 secs 
Metric:  Pianka 
Algorithm:  ra3 
Observed Index:  0.40673 
Mean Of Simulated Index:  0.38975 
Variance Of Simulated Index:  0.0022413 
Lower 95% (1-tail):  0.32403 
Upper 95% (1-tail):  0.47414 
Lower 95% (2-tail):  0.31484 
Upper 95% (2-tail):  0.4951 
P(Obs <= null) =  0.688 
P(Obs >= null) =  0.312 
P(Obs = null) =  0 
Standardized Effect Size (SES):  0.3587 

Moreover, the Observed Index value changes with each run of the model, which of course it should not be doing. The observed value should be significantly larger than expected by chance, but now it is effectively random. I am getting similar results with ra1, raw, ra4, and with Czekanowski. The simulated distributions look correct. I strongly suspect that the observed index is now being calculated from one of the simulated matrices, rather than from the original data matrix.

Size ratio algorithms

Some of the size ratio algorithms require extra input. We need a nice way to do this. I'm thinking it should be a list of control parameters. But this bug can be worked out pretty easily.

Organized EcoSim functions into modules

Hi Ted -

Here is a catalog of all the functions that are currently in EcoSimR, and how they should be organized into distinct modules. For co-occurrence, the Sim9 and Sim9.Fast modules do not work inside of the original Null.Model.Engine because the replicate matrices are not independent of one another, but are built consecutively. Also, some of the algorithms contain special inputs, such as site or species weights (e.g., Sim10)..

Let me know if you have any questions.

Best,

Nick

General Functions

  • Get.Params
  • Data.Read
  • Set.The.Seed
  • Null.Model.Engine
  • Null.Model.Summary
  • Output.Results

Co-occurrence Module

Metrics

  • Species.Combo
  • Checker
  • C.Score [DEFAULT]
  • C.Score.var
  • C.Score.skew
  • V.Ratio

Algorithms

  • Sim9
  • (Sim9.Single) [Single replicate of Sim9.Fast]
  • Sim9.Fast [DEFAULT]
  • (VectorSample) [used by other functions]
  • Sim1
  • Sim2
  • Sim3
  • Sim4
  • Sim5
  • Sim6
  • Sim7
  • Sim8
  • Sim10

Graphics

  • Null.Model.Plot [DEFAULT]
  • CoOccurrence.Plot [DEFAULT]
  • Burn.In.Plot [Useful for adjusting burn-in parameter for Sim9.Fast]

Sample Data

  • West Indies Finches.csv [DEFAULT]

Niche Overlap Module

Metrics

  • Pianka [DEFAULT]
  • Czekanowski
  • Pianka.var
  • Czekanowski.var
  • Pianka.skew
  • Czekanowski.skew

Algorithms

  • RA1
  • RA2
  • RA3 [DEFAULT]
  • RA4

Graphics

  • Null.Model.Plot [DEFAULT]
  • Niche.Overlap.Plot [DEFAULT]

Sample Data

  • MacArthur Warblers.csv [DEFAULT]

Size Ratio Module

Metrics

  • Min.Diff
  • Min.Ratio
  • Var.Diff
  • Var.Ratio [DEFAULT]

Algorithms

  • Uniform.Size [DEFAULT]
  • Uniform.Size.User
  • Source.Pool.Draw
  • Gamma.Size

Graphics

  • Null.Model.Plot [DEFAULT]
  • Size.Ratio.Plot [DEFAULT]

Sample Data

  • Desert Rodents.csv [DEFAULT]

Sim 9 Fast

Sim 9 Fast is a bit tricky because it doesn't fit in with the whole framework of the rest of the algorithms. I need to work out how it all fits in the new object for co-occurrence.

Suspicious results

sim2 in co-occurrence should be giving unusually low score not high if species are reshuffled equiprobably among sites. check with plot of simulated data.

Numbers for observed and mean of simulated from size overlap with algo=size_gamma look incorrect. Need to check against plot.

Improved C-score and related metrics.

We should definitely implement @davharris code for calculating the c_score, but restore line 369, which is the error check for empty rows.

Once this is done we should then modify the code for c_score.skew and c_score.var so that they call c_score. The error check needs to live only in c_score, not in these other functions.

@emhart how do I pull in @davharris fork in GitHub to get the new function?

Add Travis ci

we should add continuous integration button for Travis ci and appveyor also a button for test coverage.

Develop future S3 class structure

Develop a strategy for summarizing, visualizing and writing model output. Currently the project has no roadmap to create a standardized output. Consider all desired model outputs and how best to organize them as classes. Would be helpful to get a list from someone else in the group.

Create S3 class outputs

Much of the current codebase tries to do everything in the function. Create a null model s3 class object that can have methods that are now separate functions.

Correct branch for last push/pull?

Hi Ted:

I just completed some annotations for the size overlap metrics. But when I went to push them, it failed because the branch was not up to date. So, I pulled changes first, and then the push worked fine. My concern now is that I appeared to have pulled and then pushed to the master branch. But now I wonder if I shouldn't have been on the dev branch? Sorry if I have botched this up, but hopefully you can repair it quickly if I have made a mistake. I am still wearing my "newbie hat" on GitHub.

Best,

Nick

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.