gotellilab / ecosimr Goto Github PK
View Code? Open in Web Editor NEWRepository for EcoSimR, by Gotelli, N.J. , Hart E. M. and A.M. Ellison. 2014. EcoSimR 0.1.0
Home Page: http://ecosimr.org
License: Other
Repository for EcoSimR, by Gotelli, N.J. , Hart E. M. and A.M. Ellison. 2014. EcoSimR 0.1.0
Home Page: http://ecosimr.org
License: Other
Summary functions print out the internal name used in the package, not the one the user inputs. We should change this..e.g.
Elapsed Time: 0.19 secs
Metric: var_ratio
Algorithm: uniform_size
Should be:
Elapsed Time: 0.19 secs
Metric: Var.Ratio
Algorithm: Uniform.Size
Build a webpage, possibly host it on gh-pages, using http://staticdocs.had.co.nz/dev/
Are there any more algorithms to add? Otherwise I think we should be ready for a release after one more commit.
@gavinsimpson suggested that we look at some functions from vegan about how to simulate species data.
specifically:
Although the stats look correct for all of the different co-occurrence algorithms, this graph is not. It keeps showing the two data matrices as almost identical, no matter what algorithm is used. Note the contrast with the appearance of the simulated matrix in the Quick Start vignette.
On R CMD CHECK --as-cran a whole series of errors are thrown. It's why travis-ci builds are failing.
I need to fix the following:
The following code works for setting up and getting into dev mode:
# Help system set up
# Code from Ted for updating EcoSimR help system
#30 July 2014
# caution: next line appears to reinstall from
# github and will probably wipe out changes
# unless they were first committed and pushed
#------------------------------
# install_github("EcoSimR","gotellilab",ref="dev")
#------------------------------
# Add devtools and roxygen and pbapply libraries
library(devtools)
library(roxygen2)
library(pbapply)
# Set path for location of EcoSimR main folder
path <- "C:/Users/Administrator/Documents/GitHub/EcoSimR"
# enter develop mode ON
dev_mode()
# Rebuild files and reinstall library
roxygenize(path)
install(path)
# Open library
library(EcoSimR)
However, at this point, the example code in the help documentation for null_model_engine
is failing:
d> data(macwarb)
Warning message:
In data(macwarb) : data set ‘macwarb’ not found
d> data(macwarblers) # but this works OK
d> null_model_engine(macwarb)
Error in parse(text = algo) : argument "algo" is missing, with no default
d> null_model_engine(macwarb,algo="RA1")
Error in eval(expr, envir, enclos) : object 'RA1' not found
d> null_model_engine(macwarblers)
Error in null_model_engine(macwarblers) : object 'macwarblers' not found
d> null_model_engine(macwarblers,algo="RA2")
Error in null_model_engine(macwarblers, algo = "RA2") :
object 'macwarblers' not found
Currently we use a very simple way to running the simulations, via a call to replicate()
. However I don't think this is the right approach because it is very rigid. A better way to do it is to call algorithms and metrics via a do.call()
set-up. This means we'll have to ditch the replicate()
call and move to using a for loop. However it appears that a preallocated for loop is just as fast, see: http://stackoverflow.com/questions/13412312/replicate-verses-a-for-loop
The advantage of this is that it allows us to easily add new functionality as well as other users could add whatever functionality they want.
the code RA1(macwarb)
returns a matrix that throws an error with the Pianka
metric. This is because RA1
returns:
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 0.4076 0.633725 0.5044575 0.6635675 0.8789925 0.2179653 0.209956 0.7695814 0.220267 0.2328854
[,11] [,12] [,13] [,14] [,15] [,16]
[1,] 0.884197 0.9193044 0.574637 0.4272182 0.9468288 0.2616613
Seems like RA1 needs to be fixed.
We should decide what to do with the code in the contributed
folder. Do we want to wrap this into our first official release @ngotelli ?
Naming conventions are currently a bit mish-mash with '.', '_' and camelCase. I need to make this consistent.
Hi @ngotelli you mentioned that we need to check if matrices are singular or not. Do you mean to say that when a matrix is simulated that we need to make sure that it is not singular? Or that a user input isn't singular? I was thinking of using this: http://www.inside-r.org/packages/cran/matrixcalc/docs/is.singular.matrix. However I do worry that checking every simulated matrix will have a computational cost. We'll have to see. Is there anything else we want to check?
If I understand sim9 single, it will only do a swap on columns that sum to 1, which would mean sites that have only 1 species will be swapped. A simple test is to just test a sample matrix vs the simulation.
out <- rep(NA,1000)
m <- matrix(rbinom(100,1,0.5),nrow=10)
for(i in 1:1000){
out[i] <- sum(m - sim9.single(m))
}
sum(out)
I noticed this because the plotting of the simulated vs. real matrix reveals the same plot. I'm not sure if the swapping is working the way we want it to.
We have sim9.fast
for the underlying algorithm, but users call simFast
for the algo= argument. The algo argument should probably be named sim9
for consistency with all the other sims. @emhart I will let you fix that one so that nothing else breaks that is using it.
Also, we should change burnin
to burn_in
.
We need to complete this file before pushing to CRAN. I'm happy to chime in on this first @ngotelli or I can contribute to what you write. We just need a short and a long description.
RA2 throws an error:
> RA2(species_data)
> Error in `[<-.data.frame`(`*tmp*`, z, value = c(0.541982688708231, 0.0249929069541395, :
new columns would leave holes after existing columns
Just wanted to keep an issue open that we can close when all the documentation for metrics and algorithms is sufficient and we'll know that we've reached milestone 0.1.0
I looked at my code for the gamma distribution and decided to use the MLE estimator of the shape and rate parameters that are produced with the fitdistr
function in the MASS library. Aaron and I decided early on we would try to not call any other libraries (e.g. ggplot), but the MASS library should be on all R distributions, so I don't think this will be an issue. However, the MASS library is now loaded from within the function, and you probably want to move that up to load initially as a dependency when EcoSimR is loaded.
That brings up another issue, which is that the structure of null_model_engine (at least as we originally built it) introduces a lot of unnecessary calculation inside the functions. Each algorithm has passed to it only an empirical data matrix, and sometimes additional data vectors for weights. All calculations are made inside the function to generate a random data matrix.
This means that the same parameters calculated from the data set are re-calculated every time the algorithm is called to produce a new random matrix. In this case, we are getting the MLE estimators from the same empirical data every time we generate a new randomized data set. But that really isn't necessary. Ideally we should calculate the needed parameters once at the start, and then pass those to the function along with the data. For example, in gamma_size
, we pass only the speciesData vector. But the algorithm would run faster if we passed the data and 3 constants:
speciesData
n = length(speciesData)
a = shape # shape parameter for gamma estimated from fitdistr applied to speciesData
b = rate # rate parameter for gamma estimated from fitdistr applied to speciesData
Although our method is slower, we probably want to keep it the way it is. Otherwise, we would have to reprogram the null_model_engine
, which could be complicated. More important, by doing all the parameter calculations within the function, it can be used outside of null_model_engine
so it is a lot more portable.
This should be something more intuitive like "reproducible" T/F
Currently if a plot happens and is set par(mfrow=c(1,2))
, and then you want another plot, it's still divided into 2 panels. So I just need to reset the plot area.
Need to check code for summary function in niche overlap with ra1
and make sure it is handling extreme cases the same way for all algorithms:
# If observed index is smaller than any of the 1000 simulated values,
# tail inequalities should read:
p(Obs <= Null) < 0.001
p(Obs >= Null) > 0.999
Code here
Niche.Overlap.Plot <- function (Data=matrix(rpois(80,1),nrow=5), Algorithm="RA3", Date.Stamp=date(), Plot.Output="screen")
{
opar<- par(no.readonly=TRUE)
if (Plot.Output == "file") par(mfrow=c(2,1))
Data <- Data/rowSums(Data)
plot(rep(1:ncol(Data),times = nrow(Data)),
rep(1:nrow(Data),each=ncol(Data)),
xlab="Resource Category",ylab="Species",cex=10*sqrt(t(Data)/pi),col="red3",lwd=2,
main="Observed Utilization Matrix",col.main="red3",cex.main=1.5)
if (Plot.Output=="file") mtext(as.character(Date.Stamp),side=3,adj=1,line=3)
Fun.Alg <- get(Algorithm)
One.Null.Matrix <- Fun.Alg(Data)
One.Null.Matrix <- One.Null.Matrix/rowSums(One.Null.Matrix)
plot(rep(1:ncol(One.Null.Matrix),times = nrow(One.Null.Matrix)),
rep(1:nrow(One.Null.Matrix),each=ncol(One.Null.Matrix)),
xlab="Resource Category",ylab="Species",cex=10*sqrt(t(One.Null.Matrix)/pi),col="royalblue3",lwd=2,
main="Simulated Utilization Matrix",col.main="royalblue3",cex.main=1.5)
par(opar)
}
Both plot type="size" and type="hist" throw error in xy.coords
Many of the algorithms could probably use a speed bump. @ngotelli identified the following.
Hi Ted:
Good talking to you the other day. I was able to use the code you sent to set up a batch file that roxygenizes and assembles the library after updating the documentation. I have cloned this repository and I have the following files in the R folder:
algorithms.R
cocurrence_null.R
EcoSimR-package.r
general_functions.R
graphics.R
metrics.R
niche_overlap_null.R
null_model_engine.R
sim9fast.R
sizeratio_null.R
Should I be adding documentation for the functions in each of the files? Aaron and I had originally intended for null_model_engine to run generically for most null models in which there is an algorithm, a metric, and a data input file. This structure works for niche overlap and for co-occurrence with all of the algorithms except Sim9. It should also work with size_overlap, although some of those algorithms have some additional inputs.
In this code, it looks like you are breaking out the null_model_engine steps into separate functions for each module. That's fine, but we may want to rename or re-organize things.
For now, unless you tell me otherwise, I will just add documentation to all functions in the files above and also check their performance and functionality.
Best,
Nick
@ngotelli I was working on tests, and I'm not sure I understand the logic behind min_ratio()
.
Here's a sequence:
m <- c(2,3,4,5,6)
So on it's face I'd expect the minimum ratio to be
2/3 ~= 0.66667
But min_ratio()
gives 0.1823
Which is strange because looking at the code it seems that you would want to exponentiate to get the ratio in non-log terms. When I do that, I get 1.2 (which is 6/5). So the reason we get that is because when diff is working on a sorted vector of a,b,c, it takes b - a and c - b. Is that what you want?
e.g.
> diff(c(4,2,4))
[1] -2 2
Just ask because based on the description you wrote of the function I would write it as:
min_ratio2 <- function(m=runif(20)) {
m <- sort(log(m), decreasing=T)
mr <- min(exp(diff(m)))
return(mr)
}
m <- c(2,3,4,5,6)
min_ratio2(m)
[1] 0.6666667
This gives me what I think of as: "the minimum size ratio difference between adjacent, ordered values".
I just came across this while I was writing tests and need to get back the expected values from a known vector.
eval(parse()) is bad practice http://stackoverflow.com/questions/13649979/what-specifically-are-the-dangers-of-evalparse
I realize that one of the issues with the null model engine is that I have done a lot of work to create output that matches user inputs to outputs e.g. "Uniform.Size" calls "uniform_size" and the outputs the former name. This requires some annoying matching and I realize will break the null model engine where people input their own functions. So I'll just remove this and not worry about the output.
Hi Ted:
I went in to finish the documentation of existing functions in the library and ran into a problem. Briefly, I cloned the repository and initially found all the .Rd
files that should be there in the man
folder and everything else in order. Next, I ran through the following code that you sent me:
library(devtools)
library(roxygen2)
path <- getwd()
dev_mode()
roxygenize(path)
install(path)
library(EcoSimR)
When the path was roxygenized, I noticed it was also deleting some files, and those were ones that used to be named in caps (RA1
-RA4
, Pianka
, Czechanowski
, Sim1
-Sim8
,Sim10
). After that first pass through with roxygenize, those .Rd
files are now missing, even though the functions are still present in the algorithms
and metrics
source files.
Next, I re-roxygenized the path and the renamed functions were restored with their .Rd
files written. All good, and I never generated any errors or warnings with the R commands. But when I go to load anything in the EcoSimR help system from the library, I get the following sort of error:
error in fetch(key): cannot allocate memory block of size 2.2 GB
Possibly I screwed something up 5 months ago when I pushed my changes, but everything was working correctly in my cloned version then.
Thanks for your help. I would really like to get this documentation completed so we can move on to the next phase!
Best,
Nick
Hi Ted:
OK, I have pushed some documentation for the pianka
niche overlap index. Please take a look and see if this is OK. I have tried to imitate the style in other R packages. The material found said I could use LaTeX in the documentation, but when I tried I could only generate very simple equations with no superscripts, subscripts, or summation. That's a shame, but it seems to be that way for other packages as well.
If this looks OK to you, I will move on to the rest of the documentation. It sure is a nuisance to have to quit R every time I want to inspect the help system, but at least now I have a working template.
Best,
Nick
@davharris pointed out that the way we have set.seed() in the function reduces entropy. A better practice is to just set the seed at the beginning of the session. Will have to give a think on this and possibly change
macwarb
, rodents
, and wiFinches
should be renamed dataMacWarb
, dataRodents
, and dataWiFinches
so they are obvious and pop up together in the help system.
There are many different algorithm combinations, and we should write tests for all of them to ensure that all of our code works.
For the MacArthur's warblers data set, the observed value for the pianka index should be 0.5551383. This value is confirmed if you strip out the row names in the first column of the macwarb data frame and run the pianka
function directly on the matrix:
z <- as.matrix(macwarb[,-1],nrow=5)
pianka(z)
> [1] 0.5551383
In the printed output on the quickstart page, this is also what was calculated. However, if you now run this through niche_null_model
you see the following
myWarblers <- niche_null_model(macwarb)
summary(myWarblers)
Time Stamp: Fri Feb 20 21:25:52 2015
Random Number Seed Saved: TRUE
Number of Replications: 1000
Elapsed Time: 0.4 secs
Metric: Pianka
Algorithm: ra3
Observed Index: 0.40673
Mean Of Simulated Index: 0.38975
Variance Of Simulated Index: 0.0022413
Lower 95% (1-tail): 0.32403
Upper 95% (1-tail): 0.47414
Lower 95% (2-tail): 0.31484
Upper 95% (2-tail): 0.4951
P(Obs <= null) = 0.688
P(Obs >= null) = 0.312
P(Obs = null) = 0
Standardized Effect Size (SES): 0.3587
Moreover, the Observed Index value changes with each run of the model, which of course it should not be doing. The observed value should be significantly larger than expected by chance, but now it is effectively random. I am getting similar results with ra1
, raw
, ra4
, and with Czekanowski
. The simulated distributions look correct. I strongly suspect that the observed index is now being calculated from one of the simulated matrices, rather than from the original data matrix.
Some of the size ratio algorithms require extra input. We need a nice way to do this. I'm thinking it should be a list of control parameters. But this bug can be worked out pretty easily.
Hi Ted -
Here is a catalog of all the functions that are currently in EcoSimR, and how they should be organized into distinct modules. For co-occurrence, the Sim9
and Sim9.Fast
modules do not work inside of the original Null.Model.Engine
because the replicate matrices are not independent of one another, but are built consecutively. Also, some of the algorithms contain special inputs, such as site or species weights (e.g., Sim10
)..
Let me know if you have any questions.
Best,
Nick
Get.Params
Data.Read
Set.The.Seed
Null.Model.Engine
Null.Model.Summary
Output.Results
Species.Combo
Checker
C.Score
[DEFAULT]C.Score.var
C.Score.skew
V.Ratio
Sim9
Sim9.Single
) [Single replicate of Sim9.Fast]Sim9.Fast
[DEFAULT]VectorSample
) [used by other functions]Sim1
Sim2
Sim3
Sim4
Sim5
Sim6
Sim7
Sim8
Sim10
Null.Model.Plot
[DEFAULT]CoOccurrence.Plot
[DEFAULT]Burn.In.Plot
[Useful for adjusting burn-in parameter for Sim9.Fast]West Indies Finches.csv
[DEFAULT]Pianka
[DEFAULT]Czekanowski
Pianka.var
Czekanowski.var
Pianka.skew
Czekanowski.skew
RA1
RA2
RA3
[DEFAULT]RA4
Null.Model.Plot
[DEFAULT]Niche.Overlap.Plot
[DEFAULT]MacArthur Warblers.csv
[DEFAULT]Min.Diff
Min.Ratio
Var.Diff
Var.Ratio
[DEFAULT]Uniform.Size
[DEFAULT]Uniform.Size.User
Source.Pool.Draw
Gamma.Size
Null.Model.Plot
[DEFAULT]Size.Ratio.Plot
[DEFAULT]Desert Rodents.csv
[DEFAULT]Sim 9 Fast is a bit tricky because it doesn't fit in with the whole framework of the rest of the algorithms. I need to work out how it all fits in the new object for co-occurrence.
Update c_score code in _var and _skew to be more like @davharris PR for c_score.
sim2
in co-occurrence should be giving unusually low score not high if species are reshuffled equiprobably among sites. check with plot of simulated data.
Numbers for observed and mean of simulated from size overlap with algo=size_gamma
look incorrect. Need to check against plot.
Currently niche overlap has several specialized summary and plot methods. Convert these to work on S3 null model objects. e.g.
plot(null_model_obj)
summary(null_model_obj)
We should definitely implement @davharris code for calculating the c_score
, but restore line 369, which is the error check for empty rows.
Once this is done we should then modify the code for c_score.skew
and c_score.var
so that they call c_score
. The error check needs to live only in c_score
, not in these other functions.
@emhart how do I pull in @davharris fork in GitHub to get the new function?
we should add continuous integration button for Travis ci and appveyor also a button for test coverage.
Develop a strategy for summarizing, visualizing and writing model output. Currently the project has no roadmap to create a standardized output. Consider all desired model outputs and how best to organize them as classes. Would be helpful to get a list from someone else in the group.
Need to add default row weights for sim10
.
Convert the tutorials in use_tutorials to knitr vignettes.
When I changed summary output to match user inputs, I broke some of the plotting functions. I need to fix this.
Much of the current codebase tries to do everything in the function. Create a null model s3 class object that can have methods that are now separate functions.
Hi Ted:
I just completed some annotations for the size overlap metrics. But when I went to push them, it failed because the branch was not up to date. So, I pulled changes first, and then the push worked fine. My concern now is that I appeared to have pulled and then pushed to the master branch. But now I wonder if I shouldn't have been on the dev branch? Sorry if I have botched this up, but hopefully you can repair it quickly if I have made a mistake. I am still wearing my "newbie hat" on GitHub.
Best,
Nick
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.