Summary

The R package SDMSelect is for species distribution modeling (SDM). It uses a forward model selection with cross-validation to select covariates and model specifications that allow for best predictions. This selection is not specifically for SDM. Results can then be used for mapping predicted species distribution and uncertainty of prediction.

Distribution of probability of presence with SDMSelect library on Rstat

How to cite

Please give credit where credit is due and cite R and R packages when you use them for data analysis. To cite SDMSelect properly, call the R built-in command citation("SDMSelect"):

Sébastien Rochette. (2019, June 25). statnmap/SDMSelect: SDMSelect: A R-package for cross-validation model selection and species distribution mapping (Version v0.1.5). Zenodo. http://doi.org/10.5281/zenodo.3256536

Download and Install

Before installing SDMSelect, be sure to have updated versions of mgcv (>= 1.8-19), dplyr (>= 0.7) and corrplot (>= 0.82).

To download the development version of the SDMSelect package, type the following at the R command line:

install.packages("devtools")  
devtools::install_github("statnmap/SDMSelect")

Build with vignettes:

devtools::install_github("statnmap/SDMSelect", build_opts = c("--no-resave-data", "--no-manual"))

To be able to test the vignettes, you will previously need to install knitr, rmarkdown, dismo and rasterVis. Vignette are available on {pkgdown} website for covariates selection and for species distribution model selection

Note that spatial libraries like rgdal and sp may require additional softwares to be installed on your computer if you work with Mac or Linux. Look how to install proj4, geos and gdal on your system.

Issues

namespace 'mgcv' 1.xx is being loaded, but >= 1.8.19 is required. This means that you need to update mgcv library: install.packages("mgcv", force = TRUE).
Error when building vignette: win.asp is not graphical parameter. This means you did not install the last version of corrplot: install.packages("corrplot", force = TRUE)
there is no package called 'dismo' ... cannot create a RasterLayer object from this file (file does not exist): install library dismo: install.packages("dismo")

Examples

Vignettes have been created to show how to use the library for covariate selection on simple cases (vignette(package = "SDMSelect")).

First case is covariate selection procedure for classical dataset (not geographical data, nor species distribution data): vignette("Covar_Selection", package = "SDMSelect"). See the vignette in the {pkgdown} website: https://statnmap.github.io/SDMSelect/articles/Covar_Selection.html
The second case is for spatial data of species occurence to produce predicted species distribution maps and maps of uncertainties: vignette("SDM_Selection", package = "SDMSelect"). See the vignette in the {pkgdown} website: https://statnmap.github.io/SDMSelect/articles/SDM_Selection.html

Note that most figures of the vignette are saved in “inst” so that model selection is not run during vignette building. However, code in the vignettes can be run on your own computer and should return the same outputs. You can also find the path of the complete vignettes to be run on your computer with: system.file("Covar_Selection", "Covar_Selection.Rmd", package = "SDMSelect") and system.file("SDM_Selection", "SDM_Selection.Rmd", package = "SDMSelect"). Open and click knit button if you are on Rstudio. (This may require 5-10 minutes depending on the number of cores of your system).
Main functions are listed in the library general help: ?SDMSelect.

Description

I decided to transform into a R package some R-scripts I have been using for years for my studies with species distribution modelling. I know it could be improved in many ways, but it works as is.
This library has originally been created for covariates selection to predict species distribution (Biomass, density or presence/absence). Its final aim is thus to produce maps of predicted distributions (Look at vignette SDM_Selection). However, the core of the library is a N-times k-fold cross-validation selection procedure that can be applied to any kind of data, provided that model parameters are well defined (Look at vignette Covar_selection).

This works with dataframe, SpatialPointsDataFrame and tibbles.
This has been designed to run in parallel in multiple cores computer, using library parallel. The only steps missing in this library are data exploration and data cleansing. These are important parts of modelling and should be realized prior to model selection procedure.

Selection method

The model selection procedure will test different combinations of covariates with LM, GLM, GLM natural splines and GAM models, with different distributions (Gaussian, Gamma, Log-Normal, Tweedie; Binomial) and with different maximum degrees of freedom for GLM with polynoms or natural splines. The multiple k-fold cross-validation realised on the same folds for each model/submodel type allows their comparison using cross-validated RMSE or AUC. Covariates correlation can be tested prior to the selection procedure to avoid fitting models with correlated covariates. However, the cross-validation procedure is coupled with a forward stepwise procedure. This means that covariates are added to best models selected at the previous step. This obviously not choose a new covariate that add no predictive power to a model, thus avoiding selecting correlated covariates in the same model.
Models are ordered according to RMSE (or AUC for presence-absence). The cross-validation produces a distribution of (N*k) RMSE values for each model at each step. Models are ordered according to mean RMSE. RMSE distributions of model 2 to n are then statistically compared to the best model. Models not statistically worse than the best model are retained for the next step. After the stepwise procedure, the best model among all is selected with the same method. Models not statistically worse are also retained.

Models and covariates in models not statistically different from the best one. Covariates are ordered based on occurrence in the models selected.

Outputs

Outputs of the selection procedure are numerous, allowing for summary of the model selection and the comparison of the different models alltogether (GAM, GLM; Gaussian, Gamma, Log-Normal, Tweedie; Binomial). All output files and figures are stored in a common folder and are not showing up in the R session directly. The final model retained by the user can then be analysed (residual analysis, variance analysis, effect of covariates). For spatial data, maps of prediction can be produced.

Uncertainties

A particular attention has been given to assessment of uncertainty. Each prediction of a model is given with a standard error associated. This standard errors have been used to estimate possible minimum and maximum distributions of species (through estimations of quantiles).

Distribution of minimum (quantile 5%) and maximum (quantile 95%) of probabilities of presence with SDMSelect library on Rstat

Concerning presence-absence data, the estimation of probability of presence is not enough. The balance between presences and absences in data may conduct to biased predictions. The best threshold value to classify a probability of presence in presence or absence is here explored in more details. Map of probability to be over the threshold value is calculated.

Comparison of predictions against observations in a presence-absence data model and thresholds values with SDMSelect library on Rstat

Distribution of probabilities to be over the best threshold value separating presences from absences with SDMSelect library on Rstat

Caution

This library relies on a lot of other R-packages, which means that any modifications of those may prevent this library to work correctly. As I continue to use it regularly, I may see if some updates broke some of my functions and would try to fix them.
There are recent libraries like dplyr, tidyr, ggplot2 that did not originally exist when I first created these R-scripts. I now try to implement them from time to time.

Remark

This library is not as complete as can be library(caret) in terms of model types available, however, as far as I know, caret model selection procedure is designed for a unique combination of model/distribution. Here, cross-validation outputs of each model/distribution are kept, and can then be compared all together with paired comparison. Not only the model with the best mean RMSE (or AUC) is retained, but also all other models giving predictions statistically as good as the best one. For biological purpose, knowing all models with the same predictive power may modify outputs interpretation. Another difference is that SDMSelect package is designed for predicting species distribution models, which means that some outputs are maps of species distributions as well as all maps of uncertainty that can be deduced from model outputs.

Collaboration

For your work on covariates selection and species distribution modelling, you can contact me. Vignettes will give you a good starting point. I’ll be happy to participate to scientific collaborations based on this R-package. More information on my website https://statnmap.com.

If you want to participate in improving this library, please have a look at my todo list (in Vignettes directory or in the pkgdown site) and feel free to clone, modify and provide pull requests. Please run the two Rmd files in inst folder before submitting a pull request. These are the complete versions of the two vignettes.

See full documentation realized using {pkgdown} at https://statnmap.github.io/SDMSelect/

License

This package is free and open source software, licensed under GPL.

Bug with `Param_corr ()`

I have come across an error when running the Param_corr function, error which appears on my own data but also in the example data used in the vignettes, (mtcars & bradypus), even when the data preparation is followed along thoroughly. This error still persists after I force updates on all the package dependency, it is to this day the only function that doesn’t work, which prevent me from going further in the use of the package.

Here is a simple version of the code that leads to this error (taken from the covariate selection vignette)

tmpdir <- paste0(tempdir(), "/out_CovarSelection")
dir.create(tmpdir)

data <- dplyr::mutate_at(mtcars, 8:11, as.character)

data.new <- Prepare_dataset(
    x = data, var = 1, cov = 2:ncol(data),
    datatype = "Cont", na.rm = TRUE
)

corSpearman <- Param_corr(
    x = data.new, rm = 1, thd = 0.7, visual = FALSE,
    plot = TRUE, saveWD = tmpdir, img.size = 5)

The error message is the following ;

Erreur : Can't subset columns that don't exist.
x Column `rowname` doesn't exist.

Running traceback takes me back to this particular function tidyr:::gather.data.frame(., key = "Var2", value = "Corr", -rowname)
Which seems to call for a rowname column which is non existent.

For more information, here is a few look at the data.new which seems fine after being prepared

 head(data.new)
                  dataY cyl disp  hp drat    wt  qsec factor_vs factor_am factor_gear factor_carb
Mazda RX4          21.0   6  160 110 3.90 2.620 16.46         0         1           4           4
Mazda RX4 Wag      21.0   6  160 110 3.90 2.875 17.02         0         1           4           4
Datsun 710         22.8   4  108  93 3.85 2.320 18.61         1         1           4           1
Hornet 4 Drive     21.4   6  258 110 3.08 3.215 19.44         1         0           3           1
Hornet Sportabout  18.7   8  360 175 3.15 3.440 17.02         0         0           3           2
Valiant            18.1   6  225 105 2.76 3.460 20.22         1         0           3           1

And the column parsing.

str(data.new)
'data.frame':	32 obs. of  11 variables:
 $ dataY      : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
 $ cyl        : num  6 6 4 6 8 6 8 4 4 6 ...
 $ disp       : num  160 160 108 258 360 ...
 $ hp         : num  110 110 93 110 175 105 245 62 95 123 ...
 $ drat       : num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
 $ wt         : num  2.62 2.88 2.32 3.21 3.44 ...
 $ qsec       : num  16.5 17 18.6 19.4 17 ...
 $ factor_vs  : chr  "0" "0" "1" "1" ...
 $ factor_am  : chr  "1" "1" "1" "0" ...
 $ factor_gear: chr  "4" "4" "4" "3" ...
 $ factor_carb: chr  "4" "4" "1" "1" …

I am running the latest version of R on Mac OS Mojave 10.14.6.

statnmap / sdmselect Goto Github PK

sdmselect's Introduction

Summary

How to cite

Download and Install

Issues

Examples

Description

Selection method

Outputs

Uncertainties

Caution

Remark

Collaboration

License

sdmselect's People

Contributors

Stargazers

Watchers

Forkers

sdmselect's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs