cjcarlson / embarcadero Goto Github PK

🌲🌉 Species distribution models with Bayesian additive regression trees

R 100.00%

species-distribution-modelling ecological-niche-modelling bayesian-additive-regression-trees

embarcadero's Introduction

Hi! I'm Colin. 👨‍🎤

🎓 I'm currently an Assistant Research Professor at Georgetown University. Check out what our lab works on here. Our group is moving to Yale University in summer 2024. See you there!

🦠 I'm also director of the Verena Institute, a team of researchers using artificial intelligence to predict the where, when, and how of viral emergence.

🔢 On my Github, you'll find work that uses machine learning and other kinds of data analytics to predict species distributions (like the embarcadero R package), identify the signal of climate change impacts on infectious diseases, or otherwise solve complex problems at the interface between ecology and global health.

embarcadero's People

Contributors

Stargazers

Watchers

Forkers

yangxhcaf sethmusker francescogattoelypta ck37 kkougiou clintynkolokosa diddrog11 ambarbosa pratikunterwegs lmbrainerd

embarcadero's Issues

predict() gives completely flat prediction at 0.5

I am having some issues with the predict function that I'm not sure are a bug or me doing something wrong.

I ran a model in a cluster, saved the result of bart.step() as an RDS file, and then opened it locally. Everything seems good, this is how the loaded object looks:

> class(calbor.sdm)
[1] "bart"
> summary(calbor.sdm)
Call:  bart all.cov[, step.model] all.cov[, "pres"] TRUE 
 
Predictor list: 
 bati chla_var logchla_lag3 sal sst sst_grad 
 
Area under the receiver-operator curve 
AUC = 0.8937647 
 
Recommended threshold (maximizes true skill statistic) 
Cutoff =  0.519118 
TSS =  0.6287537 
Resulting type I error rate:  0.16072 
Resulting type II error rate:  0.2105263

which I don't hate, and the plot looks like this

However, when I try to predict using

CB_prediction <- embarcadero::predict2.bart(object = calbor.sdm2,   #make sure I'm getting embarcadero's predict
                         x.layers = predictors_original,
                         quantiles =c(0.025, 0.975),
                         # splitby = 20,   #Doesn't work with or without this
                         quiet = F)

I get a stack of rasters with all cells == 0.5, see:

> CB_prediction
class      : RasterStack 
dimensions : 266, 242, 64372, 3  (nrow, ncol, ncell, nlayers)
resolution : 0.4986111, 0.4986111  (x, y)
extent     : -64.84722, 55.81667, -52.825, 79.80556  (xmin, xmax, ymin, ymax)
crs        : NA 
names      : layer.1, layer.2, layer.3 
min values :     0.5,     0.5,     0.5 
max values :     0.5,     0.5,     0.5

that looks like this

I've tried tweaking the options of the predict function, but nothing seemed to work.

The cluster (where I ran the model) works with R version 3.6, while I'm running in my laptop Windows x64 (where I'm predicting and plotting) R version 4.1.1.

Thanks for the help!

"velox" was removed from CRAN

My initial attempt to install failed because the package velox was removed from CRAN and is only available from the archive.

velox aggregate as bigstack()

Error in varim.diag "can't find `dropnames`"

Hi,

Just encountered an error when running the variam.diag function on my data, so I decided to test it with the vignette and was able to reproduce the error.

When you run the line

varimp.diag(occ.df[,xnames], occ.df[,'Observed'], iter=50, quiet=TRUE)

it throws up the error

Error in select(., -dropnames) : unused argument (-dropnames)

Trying to figure out what was going on, I ran lines 31-33 from the varim.diag code

quietly(model.0 <- bart.flex(x.data = x.data, y.data = y.data, 
                           ri.data = ri.data,
                           n.trees = 200))

It turns out that

dput(unlist(attr(model.0$fit$data@x,"drop")) )

throws

c(x1 = FALSE, x2 = FALSE, x3 = FALSE, x4 = FALSE, x5 = FALSE, x6 = FALSE, x7 = FALSE, x8 = FALSE)

which is why the next line (#35)

 dropnames <- colnames(x.data)[!(colnames(x.data) %in% names(which(unlist(attr(model.0$fit$data@x,"drop"))==FALSE)))]

doesn''t assign anything to dropnames and then the function can't find the object.

Can't figure out anything beyond that, but then I tried variable.step which uses similar code, and it works! I hope this is helpful to find the issue!

Session info:

- Session info ------------------------------------------------------------------------------------------------
 setting  value                       
 version  R version 4.0.2 (2020-06-22)
 os       Windows 10 x64              
 system   x86_64, mingw32             
 ui       RStudio                     
 language (EN)                        
 collate  Spanish_Spain.1252          
 ctype    Spanish_Spain.1252          
 tz       Europe/Paris                
 date     2020-08-21

add warning partial() doesn't work with rbart objects (yet)

Running partial() on a model of class rbart fails, with error message

Error in pdbart(model, xind = x.vars, levs = lev, pl = FALSE) :
x.train must be a matrix, data.frame, formula, fitted bart model, or dbartsSampler

Can confirm that pdbart() doesn't recognize an rbart object as a fitted BART model; and the documentation for rbart_vi doesn't mention partials, so this is on some level a dbarts thing.

Categorical variables not recognized

I am starting with the embarcadero package as I would like to compare bayesian-SDMs with standard SDMs. However, I really need to include a categorical raster map (geology in my case) to model a plant species. It is very likely confined to special substrates and this would be important to consider.

Yet, the bart function does not accept a factorial parameter. Any workarounds known?

Spatial covariates

Question more than an issue - but wanted to know how the algorithm handles spatial covariates - i.e. with INLA you would used SPDE to create a spatial smoother. Does BART have a way of incorporating those kind of spatial covariates that deal with spatial autocorrelation?

Saved RI model objects may not retain saved trees

Fitted an RI model with keepTrees = TRUE, saved it, then loaded it in a fresh R session and used it for prediction: the output is all value = 0.5 for every grid cell. Suspect, based on our chat, that trees aren't being retained in the saved R object.

redo vignette with bart.map, other new names

bart.auc doesn't handle NAs

avian cbot showed that bart.auc doesn't work even though var selection one does

Conflicted "select" function between raster and dplyr.

Hi Colin!

Just dropping a quick note about an error when running bart.step in which the "select" function is not properly defined as coming from either the raster or dplyr libraries:

library(conflicted)
sdm <- bart.step(y.data=dat[,"occurrence"], x.data=dat[,xnames], full=TRUE, quiet=TRUE)

_Error: [conflicted] select found in 2 packages.
Either pick the one you want with ::

raster::select
dplyr::select
Or declare a preference with conflict_prefer()
conflict_prefer("select", "raster")
conflict_prefer("select", "dplyr")_

two-model one-partial

life is hell. need to do dark blue/dark red, on the same plot

bart.flex doesn't allow prior adjusting

Can't find k even if you pass it k? Needs to be adjusted. Could be a cores issue? Unclear

cutoff value?

Super quick question! (hope that's ok)

I'm going to have to run several models and their corresponding predictions in a loop. I'd like to plot a binary map for each, but I haven't found a way to obtain a cutoff value from the object resulting from the bart function. it appears in the console when running summary(model), but haven't found it anywhere inside the model object. Any idea where to find it, or if it's even possible? I don't want to have to do each one by hand, printing the summary info on screen and typing the cutoff value myself, and I'm sure it must be tucked away inside the huge list somewhere, but I just haven't found it.

Conservative partial plot extent

Create option in partial() to limit xlim to 100% or 90% range of training data

Allow data frame y.data instead of vector

Responding to closed pull request from @DidDrog11 - would be nice to allow y.data in non-vector format. Maybe worth a broader think about how other classes of data are handled e.g., tibbles.

open thread: small vignette

known issues

NLMR is not on CRAN
neither is embarcadero

velox dependency

Is there a way to install/use embarcadero without needing the velox dependency? I can't seem to install it with either method suggested in the readme getting errors saying "Rterm.exe - Entry Point Not Found". Looking at the issues in the velox github it looks like it isn't maintained and isn't updated to R v4.X

I tried a fresh install of R (v4.0.2) and had a different error but still can't install velox. The error looks similar to one you mention in your possible Mac fixes but I have no idea how to translate that into a fix for Windows.

Error: package or namespace load failed for 'velox' in .doLoadActions(where, attach):
 error in load action .__A__.1 for package velox: Rcpp::loadModule(module = "BOOSTGEOM", what = TRUE, env = ns, : Unable to load module "BOOSTGEOM": cannot allocate vector of length 1739883848

R version 4.0.2 (2020-06-22)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] ps_1.3.4          fansi_0.4.1       prettyunits_1.1.1 rprojroot_1.3-2   withr_2.2.0      
 [6] crayon_1.3.4      assertthat_0.2.1  R6_2.4.1          backports_1.1.8   cli_2.0.2        
[11] curl_4.3          remotes_2.1.1     rstudioapi_0.11   callr_3.4.3       tools_4.0.2      
[16] glue_1.4.1        yaml_2.2.1        compiler_4.0.2    processx_3.4.3    pkgbuild_1.0.8

Error from `bart.step()` related to `select()` call

bart.step() throws an error

unable to find an inherited method for function ‘select’ for signature ‘"data.frame"’

I think this is due to the call to select() in the function varimp.diag() (line 29 in my copy-paste of the code from the terminal), which doesn't specify the dplyr namespace.

Requiremed Number of Covariates

Hiya,

just partook in a workshop organised by the IBS and am pretty stoked on your package here. Wanting to run an example as simple as possible during the workshop, I opted for a model with only two environmental covariates to my binary outcome of species-presence/absence.

Unfortunately, that caused a few issues in particular with the variable.step function which seems to accommodate exclusively models with three or more covariates. I propose a check to be implemented in this function that alerts users to this fact.

Cheers,
Erik

don't autoplot predict.dbart.whatever - should be plots=FALSE default option

pnorm the partials

categorical variables mishandled by `varimp`, `variable.step`, `bart.step`

When the predictors include categorical variables, dbarts::bart includes them but embarcadero removes them. This appears to be because embarcadero removes variables based on unlist(attr(model$fit$data@x, "drop")), where the categorical variables are actually split and renamed to reflect their categories. This leads to an error in varimp (which fails for models that include categorical predictors) and to categorical variables being automatically excluded a priori by variable.step and bart.step, with a message unfairly blaming dbarts. Here some reproducible code:

# generate some data as in ?bart examples:

f <- function(x) {
  10 * sin(pi * x[,1] * x[,2]) + 20 * (x[,3] - 0.5)^2 +
    10 * x[,4] + 5 * x[,5]
}

set.seed(99)
sigma <- 1.0
n     <- 100

x  <- matrix(runif(n * 10), n, 10)
Ey <- f(x)
y  <- rnorm(n, Ey, sigma)


# make 'y' binary:
y <- ifelse(y > mean(y), 1, 0)

# make one of the x variables categorical:
x <- data.frame(x)
x[,1] <- ifelse(x[,1] > mean(x[,1]), "high", "low")
head(x)


# fit a bart model:
set.seed(99)
bartFit <- bart(x, y, keeptrees = TRUE)

summary(bartFit)  # notice 10 variables (i.e. including the categorical one) in predictor list

bartFit$fit$data
unlist(attr(bartFit$fit$data@x, "drop"))  # notice X1 (categorical variable) named here as X11 and X12 (one for each category)
# X11 X12  X2  X3  X4  X5  X6  X7  X8  X9 X10 
#  52  48   0   0   0   0   0   0   0   0   0 

# attempt to compute variable importance with 'embarcadero':
varimp(bartFit)  # Error in data.frame(names, varimps) : arguments imply differing number of rows: 9, 10

# but the variable importance info is there, including for the categorical variable (though it's also renamed here):
rel_imp <- bartFit$varcount / rowSums(bartFit$varcount)
colnames(rel_imp)
# [1] "X1.low"   "X2"     "X3"     "X4"     "X5"     "X6"     "X7"     "X8"     "X9"     "X10"

# attempt to simplify the model with 'embarcadero':
variable.step(x, y)  # X1 (categorical variable) said to be dropped by 'dbarts', but it wasn't really -- it was dropped by 'embarcadero' when expecting unlist(attr(bartFit$fit$data@x, "drop")) to have the original variables' names

Maybe rotate text on variable plots

R version 4.0.5

I've just downloaded R v4.0.5 and attempted to install embarcadero but I get a warning message saying that the package isn't available for this version of R- is this a known issue?

Cheers

Harry

velox library

predict to data frame rather than raster stack

Thanks for the great package and vignette. I'm a bit puzzled that input data (including species occurrences and predictor variables) for bart are a data frame (occ.df), but then for predict you need a RasterStack of the predictor variables. Any chance of allowing 'climate' to be a data frame just like occ.df?

what if we did spatial projections of partial dependence plots

find a way to retain variable names

tmean being wrong for plague probably - then, issue gonna be a Thing

add spartials to vignette

Vignettes need to use data() instead of Github paths

haha

Dropped variable problem

Sometimes, 'dbarts' entirely drops a variable from the model without even including it in the variable splits with "0" splits - both can happen in the same model, if dimensionality is high. This causes issues with varimp() and then some downstream issues.

I started writing a solution for varimp() but it doesn't work yet, and a lot of things are broken downstream because of it.

plot.mcmc does not work with masked raster layers

The plot.mcmc() function works fine for the example provided, but if the raster input dataset is not a simulated squared-raster layer but a masked raster layer (i.e. South America) the output of the plot.mcmc() function is meaningless.
I probably has something to do with the matrix conversion in the beginning of the function.
Here is a reproducible example of the problem:

library(dismo)
file <- paste0(system.file(package="dismo"), "/ex/bradypus.csv")
file
bradypus <- read.table(file,  header=TRUE,  sep=",") %>% select(-species)
bradypus$presence <- 1

files <- list.files(path=paste(system.file(package="dismo"), '/ex',
                               sep=''),  pattern='grd',  full.names=TRUE )
mask <- raster(files[1])
set.seed(1963)
bg <- randomPoints(mask, 500)
bg <- as.data.frame(bg)
colnames(bg) <- c("lon", "lat")
bg$presence <- 0

abspres <- bind_rows(bradypus, bg)

path <- file.path(system.file(package="dismo"), 'ex')

files <- list.files(path, pattern='grd$', full.names=TRUE )
files
predictors <- stack(files)
predictors

xnames <- names(predictors)

plot(predictors[[1]])
occ <- SpatialPoints(abspres[,c('lon','lat')])
occ.df <- cbind(abspres,
                raster::extract(predictors, occ))
occ.df <- occ.df[,-c(1:3)]

### The actual example 

sdm <- bart(y.train=occ.df[,'presence'],
            x.train=occ.df[,xnames],
            keeptrees = TRUE)

plot.mcmc(sdm, predictors, iter=50)