benjamin-allevius / scanstatistics Goto Github PK
View Code? Open in Web Editor NEWAn R package for space-time anomaly detection using scan statistics.
License: GNU General Public License v3.0
An R package for space-time anomaly detection using scan statistics.
License: GNU General Public License v3.0
I've been digging into this package, and I noticed that you're using the formula sum((y - m) / w) / sum(m / w)
to calculate "hotspot" scores for the scan_eb_negbin
function, but Tango et al. (2011) use the formula sum((y-m) / w) / sqrt(sum(m / w))
. Was there a deliberate reason for this change or is this a bug?
Thanks,
-Paul
Hi Benjamin,
I was trying to us both of these functions in the scanstatistics package in a jupyter R notebook. It crashed the kernel for scan_eb_poisson (small sample based on simulated data and NMgeo), and was a long running operation for scan_permutations. Any ideas. 16GB RAM, 4core CPU, Windows 10.
``
Warning message in seq_len(nrow(x)):
"first element used of 'length.out' argument"
Error in seq_len(nrow(x)): argument must be coercible to non-negative integer
Traceback:
Dear Ben,
I'm pretty new to scanstatistics. When I was trying to run functions in scan statistics, I got this error:
error: Mat::elem(): index out of bounds
Error in (function (counts, baselines, zones, zone_lengths, store_everything, :
Mat::elem(): index out of bounds
Could you please help me out?
Thanks.
Best,
Lusi
Hi Benjamin,
I used the scan_pb_poisson to conduct space-time analysis with my dataset and I found that the results given by scan_pb_poisson and the result given by the software SatScan were quite different.
My dataset is a day-frequency disease counts data, range from 2020/12/31 to 2021/4/14.It contains 10 locations with latitude and longitude.
For SatScan,here is the settings:
[Input]
Time precision : Day
Coordinates : Lat/Long
[Analysis]
Type of Analysis : Space-Time
Probability Model : Poissson
Scan For Area With : High rates
And for scan_pb_possion,here is my code :
`counts = SZ_counts %>%
df_to_matrix(time_col = "time", location_col = "region", value_col = "count")
population = SZ_counts %>%
df_to_matrix(time_col = "time", location_col = "region", value_col = "population")
zones = SZ_geo %>%
select(long, lat) %>%
as.matrix %>%
spDists(x = ., y = ., longlat = TRUE) %>%
dist_to_knn(k = 4) %>%
knn_zones
regions = as.character(SZ_geo$region)
result = data.frame()
newcounts = counts
newpopulation = population
poisson_result = scan_pb_poisson(counts = newcounts,
zones = zones,
population = newpopulation,
n_mcsim = 999)
topclusters = top_clusters(poisson_result, zones, k = 10, overlapping = FALSE)
top_regions = topclusters$zone %>%
purrr::map(get_zone, zones = zones) %>%
purrr::map(function(x) regions[x])
new_top_regions = c()
for (j in 1:length(top_regions)) {
new_top_regions[j] = paste(top_regions[[j]], collapse = ',')
}
topclusters$zonename = new_top_regions
topclusters$endtime = rownames(population)[53]
result = rbind(result, topclusters)`
For the same dataset, the SaTScan gave following results:
1.Location IDs included.: 5
Coordinates / radius..: (22.726017 N, 114.254455 E) / 0 km
Time frame............: 2021/2/21 to 2021/4/13
Population............: 2508600
Number of cases.......: 198
Expected cases........: 43.90
Annual cases / 100000.: 55.4
Observed / expected...: 4.51
Relative risk.........: 6.85
Log likelihood ratio..: 174.120739
P-value...............: < 0.00000000000000001
2.Location IDs included.: 4, 6, 1
Coordinates / radius..: (22.754466 N, 113.942560 E) / 22.26 km
Time frame............: 2021/2/10 to 2021/4/2
Population............: 6369300
Number of cases.......: 22
Expected cases........: 111.46
Annual cases / 100000.: 2.4
Observed / expected...: 0.20
Relative risk.........: 0.16
Log likelihood ratio..: 63.471627
P-value...............: < 0.00000000000000001
3.Location IDs included.: 3, 7, 8
Coordinates / radius..: (22.528466 N, 114.061547 E) / 12.88 km
Time frame............: 2021/1/1 to 2021/2/20
Population............: 4265300
Number of cases.......: 14
Expected cases........: 73.21
Annual cases / 100000.: 2.4
Observed / expected...: 0.19
Relative risk.........: 0.17
Log likelihood ratio..: 40.022434
P-value...............: 0.000000000000092
While scan_pb_possion gave following results:
zone duration score relrisk_in relrisk_out Gumbel_pvalue zonename endtime
15 104 392.4982441 4.248194 0.2996086 0.0000000 5 2021/2/28
13 104 329.5112571 3.428604 0.2993484 0.0000000 4,5 2021/2/28
The 2 zones given by scan_pb_possion were totally different from the 3 clusters given by SaTScan.Why is that?
In addition, the SatScan only gave one relative risk but scan_pb_possion give two risk:relrisk_in,relrisk_out.How could I match these results?
Hi Benjamin,
I failed to find function named mscan_fss() in your R package. Could you please kindly check it? Thanks!
Is there some reason why these routines don't work for results of scan_bayes_negbin?
I noticed that this package has dropped off of CRAN. Do you know if anyone is maintaining the package at the moment?
If not, I'd be interested in taking over maintainer duties for the package and working to get it back on CRAN.
I've found it very useful in my work, and would like to keep it easily accessible.
Thanks,
-Paul
Hi Ben, earlier, I've tried to get top clusters using this syntax
top10 <- top_clusters(res, zones, k = 10, overlapping = FALSE)
top10
but the result (top10), all clusters have gumble p value = 0, and altough I set overlapping = FALSE, the result is still overlapping. And then when I read your updates to top clusters and documentation, the result of top clusters are different than first syntax and all of MLC p value is 0.01. Beside that, when I use the syntax for show subregion in top10 cluster in Flexible Zones, there was error
Error: object of type 'closure' is not subsettable
What should I do? Thank you very much
Here the First syntax
knn_mat <- coords_to_knn(unique(data[,6:7]), 12)
zones <- knn_zones(knn_mat)
t<-length(unique(data$year))
m<-length(unique(data$subregion))
counts<-matrix(data$case,nrow=t, ncol=m)
View(counts)
population<-matrix(data$population,nrow=t, ncol=m)
res <- scan_pb_poisson(counts = counts,
zones = zones,
population = population,
n_mcsim = 99,
max_only = FALSE)
res$MLC
hotspot<-unique(data$id)[res$MLC$locations]
hotspot
#TOP Cluster
top10 <- top_clusters(res, zones, k = 10, overlapping = FALSE)
top10
#show subregion in top10 cluster
j=1
clustersubregion<-list()
for(i in top10$zone){
clustersubregion[[j]]<-unique(data$id)[zones[[i]]]
j<-j+1
}
clustersubregion
Second Syntax
knn_mat <- coords_to_knn(unique(data[,6:7]), 12)
zones <- knn_zones(knn_mat)
t<-length(unique(data$year))
m<-length(unique(data$subregion))
counts<-matrix(data$case,nrow=t, ncol=m)
#View(counts)#
population<-matrix(data$population,nrow=t, ncol=m)
res <- scan_pb_poisson(counts = counts,
zones = zones,
population = population,
n_mcsim = 99,
max_only = FALSE)
res$MLC
hotspot<-unique(data$id)[res$MLC$locations]
hotspot
#tOP CLUSTER P VALUE
mc_pvalue <- function(observed, replicates) {
if (length(replicates) == 0) {
return(NULL)
} else {
f <- Vectorize(
function(y) {
(1 + sum(replicates > y)) / (1 + length(replicates))
}
)
return(f(observed))
}
}
gumbel_pvalue <- function(observed, replicates, method = "ML", ...) {
if (length(replicates) < 2) {
stop("Need at least 2 observations to fit Gumbel distribution.")
}
gumbel_mu <- NA
gumbel_sigma <- NA
if (method == "ML") {
gum_fit <- gum.fit(replicates, show = FALSE, ...)
gumbel_mu <- gum_fit$mle[1]
gumbel_sigma <- gum_fit$mle[2]
} else {
gumbel_sigma <- sqrt(6 * var(replicates) / pi^2)
gumbel_mu <- mean(replicates) + digamma(1) * gumbel_sigma
}
pvalue <- pgumbel(observed, gumbel_mu, gumbel_sigma, lower.tail = FALSE)
return(list(pvalue = pvalue,
gumbel_mu = gumbel_mu,
gumbel_sigma = gumbel_sigma))
}
mtop_clusters <- function(x, zones, k = 10, overlapping = FALSE, gumbel = FALSE,
alpha = NULL, ...) {
k <- min(k, nrow(x$observed))
if (overlapping) {
return(x$observed[seq_len(k), ])
} else {
row_idx <- c(1L, integer(k - 1))
seen_locations <- zones[[x$observed[1,]$zone]]
n_added <- 1L
i <- 2L
while (n_added < k && i <= nrow(x$observed)) {
zone <- x$observed[i, ]$zone
if (zone != x$observed[i-1, ]$zone &&
length(intersect(seen_locations, zones[[zone]])) == 0) {
seen_locations <- c(seen_locations, zones[[zone]])
n_added <- n_added + 1L
row_idx[n_added] <- i
}
i <- i + 1L
}
res <- x$observed[row_idx[row_idx > 0], ]
if (nrow(x$replicates) > 0) {
res$MC_pvalue <- mc_pvalue(res$score, x$replicates$score)
if (gumbel) {
res$Gumbel_pvalue <- gumbel_pvalue(res$score,
x$replicates$score)$pvalue
}
if (!is.null(alpha) && alpha >= 0 && alpha <= 1) {
res$critical_value <- quantile(x$replicates$score, 1 - alpha)
}
}
return(res)
}
}
top10 <- mtop_clusters(res, zones, k = 10, overlapping = FALSE, gumbel=FALSE,alpha=0.05)
top10
#show subregion in top10 cluster
j=1
clustersubregion<-list()
for(i in top10$zone){
clustersubregion[[j]]<-unique(data$id)[zones[[i]]]
j<-j+1
}
clustersubregion
Set the default argument baselines = NULL
scanstatistics/R/scan_eb_negbin.R.
you added functionality after 1.0.1 release, for us most relevantly in
but did not yet do a release including it. is it planned?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.