benjamin-allevius / scanstatistics Goto Github PK

An R package for space-time anomaly detection using scan statistics.

License: GNU General Public License v3.0

R 19.47% TeX 0.43% C++ 7.25% C 1.21% HTML 71.64%

scan-statistics statistics r cluster rcpp rcpparmadillo spatial spatio-temporal anomaly-detection

scanstatistics's Issues

Missing square root in scores for scan_eb_negbin?

I've been digging into this package, and I noticed that you're using the formula sum((y - m) / w) / sum(m / w) to calculate "hotspot" scores for the scan_eb_negbin function, but Tango et al. (2011) use the formula sum((y-m) / w) / sqrt(sum(m / w)). Was there a deliberate reason for this change or is this a bug?

Thanks,
-Paul

scan_permutation and scan_eb_poisson

Hi Benjamin,

I was trying to us both of these functions in the scanstatistics package in a jupyter R notebook. It crashed the kernel for scan_eb_poisson (small sample based on simulated data and NMgeo), and was a long running operation for scan_permutations. Any ideas. 16GB RAM, 4core CPU, Windows 10.
``
Warning message in seq_len(nrow(x)):
"first element used of 'length.out' argument"

Error in seq_len(nrow(x)): argument must be coercible to non-negative integer
Traceback:

scan_permutation(counts = counts2, zones = zones, population = NULL,
. n_mcsim = 1, max_only = TRUE)
flipud(population)
rev(seq_len(nrow(x)))
#' @Keywords internal
flipud <- function(x) {
x[rev(seq_len(nrow(x))), , drop = FALSE]

counts2.zip

Index out of bounds in Scan statistics

Dear Ben,

I'm pretty new to scanstatistics. When I was trying to run functions in scan statistics, I got this error:
error: Mat::elem(): index out of bounds
Error in (function (counts, baselines, zones, zone_lengths, store_everything, :
Mat::elem(): index out of bounds

Could you please help me out?

Thanks.

Best,
Lusi

Why did the function scan_pb_poisson give different results than SatScan?

Hi Benjamin,
I used the scan_pb_poisson to conduct space-time analysis with my dataset and I found that the results given by scan_pb_poisson and the result given by the software SatScan were quite different.
My dataset is a day-frequency disease counts data, range from 2020/12/31 to 2021/4/14.It contains 10 locations with latitude and longitude.
For SatScan,here is the settings:
[Input]
Time precision : Day
Coordinates : Lat/Long
[Analysis]
Type of Analysis : Space-Time
Probability Model : Poissson
Scan For Area With : High rates
And for scan_pb_possion,here is my code :
`counts = SZ_counts %>%
df_to_matrix(time_col = "time", location_col = "region", value_col = "count")
population = SZ_counts %>%
df_to_matrix(time_col = "time", location_col = "region", value_col = "population")
zones = SZ_geo %>%
select(long, lat) %>%
as.matrix %>%
spDists(x = ., y = ., longlat = TRUE) %>%
dist_to_knn(k = 4) %>%
knn_zones
regions = as.character(SZ_geo$region)
result = data.frame()
newcounts = counts
newpopulation = population
poisson_result = scan_pb_poisson(counts = newcounts,
zones = zones,
population = newpopulation,
n_mcsim = 999)
topclusters = top_clusters(poisson_result, zones, k = 10, overlapping = FALSE)

                      top_regions = topclusters$zone %>%
                        purrr::map(get_zone, zones = zones) %>%
                        purrr::map(function(x) regions[x])
                      
                      new_top_regions = c()
                      for (j in 1:length(top_regions)) {
                        new_top_regions[j] = paste(top_regions[[j]], collapse = ',')
                      }
                      
                      topclusters$zonename = new_top_regions
                      topclusters$endtime = rownames(population)[53]
                      result = rbind(result, topclusters)`

    For the same dataset, the SaTScan gave following results:
              1.Location IDs included.: 5
                Coordinates / radius..: (22.726017 N, 114.254455 E) / 0 km
                Time frame............: 2021/2/21 to 2021/4/13
                Population............: 2508600
                Number of cases.......: 198
                Expected cases........: 43.90
                Annual cases / 100000.: 55.4
                Observed / expected...: 4.51
                Relative risk.........: 6.85
                Log likelihood ratio..: 174.120739
                P-value...............: < 0.00000000000000001
              
              2.Location IDs included.: 4, 6, 1
                Coordinates / radius..: (22.754466 N, 113.942560 E) / 22.26 km
                Time frame............: 2021/2/10 to 2021/4/2
                Population............: 6369300
                Number of cases.......: 22
                Expected cases........: 111.46
                Annual cases / 100000.: 2.4
                Observed / expected...: 0.20
                Relative risk.........: 0.16
                Log likelihood ratio..: 63.471627
                P-value...............: < 0.00000000000000001
              
              3.Location IDs included.: 3, 7, 8
                Coordinates / radius..: (22.528466 N, 114.061547 E) / 12.88 km
                Time frame............: 2021/1/1 to 2021/2/20
                Population............: 4265300
                Number of cases.......: 14
                Expected cases........: 73.21
                Annual cases / 100000.: 2.4
                Observed / expected...: 0.19
                Relative risk.........: 0.17
                Log likelihood ratio..: 40.022434
                P-value...............: 0.000000000000092

       While scan_pb_possion gave following results:
     zone duration           score        relrisk_in  relrisk_out          Gumbel_pvalue      zonename          endtime
       15     104         392.4982441   4.248194    0.2996086         0.0000000             5              2021/2/28
        13     104        329.5112571     3.428604   0.2993484         0.0000000             4,5       2021/2/28
     
      The 2 zones given by scan_pb_possion were totally different from the 3 clusters given by SaTScan.Why is that?

      In addition, the SatScan only gave one relative risk but scan_pb_possion give two risk:relrisk_in,relrisk_out.How could I match these results?

fail to find function named mscan_fss()

Hi Benjamin,

I failed to find function named mscan_fss() in your R package. Could you please kindly check it? Thanks!

score_locations and top_clusters

Is there some reason why these routines don't work for results of scan_bayes_negbin?

No longer on CRAN

I noticed that this package has dropped off of CRAN. Do you know if anyone is maintaining the package at the moment?
If not, I'd be interested in taking over maintainer duties for the package and working to get it back on CRAN.
I've found it very useful in my work, and would like to keep it easily accessible.

Thanks,
-Paul

top clusters

Hi Ben, earlier, I've tried to get top clusters using this syntax
top10 <- top_clusters(res, zones, k = 10, overlapping = FALSE)
top10

but the result (top10), all clusters have gumble p value = 0, and altough I set overlapping = FALSE, the result is still overlapping. And then when I read your updates to top clusters and documentation, the result of top clusters are different than first syntax and all of MLC p value is 0.01. Beside that, when I use the syntax for show subregion in top10 cluster in Flexible Zones, there was error
Error: object of type 'closure' is not subsettable
What should I do? Thank you very much

Here the First syntax
knn_mat <- coords_to_knn(unique(data[,6:7]), 12)
zones <- knn_zones(knn_mat)

t<-length(unique(data$year))
m<-length(unique(data$subregion))
counts<-matrix(data$case,nrow=t, ncol=m)
View(counts)
population<-matrix(data$population,nrow=t, ncol=m)

res <- scan_pb_poisson(counts = counts,
zones = zones,
population = population,
n_mcsim = 99,
max_only = FALSE)

res$MLC

hotspot<-unique(data$id)[res$MLC$locations]
hotspot

#TOP Cluster
top10 <- top_clusters(res, zones, k = 10, overlapping = FALSE)
top10

#show subregion in top10 cluster
j=1
clustersubregion<-list()
for(i in top10$zone){
clustersubregion[[j]]<-unique(data$id)[zones[[i]]]
j<-j+1
}
clustersubregion

Second Syntax
knn_mat <- coords_to_knn(unique(data[,6:7]), 12)
zones <- knn_zones(knn_mat)

t<-length(unique(data$year))
m<-length(unique(data$subregion))
counts<-matrix(data$case,nrow=t, ncol=m)
#View(counts)#
population<-matrix(data$population,nrow=t, ncol=m)

res <- scan_pb_poisson(counts = counts,
zones = zones,
population = population,
n_mcsim = 99,
max_only = FALSE)

res$MLC

hotspot<-unique(data$id)[res$MLC$locations]
hotspot

#tOP CLUSTER P VALUE
mc_pvalue <- function(observed, replicates) {
if (length(replicates) == 0) {
return(NULL)
} else {
f <- Vectorize(
function(y) {
(1 + sum(replicates > y)) / (1 + length(replicates))
}
)

return(f(observed))

}
}

gumbel_pvalue <- function(observed, replicates, method = "ML", ...) {
if (length(replicates) < 2) {
stop("Need at least 2 observations to fit Gumbel distribution.")
}

Fit Gumbel distribution to Monte Carlo replicates

gumbel_mu <- NA
gumbel_sigma <- NA
if (method == "ML") {
gum_fit <- gum.fit(replicates, show = FALSE, ...)
gumbel_mu <- gum_fit$mle[1]
gumbel_sigma <- gum_fit$mle[2]
} else {
gumbel_sigma <- sqrt(6 * var(replicates) / pi^2)
gumbel_mu <- mean(replicates) + digamma(1) * gumbel_sigma
}

pvalue <- pgumbel(observed, gumbel_mu, gumbel_sigma, lower.tail = FALSE)

return(list(pvalue = pvalue,
gumbel_mu = gumbel_mu,
gumbel_sigma = gumbel_sigma))
}

mtop_clusters <- function(x, zones, k = 10, overlapping = FALSE, gumbel = FALSE,
alpha = NULL, ...) {
k <- min(k, nrow(x$observed))
if (overlapping) {
return(x$observed[seq_len(k), ])
} else {
row_idx <- c(1L, integer(k - 1))
seen_locations <- zones[[x$observed[1,]$zone]]
n_added <- 1L
i <- 2L
while (n_added < k && i <= nrow(x$observed)) {
zone <- x$observed[i, ]$zone
if (zone != x$observed[i-1, ]$zone &&
length(intersect(seen_locations, zones[[zone]])) == 0) {
seen_locations <- c(seen_locations, zones[[zone]])
n_added <- n_added + 1L
row_idx[n_added] <- i
}
i <- i + 1L
}
res <- x$observed[row_idx[row_idx > 0], ]

if (nrow(x$replicates) > 0) {
  res$MC_pvalue <- mc_pvalue(res$score, x$replicates$score)
  
  if (gumbel) {
    res$Gumbel_pvalue <- gumbel_pvalue(res$score, 
                                       x$replicates$score)$pvalue
  }
  if (!is.null(alpha) && alpha >= 0 && alpha <= 1) {
    res$critical_value <- quantile(x$replicates$score, 1 - alpha)
  }
}
return(res)

}
}

top10 <- mtop_clusters(res, zones, k = 10, overlapping = FALSE, gumbel=FALSE,alpha=0.05)
top10

#show subregion in top10 cluster
j=1
clustersubregion<-list()
for(i in top10$zone){
clustersubregion[[j]]<-unique(data$id)[zones[[i]]]
j<-j+1
}
clustersubregion

baselines = NULL default in scan_eb_negbin

Set the default argument baselines = NULL scanstatistics/R/scan_eb_negbin.R.

1.0.2 release?

you added functionality after 1.0.1 release, for us most relevantly in

30b424c

but did not yet do a release including it. is it planned?

benjamin-allevius / scanstatistics Goto Github PK

scanstatistics's Issues

Missing square root in scores for scan_eb_negbin?

scan_permutation and scan_eb_poisson

Index out of bounds in Scan statistics

Why did the function scan_pb_poisson give different results than SatScan?

fail to find function named mscan_fss()

score_locations and top_clusters

No longer on CRAN

top clusters

Fit Gumbel distribution to Monte Carlo replicates

baselines = NULL default in scan_eb_negbin

1.0.2 release?

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs