GithubHelp home page GithubHelp logo

benjamin-allevius / scanstatistics Goto Github PK

View Code? Open in Web Editor NEW
49.0 5.0 10.0 3.85 MB

An R package for space-time anomaly detection using scan statistics.

License: GNU General Public License v3.0

R 19.47% TeX 0.43% C++ 7.25% C 1.21% HTML 71.64%
scan-statistics statistics r cluster rcpp rcpparmadillo spatial spatio-temporal anomaly-detection

scanstatistics's Issues

Missing square root in scores for scan_eb_negbin?

I've been digging into this package, and I noticed that you're using the formula sum((y - m) / w) / sum(m / w) to calculate "hotspot" scores for the scan_eb_negbin function, but Tango et al. (2011) use the formula sum((y-m) / w) / sqrt(sum(m / w)). Was there a deliberate reason for this change or is this a bug?

Thanks,
-Paul

scan_permutation and scan_eb_poisson

Hi Benjamin,

I was trying to us both of these functions in the scanstatistics package in a jupyter R notebook. It crashed the kernel for scan_eb_poisson (small sample based on simulated data and NMgeo), and was a long running operation for scan_permutations. Any ideas. 16GB RAM, 4core CPU, Windows 10.
``
Warning message in seq_len(nrow(x)):
"first element used of 'length.out' argument"

Error in seq_len(nrow(x)): argument must be coercible to non-negative integer
Traceback:

  1. scan_permutation(counts = counts2, zones = zones, population = NULL,
    . n_mcsim = 1, max_only = TRUE)
  2. flipud(population)
  3. rev(seq_len(nrow(x)))
    #' @Keywords internal
    flipud <- function(x) {
    x[rev(seq_len(nrow(x))), , drop = FALSE]

counts2.zip

Index out of bounds in Scan statistics

Dear Ben,

I'm pretty new to scanstatistics. When I was trying to run functions in scan statistics, I got this error:
error: Mat::elem(): index out of bounds
Error in (function (counts, baselines, zones, zone_lengths, store_everything, :
Mat::elem(): index out of bounds

Could you please help me out?

Thanks.

Best,
Lusi

Why did the function scan_pb_poisson give different results than SatScan?

Hi Benjamin,
I used the scan_pb_poisson to conduct space-time analysis with my dataset and I found that the results given by scan_pb_poisson and the result given by the software SatScan were quite different.
My dataset is a day-frequency disease counts data, range from 2020/12/31 to 2021/4/14.It contains 10 locations with latitude and longitude.
For SatScan,here is the settings:
[Input]
Time precision : Day
Coordinates : Lat/Long
[Analysis]
Type of Analysis : Space-Time
Probability Model : Poissson
Scan For Area With : High rates
And for scan_pb_possion,here is my code :
`counts = SZ_counts %>%
df_to_matrix(time_col = "time", location_col = "region", value_col = "count")
population = SZ_counts %>%
df_to_matrix(time_col = "time", location_col = "region", value_col = "population")
zones = SZ_geo %>%
select(long, lat) %>%
as.matrix %>%
spDists(x = ., y = ., longlat = TRUE) %>%
dist_to_knn(k = 4) %>%
knn_zones
regions = as.character(SZ_geo$region)
result = data.frame()
newcounts = counts
newpopulation = population
poisson_result = scan_pb_poisson(counts = newcounts,
zones = zones,
population = newpopulation,
n_mcsim = 999)
topclusters = top_clusters(poisson_result, zones, k = 10, overlapping = FALSE)

                      top_regions = topclusters$zone %>%
                        purrr::map(get_zone, zones = zones) %>%
                        purrr::map(function(x) regions[x])
                      
                      new_top_regions = c()
                      for (j in 1:length(top_regions)) {
                        new_top_regions[j] = paste(top_regions[[j]], collapse = ',')
                      }
                      
                      topclusters$zonename = new_top_regions
                      topclusters$endtime = rownames(population)[53]
                      result = rbind(result, topclusters)`

    For the same dataset, the SaTScan gave following results:
              1.Location IDs included.: 5
                Coordinates / radius..: (22.726017 N, 114.254455 E) / 0 km
                Time frame............: 2021/2/21 to 2021/4/13
                Population............: 2508600
                Number of cases.......: 198
                Expected cases........: 43.90
                Annual cases / 100000.: 55.4
                Observed / expected...: 4.51
                Relative risk.........: 6.85
                Log likelihood ratio..: 174.120739
                P-value...............: < 0.00000000000000001
              
              2.Location IDs included.: 4, 6, 1
                Coordinates / radius..: (22.754466 N, 113.942560 E) / 22.26 km
                Time frame............: 2021/2/10 to 2021/4/2
                Population............: 6369300
                Number of cases.......: 22
                Expected cases........: 111.46
                Annual cases / 100000.: 2.4
                Observed / expected...: 0.20
                Relative risk.........: 0.16
                Log likelihood ratio..: 63.471627
                P-value...............: < 0.00000000000000001
              
              3.Location IDs included.: 3, 7, 8
                Coordinates / radius..: (22.528466 N, 114.061547 E) / 12.88 km
                Time frame............: 2021/1/1 to 2021/2/20
                Population............: 4265300
                Number of cases.......: 14
                Expected cases........: 73.21
                Annual cases / 100000.: 2.4
                Observed / expected...: 0.19
                Relative risk.........: 0.17
                Log likelihood ratio..: 40.022434
                P-value...............: 0.000000000000092

       While scan_pb_possion gave following results:
     zone duration           score        relrisk_in  relrisk_out          Gumbel_pvalue      zonename          endtime
       15     104         392.4982441   4.248194    0.2996086         0.0000000             5              2021/2/28
        13     104        329.5112571     3.428604   0.2993484         0.0000000             4,5       2021/2/28
     
      The 2 zones given by scan_pb_possion were totally different from the 3 clusters given by SaTScan.Why is that?

      In addition, the SatScan only gave one relative risk but scan_pb_possion give two risk:relrisk_in,relrisk_out.How could I match these results?

No longer on CRAN

I noticed that this package has dropped off of CRAN. Do you know if anyone is maintaining the package at the moment?
If not, I'd be interested in taking over maintainer duties for the package and working to get it back on CRAN.
I've found it very useful in my work, and would like to keep it easily accessible.

Thanks,
-Paul

top clusters

Hi Ben, earlier, I've tried to get top clusters using this syntax
top10 <- top_clusters(res, zones, k = 10, overlapping = FALSE)
top10

but the result (top10), all clusters have gumble p value = 0, and altough I set overlapping = FALSE, the result is still overlapping. And then when I read your updates to top clusters and documentation, the result of top clusters are different than first syntax and all of MLC p value is 0.01. Beside that, when I use the syntax for show subregion in top10 cluster in Flexible Zones, there was error
Error: object of type 'closure' is not subsettable
What should I do? Thank you very much

Here the First syntax
knn_mat <- coords_to_knn(unique(data[,6:7]), 12)
zones <- knn_zones(knn_mat)

t<-length(unique(data$year))
m<-length(unique(data$subregion))
counts<-matrix(data$case,nrow=t, ncol=m)
View(counts)
population<-matrix(data$population,nrow=t, ncol=m)

res <- scan_pb_poisson(counts = counts,
zones = zones,
population = population,
n_mcsim = 99,
max_only = FALSE)

res$MLC

hotspot<-unique(data$id)[res$MLC$locations]
hotspot

#TOP Cluster
top10 <- top_clusters(res, zones, k = 10, overlapping = FALSE)
top10

#show subregion in top10 cluster
j=1
clustersubregion<-list()
for(i in top10$zone){
clustersubregion[[j]]<-unique(data$id)[zones[[i]]]
j<-j+1
}
clustersubregion

Second Syntax
knn_mat <- coords_to_knn(unique(data[,6:7]), 12)
zones <- knn_zones(knn_mat)

t<-length(unique(data$year))
m<-length(unique(data$subregion))
counts<-matrix(data$case,nrow=t, ncol=m)
#View(counts)#
population<-matrix(data$population,nrow=t, ncol=m)

res <- scan_pb_poisson(counts = counts,
zones = zones,
population = population,
n_mcsim = 99,
max_only = FALSE)

res$MLC

hotspot<-unique(data$id)[res$MLC$locations]
hotspot

#tOP CLUSTER P VALUE
mc_pvalue <- function(observed, replicates) {
if (length(replicates) == 0) {
return(NULL)
} else {
f <- Vectorize(
function(y) {
(1 + sum(replicates > y)) / (1 + length(replicates))
}
)

return(f(observed))

}
}

gumbel_pvalue <- function(observed, replicates, method = "ML", ...) {
if (length(replicates) < 2) {
stop("Need at least 2 observations to fit Gumbel distribution.")
}

Fit Gumbel distribution to Monte Carlo replicates

gumbel_mu <- NA
gumbel_sigma <- NA
if (method == "ML") {
gum_fit <- gum.fit(replicates, show = FALSE, ...)
gumbel_mu <- gum_fit$mle[1]
gumbel_sigma <- gum_fit$mle[2]
} else {
gumbel_sigma <- sqrt(6 * var(replicates) / pi^2)
gumbel_mu <- mean(replicates) + digamma(1) * gumbel_sigma
}

pvalue <- pgumbel(observed, gumbel_mu, gumbel_sigma, lower.tail = FALSE)

return(list(pvalue = pvalue,
gumbel_mu = gumbel_mu,
gumbel_sigma = gumbel_sigma))
}

mtop_clusters <- function(x, zones, k = 10, overlapping = FALSE, gumbel = FALSE,
alpha = NULL, ...) {
k <- min(k, nrow(x$observed))
if (overlapping) {
return(x$observed[seq_len(k), ])
} else {
row_idx <- c(1L, integer(k - 1))
seen_locations <- zones[[x$observed[1,]$zone]]
n_added <- 1L
i <- 2L
while (n_added < k && i <= nrow(x$observed)) {
zone <- x$observed[i, ]$zone
if (zone != x$observed[i-1, ]$zone &&
length(intersect(seen_locations, zones[[zone]])) == 0) {
seen_locations <- c(seen_locations, zones[[zone]])
n_added <- n_added + 1L
row_idx[n_added] <- i
}
i <- i + 1L
}
res <- x$observed[row_idx[row_idx > 0], ]

if (nrow(x$replicates) > 0) {
  res$MC_pvalue <- mc_pvalue(res$score, x$replicates$score)
  
  if (gumbel) {
    res$Gumbel_pvalue <- gumbel_pvalue(res$score, 
                                       x$replicates$score)$pvalue
  }
  if (!is.null(alpha) && alpha >= 0 && alpha <= 1) {
    res$critical_value <- quantile(x$replicates$score, 1 - alpha)
  }
}
return(res)

}
}

top10 <- mtop_clusters(res, zones, k = 10, overlapping = FALSE, gumbel=FALSE,alpha=0.05)
top10

#show subregion in top10 cluster
j=1
clustersubregion<-list()
for(i in top10$zone){
clustersubregion[[j]]<-unique(data$id)[zones[[i]]]
j<-j+1
}
clustersubregion

1.0.2 release?

you added functionality after 1.0.1 release, for us most relevantly in

30b424c

but did not yet do a release including it. is it planned?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.