michaeldorman / nngeo Goto Github PK
View Code? Open in Web Editor NEWk-Nearest Neighbor Join for Spatial Data
License: Other
k-Nearest Neighbor Join for Spatial Data
License: Other
Hi
Very nice package, thanks! I was using distance from point to polygons, and have a few questions:
st_dist()
is used on the full dataset? Interestingly, st_contains()
seems much faste. Could that be used to restrict the dimensionality of the search (st_contain to get inner point, if n_inner<k, get distance on outer ones?)Thanks!
Hi,
I used the function st_nn to get the distance vector (class: units) to a variable/column, as follows.
listings$distance_subway <- (st_nn(listings, subway_nyc, returnDist = TRUE))$dist
I noticed that the above function stopped to work on nngeo v0.3.0.
I tried the following but it stored the vector as a list instead of units:
distance_subway <- st_nn(listings, subway_nyc, returnDist = TRUE)
listings$distance_subway <- distance_subway[[2]]
How to proceed?
The warnings on CRAN are coming from sp (or sf if st_connect uses st_sample) about *sample being used on unprojected data:
The warnings seem to appear in nngeo::st_connect() in the lapply() local function starting at line 97 in nngeo/R/st_connect.R, and come from sp:::sample.SpatialLines, sp/R/spsample.R line 177:
if (isTRUE(!is.projected(x)))
warning("working under the assumption of projected data!")
In fact the warning should be issued in all cases as far as I can see, nngeo is sampling from a line on the ellipsoid but assuming planar geometry. They might also switch from sp to sf for the spatial sampling:
# x_sp = as(x[i], "Spatial")
# start_pool = sp::spsample(x_sp, type = "regular", n = n_x[i])
# start_pool = st_as_sfc(start_pool)
start_pool = st_sample(x[i], type = "regular", size = n_x[i])
and
# y_sp = as(y[j], "Spatial")
# end_pool = sp::spsample(y_sp, type = "regular", n = n_y[j])
# end_pool = st_as_sfc(end_pool)
end_pool = st_sample(y[i], type = "regular", size = n_y[i])
in nngeo/R/st_connect.R: with sf 0.8-1, the vignette gets three:
#> although coordinates are longitude/latitude, st_sample assumes that they are planar
messages.
So the underlying problem is the misuse of sp::spsample or sf::st_sample on unprojected data.
I have a spatial dataframe of 12million GPS locations, and I'm trying to find the nearest 4 neighbours to each event from a source of 2480 line segments.
Running this crashes my machine after about 15mins. this is the code.
nngeo::st_nn(events, segments, k = k, returnDist = TRUE, maxdist = 30, parallel = 10)
are there limitations on st_nn or am I doing something wrong?
Hi,
Thanks for making a great package! I'm using version 0.3.4.
The sparse
argument in st_nn
has an unfinished sentence in the help file.
The progress
argument is missing a bracket and maybe more words - not sure.
Only very minor things, but thought I would point them out.
Hi! I have been using the st_remove_holes
function and absolutely loving it but I did notice a small modification that it was doing on the side: when I used this function it was renaming the "geometry" feature as "geom".
This is easily caught and fixed but I figured I could reach out and see if the feature renaming could be made optional (perhaps by setting the default to 'true' in case later nngeo
functions anticipate a "geom" feature by name).
This example is not reproducible because I can't share the shapefile I used as "data1" but I hope it illustrates what I mean nonetheless.
Regardless, thanks for making a really useful function!
library(tidyverse); library(nngeo)
#> Loading required package: sf
#> Linking to GEOS 3.10.1, GDAL 3.4.0, PROJ 8.2.0; sf_use_s2() is TRUE
str(data1)
#> Classes ‘sf’ and 'data.frame': 15 obs. of 2 variables:
#> $ uniqueID: chr "UMR_I080.2M" "UMR_SG16.2C" "UMR_CH00.1M" "UMR_CN00.1M" ...
#> $ geometry:sfc_GEOMETRY of length 15; first list element: List of 1
#> ..$ : num [1:1907, 1:2] -90.5 -90.5 -90.5 -90.5 -90.5 ...
#> ..- attr(, "class")= chr [1:3] "XY" "POLYGON" "sfg"
#> - attr(, "sf_column")= chr "geometry"
#> - attr(, "agr")= Factor w/ 3 levels "constant","aggregate",..: NA
#> ..- attr(, "names")= chr "uniqueID"
data2 <- data1 %>%
group_by(uniqueID) %>%
nngeo::st_remove_holes()
str(data2)
#> Classes ‘sf’ and 'data.frame': 15 obs. of 2 variables:
#> $ uniqueID: chr "UMR_I080.2M" "UMR_SG16.2C" "UMR_CH00.1M" "UMR_CN00.1M" ...
#> $ geom :sfc_GEOMETRY of length 15; first list element: List of 1
#> ..$ : num [1:1907, 1:2] -90.5 -90.5 -90.5 -90.5 -90.5 ...
#> ..- attr(, "class")= chr [1:3] "XY" "POLYGON" "sfg"
#> - attr(, "sf_column")= chr "geom"
#> - attr(, "agr")= Factor w/ 3 levels "constant","aggregate",..: NA
#> ..- attr(, "names")= chr "uniqueID"
sf
has several functions with mixed argument behavior. For example, st_union
and st_intersection
accept either one or two spatial inputs.
I think this would be of great benefit.
# Example
library(nngeo)
data(towns)
head(st_nn(towns, k = 2, maxdist = 10e3))
# [[1]]
# [1] 93 5
#
# [[2]]
# [1] 49
#
# [[3]]
# [1] 42 8
#
# [[4]]
# integer(0)
#
# [[5]]
# [1] 12 93
#
# [[6]]
# [1] 20 13
The latest versions have removed functions such as raster_extract
. What is the best practice for replacing them?
Hi,
Is there a way to define the distance calculation methodology, i.e. "Great Circle", "Euclidian", etc., similar to the which
argument under sf::st_distance
?
Hi, I've been using st_nn
for sf
points and I noticed that it is much slower using parallel processing.
I had a look at the source code and I think the way you structure the function is causing a lot of unnecessary copying of data.
For example with 10,000 points I get:
Single Core:
> system.time(r1 <- nngeo::st_nn(x, y, k = 50, parallel = 1))
user system elapsed
23.33 0.41 24.25
4 Core: Much slower
> system.time(r1 <- nngeo::st_nn(x, y, k = 50, parallel = 4))
user system elapsed
6.34 0.27 65.10
Tweaked code: Better but not 4x faster
system.time({
cl = parallel::makeCluster(4)
x_split = split(x, 1:4)
parallel::clusterExport(
cl = cl,
varlist = c("y"),
envir = environment()
)
result = parallel::parLapply(
cl,
x_split,
function(i) nngeo:::.st_nn_pnt_proj(i, y, k = 50, maxdist = Inf, progress = FALSE)
)
parallel::stopCluster(cl)
ids = lapply(result, `[[`, 1)
ids = unlist(ids, recursive = FALSE, use.names = FALSE)
r2 = ids
})
user system elapsed
0.55 0.06 10.86
> identical(r1, r2)
[1] TRUE
This might be a windows only issue as the way parallelisation works on Linux is very different.
Seems like st_nn()
may have an issue handling projections in US Survey feet/ftUS. When trying to use it for an sf object in EPSG:2248 (NAD83/Maryland), I get this error:
x cannot convert ft into us
Did you try to supply a value in a context where a bare expression was expected?
It works fine when I transform to ESPG:4326, which is in meters.
New to R and to submitting issue reports, so let me know if you need more details, but it seems similar to this issue: r-spatial/sf#504
Hi, I notice if we try to use lat-lon and disable progress bar, we still will get output:
ret <- nngeo::st_nn(
points,
nodes,
returnDist = TRUE,
k = k,
progress = FALSE,
parallel = cores
)
lon-lat points
I think the best would be a way to disable that "lon-lat" message, maybe a new param for it? or if we disable progress is likely we want quiet output, so disable if is FALSE
.
Thx!
Hi, I found this to fail here, is also related to r-spatial/sf#2299.
points <- sf::st_sfc(
sf::st_point(c(0, 1)),
sf::st_point(c(0, 2)),
crs = 'LOCAL_CS["planar", UNIT["METER",1]]'
)
points2 <- sf::st_sfc(
sf::st_point(c(0, 1)),
sf::st_point(c(0, 2)),
crs = 'LOCAL_CS["planar", UNIT["METER",1]]'
)
nngeo::st_nn(points, points2)
projected points
Error in if (!is.na(crs_units) & crs_units != "m") { :
argument is of length zero
Here the line:
Line 61 in fcfde8e
For some reason, SF is not setting the units for this CRS, maybe is only setting the ones that have the shortcut of numeric value, like 4326 for WGS84.
Thx.
Thank you very much for the amazing package!
I just have a quick question on the distance result of st_nn after setting the parallel parameter larger than 1. It seems that the result in $dist is the same as $nn.
I first tried st_nn with the example data (cities and water) to calculate distance without parallel processing. Below are the code.
library(nngeo)
library(parallel)
nn = st_nn(cities, water, returnDist = TRUE, progress = TRUE)
[[1]]
[1] 3[[2]]
[1] 2[[3]]
[1] 2
nn$dist
[[1]]
[1] 22833.09[[2]]
[1] 1372.235[[3]]
[1] 2777.558
However, the distance result changes when I add parallel = 4
nn_p2 = st_nn(cities, water, returnDist = TRUE, progress = TRUE, parallel = 4)
nn_p2 $nn
[[1]]
[1] 3[[2]]
[1] 2[[3]]
[1] 2
nn_p2$dist
[[1]]
[1] 3[[2]]
[1] 2[[3]]
[1] 2
I hope I did not ask the same question already covered. Could you let me know if I miss anything here?
Hi Michael
This package would be ideal for a problem I am trying to solve.
I hope you progress it and add additional functionality/articulation of methods. For example, I would like to include more fields from the original data set, and have the columns labelled by their origin.
My issue is that I get an error when running your code example. Running > nn = st_nn(cities, towns, progress = TRUE) produces this error message:
projected points
Error in if (!is.na(crs_units) & crs_units != "m") { :
argument is of length zero
I'd welcome your advice because having your example and its variants working would provide much needed insight. The proj string for those two data sets does not have a units comment.
Thank you once again for a great package!
This might be beyond the scope of the package and not of interest to you, but I thought I should mention it anyway. I think it would be valuable to have the possibility to find how many neighbours there are within a certain, specified distance. I assume this could be achieved by setting a very high k, and then adding up the number of neighbours within each observation. And there are of course other ways to calculate this. But it would be very convenient with an additional function in nngeo that adds the number of neighbours within distance d, as a new variable to the data frame. It might be just my field, but there this type of calculation is very common.
But there might be problems with this idea that I don't grasp.
All the best, Richard
Hi,
Thanks for a great package. I've noticed that st_nn is very slow when getting the nearest lines to a point, and looking at the code it seems that it just defaults to using sf::st_distance from each point to all the lines.
I've been working on my own function to get around this problem, by measuring the nearest 10 centroids of the lines (which is fast), and then performing the st_nn on just those 10 lines rather than the whole dataset.
nn_line <- function(point, lines){
cents <- sf::st_centroid(lines)
nn <- nngeo::st_nn(point, cents, k = 10)
res <- list()
for(i in seq_len(nrow(point))){
nnsub <- nn[[i]]
sub <- unlist(nngeo::st_nn(point$geometry[i], lines$geometry[nnsub], progress = FALSE))
res[[i]] <- nnsub[sub]
}
return(res)
}
This function is about 143x faster on my data, so I thought it might be useful to share more generally.
I am unable to install nngeo_0.4.7 (either from CRAN or github) in R 4.3.0 in ubuntu 22.04, getting the following error:
*** caught segfault ***
address 0x55582a37ddb6, cause 'memory not mapped'
An irrecoverable exception occurred. R is aborting now ...
Segmentation fault (core dumped)
I have installed Rccp and RcppEigen via apt install
(rather than install.packages()
in R) from the c2d4u repository. Perhaps this is causing a conflict?
For large amounts of data, many of this package's functions are quite slow. Is it possible to pass a flag to the underlying C++ to split the task up and utilize all available CPU cores? Right now it runs single threaded.
The lines that st_connect
returns do not have a CRS:
library(nngeo) # version 0.1.8
x <- data.frame(x = runif(1), y = runif(1))
x <- st_as_sf(x, coords = c("x", "y"), crs = 4326)
y <- data.frame(x = runif(5), y = runif(5))
y <- st_as_sf(y, coords = c("x", "y"), crs = 4326)
lines <- st_connect(x, y, st_nn(x, y))
print(st_crs(lines))
# Coordinate Reference System: NA
stopifnot(st_crs(lines) == st_crs(x))
# Error: st_crs(lines) == st_crs(x) is not TRUE
I got some longitude and latitude geo point data and found out it is very slow the st_nn() works on this kind of data, is there something I can do to make it more efficient?
I know I cannot use RANN::nn2() directly for this kind of data, but is there some kind of way to convert epsg 4326 to some projection that is good for use RANN::nn2()?
Thanks.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.