ipeagit / gtfs2gps Goto Github PK

View Code? Open in Web Editor NEW

69.0 8.0 10.0 13.14 MB

Convert GTFS data into a data.table with GPS-like records in R

Home Page: https://ipeagit.github.io/gtfs2gps/

License: Other

R 95.62% C++ 4.38%

gtfs transport public-transport rspatial gtfs-format gps-format r

gtfs2gps's Introduction

gtfs2gps: Converting public transport data from GTFS format to GPS-like records

gtfs2gps is an R package that converts public transportation data in GTFS format to GPS-like records in a data.frame/data.table, which can then be used in various applications such as running transport simulations or scenario analyses.

The core function of the package takes a GTFS.zip file and interpolates the space-time position of each vehicle in each trip considering the network distance and average speed between stops. The output is a data.table where each row represents the timestamp of each vehicle at a given spatial resolution. The package also has some functions to subset GTFS data in time and space and to convert both representations to simple feature format. More information about the methods used in the package can be found in this preprint.

Installation

Please install gtfs2gps package from CRAN to get the stable version.

install.packages("gtfs2gps")
library(gtfs2gps)

Vignette

Please see our vignette:

gtfs2gps: Converting GTFS data to GPS format

Credits

The gtfs2gps package is developed by a team at the Institute for Applied Economic Research (Ipea) with collaboration from the National Institute for Space Research (INPE), both from Brazil. You can cite this package as:

Pereira, R. H. M., Andrade, P. R., & Vieira, J. P. B. (2022). Exploring the time geography of public transport networks with the gtfs2gps package. Journal of Geographical Systems. https://doi.org/10.1007/s10109-022-00400-x

gtfs2gps's People

Contributors

Stargazers

Watchers

Forkers

pedro-andrade-inpe joaobazzo davan690 symbolixau stmarcin jimsforks mvpsaraiva abrac dhersz freyja-bt olivroy

gtfs2gps's Issues

Create Hex sticker

It would be nice to have a hex sticker for the package. I've created a draft proposal so we can have a starting point to discuss. Here is draft version 1:

I know, it's not great. I wanted to convey the output in data.table format but I recognized it should also have an element related to public transport that is still missing from the logo. Perhaps the acronym gtfs in the package name would suffice, or perhaps you have other suggestions.

The logo was created using the script ./man/figures/gtfs2gps_hexsticker.R, so please feel free to tweak the code and make your suggestions.

Parallelization using all CPUs

We have two tasks that use parallel computing, (1) data.table some operations including fwrite and fread, and (2) the processing of multiple shape_ids in gtfs2gps.

The data.table operations are already running with all CPUs available because we have set

.onLoad = function(lib, pkg) {
  # Use GForce Optimisations in data.table operations
  # details > https://jangorecki.gitlab.io/data.cube/library/data.table/html/datatable-optimize.html
  options(datatable.optimize = Inf) # nocov
  
  # set number of threads used in data.table 
  data.table::setDTthreads(percent = 100) # nocov
}

I was wondering if we could also set the core argument of gtfs2gps default to use all logical CPUs available.

core = getDTthreads(verbose)

What do you think?

better solution for midnight trips

Implement a better solution for the midnight trips. Currently they are removed (see #9).

check update_dt function

for single gtfs files

filter by agency_id

Similiar to how

filter_by_shape_id

works, could you implement a way to filter by agency_id? Obviously there's a roundabout way to get all the shape_ids for an agency, but that can lead to extremely long vectors, as I've experienced with the Berlin gtfs, where there's over 3000 ids for the agency's I was interested in.

Small issues with data.table::as.ITime

The as.ITime function seems to have an issue with non-integer values

> data.table::as.ITime(0.5)
[1] "00:00:00"
 > data.table::as.ITime(0.89)
[1] "00:00:00"
> data.table::as.ITime(0.99)
[1] "00:00:00"
> data.table::as.ITime(2.90)
[1] "00:00:02"

Perhaps we should use in all scripts (such as gtfs2gps_dt_parellel.R) the function round inside it

> data.table::as.ITime(round(0.5)) # still with problems
[1] "00:00:00"
> data.table::as.ITime(round(0.99))
[1] "00:00:01"
> data.table::as.ITime(round(2.8))
[1] "00:00:03"

Issues on Fortaleza dataset

I've been doing tests on gtfs_for_etufor_2019-10.zip file.
Specifically, on shape_id shape073-V, there are different stop_times according to trip_id's. In our script, we were considering the same stop sequences along each trip_id.

 # get all trips linked to that route
 >     trips_temp <- gtfs_data$trips[shape_id == "shape073-V" & route_id == routeid, ]
 >     nrow(trips_temp)
 # [1] 197
 >     all_tripids <- unique(trips_temp$trip_id)
 >     length(all_tripids)
 # [1] 197

If we check the number of valid stop_times per trip_id, there will be different values during the day

 >     nstop <- sapply(seq_along(all_tripids),function(i){nrow(gtfs_data$stop_times[trip_id == all_tripids[i],])}) 
 >     nstop
   # [1]  2  2  2  2  2  2  2  2 43  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2 43 43 43 43 43 43 43 43  2  2  2  2  2  2  2  2
  [41] 43 43 43 43 43 43 43 43  2  2  2  2  2  2  2  2 43 43 43 43 43 43 43 43  2  2  2  2  2  2 43 43 43 43 43 43 43 43 43 43
  [81]  2  2  2  2  2 43 43 43 43 43  2  2  2  2  2 43 43 43 43 43 43  2  2  2  2 43 43 43 43 43 43  2  2  2  2  2 43 43 43 43
 [121] 43 43  2  2  2  2  2  2 43 43 43 43 43 43  2  2  2  2  2 43 43 43 43 43 43 43  2  2  2  2  2  2  2 43 43 43 43 43 43 43
 [161] 43 43 43 43 43 43 43 43 43 43 43 43 43 43 43 43 43 43 43 43 43 43 43 43 43 43 43 43 43 43 43 43 43 43 43 43 43
 > table(nstop)
 nstop
   2  43 
  82 115

One idea is to consider the maximum number of stop_times per trip_id to get the correct stop sequence to that route, which means

instead of using simply

  stops_seq <- gtfs_data$stop_times[trip_id == all_tripids[1], .(stop_id, stop_sequence)]

we would do

 nstop <- sapply(seq_along(all_tripids),function(i){nrow(gtfs_data$stop_times[trip_id == all_tripids[i],])}) 
 stops_seq <- gtfs_data$stop_times[trip_id == all_tripids[which.max(nstop)], .(stop_id, stop_sequence)]

Checking gtfs_for_etufor_2019-10.zip, there are actually 105 shape_id's out of 675 that have different number of stop_times on its trip_id's

Data outputs larger than RAM memory

I am anticipating that soon we will be using our package on GTFS data sets that will generate outputs larger than RAM memory. I believe we can address this issue after we have the 1st version of package published on CRAN.

For now, I just would like to flag that the disk.frame package sounds like a promising approach to deal with this issue.

Fix progress bar in gtfs2gps_dt_parallel.R

improve time filter in `filter_day_period`

As of today, the function filter_day_period.R only works with frequency-based GTFS feeds when the start_period - end_period is longer than 1 hour, 2 hours, 3 hours etc. If start_period - end_period is '01:20', for example, the function will only keep the trips sections within the first full hour.

One way to solve this, would be to round down start_period and round up end_period to full hours. It seems the best approach to do this would be to convert x_period and departure_time to numeric and run all time operations based on numeric vectors. We would only convert departure_time back to ITime() in the final output of the functions.

ps. This apporach migh improve the package efficiency because as.ITime() conversion takes quite some time.

Remove dependency to tidytransit?

Should we remove the dependency to tidytransit and use only data.tables in the package?

Filtering for time of day

I'm unsure if this isn't the same as what is mentioned in #44 , but an option to filter for a specific period during the day could be interesting.

code coverage

Implement code coverage using covr.

Display messages when converting gtfs to gps in function gtfs2gps

As it may take a long time to complete the conversion between gtfs to gps, it would be nice to have some messages informing the process that is happening at the moment.

strange bug when running check

There is a strange error when running check. For poa data, ID "R10-2", two points are not being created. When only the tests are run it works perfectly. See the figure below. The only two missing points become more than a hundred points (in these two locations), as gtfs stores all the trips individually.

new parameter file to gtfs2gps

Add a parameter file to gtfs2gps() in order to save the output into a file instead of returning a data.table. This will be useful for very large datasets.

Make time operations using data.table::as.ITime

Time operations are currently performed using fasttime::fastPOSIXct. Replacing this with data.table::as.ITime should make the code faster and reduce package depencies

Distance between interpolated points

Although our script sets the spatial resolution of 15 m, our interpolated stop times file are displaying different values for dist

> new_stoptimes
       shape_id  id route_type shape_pt_lon shape_pt_lat stop_id stop_sequence      dist
  1: shape804-I   1          3    -38.47761    -3.726532    2649             1 11.218464
  2: shape804-I   2          3    -38.47767    -3.726615      NA            NA 11.218464
  3: shape804-I   3          3    -38.47773    -3.726697      NA            NA 13.953559
  4: shape804-I   4          3    -38.47762    -3.726766      NA            NA 13.953559
  5: shape804-I   5          3    -38.47752    -3.726835      NA            NA 13.953559
 ---                                                                                    
363: shape804-I 363          3    -38.48439    -3.738244      NA            NA 13.422701
364: shape804-I 364          3    -38.48444    -3.738353      NA            NA 13.422701
365: shape804-I 365          3    -38.48450    -3.738462      NA            NA  7.889898
366: shape804-I 366          3    -38.48453    -3.738523      NA            NA  7.889898
367: shape804-I 367          3    -38.48457    -3.738583    6079            13       NaN

I've tried changing the coordinate system to UTM before estimating the distances, but the results were the same. Also, the function st_distance() between consecutive points behaves similarly. I think it has something to do with the st_cast function, but so far couldn't figure out.

gtfs2gps_dt_parallel() NaN dist/cumdist

There is a small bug in gtfs2gps_dt_parallel() that produces NaN values in dist and cumdist.

require(gtfs2gps)
poa <- system.file("extdata/poa.zip", package="gtfs2gps")

poa_gps <- gtfs2gps_dt_parallel(poa)
poa_gps

See the last line of the output below:

        stop_id stop_sequence      dist shape_id      cumdist
     1:      NA            NA  7.725909     T2-1     7.725909
     2:      NA            NA  7.725909     T2-1    15.451819
     3:      NA            NA 12.940440     T2-1    28.392258
     4:      NA            NA 12.940440     T2-1    41.332698
     5:      NA            NA 12.940440     T2-1    54.273138
    ---                                                      
309473:      NA            NA 14.539346    R10-2 26711.946687
309474:      NA            NA 14.539346    R10-2 26726.486033
309475:      NA            NA 14.539346    R10-2 26741.025379
309476:     433            40 14.539346    R10-2 26755.564725
309477:      NA            NA       NaN    R10-2          NaN

> which(is.nan(poa_gps$cumdist)) %>% length()
[1] 194

Identificação de week days

Dependendo do GTFS a forma de identificar week days difere. BH é um exemplo que não o arquivo de calendar e precisa extrair do week day do calendar dates.

Argument `spatial_resolution` not working

It seems that the spatial_resolution argument is not working.

# gtfs zip
poa <- system.file("extdata/poa.zip", package="gtfs2gps" )

# running gtfs2gps
poa15 <- gtfs2gps(poa, progress = T, spatial_resolution = 15, cores = getDTthreads())
poa30 <- gtfs2gps(poa, progress = T, spatial_resolution = 30, cores = getDTthreads())
poa60 <- gtfs2gps(poa, progress = T, spatial_resolution = 60, cores = getDTthreads())

# test
nrow(poa15) == nrow(poa30) 
> TRUE
nrow(poa30) == nrow(poa60)
> TRUE

Departure time as an argument

Calculate single route between a start and end station departing at or after a specified time. Might be useful

Suggestions for gtfs2gps_dt_single()

I have two suggestions for gtfs2gps_dt_single():

Rename it to simply gtfs2gps(), as we now will have only one function.
Remove the argument week_days. It might be interesting in order to work with large data to allow the user to execute this kind of function before running gtfs2gps() and save it separately. We could for instance have a progress bar for this and other filter functions as well.

Removing lwgeom dependency

I've noticed we are not usig lwgeom::st_make_valid anywhere in the package. I guess it's Ok to remove lwgeom dependency, right? or did you guys have some use in mind?

consider using geodist for distance calculations

Consider importing the geodist package for the distance calculations here.

Calculate average speed between stops

Calculate average speed between stops, instead of average speed for the whole trip. This could give more accurate estimates.

bug: not skipping

Apparently, the gtfs2gps function is not skipping shape_ids with missing routes/trips. Here is a reproducible example of the problem when it hits the shape_id 8700-21-0

library(gtfs2gps)
library(magrittr)
library(data.table)
library(sf)

# local GTFS.zip
spo_zip <- system.file("extdata/saopaulo.zip", package="gtfs2gps" )

# read gtfs
spo_gtfs <- gtfs2gps::read_gtfs(spo_zip)

# subset time interval
spo_gtfs_f <- gtfs2gps::filter_day_period(spo_gtfs, period_start = "07:00:", period_end = "07:30")
  
# Convert GTFS data into a data.table with GPS-like records
spo_gps <- gtfs2gps2(spo_gtfs_f, spatial_resolution = 15, progress = T, cores = 1 )

The following Trip_ids have been ignored due to missing data in original gtfs.zip: 8700-21-0
Error in FUN(X[[i]], ...) : object 'departure_time' not found

More Berlin gtfs2gps function issues

After you made some updates I again dove in to the Berlin gtfs file.

The new filter_day_period function seems to work as intended, and is a welcome addition, sadly though I'm still getting errors.

berlin <- read_gtfs("berlin.zip")

berlin <- filter_valid_stop_times(berlin)

berlin_s <- filter_by_agency_id(berlin, 1)

berlin_5 <- filter_day_period(berlin_s, period_start = "04:00", period_end = "05:00")

berlin_gps <- gtfs2gps(berlin_5, progress = TRUE, spatial_resolution = 150)

22.
stop(err$message, call. = FALSE)
21.
.checkTypos(e, names_x)
20.
value[[3L]](cond)
19.
tryCatchOne(expr, names, parentenv, handlers[[1L]])
18.
tryCatchList(expr, classes, parentenv, handlers)
17.
tryCatch(eval(.massagei(isub), x, ienv), error = function(e) .checkTypos(e, names_x))
16.
`[.data.table`(new_stoptimes, a:b, `:=`(speed, 3.6 * (data.table::last(cumdist) - data.table::first(cumdist))/(data.table::last(departure_time) - data.table::first(departure_time))))
15.
new_stoptimes[a:b, `:=`(speed, 3.6 * (data.table::last(cumdist) - data.table::first(cumdist))/(data.table::last(departure_time) - data.table::first(departure_time)))]
14.
FUN(X[[i]], ...)
13.
lapply(X = 1:(L - 1), FUN = update_speeds)
12.
FUN(X[[i]], ...)
11.
lapply(X = all_tripids, FUN = update_dt, new_stoptimes, gtfs_data, all_tripids)
10.
eval(lhs, parent, parent)
9.
eval(lhs, parent, parent)
8.
lapply(X = all_tripids, FUN = update_dt, new_stoptimes, gtfs_data, all_tripids) %>% data.table::rbindlist()
7.
FUN(X[[i]], ...)
6.
lapply(X[Split[[i]]], FUN, ...)
5.
pbapply::pblapply(X = all_shapeids, FUN = corefun)
4.
eval(lhs, parent, parent)
3.
eval(lhs, parent, parent)
2.
pbapply::pblapply(X = all_shapeids, FUN = corefun) %>% data.table::rbindlist()
1.
gtfs2gps(berlin_5, progress = TRUE, spatial_resolution = 150)

Changing filter_day_period(berlin_s, period_start = "04:00", period_end = "05:00") to filter_day_period(berlin_s, period_start = "03:00", period_end = "04:00") leads to no error messages, sadly though I can't tell you why that is.

What to do with BH data?

Belo Horizonte data does not have shape, but it has stops with lat/long. Should we ignore it? Or could we allow loading the data? Currently it stops with an error. We can add an argument to read_gtfs():

#' @param stopMissingFile Stop if find a missing file? The default value is TRUE.
#' Note that if this value is false, the resulting data might not work properly
#' in the other functions of the package.

Simplify update_freq()?

@Joaobazzo, @rafapereirabr , is it possible to simplify update_freq() to use update_dt() or am I missing something? The idea would be to have a code as follows:

update_freq <- function(tripid, new_stoptimes, gtfs_data){
  new_stoptimes <- update_dt(tripid, new_stoptimes, gtfs_data) # several lines become one

  #  Get freq info for that trip
  # tripid <- "148L-10-0"
  freq_temp <- subset(gtfs_data$frequencies, trip_id== tripid)

tests fail in R CMD check but pass in devtools::test()

Not sure why this is happening. Perhaps this could gives us a clue to the solution. In the gtfs2gps.Rcheck\tests_i386\testthat.Rout.fail there is this error message: DLL 'sf' not found: maybe not installed for this architecture?

function to check consistency of GTFS file

Implement a function to check if a given GTFS is consistent. It could show for instance if there are some trips that do not have a valid shape_id.

fix package parallelization

I'm moving parallelization to use the furrr package

Improving speed of `gtfs_shapes_as_sf`

I've made some tests with the new sfheaders package and it looks like it could more than halve the computation time of our gtfs_shapes_as_sf function. Quick benchmark below:

library(gtfs2gps)
library(sfheaders)
library(data.table)
library(tidytransit)
library(magrittr)


### read data set

small_shape <- data.table::fread("https://raw.githubusercontent.com/rafapereirabr/data_dump/master/shapes_small.csv")
# poa <- read_gtfs(system.file("extdata/poa.zip", package="gtfs2gps"))
# small_shape <- poa$shapes

### prepare functions
dt2sf <- function(shapes, crs = 4326){
  # sort data
  temp_shapes <- setDT(shapes)[order(shape_id, shape_pt_sequence)]
  
  # convert to sf
  temp_shapes <- temp_shapes[,
                             {
                               geometry <- sf::st_linestring(x = matrix(c(shape_pt_lon, shape_pt_lat), ncol = 2))
                               geometry <- sf::st_sfc(geometry)
                               geometry <- sf::st_sf(geometry = geometry)
                             }
                             , by = shape_id
                             ]
  
  temp_shapes <- sf::st_as_sf(temp_shapes, crs = crs)
  
  # calculate distances
  data.table::setDT(temp_shapes)[, length := sf::st_length(geometry) %>% units::set_units("km") ] 
  
  # back to sf
  temp_shapes <- sf::st_sf(temp_shapes)
  return(temp_shapes)
}



sfheaders_test <- function(shapes, crs = 4326){
  # sort data
  temp_shapes <- setDT(shapes)[order(shape_id, shape_pt_sequence)]
  
  # convert to sf
  temp_shapes <- sfheaders::sf_linestring(temp_shapes, x = "shape_pt_lon" , y = "shape_pt_lat", linestring_id = "shape_id")
  
  # add projection
  st_crs(temp_shapes) <- crs
  
  # calculate distances
  data.table::setDT(temp_shapes)[, length := sf::st_length(geometry) %>% units::set_units("km") ] 
  
  # back to sf
  temp_shapes <- sf::st_sf(temp_shapes)
  
}



### Benchmark
mbm <- microbenchmark::microbenchmark(times = 20,
                                      
                                      'dt' = { dt2sf(small_shape, crs = 4326) },
                                      
                                      'sfheaders' = { sfheaders_test(small_shape, crs = 4326) }
                                      )

ggplot2::autoplot(mbm)

dplyr dependency

the DESCRIPTIONP file imports dplyr. As fas as I remember, we do need this. right? Shall we remove it?

optimize st_snap_points

The data from stops are not spatially ordered according to their respective shapes. The code below shows that, using (1) their rows or (2) stop_id do not guarantee such order. I will then just reimplement st_snap_points in C to speedup the execution.

require(dplyr)
require(gtfs2gps)
require(sf)

gtfs <- read_gtfs(system.file("extdata/poa.zip", package="gtfs2gps")) %>%
  filter_by_shape_id("T2-1")

shapes <- gtfs_shapes_as_sf(gtfs)
stops <- gtfs_stops_as_sf(gtfs)
plot(st_geometry(shapes))
plot(st_geometry(stops), add=T)

gtfs$stops <- gtfs$stops %>% dplyr::filter(stop_id < 2000) # by stop_id
#gtfs$stops <- gtfs$stops[1:20, ] # by rows

stops <- gtfs_stops_as_sf(gtfs)
plot(st_geometry(stops), add=T, col="red", pch=16)

How to deal with routes that are closed circuits represented on a single line?

This is the case of the bus route with shape_id==52936 in the Sao Paulo data set. Here is the shape:

It looks like a simple route but it's actually a circuit. This causes a serious problem when we try to create a new stop_times table with high resolution because it messes up with the stop sequence . Like this:

    shape_id id route_type shape_pt_lon shape_pt_lat   stop_id stop_sequence
 1:    52936  1          3    -46.57452    -23.48741 940003734             1
 2:    52936  2          3    -46.57441    -23.48737        NA            NA
 3:    52936  3          3    -46.57430    -23.48732        NA            NA
 4:    52936  4          3    -46.57419    -23.48727        NA            NA
 5:    52936  5          3    -46.57411    -23.48723        NA            NA
 6:    52936  6          3    -46.57403    -23.48719        NA            NA
 7:    52936  7          3    -46.57392    -23.48715        NA            NA
 8:    52936  8          3    -46.57382    -23.48711        NA            NA
 9:    52936  9          3    -46.57371    -23.48707        NA            NA
10:    52936 10          3    -46.57366    -23.48705        NA            NA
11:    52936 11          3    -46.57356    -23.48701        NA            NA
12:    52936 12          3    -46.57345    -23.48696        NA            NA
13:    52936 13          3    -46.57332    -23.48694        NA            NA
14:    52936 14          3    -46.57320    -23.48691        NA            NA
15:    52936 15          3    -46.57310    -23.48689        NA            NA
16:    52936 16          3    -46.57304    -23.48683        NA            NA
17:    52936 17          3    -46.57299    -23.48677        NA            NA
18:    52936 18          3    -46.57291    -23.48671        NA            NA
19:    52936 19          3    -46.57282    -23.48665        NA            NA
20:    52936 20          3    -46.57272    -23.48659        NA            NA
21:    52936 21          3    -46.57261    -23.48653        NA            NA
22:    52936 22          3    -46.57254    -23.48649        NA            NA
23:    52936 23          3    -46.57247    -23.48646        NA            NA
24:    52936 24          3    -46.57239    -23.48642        NA            NA
25:    52936 25          3    -46.57230    -23.48638        NA            NA
26:    52936 26          3    -46.57217    -23.48633        NA            NA
27:    52936 27          3    -46.57223    -23.48621        NA            NA
28:    52936 28          3    -46.57230    -23.48609        NA            NA
29:    52936 29          3    -46.57234    -23.48599        NA            NA
30:    52936 30          3    -46.57239    -23.48590   9412609            32
31:    52936 31          3    -46.57244    -23.48580   9412610             2
32:    52936 32          3    -46.57249    -23.48571        NA            NA
33:    52936 33          3    -46.57254    -23.48561        NA            NA
34:    52936 34          3    -46.57260    -23.48550        NA            NA
35:    52936 35          3    -46.57265    -23.48541        NA            NA
36:    52936 36          3    -46.57270    -23.48533        NA            NA
37:    52936 37          3    -46.57275    -23.48524        NA            NA
38:    52936 38          3    -46.57278    -23.48519        NA            NA
39:    52936 39          3    -46.57282    -23.48511        NA            NA
40:    52936 40          3    -46.57286    -23.48503        NA            NA
41:    52936 41          3    -46.57291    -23.48493        NA            NA
42:    52936 42          3    -46.57296    -23.48486        NA            NA
43:    52936 43          3    -46.57301    -23.48476        NA            NA
44:    52936 44          3    -46.57307    -23.48465        NA            NA
45:    52936 45          3    -46.57313    -23.48455        NA            NA
46:    52936 46          3    -46.57319    -23.48444        NA            NA
47:    52936 47          3    -46.57324    -23.48435   9412651            31
48:    52936 48          3    -46.57329    -23.48426        NA            NA
49:    52936 49          3    -46.57334    -23.48417        NA            NA
50:    52936 50          3    -46.57338    -23.48409        NA            NA
51:    52936 51          3    -46.57343    -23.48400        NA            NA
52:    52936 52          3    -46.57348    -23.48392        NA            NA
53:    52936 53          3    -46.57354    -23.48381        NA            NA
54:    52936 54          3    -46.57360    -23.48370        NA            NA
55:    52936 55          3    -46.57366    -23.48359        NA            NA
56:    52936 56          3    -46.57372    -23.48349        NA            NA
57:    52936 57          3    -46.57378    -23.48338        NA            NA
58:    52936 58          3    -46.57384    -23.48327        NA            NA
59:    52936 59          3    -46.57390    -23.48316        NA            NA
60:    52936 60          3    -46.57396    -23.48306        NA            NA
61:    52936 61          3    -46.57401    -23.48296        NA            NA
62:    52936 62          3    -46.57407    -23.48285        NA            NA
63:    52936 63          3    -46.57413    -23.48275        NA            NA
64:    52936 64          3    -46.57419    -23.48265        NA            NA
65:    52936 65          3    -46.57424    -23.48254        NA            NA
66:    52936 66          3    -46.57429    -23.48245        NA            NA
67:    52936 67          3    -46.57434    -23.48236 940003982            30
68:    52936 68          3    -46.57439    -23.48227        NA            NA
69:    52936 69          3    -46.57444    -23.48218        NA            NA
70:    52936 70          3    -46.57450    -23.48206        NA            NA
71:    52936 71          3    -46.57457    -23.48195 940003983             4
72:    52936 72          3    -46.57463    -23.48184        NA            NA

Probably a bug in filter_day_period

In theory, the result should be identical whether (1) we filter a day period of a gtfs dataset before converting it to gps-like data.table, or (2) we convert a gtfs to gps-like data.table and then filter a day period. I've tried a test below and the results are very different. This should gives us an idea of where and why the filter_day_period is nor working properly.

library(gtfs2gps)
library(data.table)


# read local GTFS.zip
gtfs_zip <- system.file("extdata/fortaleza.zip", package="gtfs2gps" )
gtfs_dt <- gtfs2gps::read_gtfs(gtfs_zip)


### 1) Filter before gtfs2gps
  
  # subset time interval
  gtfs_filtered <- gtfs2gps::filter_day_period(gtfs_dt, period_start = "07:00:", period_end = "07:59")
  
  # Convert GTFS data into a data.table with GPS-like records
  gps_before <- gtfs2gps::gtfs2gps(gtfs_filtered, spatial_resolution = 15, progress = T, parallel = T )

  
### 2) Filter after gtfs2gps
  
  # Convert GTFS data into a data.table with GPS-like records
  gps <- gtfs2gps::gtfs2gps(gtfs_dt, spatial_resolution = 15, progress = T, parallel = T )

  # subset time interval
  gps_after <- gps[ between(departure_time, as.ITime("07:00:"), as.ITime("07:59"))]
  
  
  nrow(gps_before)
  nrow(gps_after)

Create function to filter gtfs by period of the day

bounding box problem while converting gtfs data to sf format

When converting GTFS data to sf format, the bounding box is wrong (therefore plotting does not work properly). This might indicate a problem in sf, but requires further investigation. For now, the code will convert to spatial and then to sf to fix this problem. See the code below to reproduce the problem. When the converted data is saved into a shp and then loaded again, the bounding box is correct.

require(magrittr)

gtfs <- gtfs2gps::read_gtfs(system.file("extdata/poa.zip", package="gtfs2gps"))
crs = 4326

temp_shapes <- gtfs$shapes[,
                           {
                             geometry <- sf::st_linestring(x = matrix(c(shape_pt_lon, shape_pt_lat), ncol = 2))
                             geometry <- sf::st_sfc(geometry)
                             geometry <- sf::st_sf(geometry = geometry)
                           }
                           , by = shape_id
                           ]

temp_shapes[, length := sf::st_length(geometry) %>% units::set_units("km"), by = shape_id]

result <- sf::st_as_sf(temp_shapes, crs = crs)

sf::st_write(result, "myresult.shp")
shp <- sf::st_read("myresult.shp")

sf::st_bbox(result)
sf::st_bbox(shp)

result %>% sf::as_Spatial() %>% sf::st_as_sf() %>% sf::st_bbox()

merge two GTFS files

Implement a function to merge two GTFS files. Be careful not to have duplicated IDs.

Improve package efficiency

Here is a list of some of the most time-consuming lines of code in the package so we can prioritize:

script gtfs2gps.R (line 88):

sf::st_cast("LINESTRING") # I already know how to improve this. Fixing thins one soon

script mod_updates.R (lines 108-115)

  new_stoptimes[speed_valid,speed := {lapply(ranges, function(x){
    new_stoptimes[x,][,speed := {
      dt = data.table::last(departure_time) - data.table::first(departure_time)
      ds = data.table::last(cumdist) - data.table::first(cumdist)
      v = 3.6 * ds / dt
      list(v = v)
    }][-nrow(new_stoptimes[x,]), "speed"] 
  }) %>% data.table::rbindlist()}]

script mod_updates.R (line 136)

new_stoptimes[, departure_time := data.table::as.ITime(round(departure_time[1L] + lag(cumtime, 1, 0)))]

I still need to profvis the update_freq function

Package not passing tests

I know, my last commit #a902a00 broke the core function of the package. Sorry :]

It seems to be related to this change I did between lines 80 and 91, but this needs to be investigated further.

    # # old (slower) version
    # new_shape <- subset(shapes_sf, shape_id == shapeid) %>%
    #   sf::st_segmentize(spatial_resolution) %>%
    #   sf::st_cast("LINESTRING") %>%
    #   sf::st_cast("POINT", warn = FALSE)  %>% 
    #   sf::st_sf()

    # new faster verion using sfheaders
    new_shape <- subset(shapes_sf, shape_id == shapeid) %>%
      sf::st_segmentize(spatial_resolution) %>%
      sfheaders::sf_to_df(fill = T) %>%
      sfheaders::sf_point( x = "x", y="y", keep = T)

Berlin gtfs2gps function issues

I saw your package mentioned on twitter and wanted to try it out with the Berlin data available here : https://openmobilitydata.org/p/verkehrsverbund-berlin-brandenburg/213

To me it looked like the data was fine, sadly I wasn't able to make it work. There were two issues I found.

The first one leads to an empty file.

berlin <- read_gtfs("gtfs.zip")

berlin_small_1 <- filter_by_shape_id(berlin, c(622))

berlin_gps_empty <- gtfs2gps(berlin_small_1, progress = TRUE, cores = 1, spatial_resolution = 15)

The second one brings up an error.

berlin <- read_gtfs("gtfs.zip")

berlin_small_2 <- filter_by_shape_id(berlin, c(2655))

berlin_gps_error <- gtfs2gps(berlin_small_2, progress = TRUE, cores = 1, spatial_resolution = 15)


Fehler in head(newstop_t0, 1):(tail(newstop_t0, 1) - 1) : Argument der Länge 0

13.
update_dt(tripid, new_stoptimes, gtfs_data)
12.
FUN(X[[i]], ...)
11.
lapply(X = all_tripids, FUN = update_freq, new_stoptimes, gtfs_data)
10.
eval(lhs, parent, parent)
9.
eval(lhs, parent, parent)
8.
lapply(X = all_tripids, FUN = update_freq, new_stoptimes, gtfs_data) %>% data.table::rbindlist()
7.
FUN(X[[i]], ...)
6.
lapply(X[Split[[i]]], FUN, ...)
5.
pbapply::pblapply(X = all_shapeids, FUN = corefun)
4.
eval(lhs, parent, parent)
3.
eval(lhs, parent, parent)
2.
pbapply::pblapply(X = all_shapeids, FUN = corefun) %>% data.table::rbindlist()
1.
gtfs2gps(berlin_small_2, progress = TRUE, cores = 1, spatial_resolution = 15)

Even though to me, the gtfs looked more or less the same as the São Paulo one.

warning and parallel execution of gtfs2gps

When running gtfs2gps() in parallel, all warnings related to inconsistencies in the data are surpressed. Maybe we could have a log file to store them.

two columns shape_id in gps_as_sf

gps_as_sf() returns a sf with two columns named shape_id. Remove one.

    poa <- system.file("extdata/poa.zip", package="gtfs2gps")

    poa_gps <- read_gtfs(poa) %>%
      filter_week_days() %>%
      gtfs2gps(parallel = FALSE)

    poa_sf <- gps_as_sf(poa_gps)
    names(poa_sf)

publish v1.0 on CRAN

Merge `gtfs2gps_dt` scripts into a single function

The scripts gtfs2gps_dt_parallel.R and gtfs2gps_dt_parallel_freq.R basically do the same thing. Both convert a GTFS feed to a "GPS format". The difference is that the former works on a standard GTFS feed, while the latter works on a frequency-based GTFS. Ideally, both functions shuld be combined into a single function.

alert: this would require a method to automatically identify the type of GTFS passed as input to the function.

cpp_snap_points limit recursive calls

The code of cpp_snap_points() might enter into an infinite loop when used with wrong data. Limit the number of recursive calls in order to stop and show an error message to the user when this happens.

spatial resolution as an argument

I'm noticing the processing time is taking a long time and the output is rather large. Ideally, we would like to include a spatial resolution argument to gtfs2gps_single_dt.R. Using lower spatial resolutions (say 500 meters or 1Km) could make the function more efficient at the expense of spatial detail, but some people might be fine with this trade-off.

However, I don't know whether it would be possible to include a spatial resolution argument to our
gtfs2gps_single_dt.R as it stands. What would happen if we used a spatial resolution of say 1Km but there is a pair of sequential stops that are closer than that?

I just wanted to leave this issue registered here, but we can try do address it latter on after we have a robust version of the package running.