tidyverse / vroom Goto Github PK

View Code? Open in Web Editor NEW

616.0 18.0 59.0 20.51 MB

Fast reading of delimited files

Home Page: https://vroom.r-lib.org

License: Other

R 9.65% C++ 87.26% CMake 1.39% Shell 0.14% C 1.36% Makefile 0.18% Python 0.02%

r csv csv-parser tsv tsv-parser fixed-width-text

vroom's Introduction

🏎💨vroom

The fastest delimited reader for R, 1.23 GB/sec.

But that’s impossible! How can it be so fast?

vroom doesn’t stop to actually read all of your data, it simply indexes where each record is located so it can be read later. The vectors returned use the Altrep framework to lazily load the data on-demand when it is accessed, so you only pay for what you use. This lazy access is done automatically, so no changes to your R data-manipulation code are needed.

vroom also uses multiple threads for indexing, materializing non-character columns, and when writing to further improve performance.

package	version	time (sec)	speedup	throughput
vroom	1.5.1	1.36	53.30	1.23 GB/sec
data.table	1.14.0	5.83	12.40	281.65 MB/sec
readr	1.4.0	37.30	1.94	44.02 MB/sec
read.delim	4.1.0	72.31	1.00	22.71 MB/sec

Features

vroom has nearly all of the parsing features of readr for delimited and fixed width files, including

delimiter guessing*
custom delimiters (including multi-byte* and Unicode* delimiters)
specification of column types (including type guessing)
- numeric types (double, integer, big integer*, number)
- logical types
- datetime types (datetime, date, time)
- categorical types (characters, factors)
column selection, like dplyr::select()*
skipping headers, comments and blank lines
quoted fields
double and backslashed escapes
whitespace trimming
windows newlines
reading from multiple files or connections*
embedded newlines in headers and fields**
writing delimited files with as-needed quoting.
robust to invalid inputs (vroom has been extensively tested with the afl fuzz tester)*.

* these are additional features not in readr.

** requires num_threads = 1.

Installation

Install vroom from CRAN with:

install.packages("vroom")

Alternatively, if you need the development version from GitHub install it with:

# install.packages("pak")
pak::pak("tidyverse/vroom")

Usage

See getting started to jump start your use of vroom!

vroom uses the same interface as readr to specify column types.

vroom::vroom("mtcars.tsv",
  col_types = list(cyl = "i", gear = "f",hp = "i", disp = "_",
                   drat = "_", vs = "l", am = "l", carb = "i")
)
#> # A tibble: 32 × 10
#>   model           mpg   cyl    hp    wt  qsec vs    am    gear   carb
#>   <chr>         <dbl> <int> <int> <dbl> <dbl> <lgl> <lgl> <fct> <int>
#> 1 Mazda RX4      21       6   110  2.62  16.5 FALSE TRUE  4         4
#> 2 Mazda RX4 Wag  21       6   110  2.88  17.0 FALSE TRUE  4         4
#> 3 Datsun 710     22.8     4    93  2.32  18.6 TRUE  TRUE  4         1
#> # ℹ 29 more rows

Reading multiple files

vroom natively supports reading from multiple files (or even multiple connections!).

First we generate some files to read by splitting the nycflights dataset by airline. For the sake of the example, we’ll just take the first 2 lines of each file.

library(nycflights13)
purrr::iwalk(
  split(flights, flights$carrier),
  ~ { .x$carrier[[1]]; vroom::vroom_write(head(.x, 2), glue::glue("flights_{.y}.tsv"), delim = "\t") }
)

Then we can efficiently read them into one tibble by passing the filenames directly to vroom. The id argument can be used to request a column that reveals the filename that each row originated from.

files <- fs::dir_ls(glob = "flights*tsv")
files
#> flights_9E.tsv flights_AA.tsv flights_AS.tsv flights_B6.tsv flights_DL.tsv 
#> flights_EV.tsv flights_F9.tsv flights_FL.tsv flights_HA.tsv flights_MQ.tsv 
#> flights_OO.tsv flights_UA.tsv flights_US.tsv flights_VX.tsv flights_WN.tsv 
#> flights_YV.tsv
vroom::vroom(files, id = "source")
#> Rows: 32 Columns: 20
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: "\t"
#> chr   (4): carrier, tailnum, origin, dest
#> dbl  (14): year, month, day, dep_time, sched_dep_time, dep_delay, arr_time, ...
#> dttm  (1): time_hour
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 32 × 20
#>   source          year month   day dep_time sched_dep_time dep_delay arr_time
#>   <chr>          <dbl> <dbl> <dbl>    <dbl>          <dbl>     <dbl>    <dbl>
#> 1 flights_9E.tsv  2013     1     1      810            810         0     1048
#> 2 flights_9E.tsv  2013     1     1     1451           1500        -9     1634
#> 3 flights_AA.tsv  2013     1     1      542            540         2      923
#> # ℹ 29 more rows
#> # ℹ 12 more variables: sched_arr_time <dbl>, arr_delay <dbl>, carrier <chr>,
#> #   flight <dbl>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
#> #   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

Learning more

Getting started with vroom
📽 vroom: Because Life is too short to read slow - Presentation at UseR!2019 (slides)
📹 vroom: Read and write rectangular data quickly - a video tour of the vroom features.

Benchmarks

The speed quoted above is from a real 1.53G dataset with 14,388,451 rows and 11 columns, see the benchmark article for full details of the dataset and bench/ for the code used to retrieve the data and perform the benchmarks.

Environment variables

In addition to the arguments to the vroom() function, you can control the behavior of vroom with a few environment variables. Generally these will not need to be set by most users.

VROOM_TEMP_PATH - Path to the directory used to store temporary files when reading from a R connection. If unset defaults to the R session’s temporary directory (tempdir()).
VROOM_THREADS - The number of processor threads to use when indexing and parsing. If unset defaults to parallel::detectCores().
VROOM_SHOW_PROGRESS - Whether to show the progress bar when indexing. Regardless of this setting the progress bar is disabled in non-interactive settings, R notebooks, when running tests with testthat and when knitting documents.
VROOM_CONNECTION_SIZE - The size (in bytes) of the connection buffer when reading from connections (default is 128 KiB).
VROOM_WRITE_BUFFER_LINES - The number of lines to use for each buffer when writing files (default: 1000).

There are also a family of variables to control use of the Altrep framework. For versions of R where the Altrep framework is unavailable (R < 3.5.0) they are automatically turned off and the variables have no effect. The variables can take one of true, false, TRUE, FALSE, 1, or 0.

VROOM_USE_ALTREP_NUMERICS - If set use Altrep for all numeric types (default false).

There are also individual variables for each type. Currently only VROOM_USE_ALTREP_CHR defaults to true.

VROOM_USE_ALTREP_CHR
VROOM_USE_ALTREP_FCT
VROOM_USE_ALTREP_INT
VROOM_USE_ALTREP_BIG_INT
VROOM_USE_ALTREP_DBL
VROOM_USE_ALTREP_NUM
VROOM_USE_ALTREP_LGL
VROOM_USE_ALTREP_DTTM
VROOM_USE_ALTREP_DATE
VROOM_USE_ALTREP_TIME

RStudio caveats

RStudio’s environment pane calls object.size() when it refreshes the pane, which for Altrep objects can be extremely slow. RStudio 1.2.1335+ includes the fixes (RStudio#4210, RStudio#4292) for this issue, so it is recommended you use at least that version.

Thanks

Gabe Becker, Luke Tierney and Tomas Kalibera for conceiving, Implementing and maintaining the Altrep framework
Romain François, whose Altrepisode package and related blog-posts were a great guide for creating new Altrep objects in C++.
Matt Dowle and the rest of the Rdatatable team, data.table::fread() is blazing fast and great motivation to see how fast we could go faster!

vroom's People

Contributors

Stargazers

Watchers

vroom's Issues

Whitespace trimming

Dates, times, datetimes

Probably use cctz

Normal
- datetimes
- dates
- times
Altrep
- datetimes
- dates
- times

Reading from connections

Will require a decent amount of refactoring, as ideally we would stream this to the indexer and save the file to a temporary location as we go.

Quoted fields

Windows newlines

strtod

crash if a header row with no data

Progress bars

Look into fuzz testing

Maybe just by passing inputs to vroom::index() directly?

https://github.com/google/oss-fuzz

add tests for multiple files

Connection indexing wrong

We need to adjust the indexes after the first batch is read by the sizes of the previous batches.

add tests for logical parsing (0 and 1s in particular)

Is the sky (or 2.5 GB) the limit?

I've tried to read a 2.5GB .csv file with vroom and while everything seems to go through very nicely, at the end the R session just crashes.
Unfortunately, I can't share this file because it's sensitive but the code was as follow:

Test <- vroom::vroom("./tests/Big_2-5GB_file.csv", delim = ",", col_names = FALSE, col_types = readr::cols("X1" = readr::col_double(), "X2" = readr::col_double(), "X3" = readr::col_skip(), "X4" = readr::col_double() ), skip = 1)

I've tried in Rstudio and the sessions just crash without any messages. I read about some issues with Rstudio in such situations so I've moved to the OSX terminal to start R and test and I got the following more informative error message:

indexing Big_2-5GB_file.csv [============================-] 407.75MB/s, eta: 0s
*** caught segfault ***
address 0x1b5d35a4a, cause 'memory not mapped'

Traceback:
1: vroom_(file, delim = delim, col_names = col_names, col_types = col_types, id = id, skip = skip, na = na, quote = quote, trim_ws = trim_ws, escape_double = escape_double, escape_backslash = escape_backslash, comment = comment, locale = locale, use_altrep = getRversion() > "3.5.0" && as.logical(getOption("vroom.use_altrep", TRUE)), num_threads = num_threads, progress = progress)
2: vroom::vroom("./tests/Big_2-5GB_file.csv", delim = ",", col_names = FALSE, col_types = readr::cols(X1 = readr::col_double(), X2 = readr::col_double(), X3 = readr::col_skip(), X4 = readr::col_double()), skip = 1)

What's the biggest file you've tried? Anyone else with the same issue?

Edit: Just noticed you've documented the issue with Rstudio so that's great but since I've tried with R outside of it, the problem is still valid.

Non-UTF-8 locales

Marking materialized altvectors as sorted

- string
- numeric

Compilation failed

Hey,
thanks for the package its awesome !
I had vroom installed and it worked but then I tried to update it and I got this error.
Any idea what's the problem here ?
Thanks :-)

devtools::install_github("jimhester/vroom")

	‘/private/var/folders/rm/y0lpngln0250x9h8cztnn4wc0000gn/T/RtmpthnfIY/downloaded_packages’
✔  checking for file ‘/private/var/folders/rm/y0lpngln0250x9h8cztnn4wc0000gn/T/RtmpthnfIY/remotes5fb8758ed9d4/jimhester-vroom-46cfdc5/DESCRIPTION’ ...
─  preparing ‘vroom’:
✔  checking DESCRIPTION meta-information ...
─  cleaning src
─  checking for LF line-endings in source and make files and shell scripts
─  checking for empty or unneeded directories
─  building ‘vroom_0.0.0.9000.tar.gz’
   
* installing *source* package ‘vroom’ ...
** libs
clang++ -std=gnu++11 -I"/Library/Frameworks/R.framework/Resources/include" -DNDEBUG  -I"/Library/Frameworks/R.framework/Versions/3.5/Resources/library/Rcpp/include" -I"/Library/Frameworks/R.framework/Versions/3.5/Resources/library/progress/include" -I/usr/local/include  -Imio/include -DWIN32_LEAN_AND_MEAN -fPIC  -Wall -g -O2 -c RcppExports.cpp -o RcppExports.o
clang++ -std=gnu++11 -I"/Library/Frameworks/R.framework/Resources/include" -DNDEBUG  -I"/Library/Frameworks/R.framework/Versions/3.5/Resources/library/Rcpp/include" -I"/Library/Frameworks/R.framework/Versions/3.5/Resources/library/progress/include" -I/usr/local/include  -Imio/include -DWIN32_LEAN_AND_MEAN -fPIC  -Wall -g -O2 -c altrep.cc -o altrep.o
clang++ -std=gnu++11 -I"/Library/Frameworks/R.framework/Resources/include" -DNDEBUG  -I"/Library/Frameworks/R.framework/Versions/3.5/Resources/library/Rcpp/include" -I"/Library/Frameworks/R.framework/Versions/3.5/Resources/library/progress/include" -I/usr/local/include  -Imio/include -DWIN32_LEAN_AND_MEAN -fPIC  -Wall -g -O2 -c index.cc -o index.o
In file included from index.cc:1:
./index.h:13:10: fatal error: 'multi_progress.h' file not found
#include "multi_progress.h"
         ^~~~~~~~~~~~~~~~~~
1 error generated.
make: *** [index.o] Error 1
ERROR: compilation failed for package ‘vroom’
* removing ‘/Library/Frameworks/R.framework/Versions/3.5/Resources/library/vroom’
* restoring previous ‘/Library/Frameworks/R.framework/Versions/3.5/Resources/library/vroom’
installation of package ‘/var/folders/rm/y0lpngln0250x9h8cztnn4wc0000gn/T//RtmpthnfIY/file5fb82c072fdb/vroom_0.0.0.9000.tar.gz’ had non-zero exit status ```

Provide way to include the filename as a column

As a character vector (or maybe a factor?)

Re-use memory when materializing vector

Currently get_trimmed_val() returns a new string object, so memory needs to be dynamically allocated for each one. We could instead reuse the same block of memory for each value.

richer delimiters

Multi-character ASCII and unicode

Handle case when `character(0)` is passed as the filenames

add tests for connection creation

fixed width files

skipping comments and blank lines in data

would require fairly significant changes to the index storage format, not sure it is worth it.

Should the index be in memory or a mmap?

mmap would make us less memory constrained, but a bit slower for the first read.

Instability with gzip connections

run_n <- function(n) {
  set.seed(42)
  # n <- 5544
  dat <- data.frame(
    a = sample(letters, n, replace = TRUE),
    b = sample(letters, n, replace = TRUE),
    c = rnorm(n)
  )
  fname <- "test.tsv.gz"
  readr::write_tsv(dat, fname)
  d <- vroom::vroom(fname)
  x <- dplyr::count(d, b)
  d <- readr::read_tsv(fname, col_types = readr::cols())
  y <- dplyr::count(d, b)
  print(all.equal(x, y))
  print(x)
  print(y)
}

run_n(5544)
#> [1] TRUE
#> # A tibble: 26 x 2
#>    b         n
#>    <chr> <int>
#>  1 a       234
#>  2 b       218
#>  3 c       235
#>  4 d       226
#>  5 e       235
#>  6 f       202
#>  7 g       177
#>  8 h       205
#>  9 i       209
#> 10 j       217
#> # … with 16 more rows
#> # A tibble: 26 x 2
#>    b         n
#>    <chr> <int>
#>  1 a       234
#>  2 b       218
#>  3 c       235
#>  4 d       226
#>  5 e       235
#>  6 f       202
#>  7 g       177
#>  8 h       205
#>  9 i       209
#> 10 j       217
#> # … with 16 more rows
run_n(5545)
#> [1] "Rows in x but not y: 19, 3. Rows in y but not x: 19, 3. "
#> # A tibble: 26 x 2
#>    b         n
#>    <chr> <int>
#>  1 a       234
#>  2 b       218
#>  3 c       236
#>  4 d       226
#>  5 e       235
#>  6 f       202
#>  7 g       176
#>  8 h       205
#>  9 i       209
#> 10 j       217
#> # … with 16 more rows
#> # A tibble: 26 x 2
#>    b         n
#>    <chr> <int>
#>  1 a       234
#>  2 b       218
#>  3 c       235
#>  4 d       226
#>  5 e       235
#>  6 f       202
#>  7 g       176
#>  8 h       205
#>  9 i       209
#> 10 j       217
#> # … with 16 more rows
run_n(5546)
#> [1] "Different number of rows"
#> # A tibble: 27 x 2
#>    b                      n
#>    <chr>              <int>
#>  1 2.0603492951994604     1
#>  2 a                    234
#>  3 b                    218
#>  4 c                    236
#>  5 d                    226
#>  6 e                    235
#>  7 f                    202
#>  8 g                    175
#>  9 h                    205
#> 10 i                    209
#> # … with 17 more rows
#> # A tibble: 26 x 2
#>    b         n
#>    <chr> <int>
#>  1 a       234
#>  2 b       218
#>  3 c       235
#>  4 d       226
#>  5 e       235
#>  6 f       202
#>  7 g       175
#>  8 h       205
#>  9 i       209
#> 10 j       218
#> # … with 16 more rows
run_n(5547)
#> [1] "Different number of rows"
#> # A tibble: 28 x 2
#>    b                       n
#>    <chr>               <int>
#>  1 -1.2790250179841622     1
#>  2 1.2409316387649358      1
#>  3 a                     234
#>  4 b                     218
#>  5 c                     236
#>  6 d                     226
#>  7 e                     235
#>  8 f                     202
#>  9 g                     175
#> 10 h                     205
#> # … with 18 more rows
#> # A tibble: 26 x 2
#>    b         n
#>    <chr> <int>
#>  1 a       234
#>  2 b       218
#>  3 c       235
#>  4 d       226
#>  5 e       235
#>  6 f       202
#>  7 g       175
#>  8 h       205
#>  9 i       209
#> 10 j       218
#> # … with 16 more rows

^{Created on 2019-02-25 by the reprex package (v0.2.1)}

Session info

devtools::session_info()
#> ─ Session info ──────────────────────────────────────────────────────────
#>  setting  value                       
#>  version  R version 3.5.1 (2018-07-02)
#>  os       macOS  10.14.2              
#>  system   x86_64, darwin15.6.0        
#>  ui       X11                         
#>  language (EN)                        
#>  collate  en_US.UTF-8                 
#>  ctype    en_US.UTF-8                 
#>  tz       America/New_York            
#>  date     2019-02-25                  
#> 
#> ─ Packages ──────────────────────────────────────────────────────────────
#>  package     * version    date       lib source                         
#>  assertthat    0.2.0      2017-04-11 [1] CRAN (R 3.5.0)                 
#>  backports     1.1.3      2018-12-14 [1] CRAN (R 3.5.0)                 
#>  callr         3.1.1.9000 2019-01-22 [1] Github (r-lib/callr@6413af8)   
#>  cli           1.0.1.9000 2019-01-22 [1] Github (r-lib/cli@94e2fc5)     
#>  crayon        1.3.4      2017-09-16 [1] CRAN (R 3.5.0)                 
#>  desc          1.2.0      2019-01-22 [1] Github (r-lib/desc@42b9578)    
#>  devtools      2.0.1.9000 2019-01-22 [1] local                          
#>  digest        0.6.18     2018-10-10 [1] CRAN (R 3.5.0)                 
#>  dplyr         0.8.0.1    2019-02-15 [1] CRAN (R 3.5.2)                 
#>  evaluate      0.13       2019-02-12 [1] CRAN (R 3.5.2)                 
#>  fansi         0.4.0      2018-11-08 [1] Github (brodieG/fansi@ab11e9c) 
#>  fs            1.2.6      2018-08-23 [1] CRAN (R 3.5.0)                 
#>  glue          1.3.0.9000 2019-02-22 [1] Github (tidyverse/glue@8188cea)
#>  highr         0.7        2018-06-09 [1] CRAN (R 3.5.0)                 
#>  hms           0.4.2.9001 2019-02-22 [1] Github (tidyverse/hms@16ff76e) 
#>  htmltools     0.3.6      2017-04-28 [1] CRAN (R 3.5.0)                 
#>  knitr         1.21       2018-12-10 [1] CRAN (R 3.5.1)                 
#>  magrittr      1.5        2014-11-22 [1] CRAN (R 3.5.0)                 
#>  memoise       1.1.0      2017-04-21 [1] CRAN (R 3.5.0)                 
#>  pillar        1.3.1.9000 2019-01-22 [1] Github (r-lib/pillar@3a54b8d)  
#>  pkgbuild      1.0.2      2018-10-16 [1] CRAN (R 3.5.0)                 
#>  pkgconfig     2.0.2      2018-08-16 [1] CRAN (R 3.5.0)                 
#>  pkgload       1.0.2      2018-10-29 [1] CRAN (R 3.5.0)                 
#>  prettyunits   1.0.2      2015-07-13 [1] CRAN (R 3.5.0)                 
#>  processx      3.2.1      2018-12-05 [1] CRAN (R 3.5.0)                 
#>  ps            1.3.0.9000 2019-01-10 [1] Github (r-lib/ps@7d17711)      
#>  purrr         0.3.0      2019-01-27 [1] CRAN (R 3.5.2)                 
#>  R6            2.4.0      2019-02-14 [1] CRAN (R 3.5.2)                 
#>  Rcpp          1.0.0      2018-11-07 [1] CRAN (R 3.5.0)                 
#>  readr         1.3.1      2018-12-21 [1] CRAN (R 3.5.0)                 
#>  remotes       2.0.2.9000 2019-01-19 [1] Github (r-lib/remotes@cb69654) 
#>  rlang         0.3.1      2019-01-08 [1] CRAN (R 3.5.2)                 
#>  rmarkdown     1.11       2018-12-08 [1] CRAN (R 3.5.0)                 
#>  rprojroot     1.3-2      2018-01-03 [1] CRAN (R 3.5.0)                 
#>  sessioninfo   1.1.1      2018-11-05 [1] CRAN (R 3.5.0)                 
#>  stringi       1.3.1      2019-02-13 [1] CRAN (R 3.5.2)                 
#>  stringr       1.4.0      2019-02-10 [1] CRAN (R 3.5.2)                 
#>  testthat      2.0.1      2018-10-11 [1] local                          
#>  tibble        2.0.1      2019-01-12 [1] CRAN (R 3.5.2)                 
#>  tidyselect    0.2.5      2018-10-11 [1] CRAN (R 3.5.0)                 
#>  usethis       1.4.0.9000 2019-02-11 [1] Github (r-lib/usethis@8e3c151) 
#>  utf8          1.1.4      2018-05-24 [1] CRAN (R 3.5.0)                 
#>  vroom         0.0.0.9000 2019-02-26 [1] local                          
#>  withr         2.1.2      2018-03-15 [1] CRAN (R 3.5.0)                 
#>  xfun          0.5        2019-02-20 [1] CRAN (R 3.5.2)                 
#>  yaml          2.2.0      2018-07-25 [1] CRAN (R 3.5.0)                 
#> 
#> [1] /Users/jhester/Library/R/3.5/library
#> [2] /Library/Frameworks/R.framework/Versions/3.5/Resources/library

> sessionInfo()
R version 3.4.0 (2017-04-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux Server release 6.5 (Santiago)

Matrix products: default
BLAS: /apps/lib-osver/R/3.4.0/lib64/R/lib/libRblas.so
LAPACK: /apps/lib-osver/R/3.4.0/lib64/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] readr_1.3.1           data.table_1.10.4-3   vroom_0.0.0.9000
[4] dplyr_0.7.6           tidyr_0.7.2           BiocParallel_1.12.0
[7] QuaternaryProd_1.17.0 devtools_1.13.4

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.0         bindr_0.1.1        magrittr_1.5       hms_0.4.2
 [5] uuid_0.1-2         tidyselect_0.2.4   R6_2.4.0           rlang_0.3.1
 [9] fansi_0.4.0        tools_3.4.0        parallel_3.4.0     utf8_1.1.4
[13] cli_1.0.1          withr_2.1.2.9000   yaml_2.2.0         assertthat_0.2.0
[17] digest_0.6.15      tibble_2.0.1       crayon_1.3.4       bindrcpp_0.2.2
[21] purrr_0.2.5        memoise_1.1.0.9000 glue_1.3.0         compiler_3.4.0
[25] pillar_1.3.1       pkgconfig_2.0.2

Figure out gcc issue with index_connection

Comments

vroom_list to import from a vector of files?

Please consider adding a function to import from a vector of file names. Similar to import_list from {rio}.
So e.g. whole folder of .csv files can be vroom'ed in single data frame without materializing the stuff.
Great stuff overall! This brings the performance to the whole new level!

Connection speed

Reading from connections is already somewhat fast, but I am pretty sure it could be faster if we do the reading from the connection and writing to the temporary file asynchronously from the parsing.

User supplied Factor levels

They are slightly complex

If we have the levels a priori we can treat it like a integer vector.
If we don't know the levels we can still read it in parallel
- Make independant maps for each thread
- Them when combining add the max value from previous mapping to the values we are appending

delimiter guessing

installation fails: undefined symbol: _Z21force_materializationP7SEXPREC

Maybe you are already aware of this as the same error also occurs for travis (build #27), but nevertheless I note it here:

** testing if installed package can be loaded
Error: package or namespace load failed for ‘vroom’ in dyn.load(file, DLLpath = DLLpath, ...):
 unable to load shared object '/home/kapper/R/x86_64-pc-linux-gnu-library/3.5/vroom/libs/vroom.so':
  /home/kapper/R/x86_64-pc-linux-gnu-library/3.5/vroom/libs/vroom.so: undefined symbol: _Z21force_materializationP7SEXPREC
Error: loading failed
Execution halted
ERROR: loading failed
* removing ‘/home/kapper/R/x86_64-pc-linux-gnu-library/3.5/vroom’
Error in i.p(...) : 
  (converted from warning) installation of package ‘/tmp/Rtmphhmsgf/file2044577eab40/vroom_0.0.0.9000.tar.gz’ had non-zero exit status

Missing data / NA values

When reading from connections make sure full first line is in buffer

Sys.setenv("VROOM_CONNECTION_SIZE" = "32")
vroom::vroom(file(vroom::vroom_example("mtcars.csv"), ""))

readr's flexible number parser

Support for R versions before 3.5

Will need to ifdef out the altrep code and fallback to non-altrep codepaths.

Specify col_types

Thanks for the great work Jim!
It would be great if vroom could take the col_types argument just like readr so that in configurations like below wouldn't result in a wrong column type.

For the files I read, vroom is 5 times faster than readr but because a lot of columns types are wrong, I lose all the benefits when correcting it. And no, I can't get rid of those missing values in some of my columns :(

In this example, having an NA value on the first row will result in the col type to change from dbl to chr.

mtcarsBis <- mtcars
mtcarsBis$vs[1] <- NA
readr::write_tsv(mtcarsBis, "mtcars.tsv")
mtcarsBack <- vroom::vroom("mtcars.tsv")
dplyr::glimpse(mtcars)
#> Observations: 32
#> Variables: 11
#> $ mpg  <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19....
#> $ cyl  <dbl> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, ...
#> $ disp <dbl> 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146.7, 1...
#> $ hp   <dbl> 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, ...
#> $ drat <dbl> 3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.9...
#> $ wt   <dbl> 2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3.570, 3.190, 3...
#> $ qsec <dbl> 16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20.00, 2...
#> $ vs   <dbl> 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, ...
#> $ am   <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, ...
#> $ gear <dbl> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, ...
#> $ carb <dbl> 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, ...
dplyr::glimpse(mtcarsBack)
#> Observations: 32
#> Variables: 11
#> $ mpg  <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19....
#> $ cyl  <int> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, ...
#> $ disp <dbl> 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146.7, 1...
#> $ hp   <int> 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, ...
#> $ drat <dbl> 3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.9...
#> $ wt   <dbl> 2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3.570, 3.190, 3...
#> $ qsec <dbl> 16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20.00, 2...
#> $ vs   <chr> "NA", "0", "1", "1", "0", "1", "0", "1", "1", "1", "1", "...
#> $ am   <int> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, ...
#> $ gear <int> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, ...
#> $ carb <int> 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, ...

tests

some would be nice ;)

Conversion / Materialization performance

library(vroom)
df <- vroom(filename)
# force materialization of all cols
system.time(for (i in seq_along(df)) force_materialization(df[[i]]))

# Should be ~ the same time as
df <- data.table::fread(filename)

crash if file does not have a trailing newline

Support embedded newlines in fields

Requires a different parsing strategy or disabling multi-threading

Byte order marks

Performance regressions with dplyr

This issue now is tracking dplyr performance regressions.

Use C++ traits for zero-runtime-cost features

Currently we just use runtime conditionals, but it should be possible to use C++ traits for at least some of the features so there is no runtime cost during parsing.

Consider using threads for index parsing

https://stackoverflow.com/a/38097463/2055486 has a good example we could try.

compilation failed on macOS

Hi Jim,

Always excited about speed!

But I'm afraid I can't get it to build on my Mac.

Do you have any tips?

devtools::install_github("jimhester/vroom")

✔  checking for file ‘/private/var/folders/k5/ynx90ngs6_7f7pcdkb77tp088qtqn4/T/RtmpxHHZSF/remotes2964396d37e3/jimhester-vroom-20adf2d/DESCRIPTION’ ...
─  preparing ‘vroom’:
✔  checking DESCRIPTION meta-information ...
─  cleaning src
─  checking for LF line-endings in source and make files and shell scripts
─  checking for empty or unneeded directories
─  building ‘vroom_0.0.0.9000.tar.gz’
   Warning: invalid uid value replaced by that for user 'nobody'
   Warning: invalid gid value replaced by that for user 'nobody'

* installing *source* package ‘vroom’ ...
** libs
clang++ -std=gnu++11 -I"/Library/Frameworks/R.framework/Resources/include" -DNDEBUG  -I"/Library/Frameworks/R.framework/Versions/3.5/Resources/library/Rcpp/include" -I"/Library/Frameworks/R.framework/Versions/3.5/Resources/library/progress/include" -I/usr/local/include  -Imio/include -DWIN32_LEAN_AND_MEAN -fPIC  -Wall -g -O2 -c Iconv.cpp -o Iconv.o
clang++ -std=gnu++11 -I"/Library/Frameworks/R.framework/Resources/include" -DNDEBUG  -I"/Library/Frameworks/R.framework/Versions/3.5/Resources/library/Rcpp/include" -I"/Library/Frameworks/R.framework/Versions/3.5/Resources/library/progress/include" -I/usr/local/include  -Imio/include -DWIN32_LEAN_AND_MEAN -fPIC  -Wall -g -O2 -c LocaleInfo.cpp -o LocaleInfo.o
clang++ -std=gnu++11 -I"/Library/Frameworks/R.framework/Resources/include" -DNDEBUG  -I"/Library/Frameworks/R.framework/Versions/3.5/Resources/library/Rcpp/include" -I"/Library/Frameworks/R.framework/Versions/3.5/Resources/library/progress/include" -I/usr/local/include  -Imio/include -DWIN32_LEAN_AND_MEAN -fPIC  -Wall -g -O2 -c RcppExports.cpp -o RcppExports.o
clang++ -std=gnu++11 -I"/Library/Frameworks/R.framework/Resources/include" -DNDEBUG  -I"/Library/Frameworks/R.framework/Versions/3.5/Resources/library/Rcpp/include" -I"/Library/Frameworks/R.framework/Versions/3.5/Resources/library/progress/include" -I/usr/local/include  -Imio/include -DWIN32_LEAN_AND_MEAN -fPIC  -Wall -g -O2 -c altrep.cc -o altrep.o
clang++ -std=gnu++11 -I"/Library/Frameworks/R.framework/Resources/include" -DNDEBUG  -I"/Library/Frameworks/R.framework/Versions/3.5/Resources/library/Rcpp/include" -I"/Library/Frameworks/R.framework/Versions/3.5/Resources/library/progress/include" -I/usr/local/include  -Imio/include -DWIN32_LEAN_AND_MEAN -fPIC  -Wall -g -O2 -c index.cc -o index.o
In file included from index.cc:1:
In file included from ./index.h:13:
./multi_progress.h:20:13: error: no matching constructor for initialization of 'RProgress::RProgress'
      : pb_(RProgress::RProgress(
            ^
/Library/Frameworks/R.framework/Versions/3.5/Resources/library/progress/include/RProgress.h:42:3: note: candidate constructor not viable: no known conversion
      from 'const char' to 'const char *' for 4th argument; take the address of the argument with &
  RProgress(std::string format = "[:bar] :percent",
  ^
/Library/Frameworks/R.framework/Versions/3.5/Resources/library/progress/include/RProgress.h:38:7: note: candidate constructor (the implicit copy constructor) not
      viable: requires 1 argument, but 7 were provided
class RProgress {
      ^
1 error generated.
make: *** [index.o] Error 1
ERROR: compilation failed for package ‘vroom’
* removing ‘/Library/Frameworks/R.framework/Versions/3.5/Resources/library/vroom’
Error in i.p(...) :
  (converted from warning) installation of package ‘/var/folders/k5/ynx90ngs6_7f7pcdkb77tp088qtqn4/T//RtmpxHHZSF/file29644ed9568b/vroom_0.0.0.9000.tar.gz’ had non-zero exit status

> sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS High Sierra 10.13.6

Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.0         compiler_3.5.1     prettyunits_1.0.2  remotes_2.0.2
 [5] tools_3.5.1        testthat_2.0.1     digest_0.6.18      pkgbuild_1.0.2
 [9] pkgload_1.0.2      evaluate_0.12      memoise_1.1.0.9000 rlang_0.3.1.9000
[13] reprex_0.2.1       cli_1.0.1          curl_3.3           xfun_0.4
[17] withr_2.1.2        knitr_1.21         desc_1.2.0         fs_1.2.6
[21] devtools_2.0.1     rprojroot_1.3-2    glue_1.3.0         R6_2.4.0
[25] processx_3.2.1     tcltk_3.5.1        rmarkdown_1.11     sessioninfo_1.1.1
[29] callr_3.1.1        clipr_0.5.0        magrittr_1.5       whisker_0.3-2
[33] backports_1.1.3    ps_1.3.0           htmltools_0.3.6    usethis_1.4.0
[37] assertthat_0.2.0   crayon_1.3.4

Test-driving performance issues

I'm giving the package a test based on the Twitter post. I was expecting to drop in the package for readr and just magically speed things up. On my Windows 10 computer, that did not seem to be the case (I guess this is mostly a FYI issue...).

I have a rather large server log csv file.... It's got a combination of dates, strings, logicals, and numerics. Of fread, read_csv, and vroom...vroom is the slowest when I include some command after the load (I'd be happy to send you the file...):

file_name <- "test.csv"
> system.time({
+   fread_df <- fread(file_name)
+   attr(fread_df$requested_date, "tzone") <- "America/New_York"
+ })
|--------------------------------------------------|
|==================================================|
   user  system elapsed 
   9.63    1.03    6.00 
> system.time({
+   vroom_df <- vroom(file_name,delim = ",")
+   attr(vroom_df$requested_date, "tzone") <- "America/New_York"
+ })
   user  system elapsed 
  69.32    1.41   14.08 
system.time({
+   readr_df <- read_csv(file_name, progress = FALSE)
+   attr(readr_df$requested_date, "tzone") <- "America/New_York"
+ })
Parsed with column specification:
cols(
  .default = col_logical(),
  service = col_character(),
  html_method = col_character(),
  html_protocol = col_double(),
  access = col_double(),
  agencycd = col_character(),
  enddt = col_character(),
  format = col_character(),
  huc = col_character(),
  modifiedsince = col_character(),
  parametercds = col_character(),
  period = col_character(),
  sites = col_character(),
  startdt = col_character(),
  requested_date = col_datetime(format = ""),
  http_code = col_double(),
  bytes = col_double(),
  user_agent = col_character()
)
See spec(...) for full column specifications.
Warning: 659499 parsing failures.
  row        col           expected                                        actual       file
11895 bbox       1/0/T/F/TRUE/FALSE -93.8012695,29.3941408,-89.4506836,31.5153373 'test.csv'
11895 sitestatus 1/0/T/F/TRUE/FALSE active                                        'test.csv'
11895 sitetype   1/0/T/F/TRUE/FALSE oc,oc-co,es,lk,st,st-ca,st-dch,st-ts          'test.csv'
12694 indent     1/0/T/F/TRUE/FALSE on                                            'test.csv'
12695 indent     1/0/T/F/TRUE/FALSE on                                            'test.csv'
..... .......... .................. ............................................. ..........
See problems(...) for more details.

   user  system elapsed 
  11.06    0.32   11.37
> system.time({
+   vroom_df <- vroom(file_name,delim = ",")
+ })
   user  system elapsed 
  68.83    1.17   11.99 
> system.time({
+   attr(vroom_df$requested_date, "tzone") <- "America/New_York"
+ })
   user  system elapsed 
   3.96    0.02    3.97 
> system.time({
+   attr(vroom_df$requested_date, "tzone") <- "America/New_York"
+ })
   user  system elapsed 
   0.02    0.00    0.01

This happened once...but I wasn't able to reproduce:

My details are:

devtools::session_info()
- Session info -------------------------------------------------------------------------
 setting  value                       
 version  R version 3.5.2 (2018-12-20)
 os       Windows 10 x64              
 system   x86_64, mingw32             
 ui       RStudio                     
 language (EN)                        
 collate  English_United States.1252  
 ctype    English_United States.1252  
 tz       America/Chicago             
 date     2019-02-26                  

- Packages -----------------------------------------------------------------------------
 package     * version    date       lib source                          
 assertthat    0.2.0      2017-04-11 [1] CRAN (R 3.5.2)                  
 backports     1.1.3      2018-12-14 [1] CRAN (R 3.5.1)                  
 callr         3.1.1      2018-12-21 [1] CRAN (R 3.5.1)                  
 cli           1.0.1      2018-09-25 [1] CRAN (R 3.5.2)                  
 crayon        1.3.4      2017-09-16 [1] CRAN (R 3.5.2)                  
 data.table  * 1.12.0     2019-01-13 [1] CRAN (R 3.5.2)                  
 desc          1.2.0      2018-05-01 [1] CRAN (R 3.5.1)                  
 devtools      2.0.1      2018-10-26 [1] CRAN (R 3.5.1)                  
 digest        0.6.18     2018-10-10 [1] CRAN (R 3.5.2)                  
 fs            1.2.6      2018-08-23 [1] CRAN (R 3.5.1)                  
 glue          1.3.0      2018-07-17 [1] CRAN (R 3.5.2)                  
 hms           0.4.2      2018-03-10 [1] CRAN (R 3.5.2)                  
 magrittr      1.5        2014-11-22 [1] CRAN (R 3.5.2)                  
 memoise       1.1.0      2017-04-21 [1] CRAN (R 3.5.1)                  
 packrat       0.5.0      2018-11-14 [1] CRAN (R 3.5.1)                  
 pillar        1.3.1      2018-12-15 [1] CRAN (R 3.5.2)                  
 pkgbuild      1.0.2      2018-10-16 [1] CRAN (R 3.5.1)                  
 pkgconfig     2.0.2      2018-08-16 [1] CRAN (R 3.5.2)                  
 pkgload       1.0.2      2018-10-29 [1] CRAN (R 3.5.2)                  
 prettyunits   1.0.2      2015-07-13 [1] CRAN (R 3.5.1)                  
 processx      3.2.1      2018-12-05 [1] CRAN (R 3.5.1)                  
 ps            1.3.0      2018-12-21 [1] CRAN (R 3.5.1)                  
 R6            2.4.0      2019-02-14 [1] CRAN (R 3.5.2)                  
 Rcpp          1.0.0      2018-11-07 [1] CRAN (R 3.5.2)                  
 readr       * 1.3.1      2018-12-21 [1] CRAN (R 3.5.2)                  
 remotes       2.0.2      2018-10-30 [1] CRAN (R 3.5.1)                  
 rlang         0.3.1      2019-01-08 [1] CRAN (R 3.5.2)                  
 rprojroot     1.3-2      2018-01-03 [1] CRAN (R 3.5.1)                  
 rstudioapi    0.9.0      2019-01-09 [1] CRAN (R 3.5.2)                  
 sessioninfo   1.1.1      2018-11-05 [1] CRAN (R 3.5.1)                  
 testthat      2.0.1      2018-10-13 [1] CRAN (R 3.5.2)                  
 tibble        2.0.1      2019-01-12 [1] CRAN (R 3.5.2)                  
 usethis       1.4.0      2018-08-14 [1] CRAN (R 3.5.1)                  
 vroom       * 0.0.0.9000 2019-02-26 [1] Github (jimhester/vroom@ab69db7)
 withr         2.1.2      2018-03-15 [1] CRAN (R 3.5.2)                  

[1] C:/Users/ldecicco/Documents/R/win-library/3.5
[2] C:/Program Files/R/R-3.5.2/library

Performance parsing doubles

We should be able to go faster than we are

create_df <- function(rows, cols) {
  as.data.frame(setNames(
    replicate(cols, runif(rows, 1, 100), simplify = FALSE),
    rep_len(c("x", letters), cols)))
}

df <- create_df(1000000, 10)
readr::write_csv(df, "long.csv")
bench::system_time(data.table::fread("long.csv"))
#> process    real
#>  931ms   161ms
bench::system_time(vroom::vroom("long.csv"))
#> process    real
#>  820ms   134ms

Something is wrong when parsing tables with less than 3 columns

vroom::vroom("a\n1", delim = ',')
#> Error in vroom_(file, delim = delim, col_names = col_names, skip = skip, : basic_string

vroom::vroom("a,b\n1,2", delim = ',')
#> Error in vroom_(file, delim = delim, col_names = col_names, skip = skip, : basic_string

vroom::vroom("a,b,c\n1,2,3", delim = ',')
#> # A tibble: 1 x 3
#>       a     b     c
#>   <int> <int> <int>
#> 1     1     2     3

^{Created on 2019-01-24 by the reprex package (v0.2.1)}

.csv loads too slow to enviroment.

I read a 2MO rows and 9 column .csv file (1.2 GB) and vroom reads it in around 4 seconds (vs 7.7 seconds for fread). But it takes around 1 minute to load to the RStudio enviroment (vs 2 seconds for fread).

I´m using RStudio Version 1.2.1237 and R version 3.5.2.