GithubHelp home page GithubHelp logo

tidyverse / vroom Goto Github PK

View Code? Open in Web Editor NEW
616.0 18.0 59.0 20.51 MB

Fast reading of delimited files

Home Page: https://vroom.r-lib.org

License: Other

R 9.65% C++ 87.26% CMake 1.39% Shell 0.14% C 1.36% Makefile 0.18% Python 0.02%
r csv csv-parser tsv tsv-parser fixed-width-text

vroom's Introduction

πŸŽπŸ’¨vroom

R-CMD-check Codecov test coverage CRAN status Lifecycle: stable

The fastest delimited reader for R, 1.23 GB/sec.

But that’s impossible! How can it be so fast?

vroom doesn’t stop to actually read all of your data, it simply indexes where each record is located so it can be read later. The vectors returned use the Altrep framework to lazily load the data on-demand when it is accessed, so you only pay for what you use. This lazy access is done automatically, so no changes to your R data-manipulation code are needed.

vroom also uses multiple threads for indexing, materializing non-character columns, and when writing to further improve performance.

package version time (sec) speedup throughput
vroom 1.5.1 1.36 53.30 1.23 GB/sec
data.table 1.14.0 5.83 12.40 281.65 MB/sec
readr 1.4.0 37.30 1.94 44.02 MB/sec
read.delim 4.1.0 72.31 1.00 22.71 MB/sec

Features

vroom has nearly all of the parsing features of readr for delimited and fixed width files, including

  • delimiter guessing*
  • custom delimiters (including multi-byte* and Unicode* delimiters)
  • specification of column types (including type guessing)
    • numeric types (double, integer, big integer*, number)
    • logical types
    • datetime types (datetime, date, time)
    • categorical types (characters, factors)
  • column selection, like dplyr::select()*
  • skipping headers, comments and blank lines
  • quoted fields
  • double and backslashed escapes
  • whitespace trimming
  • windows newlines
  • reading from multiple files or connections*
  • embedded newlines in headers and fields**
  • writing delimited files with as-needed quoting.
  • robust to invalid inputs (vroom has been extensively tested with the afl fuzz tester)*.

* these are additional features not in readr.

** requires num_threads = 1.

Installation

Install vroom from CRAN with:

install.packages("vroom")

Alternatively, if you need the development version from GitHub install it with:

# install.packages("pak")
pak::pak("tidyverse/vroom")

Usage

See getting started to jump start your use of vroom!

vroom uses the same interface as readr to specify column types.

vroom::vroom("mtcars.tsv",
  col_types = list(cyl = "i", gear = "f",hp = "i", disp = "_",
                   drat = "_", vs = "l", am = "l", carb = "i")
)
#> # A tibble: 32 Γ— 10
#>   model           mpg   cyl    hp    wt  qsec vs    am    gear   carb
#>   <chr>         <dbl> <int> <int> <dbl> <dbl> <lgl> <lgl> <fct> <int>
#> 1 Mazda RX4      21       6   110  2.62  16.5 FALSE TRUE  4         4
#> 2 Mazda RX4 Wag  21       6   110  2.88  17.0 FALSE TRUE  4         4
#> 3 Datsun 710     22.8     4    93  2.32  18.6 TRUE  TRUE  4         1
#> # β„Ή 29 more rows

Reading multiple files

vroom natively supports reading from multiple files (or even multiple connections!).

First we generate some files to read by splitting the nycflights dataset by airline. For the sake of the example, we’ll just take the first 2 lines of each file.

library(nycflights13)
purrr::iwalk(
  split(flights, flights$carrier),
  ~ { .x$carrier[[1]]; vroom::vroom_write(head(.x, 2), glue::glue("flights_{.y}.tsv"), delim = "\t") }
)

Then we can efficiently read them into one tibble by passing the filenames directly to vroom. The id argument can be used to request a column that reveals the filename that each row originated from.

files <- fs::dir_ls(glob = "flights*tsv")
files
#> flights_9E.tsv flights_AA.tsv flights_AS.tsv flights_B6.tsv flights_DL.tsv 
#> flights_EV.tsv flights_F9.tsv flights_FL.tsv flights_HA.tsv flights_MQ.tsv 
#> flights_OO.tsv flights_UA.tsv flights_US.tsv flights_VX.tsv flights_WN.tsv 
#> flights_YV.tsv
vroom::vroom(files, id = "source")
#> Rows: 32 Columns: 20
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: "\t"
#> chr   (4): carrier, tailnum, origin, dest
#> dbl  (14): year, month, day, dep_time, sched_dep_time, dep_delay, arr_time, ...
#> dttm  (1): time_hour
#> 
#> β„Ή Use `spec()` to retrieve the full column specification for this data.
#> β„Ή Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 32 Γ— 20
#>   source          year month   day dep_time sched_dep_time dep_delay arr_time
#>   <chr>          <dbl> <dbl> <dbl>    <dbl>          <dbl>     <dbl>    <dbl>
#> 1 flights_9E.tsv  2013     1     1      810            810         0     1048
#> 2 flights_9E.tsv  2013     1     1     1451           1500        -9     1634
#> 3 flights_AA.tsv  2013     1     1      542            540         2      923
#> # β„Ή 29 more rows
#> # β„Ή 12 more variables: sched_arr_time <dbl>, arr_delay <dbl>, carrier <chr>,
#> #   flight <dbl>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
#> #   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

Learning more

Benchmarks

The speed quoted above is from a real 1.53G dataset with 14,388,451 rows and 11 columns, see the benchmark article for full details of the dataset and bench/ for the code used to retrieve the data and perform the benchmarks.

Environment variables

In addition to the arguments to the vroom() function, you can control the behavior of vroom with a few environment variables. Generally these will not need to be set by most users.

  • VROOM_TEMP_PATH - Path to the directory used to store temporary files when reading from a R connection. If unset defaults to the R session’s temporary directory (tempdir()).
  • VROOM_THREADS - The number of processor threads to use when indexing and parsing. If unset defaults to parallel::detectCores().
  • VROOM_SHOW_PROGRESS - Whether to show the progress bar when indexing. Regardless of this setting the progress bar is disabled in non-interactive settings, R notebooks, when running tests with testthat and when knitting documents.
  • VROOM_CONNECTION_SIZE - The size (in bytes) of the connection buffer when reading from connections (default is 128 KiB).
  • VROOM_WRITE_BUFFER_LINES - The number of lines to use for each buffer when writing files (default: 1000).

There are also a family of variables to control use of the Altrep framework. For versions of R where the Altrep framework is unavailable (R < 3.5.0) they are automatically turned off and the variables have no effect. The variables can take one of true, false, TRUE, FALSE, 1, or 0.

  • VROOM_USE_ALTREP_NUMERICS - If set use Altrep for all numeric types (default false).

There are also individual variables for each type. Currently only VROOM_USE_ALTREP_CHR defaults to true.

  • VROOM_USE_ALTREP_CHR
  • VROOM_USE_ALTREP_FCT
  • VROOM_USE_ALTREP_INT
  • VROOM_USE_ALTREP_BIG_INT
  • VROOM_USE_ALTREP_DBL
  • VROOM_USE_ALTREP_NUM
  • VROOM_USE_ALTREP_LGL
  • VROOM_USE_ALTREP_DTTM
  • VROOM_USE_ALTREP_DATE
  • VROOM_USE_ALTREP_TIME

RStudio caveats

RStudio’s environment pane calls object.size() when it refreshes the pane, which for Altrep objects can be extremely slow. RStudio 1.2.1335+ includes the fixes (RStudio#4210, RStudio#4292) for this issue, so it is recommended you use at least that version.

Thanks

vroom's People

Contributors

ab-kent avatar andrie avatar anirban166 avatar bairdj avatar bart1 avatar batpigandme avatar criscelylp avatar davisvaughan avatar edzer avatar frm1789 avatar hadley avatar jennybc avatar jeroen avatar jimhester avatar jrf1111 avatar lionel- avatar lwjohnst86 avatar maurolepore avatar meta00 avatar michaelchirico avatar mikejohnpage avatar philaris avatar r3myg avatar sbearrows avatar wlattner avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

vroom's Issues

Reading from connections

Will require a decent amount of refactoring, as ideally we would stream this to the indexer and save the file to a temporary location as we go.

Connection indexing wrong

We need to adjust the indexes after the first batch is read by the sizes of the previous batches.

Is the sky (or 2.5 GB) the limit?

I've tried to read a 2.5GB .csv file with vroom and while everything seems to go through very nicely, at the end the R session just crashes.
Unfortunately, I can't share this file because it's sensitive but the code was as follow:

Test <- vroom::vroom("./tests/Big_2-5GB_file.csv", delim = ",", col_names = FALSE, col_types = readr::cols("X1" = readr::col_double(), "X2" = readr::col_double(), "X3" = readr::col_skip(), "X4" = readr::col_double() ), skip = 1)

I've tried in Rstudio and the sessions just crash without any messages. I read about some issues with Rstudio in such situations so I've moved to the OSX terminal to start R and test and I got the following more informative error message:

indexing Big_2-5GB_file.csv [============================-] 407.75MB/s, eta: 0s
*** caught segfault ***
address 0x1b5d35a4a, cause 'memory not mapped'

Traceback:
1: vroom_(file, delim = delim, col_names = col_names, col_types = col_types, id = id, skip = skip, na = na, quote = quote, trim_ws = trim_ws, escape_double = escape_double, escape_backslash = escape_backslash, comment = comment, locale = locale, use_altrep = getRversion() > "3.5.0" && as.logical(getOption("vroom.use_altrep", TRUE)), num_threads = num_threads, progress = progress)
2: vroom::vroom("./tests/Big_2-5GB_file.csv", delim = ",", col_names = FALSE, col_types = readr::cols(X1 = readr::col_double(), X2 = readr::col_double(), X3 = readr::col_skip(), X4 = readr::col_double()), skip = 1)

What's the biggest file you've tried? Anyone else with the same issue?

Edit: Just noticed you've documented the issue with Rstudio so that's great but since I've tried with R outside of it, the problem is still valid.

Compilation failed

Hey,
thanks for the package its awesome !
I had vroom installed and it worked but then I tried to update it and I got this error.
Any idea what's the problem here ?
Thanks :-)

devtools::install_github("jimhester/vroom")

	β€˜/private/var/folders/rm/y0lpngln0250x9h8cztnn4wc0000gn/T/RtmpthnfIY/downloaded_packages’
βœ”  checking for file β€˜/private/var/folders/rm/y0lpngln0250x9h8cztnn4wc0000gn/T/RtmpthnfIY/remotes5fb8758ed9d4/jimhester-vroom-46cfdc5/DESCRIPTION’ ...
─  preparing β€˜vroom’:
βœ”  checking DESCRIPTION meta-information ...
─  cleaning src
─  checking for LF line-endings in source and make files and shell scripts
─  checking for empty or unneeded directories
─  building β€˜vroom_0.0.0.9000.tar.gz’
   
* installing *source* package β€˜vroom’ ...
** libs
clang++ -std=gnu++11 -I"/Library/Frameworks/R.framework/Resources/include" -DNDEBUG  -I"/Library/Frameworks/R.framework/Versions/3.5/Resources/library/Rcpp/include" -I"/Library/Frameworks/R.framework/Versions/3.5/Resources/library/progress/include" -I/usr/local/include  -Imio/include -DWIN32_LEAN_AND_MEAN -fPIC  -Wall -g -O2 -c RcppExports.cpp -o RcppExports.o
clang++ -std=gnu++11 -I"/Library/Frameworks/R.framework/Resources/include" -DNDEBUG  -I"/Library/Frameworks/R.framework/Versions/3.5/Resources/library/Rcpp/include" -I"/Library/Frameworks/R.framework/Versions/3.5/Resources/library/progress/include" -I/usr/local/include  -Imio/include -DWIN32_LEAN_AND_MEAN -fPIC  -Wall -g -O2 -c altrep.cc -o altrep.o
clang++ -std=gnu++11 -I"/Library/Frameworks/R.framework/Resources/include" -DNDEBUG  -I"/Library/Frameworks/R.framework/Versions/3.5/Resources/library/Rcpp/include" -I"/Library/Frameworks/R.framework/Versions/3.5/Resources/library/progress/include" -I/usr/local/include  -Imio/include -DWIN32_LEAN_AND_MEAN -fPIC  -Wall -g -O2 -c index.cc -o index.o
In file included from index.cc:1:
./index.h:13:10: fatal error: 'multi_progress.h' file not found
#include "multi_progress.h"
         ^~~~~~~~~~~~~~~~~~
1 error generated.
make: *** [index.o] Error 1
ERROR: compilation failed for package β€˜vroom’
* removing β€˜/Library/Frameworks/R.framework/Versions/3.5/Resources/library/vroom’
* restoring previous β€˜/Library/Frameworks/R.framework/Versions/3.5/Resources/library/vroom’
installation of package β€˜/var/folders/rm/y0lpngln0250x9h8cztnn4wc0000gn/T//RtmpthnfIY/file5fb82c072fdb/vroom_0.0.0.9000.tar.gz’ had non-zero exit status ```     

Re-use memory when materializing vector

Currently get_trimmed_val() returns a new string object, so memory needs to be dynamically allocated for each one. We could instead reuse the same block of memory for each value.

fixed width files

  • basic reading
  • column types
  • column skips
  • locales
  • nas
  • trim_ws
  • guess_max
  • comment
  • header skip
  • n_max
  • num_threads (materializing)
  • num_threads (indexing) - maybe won't do
  • progress
  • reading from connections
  • tests from readr

Instability with gzip connections

run_n <- function(n) {
  set.seed(42)
  # n <- 5544
  dat <- data.frame(
    a = sample(letters, n, replace = TRUE),
    b = sample(letters, n, replace = TRUE),
    c = rnorm(n)
  )
  fname <- "test.tsv.gz"
  readr::write_tsv(dat, fname)
  d <- vroom::vroom(fname)
  x <- dplyr::count(d, b)
  d <- readr::read_tsv(fname, col_types = readr::cols())
  y <- dplyr::count(d, b)
  print(all.equal(x, y))
  print(x)
  print(y)
}

run_n(5544)
#> [1] TRUE
#> # A tibble: 26 x 2
#>    b         n
#>    <chr> <int>
#>  1 a       234
#>  2 b       218
#>  3 c       235
#>  4 d       226
#>  5 e       235
#>  6 f       202
#>  7 g       177
#>  8 h       205
#>  9 i       209
#> 10 j       217
#> # … with 16 more rows
#> # A tibble: 26 x 2
#>    b         n
#>    <chr> <int>
#>  1 a       234
#>  2 b       218
#>  3 c       235
#>  4 d       226
#>  5 e       235
#>  6 f       202
#>  7 g       177
#>  8 h       205
#>  9 i       209
#> 10 j       217
#> # … with 16 more rows
run_n(5545)
#> [1] "Rows in x but not y: 19, 3. Rows in y but not x: 19, 3. "
#> # A tibble: 26 x 2
#>    b         n
#>    <chr> <int>
#>  1 a       234
#>  2 b       218
#>  3 c       236
#>  4 d       226
#>  5 e       235
#>  6 f       202
#>  7 g       176
#>  8 h       205
#>  9 i       209
#> 10 j       217
#> # … with 16 more rows
#> # A tibble: 26 x 2
#>    b         n
#>    <chr> <int>
#>  1 a       234
#>  2 b       218
#>  3 c       235
#>  4 d       226
#>  5 e       235
#>  6 f       202
#>  7 g       176
#>  8 h       205
#>  9 i       209
#> 10 j       217
#> # … with 16 more rows
run_n(5546)
#> [1] "Different number of rows"
#> # A tibble: 27 x 2
#>    b                      n
#>    <chr>              <int>
#>  1 2.0603492951994604     1
#>  2 a                    234
#>  3 b                    218
#>  4 c                    236
#>  5 d                    226
#>  6 e                    235
#>  7 f                    202
#>  8 g                    175
#>  9 h                    205
#> 10 i                    209
#> # … with 17 more rows
#> # A tibble: 26 x 2
#>    b         n
#>    <chr> <int>
#>  1 a       234
#>  2 b       218
#>  3 c       235
#>  4 d       226
#>  5 e       235
#>  6 f       202
#>  7 g       175
#>  8 h       205
#>  9 i       209
#> 10 j       218
#> # … with 16 more rows
run_n(5547)
#> [1] "Different number of rows"
#> # A tibble: 28 x 2
#>    b                       n
#>    <chr>               <int>
#>  1 -1.2790250179841622     1
#>  2 1.2409316387649358      1
#>  3 a                     234
#>  4 b                     218
#>  5 c                     236
#>  6 d                     226
#>  7 e                     235
#>  8 f                     202
#>  9 g                     175
#> 10 h                     205
#> # … with 18 more rows
#> # A tibble: 26 x 2
#>    b         n
#>    <chr> <int>
#>  1 a       234
#>  2 b       218
#>  3 c       235
#>  4 d       226
#>  5 e       235
#>  6 f       202
#>  7 g       175
#>  8 h       205
#>  9 i       209
#> 10 j       218
#> # … with 16 more rows

Created on 2019-02-25 by the reprex package (v0.2.1)

Session info
devtools::session_info()
#> ─ Session info ──────────────────────────────────────────────────────────
#>  setting  value                       
#>  version  R version 3.5.1 (2018-07-02)
#>  os       macOS  10.14.2              
#>  system   x86_64, darwin15.6.0        
#>  ui       X11                         
#>  language (EN)                        
#>  collate  en_US.UTF-8                 
#>  ctype    en_US.UTF-8                 
#>  tz       America/New_York            
#>  date     2019-02-25                  
#> 
#> ─ Packages ──────────────────────────────────────────────────────────────
#>  package     * version    date       lib source                         
#>  assertthat    0.2.0      2017-04-11 [1] CRAN (R 3.5.0)                 
#>  backports     1.1.3      2018-12-14 [1] CRAN (R 3.5.0)                 
#>  callr         3.1.1.9000 2019-01-22 [1] Github (r-lib/callr@6413af8)   
#>  cli           1.0.1.9000 2019-01-22 [1] Github (r-lib/cli@94e2fc5)     
#>  crayon        1.3.4      2017-09-16 [1] CRAN (R 3.5.0)                 
#>  desc          1.2.0      2019-01-22 [1] Github (r-lib/desc@42b9578)    
#>  devtools      2.0.1.9000 2019-01-22 [1] local                          
#>  digest        0.6.18     2018-10-10 [1] CRAN (R 3.5.0)                 
#>  dplyr         0.8.0.1    2019-02-15 [1] CRAN (R 3.5.2)                 
#>  evaluate      0.13       2019-02-12 [1] CRAN (R 3.5.2)                 
#>  fansi         0.4.0      2018-11-08 [1] Github (brodieG/fansi@ab11e9c) 
#>  fs            1.2.6      2018-08-23 [1] CRAN (R 3.5.0)                 
#>  glue          1.3.0.9000 2019-02-22 [1] Github (tidyverse/glue@8188cea)
#>  highr         0.7        2018-06-09 [1] CRAN (R 3.5.0)                 
#>  hms           0.4.2.9001 2019-02-22 [1] Github (tidyverse/hms@16ff76e) 
#>  htmltools     0.3.6      2017-04-28 [1] CRAN (R 3.5.0)                 
#>  knitr         1.21       2018-12-10 [1] CRAN (R 3.5.1)                 
#>  magrittr      1.5        2014-11-22 [1] CRAN (R 3.5.0)                 
#>  memoise       1.1.0      2017-04-21 [1] CRAN (R 3.5.0)                 
#>  pillar        1.3.1.9000 2019-01-22 [1] Github (r-lib/pillar@3a54b8d)  
#>  pkgbuild      1.0.2      2018-10-16 [1] CRAN (R 3.5.0)                 
#>  pkgconfig     2.0.2      2018-08-16 [1] CRAN (R 3.5.0)                 
#>  pkgload       1.0.2      2018-10-29 [1] CRAN (R 3.5.0)                 
#>  prettyunits   1.0.2      2015-07-13 [1] CRAN (R 3.5.0)                 
#>  processx      3.2.1      2018-12-05 [1] CRAN (R 3.5.0)                 
#>  ps            1.3.0.9000 2019-01-10 [1] Github (r-lib/ps@7d17711)      
#>  purrr         0.3.0      2019-01-27 [1] CRAN (R 3.5.2)                 
#>  R6            2.4.0      2019-02-14 [1] CRAN (R 3.5.2)                 
#>  Rcpp          1.0.0      2018-11-07 [1] CRAN (R 3.5.0)                 
#>  readr         1.3.1      2018-12-21 [1] CRAN (R 3.5.0)                 
#>  remotes       2.0.2.9000 2019-01-19 [1] Github (r-lib/remotes@cb69654) 
#>  rlang         0.3.1      2019-01-08 [1] CRAN (R 3.5.2)                 
#>  rmarkdown     1.11       2018-12-08 [1] CRAN (R 3.5.0)                 
#>  rprojroot     1.3-2      2018-01-03 [1] CRAN (R 3.5.0)                 
#>  sessioninfo   1.1.1      2018-11-05 [1] CRAN (R 3.5.0)                 
#>  stringi       1.3.1      2019-02-13 [1] CRAN (R 3.5.2)                 
#>  stringr       1.4.0      2019-02-10 [1] CRAN (R 3.5.2)                 
#>  testthat      2.0.1      2018-10-11 [1] local                          
#>  tibble        2.0.1      2019-01-12 [1] CRAN (R 3.5.2)                 
#>  tidyselect    0.2.5      2018-10-11 [1] CRAN (R 3.5.0)                 
#>  usethis       1.4.0.9000 2019-02-11 [1] Github (r-lib/usethis@8e3c151) 
#>  utf8          1.1.4      2018-05-24 [1] CRAN (R 3.5.0)                 
#>  vroom         0.0.0.9000 2019-02-26 [1] local                          
#>  withr         2.1.2      2018-03-15 [1] CRAN (R 3.5.0)                 
#>  xfun          0.5        2019-02-20 [1] CRAN (R 3.5.2)                 
#>  yaml          2.2.0      2018-07-25 [1] CRAN (R 3.5.0)                 
#> 
#> [1] /Users/jhester/Library/R/3.5/library
#> [2] /Library/Frameworks/R.framework/Versions/3.5/Resources/library
> sessionInfo()
R version 3.4.0 (2017-04-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux Server release 6.5 (Santiago)

Matrix products: default
BLAS: /apps/lib-osver/R/3.4.0/lib64/R/lib/libRblas.so
LAPACK: /apps/lib-osver/R/3.4.0/lib64/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] readr_1.3.1           data.table_1.10.4-3   vroom_0.0.0.9000
[4] dplyr_0.7.6           tidyr_0.7.2           BiocParallel_1.12.0
[7] QuaternaryProd_1.17.0 devtools_1.13.4

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.0         bindr_0.1.1        magrittr_1.5       hms_0.4.2
 [5] uuid_0.1-2         tidyselect_0.2.4   R6_2.4.0           rlang_0.3.1
 [9] fansi_0.4.0        tools_3.4.0        parallel_3.4.0     utf8_1.1.4
[13] cli_1.0.1          withr_2.1.2.9000   yaml_2.2.0         assertthat_0.2.0
[17] digest_0.6.15      tibble_2.0.1       crayon_1.3.4       bindrcpp_0.2.2
[21] purrr_0.2.5        memoise_1.1.0.9000 glue_1.3.0         compiler_3.4.0
[25] pillar_1.3.1       pkgconfig_2.0.2

vroom_list to import from a vector of files?

Please consider adding a function to import from a vector of file names. Similar to import_list from {rio}.
So e.g. whole folder of .csv files can be vroom'ed in single data frame without materializing the stuff.
Great stuff overall! This brings the performance to the whole new level!

Connection speed

Reading from connections is already somewhat fast, but I am pretty sure it could be faster if we do the reading from the connection and writing to the temporary file asynchronously from the parsing.

User supplied Factor levels

They are slightly complex

  • If we have the levels a priori we can treat it like a integer vector.
  • If we don't know the levels we can still read it in parallel
    • Make independant maps for each thread
    • Them when combining add the max value from previous mapping to the values we are appending

installation fails: undefined symbol: _Z21force_materializationP7SEXPREC

Maybe you are already aware of this as the same error also occurs for travis (build #27), but nevertheless I note it here:

** testing if installed package can be loaded
Error: package or namespace load failed for β€˜vroom’ in dyn.load(file, DLLpath = DLLpath, ...):
 unable to load shared object '/home/kapper/R/x86_64-pc-linux-gnu-library/3.5/vroom/libs/vroom.so':
  /home/kapper/R/x86_64-pc-linux-gnu-library/3.5/vroom/libs/vroom.so: undefined symbol: _Z21force_materializationP7SEXPREC
Error: loading failed
Execution halted
ERROR: loading failed
* removing β€˜/home/kapper/R/x86_64-pc-linux-gnu-library/3.5/vroom’
Error in i.p(...) : 
  (converted from warning) installation of package β€˜/tmp/Rtmphhmsgf/file2044577eab40/vroom_0.0.0.9000.tar.gz’ had non-zero exit status

Specify col_types

Thanks for the great work Jim!
It would be great if vroom could take the col_types argument just like readr so that in configurations like below wouldn't result in a wrong column type.

For the files I read, vroom is 5 times faster than readr but because a lot of columns types are wrong, I lose all the benefits when correcting it. And no, I can't get rid of those missing values in some of my columns :(

In this example, having an NA value on the first row will result in the col type to change from dbl to chr.

mtcarsBis <- mtcars
mtcarsBis$vs[1] <- NA
readr::write_tsv(mtcarsBis, "mtcars.tsv")
mtcarsBack <- vroom::vroom("mtcars.tsv")
dplyr::glimpse(mtcars)
#> Observations: 32
#> Variables: 11
#> $ mpg  <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19....
#> $ cyl  <dbl> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, ...
#> $ disp <dbl> 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146.7, 1...
#> $ hp   <dbl> 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, ...
#> $ drat <dbl> 3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.9...
#> $ wt   <dbl> 2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3.570, 3.190, 3...
#> $ qsec <dbl> 16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20.00, 2...
#> $ vs   <dbl> 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, ...
#> $ am   <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, ...
#> $ gear <dbl> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, ...
#> $ carb <dbl> 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, ...
dplyr::glimpse(mtcarsBack)
#> Observations: 32
#> Variables: 11
#> $ mpg  <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19....
#> $ cyl  <int> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, ...
#> $ disp <dbl> 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146.7, 1...
#> $ hp   <int> 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, ...
#> $ drat <dbl> 3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.9...
#> $ wt   <dbl> 2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3.570, 3.190, 3...
#> $ qsec <dbl> 16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20.00, 2...
#> $ vs   <chr> "NA", "0", "1", "1", "0", "1", "0", "1", "1", "1", "1", "...
#> $ am   <int> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, ...
#> $ gear <int> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, ...
#> $ carb <int> 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, ...

tests

some would be nice ;)

Conversion / Materialization performance

library(vroom)
df <- vroom(filename)
# force materialization of all cols
system.time(for (i in seq_along(df)) force_materialization(df[[i]]))

# Should be ~ the same time as
df <- data.table::fread(filename)

compilation failed on macOS

Hi Jim,

Always excited about speed!

But I'm afraid I can't get it to build on my Mac.

Do you have any tips?

devtools::install_github("jimhester/vroom")
βœ”  checking for file β€˜/private/var/folders/k5/ynx90ngs6_7f7pcdkb77tp088qtqn4/T/RtmpxHHZSF/remotes2964396d37e3/jimhester-vroom-20adf2d/DESCRIPTION’ ...
─  preparing β€˜vroom’:
βœ”  checking DESCRIPTION meta-information ...
─  cleaning src
─  checking for LF line-endings in source and make files and shell scripts
─  checking for empty or unneeded directories
─  building β€˜vroom_0.0.0.9000.tar.gz’
   Warning: invalid uid value replaced by that for user 'nobody'
   Warning: invalid gid value replaced by that for user 'nobody'

* installing *source* package β€˜vroom’ ...
** libs
clang++ -std=gnu++11 -I"/Library/Frameworks/R.framework/Resources/include" -DNDEBUG  -I"/Library/Frameworks/R.framework/Versions/3.5/Resources/library/Rcpp/include" -I"/Library/Frameworks/R.framework/Versions/3.5/Resources/library/progress/include" -I/usr/local/include  -Imio/include -DWIN32_LEAN_AND_MEAN -fPIC  -Wall -g -O2 -c Iconv.cpp -o Iconv.o
clang++ -std=gnu++11 -I"/Library/Frameworks/R.framework/Resources/include" -DNDEBUG  -I"/Library/Frameworks/R.framework/Versions/3.5/Resources/library/Rcpp/include" -I"/Library/Frameworks/R.framework/Versions/3.5/Resources/library/progress/include" -I/usr/local/include  -Imio/include -DWIN32_LEAN_AND_MEAN -fPIC  -Wall -g -O2 -c LocaleInfo.cpp -o LocaleInfo.o
clang++ -std=gnu++11 -I"/Library/Frameworks/R.framework/Resources/include" -DNDEBUG  -I"/Library/Frameworks/R.framework/Versions/3.5/Resources/library/Rcpp/include" -I"/Library/Frameworks/R.framework/Versions/3.5/Resources/library/progress/include" -I/usr/local/include  -Imio/include -DWIN32_LEAN_AND_MEAN -fPIC  -Wall -g -O2 -c RcppExports.cpp -o RcppExports.o
clang++ -std=gnu++11 -I"/Library/Frameworks/R.framework/Resources/include" -DNDEBUG  -I"/Library/Frameworks/R.framework/Versions/3.5/Resources/library/Rcpp/include" -I"/Library/Frameworks/R.framework/Versions/3.5/Resources/library/progress/include" -I/usr/local/include  -Imio/include -DWIN32_LEAN_AND_MEAN -fPIC  -Wall -g -O2 -c altrep.cc -o altrep.o
clang++ -std=gnu++11 -I"/Library/Frameworks/R.framework/Resources/include" -DNDEBUG  -I"/Library/Frameworks/R.framework/Versions/3.5/Resources/library/Rcpp/include" -I"/Library/Frameworks/R.framework/Versions/3.5/Resources/library/progress/include" -I/usr/local/include  -Imio/include -DWIN32_LEAN_AND_MEAN -fPIC  -Wall -g -O2 -c index.cc -o index.o
In file included from index.cc:1:
In file included from ./index.h:13:
./multi_progress.h:20:13: error: no matching constructor for initialization of 'RProgress::RProgress'
      : pb_(RProgress::RProgress(
            ^
/Library/Frameworks/R.framework/Versions/3.5/Resources/library/progress/include/RProgress.h:42:3: note: candidate constructor not viable: no known conversion
      from 'const char' to 'const char *' for 4th argument; take the address of the argument with &
  RProgress(std::string format = "[:bar] :percent",
  ^
/Library/Frameworks/R.framework/Versions/3.5/Resources/library/progress/include/RProgress.h:38:7: note: candidate constructor (the implicit copy constructor) not
      viable: requires 1 argument, but 7 were provided
class RProgress {
      ^
1 error generated.
make: *** [index.o] Error 1
ERROR: compilation failed for package β€˜vroom’
* removing β€˜/Library/Frameworks/R.framework/Versions/3.5/Resources/library/vroom’
Error in i.p(...) :
  (converted from warning) installation of package β€˜/var/folders/k5/ynx90ngs6_7f7pcdkb77tp088qtqn4/T//RtmpxHHZSF/file29644ed9568b/vroom_0.0.0.9000.tar.gz’ had non-zero exit status
> sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS High Sierra 10.13.6

Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.0         compiler_3.5.1     prettyunits_1.0.2  remotes_2.0.2
 [5] tools_3.5.1        testthat_2.0.1     digest_0.6.18      pkgbuild_1.0.2
 [9] pkgload_1.0.2      evaluate_0.12      memoise_1.1.0.9000 rlang_0.3.1.9000
[13] reprex_0.2.1       cli_1.0.1          curl_3.3           xfun_0.4
[17] withr_2.1.2        knitr_1.21         desc_1.2.0         fs_1.2.6
[21] devtools_2.0.1     rprojroot_1.3-2    glue_1.3.0         R6_2.4.0
[25] processx_3.2.1     tcltk_3.5.1        rmarkdown_1.11     sessioninfo_1.1.1
[29] callr_3.1.1        clipr_0.5.0        magrittr_1.5       whisker_0.3-2
[33] backports_1.1.3    ps_1.3.0           htmltools_0.3.6    usethis_1.4.0
[37] assertthat_0.2.0   crayon_1.3.4

Test-driving performance issues

I'm giving the package a test based on the Twitter post. I was expecting to drop in the package for readr and just magically speed things up. On my Windows 10 computer, that did not seem to be the case (I guess this is mostly a FYI issue...).

I have a rather large server log csv file.... It's got a combination of dates, strings, logicals, and numerics. Of fread, read_csv, and vroom...vroom is the slowest when I include some command after the load (I'd be happy to send you the file...):

file_name <- "test.csv"
> system.time({
+   fread_df <- fread(file_name)
+   attr(fread_df$requested_date, "tzone") <- "America/New_York"
+ })
|--------------------------------------------------|
|==================================================|
   user  system elapsed 
   9.63    1.03    6.00 
> system.time({
+   vroom_df <- vroom(file_name,delim = ",")
+   attr(vroom_df$requested_date, "tzone") <- "America/New_York"
+ })
   user  system elapsed 
  69.32    1.41   14.08 
system.time({
+   readr_df <- read_csv(file_name, progress = FALSE)
+   attr(readr_df$requested_date, "tzone") <- "America/New_York"
+ })
Parsed with column specification:
cols(
  .default = col_logical(),
  service = col_character(),
  html_method = col_character(),
  html_protocol = col_double(),
  access = col_double(),
  agencycd = col_character(),
  enddt = col_character(),
  format = col_character(),
  huc = col_character(),
  modifiedsince = col_character(),
  parametercds = col_character(),
  period = col_character(),
  sites = col_character(),
  startdt = col_character(),
  requested_date = col_datetime(format = ""),
  http_code = col_double(),
  bytes = col_double(),
  user_agent = col_character()
)
See spec(...) for full column specifications.
Warning: 659499 parsing failures.
  row        col           expected                                        actual       file
11895 bbox       1/0/T/F/TRUE/FALSE -93.8012695,29.3941408,-89.4506836,31.5153373 'test.csv'
11895 sitestatus 1/0/T/F/TRUE/FALSE active                                        'test.csv'
11895 sitetype   1/0/T/F/TRUE/FALSE oc,oc-co,es,lk,st,st-ca,st-dch,st-ts          'test.csv'
12694 indent     1/0/T/F/TRUE/FALSE on                                            'test.csv'
12695 indent     1/0/T/F/TRUE/FALSE on                                            'test.csv'
..... .......... .................. ............................................. ..........
See problems(...) for more details.

   user  system elapsed 
  11.06    0.32   11.37
> system.time({
+   vroom_df <- vroom(file_name,delim = ",")
+ })
   user  system elapsed 
  68.83    1.17   11.99 
> system.time({
+   attr(vroom_df$requested_date, "tzone") <- "America/New_York"
+ })
   user  system elapsed 
   3.96    0.02    3.97 
> system.time({
+   attr(vroom_df$requested_date, "tzone") <- "America/New_York"
+ })
   user  system elapsed 
   0.02    0.00    0.01 

This happened once...but I wasn't able to reproduce:
image

My details are:

devtools::session_info()
- Session info -------------------------------------------------------------------------
 setting  value                       
 version  R version 3.5.2 (2018-12-20)
 os       Windows 10 x64              
 system   x86_64, mingw32             
 ui       RStudio                     
 language (EN)                        
 collate  English_United States.1252  
 ctype    English_United States.1252  
 tz       America/Chicago             
 date     2019-02-26                  

- Packages -----------------------------------------------------------------------------
 package     * version    date       lib source                          
 assertthat    0.2.0      2017-04-11 [1] CRAN (R 3.5.2)                  
 backports     1.1.3      2018-12-14 [1] CRAN (R 3.5.1)                  
 callr         3.1.1      2018-12-21 [1] CRAN (R 3.5.1)                  
 cli           1.0.1      2018-09-25 [1] CRAN (R 3.5.2)                  
 crayon        1.3.4      2017-09-16 [1] CRAN (R 3.5.2)                  
 data.table  * 1.12.0     2019-01-13 [1] CRAN (R 3.5.2)                  
 desc          1.2.0      2018-05-01 [1] CRAN (R 3.5.1)                  
 devtools      2.0.1      2018-10-26 [1] CRAN (R 3.5.1)                  
 digest        0.6.18     2018-10-10 [1] CRAN (R 3.5.2)                  
 fs            1.2.6      2018-08-23 [1] CRAN (R 3.5.1)                  
 glue          1.3.0      2018-07-17 [1] CRAN (R 3.5.2)                  
 hms           0.4.2      2018-03-10 [1] CRAN (R 3.5.2)                  
 magrittr      1.5        2014-11-22 [1] CRAN (R 3.5.2)                  
 memoise       1.1.0      2017-04-21 [1] CRAN (R 3.5.1)                  
 packrat       0.5.0      2018-11-14 [1] CRAN (R 3.5.1)                  
 pillar        1.3.1      2018-12-15 [1] CRAN (R 3.5.2)                  
 pkgbuild      1.0.2      2018-10-16 [1] CRAN (R 3.5.1)                  
 pkgconfig     2.0.2      2018-08-16 [1] CRAN (R 3.5.2)                  
 pkgload       1.0.2      2018-10-29 [1] CRAN (R 3.5.2)                  
 prettyunits   1.0.2      2015-07-13 [1] CRAN (R 3.5.1)                  
 processx      3.2.1      2018-12-05 [1] CRAN (R 3.5.1)                  
 ps            1.3.0      2018-12-21 [1] CRAN (R 3.5.1)                  
 R6            2.4.0      2019-02-14 [1] CRAN (R 3.5.2)                  
 Rcpp          1.0.0      2018-11-07 [1] CRAN (R 3.5.2)                  
 readr       * 1.3.1      2018-12-21 [1] CRAN (R 3.5.2)                  
 remotes       2.0.2      2018-10-30 [1] CRAN (R 3.5.1)                  
 rlang         0.3.1      2019-01-08 [1] CRAN (R 3.5.2)                  
 rprojroot     1.3-2      2018-01-03 [1] CRAN (R 3.5.1)                  
 rstudioapi    0.9.0      2019-01-09 [1] CRAN (R 3.5.2)                  
 sessioninfo   1.1.1      2018-11-05 [1] CRAN (R 3.5.1)                  
 testthat      2.0.1      2018-10-13 [1] CRAN (R 3.5.2)                  
 tibble        2.0.1      2019-01-12 [1] CRAN (R 3.5.2)                  
 usethis       1.4.0      2018-08-14 [1] CRAN (R 3.5.1)                  
 vroom       * 0.0.0.9000 2019-02-26 [1] Github (jimhester/vroom@ab69db7)
 withr         2.1.2      2018-03-15 [1] CRAN (R 3.5.2)                  

[1] C:/Users/ldecicco/Documents/R/win-library/3.5
[2] C:/Program Files/R/R-3.5.2/library

Performance parsing doubles

We should be able to go faster than we are

create_df <- function(rows, cols) {
  as.data.frame(setNames(
    replicate(cols, runif(rows, 1, 100), simplify = FALSE),
    rep_len(c("x", letters), cols)))
}

df <- create_df(1000000, 10)
readr::write_csv(df, "long.csv")
bench::system_time(data.table::fread("long.csv"))
#> process    real
#>  931ms   161ms
bench::system_time(vroom::vroom("long.csv"))
#> process    real
#>  820ms   134ms

Something is wrong when parsing tables with less than 3 columns

vroom::vroom("a\n1", delim = ',')
#> Error in vroom_(file, delim = delim, col_names = col_names, skip = skip, : basic_string

vroom::vroom("a,b\n1,2", delim = ',')
#> Error in vroom_(file, delim = delim, col_names = col_names, skip = skip, : basic_string

vroom::vroom("a,b,c\n1,2,3", delim = ',')
#> # A tibble: 1 x 3
#>       a     b     c
#>   <int> <int> <int>
#> 1     1     2     3

Created on 2019-01-24 by the reprex package (v0.2.1)

.csv loads too slow to enviroment.

I read a 2MO rows and 9 column .csv file (1.2 GB) and vroom reads it in around 4 seconds (vs 7.7 seconds for fread). But it takes around 1 minute to load to the RStudio enviroment (vs 2 seconds for fread).

IΒ΄m using RStudio Version 1.2.1237 and R version 3.5.2.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.