GithubHelp home page GithubHelp logo

michaelchirico / vroom Goto Github PK

View Code? Open in Web Editor NEW

This project forked from tidyverse/vroom

0.0 2.0 0.0 16.29 MB

Fast reading of delimited files

Home Page: https://vroom.r-lib.org

License: Other

Shell 0.14% C++ 87.43% Python 0.02% C 1.33% R 9.51% Makefile 0.18% CMake 1.39%

vroom's Introduction

πŸŽπŸ’¨vroom

R-CMD-check Codecov test coverage CRAN status Lifecycle: stable

The fastest delimited reader for R, 1.23 GB/sec.

But that’s impossible! How can it be so fast?

vroom doesn’t stop to actually read all of your data, it simply indexes where each record is located so it can be read later. The vectors returned use the Altrep framework to lazily load the data on-demand when it is accessed, so you only pay for what you use. This lazy access is done automatically, so no changes to your R data-manipulation code are needed.

vroom also uses multiple threads for indexing, materializing non-character columns, and when writing to further improve performance.

package version time (sec) speedup throughput
vroom 1.5.1 1.36 53.30 1.23 GB/sec
data.table 1.14.0 5.83 12.40 281.65 MB/sec
readr 1.4.0 37.30 1.94 44.02 MB/sec
read.delim 4.1.0 72.31 1.00 22.71 MB/sec

Features

vroom has nearly all of the parsing features of readr for delimited and fixed width files, including

  • delimiter guessing*
  • custom delimiters (including multi-byte* and Unicode* delimiters)
  • specification of column types (including type guessing)
    • numeric types (double, integer, big integer*, number)
    • logical types
    • datetime types (datetime, date, time)
    • categorical types (characters, factors)
  • column selection, like dplyr::select()*
  • skipping headers, comments and blank lines
  • quoted fields
  • double and backslashed escapes
  • whitespace trimming
  • windows newlines
  • reading from multiple files or connections*
  • embedded newlines in headers and fields**
  • writing delimited files with as-needed quoting.
  • robust to invalid inputs (vroom has been extensively tested with the afl fuzz tester)*.

* these are additional features not in readr.

** requires num_threads = 1.

Installation

Install vroom from CRAN with:

install.packages("vroom")

Alternatively, if you need the development version from GitHub install it with:

# install.packages("devtools")
devtools::install_dev("vroom")

Usage

See getting started to jump start your use of vroom!

vroom uses the same interface as readr to specify column types.

vroom::vroom("mtcars.tsv",
  col_types = list(cyl = "i", gear = "f",hp = "i", disp = "_",
                   drat = "_", vs = "l", am = "l", carb = "i")
)
#> # A tibble: 32 Γ— 10
#>   model           mpg   cyl    hp    wt  qsec vs    am    gear   carb
#>   <chr>         <dbl> <int> <int> <dbl> <dbl> <lgl> <lgl> <fct> <int>
#> 1 Mazda RX4      21       6   110  2.62  16.5 FALSE TRUE  4         4
#> 2 Mazda RX4 Wag  21       6   110  2.88  17.0 FALSE TRUE  4         4
#> 3 Datsun 710     22.8     4    93  2.32  18.6 TRUE  TRUE  4         1
#> # … with 29 more rows

Reading multiple files

vroom natively supports reading from multiple files (or even multiple connections!).

First we generate some files to read by splitting the nycflights dataset by airline.

library(nycflights13)
purrr::iwalk(
  split(flights, flights$carrier),
  ~ { .x$carrier[[1]]; vroom::vroom_write(.x, glue::glue("flights_{.y}.tsv"), delim = "\t") }
)

Then we can efficiently read them into one tibble by passing the filenames directly to vroom.

files <- fs::dir_ls(glob = "flights*tsv")
files
#> flights_9E.tsv flights_AA.tsv flights_AS.tsv flights_B6.tsv flights_DL.tsv 
#> flights_EV.tsv flights_F9.tsv flights_FL.tsv flights_HA.tsv flights_MQ.tsv 
#> flights_OO.tsv flights_UA.tsv flights_US.tsv flights_VX.tsv flights_WN.tsv 
#> flights_YV.tsv
vroom::vroom(files)
#> Rows: 336776 Columns: 19
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: "\t"
#> chr   (4): carrier, tailnum, origin, dest
#> dbl  (14): year, month, day, dep_time, sched_dep_time, dep_delay, arr_time, ...
#> dttm  (1): time_hour
#> 
#> β„Ή Use `spec()` to retrieve the full column specification for this data.
#> β„Ή Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 336,776 Γ— 19
#>    year month   day dep_time sched_dep…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁡ carrier
#>   <dbl> <dbl> <dbl>    <dbl>       <dbl>   <dbl>   <dbl>   <dbl>   <dbl> <chr>  
#> 1  2013     1     1      810         810       0    1048    1037      11 9E     
#> 2  2013     1     1     1451        1500      -9    1634    1636      -2 9E     
#> 3  2013     1     1     1452        1455      -3    1637    1639      -2 9E     
#> # … with 336,773 more rows, 9 more variables: flight <dbl>, tailnum <chr>,
#> #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#> #   minute <dbl>, time_hour <dttm>, and abbreviated variable names
#> #   ¹​sched_dep_time, ²​dep_delay, ³​arr_time, ⁴​sched_arr_time, ⁡​arr_delay

Learning more

Benchmarks

The speed quoted above is from a real 1.53G dataset with 14,388,451 rows and 11 columns, see the benchmark article for full details of the dataset and bench/ for the code used to retrieve the data and perform the benchmarks.

Environment variables

In addition to the arguments to the vroom() function, you can control the behavior of vroom with a few environment variables. Generally these will not need to be set by most users.

  • VROOM_TEMP_PATH - Path to the directory used to store temporary files when reading from a R connection. If unset defaults to the R session’s temporary directory (tempdir()).
  • VROOM_THREADS - The number of processor threads to use when indexing and parsing. If unset defaults to parallel::detectCores().
  • VROOM_SHOW_PROGRESS - Whether to show the progress bar when indexing. Regardless of this setting the progress bar is disabled in non-interactive settings, R notebooks, when running tests with testthat and when knitting documents.
  • VROOM_CONNECTION_SIZE - The size (in bytes) of the connection buffer when reading from connections (default is 128 KiB).
  • VROOM_WRITE_BUFFER_LINES - The number of lines to use for each buffer when writing files (default: 1000).

There are also a family of variables to control use of the Altrep framework. For versions of R where the Altrep framework is unavailable (R < 3.5.0) they are automatically turned off and the variables have no effect. The variables can take one of true, false, TRUE, FALSE, 1, or 0.

  • VROOM_USE_ALTREP_NUMERICS - If set use Altrep for all numeric types (default false).

There are also individual variables for each type. Currently only VROOM_USE_ALTREP_CHR defaults to true.

  • VROOM_USE_ALTREP_CHR
  • VROOM_USE_ALTREP_FCT
  • VROOM_USE_ALTREP_INT
  • VROOM_USE_ALTREP_BIG_INT
  • VROOM_USE_ALTREP_DBL
  • VROOM_USE_ALTREP_NUM
  • VROOM_USE_ALTREP_LGL
  • VROOM_USE_ALTREP_DTTM
  • VROOM_USE_ALTREP_DATE
  • VROOM_USE_ALTREP_TIME

RStudio caveats

RStudio’s environment pane calls object.size() when it refreshes the pane, which for Altrep objects can be extremely slow. RStudio 1.2.1335+ includes the fixes (RStudio#4210, RStudio#4292) for this issue, so it is recommended you use at least that version.

Thanks

vroom's People

Contributors

ab-kent avatar andrie avatar anirban166 avatar batpigandme avatar criscelylp avatar davisvaughan avatar edzer avatar frm1789 avatar hadley avatar jennybc avatar jimhester avatar jrf1111 avatar lionel- avatar lwjohnst86 avatar maurolepore avatar meta00 avatar mikejohnpage avatar r3myg avatar sbearrows avatar wlattner avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.