GithubHelp home page GithubHelp logo

csvy's Introduction

Import and Export CSV Data With a YAML Metadata Header

CSVY is a file format that combines the simplicity of CSV (comma-separated values) with the metadata of other plain text and binary formats (JSON, XML, Stata, etc.). The CSVY file specification is simple: place a YAML header on top of a regular CSV. The yaml header is formatted according to the Table Schema of a Tabular Data Package.

A CSVY file looks like this:

#---
#profile: tabular-data-resource
#name: my-dataset
#path: https://raw.githubusercontent.com/csvy/csvy.github.io/master/examples/example.csvy
#title: Example file of csvy 
#description: Show a csvy sample file.
#format: csvy
#mediatype: text/vnd.yaml
#encoding: utf-8
#schema:
#  fields:
#  - name: var1
#    type: string
#  - name: var2
#    type: integer
#  - name: var3
#    type: number
#dialect:
#  csvddfVersion: 1.0
#  delimiter: ","
#  doubleQuote: false
#  lineTerminator: "\r\n"
#  quoteChar: "\""
#  skipInitialSpace: true
#  header: true
#sources:
#- title: The csvy specifications
#  path: http://csvy.org/
#  email: ''
#licenses:
#- name: CC-BY-4.0
#  title: Creative Commons Attribution 4.0
#  path: https://creativecommons.org/licenses/by/4.0/
#---
var1,var2,var3
A,1,2.0
B,3,4.3

Which we can read into R like this:

library("csvy")
str(read_csvy(system.file("examples", "example1.csvy", package = "csvy")))
## 'data.frame':	2 obs. of  3 variables:
##  $ var1: chr  "A" "B"
##  $ var2: int  1 3
##  $ var3: num  2 4.3
##  - attr(*, "profile")= chr "tabular-data-resource"
##  - attr(*, "title")= chr "Example file of csvy"
##  - attr(*, "description")= chr "Show a csvy sample file."
##  - attr(*, "name")= chr "my-dataset"
##  - attr(*, "format")= chr "csvy"
##  - attr(*, "sources")=List of 1
##   ..$ :List of 3
##   .. ..$ name : chr "CC-BY-4.0"
##   .. ..$ title: chr "Creative Commons Attribution 4.0"
##   .. ..$ path : chr "https://creativecommons.org/licenses/by/4.0/"

Optional comment characters on the YAML lines make the data readable with any standard CSV parser while retaining the ability to import and export variable- and file-level metadata. The CSVY specification does not use these, but the csvy package for R does so that you (and other users) can continue to rely on utils::read.csv() or readr::read_csv() as usual. The import() function in rio supports CSVY natively.

Export

To create a CSVY file from R, just do:

library("csvy")
library("datasets")
write_csvy(iris, "iris.csvy")

It is also possible to export the metadata to separate YAML or JSON file (and then also possible to import from those separate files) by specifying the metadata field in write_csvy() and read_csvy().

Import

To read a CSVY into R, just do:

d1 <- read_csvy("iris.csvy")
str(d1)
## 'data.frame':	150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : chr  "setosa" "setosa" "setosa" "setosa" ...
##   ..- attr(*, "levels")= chr  "setosa" "versicolor" "virginica"
##  - attr(*, "profile")= chr "tabular-data-package"
##  - attr(*, "name")= chr "iris"

or use any other appropriate data import function to ignore the YAML metadata:

d2 <- utils::read.table("iris.csvy", sep = ",", header = TRUE)
str(d2)
## 'data.frame':	150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

Package Installation

The package is available on CRAN and can be installed directly in R using:

install.packages("csvy")

The latest development version on GitHub can be installed using devtools:

if(!require("remotes")){
    install.packages("remotes")
}
remotes::install_github("leeper/csvy")

CRAN Version Downloads Travis-CI Build Status Appveyor Build status codecov.io

csvy's People

Contributors

ashiklom avatar jonocarroll avatar leeper avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

csvy's Issues

Support for csvy in packages

This would require:

  • Some advice/tooling for how to place directly in data/ (i.e. accompanied by script that called read_csvy())

  • Some tool to turn metadata into a nice Rd file (via roxygen or otherwise)

Address column mismatches

Copied from gesistsa/rio#110 (@billdenney):

My method for generating .csvy files is via Perl, and the header output order may not match the file output order exactly.

It would be helpful if the fields were matched by column name and the fields 'name' value rather than simply in order.

I think this would just be a change to the following code in .import.rio_csvy (lines 122-124 of import_method.R, currently):

for (i in seq_along(y$fields)) {
    attributes(out[, i]) <- y$fields[[i]]
}

becomes

already.matched <- rep(FALSE, ncol(out))
for (i in seq_along(y$fields)) {
  idx.match <- (1:ncol(out))[names(out) %in% y$fields[[i]]$name]
  if (length(idx.match) == 0) {
    warning("Field name ", y$fields[[i]]$name, " is not found in the input file; please check your YAML header.")
  } else if (length(idx.match) > 1) {
    warning("Field name ", y$fields[[i]]$name, " is found more than once in the input file; please check your .csv header.")
  } else if (already.matched[idx.match]) {
    warning("Column ", idx.match, " already has a field name match; please check your YAML header.")
  }
  attributes(out[, idx.match]) <- y$fields[[i]]
}

Performance improvements

Currently read_csvy reads the complete file using readLines() - this means it will be slow for large files. I'd recommend (and can possibly help with) writing a C/C++ read_yaml_header() function that would parse from the first --- to the next ---. This metadata could then be used to generate the column specification that's passed to read.csv(), read_csv(), and fread(). (Will probably still need some additional cleanup afterwards).

Write YAML/JSON metadata in a separate function?

Currently it's possible to write CSVY metadata data to a separate file instead of at the top of the CSV file with write_csvy(..., metadata = "path/to/metadata.yaml"). This must be done in conjunction with writing a CSV file, though—write_csvy() doesn't work unless you specify an output file for the CSV. It might be helpful to have a function like write_metadata() that only writes the metadata without writing the CSV too.

For instance, I've run into an issue where I want to use readr::write_csv() to write a CSV because of its NA handling options, speed, tidyverseability, etc., but I also want to include CSVY YAML metadata as a separate file to distribute along with it. My current solution is to write two CSV files—one with write_csv() and one with write_csvy(..., metadata = "blah.yaml")—and then delete the CSV created with write_csvy(). But that's super roundabout and inefficient.

ok to add columnType as valid field descriptor option?

I'm strongly considering csvy for a project I'm working on in which I want to indicate meaning "hints" for columns in csv files.

It looks like the frictionless data folks have specified an interesting "columnType" option for their Fiscal Data Package Schema (https://specs.frictionlessdata.io/fiscal-data-package/#columntypes).

In an effort to "keep things simple" I don't see implementing their taxonomy and other parts of this scheme. Rather I would just modify the internal "add_variable_metadata" function to copy the value of columnType as yet another variable attribute (similar to the current label support).

Another approach would be to add a new logical "strict" parameter for read_csvy with a default of "TRUE". If set to "FALSE" it would allow any field variable to be copied as an attribute in the returned R table.

Is this a reasonable thing to do as part of the csvy package? If so, which approach would you prefer? Thank you.

Column attributes are dropped

Please specify whether your issue is about:

  • a possible bug
  • a question about package functionality
  • a suggested code or documentation change, improvement to the code, or feature request

The current check failure is due to the label attribute not being preserved on a roundtrip. This appears to be due to two issues: first, the attribute is never read

attr(data[[i]], "label") <- fields_this_col[["label"]]

(the options at this point are 'name', 'title', and 'type'.

Secondly, the as.X calls here:

csvy/R/read_csvy.R

Lines 200 to 208 in 01a2f9d

} else if (fields_this_col[["type"]] == "date") {
try(data[[i]] <- as.Date(data[[i]]))
} else if (fields_this_col[["type"]] == "datetime") {
try(data[[i]] <- as.POSIXct(data[[i]]))
} else if (fields_this_col[["type"]] == "boolean") {
try(data[[i]] <- as.logical(data[[i]]))
} else if (fields_this_col[["type"]] == "number") {
try(data[[i]] <- as.numeric(data[[i]]))
}

MRE:

iris2 <- iris
attr(iris2$Sepal.Length, "label") <- "Sepal Length"
attr(iris2$Sepal.Length, "label")
#> [1] "Sepal Length"
iris2$Sepal.Length <- as.numeric(iris2$Sepal.Length)
attr(iris2$Sepal.Length, "label")
#> NULL

This can be remedied by storing then re-applying the attributes on either side of the conversion. PR to follow.

Ideas/plans for additional types

Please specify whether your issue is about:

  • a possible bug
  • a question about package functionality
  • a suggested code or documentation change, improvement to the code, or feature request

I like the idea of csvy. I didn't want to invent my own metadata format, so I've been using the yaml file, even though I'm writing the data with different packages (readr, sparklyr, arrow).

However, I realized that the limitation of the column types can get me into trouble. If I have an integer that can't be represented with a 32 bit integer, in R I'll need to store it as a type that can (likely using bit64 or arrow). Those will save out and back in as strings, which is better than several other alternatives of what could happen (that would end up with mangled numbers). But it would be nice to have a better way to deal with them. I imagine that a csvy file with a 64 bit integer saved from python would call that an integer... so from the point of view of having this be a format that is easy to exchange it's not ideal.

Has there been any thought for how to handle extended numeric types for additional precision that we're more commonly have to deal with now?

Subsetting (and similar operations) lose attributes

Since the default underlying data structure is a data.frame (or data table), when I subset my data, it loses all its attributes that .csvy provides.

When generating the structure, could you perhaps change the class so that attributes are preserved with something like what is done here: http://stackoverflow.com/questions/10404224/how-to-delete-a-row-from-a-data-frame-without-losing-the-attributes

Ideally, the class would only be changed if attributes were needed. For instance, if there are no file-level attributes, it would stay as a normal data.frame, and if there are no column-level attributes, it would stay as a normal vector.

Incorrect values when reading csvy

Please specify whether your issue is about:

  • a possible bug
  • a question about package functionality
  • a suggested code or documentation change, improvement to the code, or feature request

Labeling numeric variable with expss package results in incorrect values when reading csvy:

library(dpplyr)
library(expss)
library(csvy)

# creating labeled df
cars.labeled  <- mtcars %>% 
  mutate(cyl = as.numeric(cyl),
         disp = as.numeric(disp),
         vs = recode_factor(vs,
                            "0" = "No",
                            "1" = "Yes")
         ) %>% 
  expss::apply_labels(
    cyl = "How many cilinders") %>% 
  write_csvy("cars.labeled.csv")

# reading csvy file, setting "stringsAsFactors" as TRUE because I want to treat them as factors
 cars.imported <-  read_csvy("cars.labeled.csv", stringsAsFactors = T)

# in the labeled df values are fine:
cars.labeled$cyl %>% summary()

# however, in imported df, values do not match to labeled df
cars.imported$cyl %>% summary()


## session info:
R version 4.2.1 (2022-06-23)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.4 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/liblapack.so.3

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8   
 [6] LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] dplyr_1.0.9

loaded via a namespace (and not attached):
 [1] pillar_1.8.0      compiler_4.2.1    remotes_2.4.2     tools_4.2.1       digest_0.6.29     googledrive_2.0.0 jsonlite_1.8.0    evaluate_0.15    
 [9] lifecycle_1.0.1   gargle_1.2.0      tibble_3.1.8      pkgconfig_2.0.3   rlang_1.0.4       csvy_0.3.0        DBI_1.1.3         cli_3.3.0        
[17] rstudioapi_0.13   yaml_2.3.5        curl_4.3.2        xfun_0.31         fastmap_1.1.0     stringr_1.4.0     knitr_1.39        withr_2.5.0      
[25] httr_1.4.3        generics_0.1.3    fs_1.5.2          vctrs_0.4.1       askpass_1.1       hms_1.1.1         rappdirs_0.3.3    tidyselect_1.1.2 
[33] data.table_1.14.2 glue_1.6.2        R6_2.5.1          fansi_1.0.3       rmarkdown_2.14    tzdb_0.3.0        readr_2.1.2       purrr_0.3.4      
[41] tidyr_1.2.0       magrittr_2.0.3    ellipsis_0.3.2    htmltools_0.5.3   MASS_7.3-57       assertthat_0.2.1  utf8_1.2.2        stringi_1.7.8    
[49] openssl_2.0.2    

Bug report from CRAN

Please specify whether your issue is about:

  • a possible bug
  • a question about package functionality
  • a suggested code or documentation change, improvement to the code, or feature request

Put your code here:

in function write_csvy() you have

     if (isTRUE(comment_header)) {
         m <- readLines(textConnection(y))


but you never close the connection.

Rather write:

     if (isTRUE(comment_header)) {
         con <- textConnection(y)
         on.exit(close(con))
         m <- readLines(con)
...


There may be more connections left open in other parts of the package, I 
just report the first found problem.

Please correct before 2018-08-08 to safely retain your package on CRAN.

Dropped factors?

Please specify whether your issue is about:

  • a possible bug
  • a question about package functionality
  • a suggested code or documentation change, improvement to the code, or feature request

This is shown in the readme, but is it intended behaviour for factor variables to be converted to text in a roundtrip?

library(csvy)
csvy::write_csvy(iris, "iris.csvy")
all.equal(iris, csvy::read_csvy("iris.csvy"))
#> [1] "Attributes: < Names: 1 string mismatch >"                            
#> [2] "Attributes: < Length mismatch: comparison on first 2 components >"   
#> [3] "Attributes: < Component 2: Modes: numeric, character >"              
#> [4] "Attributes: < Component 2: Lengths: 150, 1 >"                        
#> [5] "Attributes: < Component 2: target is numeric, current is character >"
#> [6] "Component \"Species\": 'current' is not a factor"

str(iris)
#> 'data.frame':    150 obs. of  5 variables:
#>  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
#>  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
#>  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
#>  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
#>  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
str(csvy::read_csvy("iris.csvy"))
#> 'data.frame':    150 obs. of  5 variables:
#>  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
#>  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
#>  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
#>  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
#>  $ Species     : chr  "setosa" "setosa" "setosa" "setosa" ...
#>   ..- attr(*, "levels")= chr  "setosa" "versicolor" "virginica"
#>  - attr(*, "profile")= chr "tabular-data-package"
#>  - attr(*, "name")= chr "iris"
Session info
devtools::session_info()
#> ─ Session info ──────────────────────────────────────────────────────────
#>  setting  value                         
#>  version  R version 3.6.2 (2019-12-12)  
#>  os       elementary OS 5.1 Hera        
#>  system   x86_64, linux-gnu             
#>  ui       X11                           
#>  language en_US                         
#>  collate  en_US.UTF-8                   
#>  ctype    en_US.UTF-8                   
#>  tz       America/Argentina/Buenos_Aires
#>  date     2019-12-16                    
#> 
#> ─ Packages ──────────────────────────────────────────────────────────────
#>  package     * version    date       lib source                         
#>  assertthat    0.2.1      2019-03-21 [1] CRAN (R 3.6.0)                 
#>  backports     1.1.5      2019-10-02 [1] CRAN (R 3.6.1)                 
#>  callr         3.3.2      2019-09-22 [1] CRAN (R 3.6.1)                 
#>  cli           1.1.0      2019-03-19 [1] CRAN (R 3.6.0)                 
#>  crayon        1.3.4      2017-09-16 [1] CRAN (R 3.6.0)                 
#>  csvy        * 0.3.0      2019-12-16 [1] Github (leeper/csvy@af0aa8d)   
#>  data.table    1.12.6     2019-10-18 [1] CRAN (R 3.6.1)                 
#>  desc          1.2.0      2018-05-01 [1] CRAN (R 3.6.0)                 
#>  devtools      2.2.0.9000 2019-09-17 [1] Github (r-lib/devtools@2765fbe)
#>  digest        0.6.23     2019-11-23 [1] CRAN (R 3.6.1)                 
#>  ellipsis      0.3.0      2019-09-20 [1] CRAN (R 3.6.1)                 
#>  evaluate      0.14       2019-05-28 [1] CRAN (R 3.6.0)                 
#>  fs            1.3.1      2019-05-06 [1] CRAN (R 3.6.1)                 
#>  glue          1.3.1.9000 2019-09-17 [1] Github (tidyverse/glue@71eeddf)
#>  highr         0.8        2019-03-20 [1] CRAN (R 3.6.0)                 
#>  htmltools     0.4.0      2019-10-04 [1] CRAN (R 3.6.1)                 
#>  jsonlite      1.6        2018-12-07 [1] CRAN (R 3.6.0)                 
#>  knitr         1.25       2019-09-18 [1] CRAN (R 3.6.1)                 
#>  magrittr      1.5        2014-11-22 [1] CRAN (R 3.6.0)                 
#>  memoise       1.1.0      2017-04-21 [1] CRAN (R 3.6.0)                 
#>  pkgbuild      1.0.6      2019-10-09 [1] CRAN (R 3.6.1)                 
#>  pkgload       1.0.2      2018-10-29 [1] CRAN (R 3.6.0)                 
#>  prettyunits   1.0.2      2015-07-13 [1] CRAN (R 3.6.0)                 
#>  processx      3.4.1      2019-07-18 [1] CRAN (R 3.6.1)                 
#>  ps            1.3.0      2018-12-21 [1] CRAN (R 3.6.0)                 
#>  R6            2.4.1      2019-11-12 [1] CRAN (R 3.6.1)                 
#>  Rcpp          1.0.3      2019-11-08 [1] CRAN (R 3.6.1)                 
#>  remotes       2.1.0      2019-06-24 [1] CRAN (R 3.6.1)                 
#>  rlang         0.4.1.9000 2019-11-12 [1] Github (r-lib/rlang@5a0b80a)   
#>  rmarkdown     1.16       2019-10-01 [1] CRAN (R 3.6.1)                 
#>  rprojroot     1.3-2      2018-01-03 [1] CRAN (R 3.6.0)                 
#>  sessioninfo   1.1.1      2018-11-05 [1] CRAN (R 3.6.0)                 
#>  stringi       1.4.3      2019-03-12 [1] CRAN (R 3.6.0)                 
#>  stringr       1.4.0      2019-02-10 [1] CRAN (R 3.6.0)                 
#>  testthat      2.3.0      2019-11-05 [1] CRAN (R 3.6.1)                 
#>  usethis       1.5.1      2019-07-04 [1] CRAN (R 3.6.1)                 
#>  withr         2.1.2      2018-03-15 [1] CRAN (R 3.6.0)                 
#>  xfun          0.11       2019-11-12 [1] CRAN (R 3.6.1)                 
#>  yaml          2.2.0      2018-07-25 [1] CRAN (R 3.6.0)                 
#> 
#> [1] /home/elio/R/x86_64-pc-linux-gnu-library/3.6
#> [2] /usr/local/lib/R/site-library
#> [3] /usr/lib/R/site-library
#> [4] /usr/lib/R/library

The relevant yaml section has this:

#- name: Species
#  type: string
#  levels:
#  - setosa
#  - versicolor
#  - virginica

Which is not really correct, as the type is actually factor. I'm not at all familiar with the csvy spec so "factor" might not be a posible type. In any case, this applies not only to factors. It seems that write/read_csvy drops class attributes.

For example,

library(csvy)

iris$col <- 1
class(iris$col) <- c("custom_cass", "numeric")
csvy::write_csvy(iris, "iris.csvy")
str(iris)
#> 'data.frame':    150 obs. of  6 variables:
#>  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
#>  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
#>  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
#>  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
#>  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
#>  $ col         : 'custom_cass' num  1 1 1 1 1 1 1 1 1 1 ...
str(csvy::read_csvy("iris.csvy"))
#> 'data.frame':    150 obs. of  6 variables:
#>  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
#>  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
#>  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
#>  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
#>  $ Species     : chr  "setosa" "setosa" "setosa" "setosa" ...
#>   ..- attr(*, "levels")= chr  "setosa" "versicolor" "virginica"
#>  $ col         : chr  "1" "1" "1" "1" ...
#>  - attr(*, "profile")= chr "tabular-data-package"
#>  - attr(*, "name")= chr "iris"

[feature request] YAML header as a list

Hi,

This is a great package!

I was pretty surprised to find that csvy::get_yaml_header gives a character vector and not a list. It might be a good idea to provide a list output to do something with it. (or a different function altogether say csvy::get_header?)

pacman::p_load("magrittr")
csvy::write_csvy(iris, "iris.csvy")

# this gives a character vector, not readable!
csvy::get_yaml_header("iris.csvy")
#>  [1] "profile: tabular-data-package" "name: iris"                   
#>  [3] "fields:"                       "- name: Sepal.Length"         
#>  [5] "  type: number"                "- name: Sepal.Width"          
#>  [7] "  type: number"                "- name: Petal.Length"         
#>  [9] "  type: number"                "- name: Petal.Width"          
#> [11] "  type: number"                "- name: Species"              
#> [13] "  type: string"                "  levels:"                    
#> [15] "  - setosa"                    "  - versicolor"               
#> [17] "  - virginica"                 "--- "

# meatadata is a recursive structure, a list might be better
metadata_list <- csvy::get_yaml_header("iris.csvy") %>% 
  textConnection() %>% 
  yaml::read_yaml()

metadata_list
#> $profile
#> [1] "tabular-data-package"
#> 
#> $name
#> [1] "iris"
#> 
#> $fields
#> $fields[[1]]
#> $fields[[1]]$name
#> [1] "Sepal.Length"
#> 
#> $fields[[1]]$type
#> [1] "number"
#> 
#> 
#> $fields[[2]]
#> $fields[[2]]$name
#> [1] "Sepal.Width"
#> 
#> $fields[[2]]$type
#> [1] "number"
#> 
#> 
#> $fields[[3]]
#> $fields[[3]]$name
#> [1] "Petal.Length"
#> 
#> $fields[[3]]$type
#> [1] "number"
#> 
#> 
#> $fields[[4]]
#> $fields[[4]]$name
#> [1] "Petal.Width"
#> 
#> $fields[[4]]$type
#> [1] "number"
#> 
#> 
#> $fields[[5]]
#> $fields[[5]]$name
#> [1] "Species"
#> 
#> $fields[[5]]$type
#> [1] "string"
#> 
#> $fields[[5]]$levels
#> [1] "setosa"     "versicolor" "virginica"

Created on 2018-12-18 by the reprex
package
(v0.2.0).

Session info
devtools::session_info()
#> Session info -------------------------------------------------------------
#>  setting  value                       
#>  version  R version 3.5.1 (2018-07-02)
#>  system   x86_64, darwin15.6.0        
#>  ui       X11                         
#>  language (EN)                        
#>  collate  en_US.UTF-8                 
#>  tz       Asia/Kolkata                
#>  date     2018-12-18
#> Packages -----------------------------------------------------------------
#>  package    * version date       source                      
#>  backports    1.1.3   2018-12-14 cran (@1.1.3)               
#>  base       * 3.5.1   2018-07-05 local                       
#>  compiler     3.5.1   2018-07-05 local                       
#>  csvy         0.3.0   2018-12-18 Github (leeper/csvy@af0aa8d)
#>  data.table   1.11.8  2018-09-30 cran (@1.11.8)              
#>  datasets   * 3.5.1   2018-07-05 local                       
#>  devtools     1.13.6  2018-06-27 CRAN (R 3.5.0)              
#>  digest       0.6.18  2018-10-10 cran (@0.6.18)              
#>  evaluate     0.11    2018-07-17 CRAN (R 3.5.0)              
#>  graphics   * 3.5.1   2018-07-05 local                       
#>  grDevices  * 3.5.1   2018-07-05 local                       
#>  htmltools    0.3.6   2017-04-28 CRAN (R 3.5.0)              
#>  jsonlite     1.5     2017-06-01 CRAN (R 3.5.0)              
#>  knitr        1.20    2018-02-20 CRAN (R 3.5.0)              
#>  magrittr   * 1.5     2014-11-22 CRAN (R 3.5.0)              
#>  memoise      1.1.0   2017-04-21 CRAN (R 3.5.0)              
#>  methods    * 3.5.1   2018-07-05 local                       
#>  pacman       0.4.6   2017-05-14 CRAN (R 3.5.0)              
#>  Rcpp         0.12.19 2018-10-01 cran (@0.12.19)             
#>  rmarkdown    1.10    2018-06-11 CRAN (R 3.5.0)              
#>  rprojroot    1.3-2   2018-01-03 CRAN (R 3.5.0)              
#>  stats      * 3.5.1   2018-07-05 local                       
#>  stringi      1.2.4   2018-07-20 CRAN (R 3.5.0)              
#>  stringr      1.3.1   2018-05-10 CRAN (R 3.5.0)              
#>  tools        3.5.1   2018-07-05 local                       
#>  utils      * 3.5.1   2018-07-05 local                       
#>  withr        2.1.2   2018-03-15 CRAN (R 3.5.0)              
#>  yaml         2.2.0   2018-07-25 CRAN (R 3.5.0)

read_csvy doesn't read the primary example on csvy.org

Not sure if the schema is canonical and has changed, but read_csvy assumes there's a headline field fields, which is now nested in resources[[1]]$schema$fields.

C/P of file currently on the front page of csvy.org

---
name: my-dataset
resources:
- order: 1
  schema:
    fields:
    - name: var1
      type: string
    - name: var2
      type: integer
    - name: var3
      type: number
  dialect:
    csvddfVersion: 1.0
    delimiter: ","
    doubleQuote: false
    lineTerminator: "\r\n"
    quoteChar: "\""
    skipInitialSpace: true
    header: true
---
var1,var2,var3
A,1,2.0
B,3,4.3

Current output of read_csvy (CRAN):

read_csvy('~/data.table/inst/tests/test.csvy')
#   var1 var2 var3
# 1    A    1  2.0
# 2    B    3  4.3

Warning message:
In check_metadata(y, out) :
Metadata is missing for variables listed in data: var1, var2, var3

utils::read.csv support

Currently the code is just using data.table, but it would be useful to also allow utils::read.csv() support and possibly readr::read_csv() as well.

Nonstandard Column Names Don't Match Fields

When I have nonstandard column names (including characters like spaces, parentheses, etc.), they don't match the field names because read.csv changes them. As an example, "test 2" becomes "test.2". I think related to this, while loading the attached file, , I get a warning that "Data is missing for variable listed in frontmatter" which doesn't make sense to me.

(.csvy isn't supported by GitHub, so I'm pasting the file as code below.)


---
fields:
  -
    name: 'test 2'
    description: "My test field 2"
    parameter_type: omega
    type: number
generationtime: 'Mon Jul 11 19:37:14 2016'

---
test1,test 2
a,2

Date handling

From an email:

I could not make Date datatype to work as expected. After modifying line 80 on in "R/read_csvy.R"

    ## attributes(out[, i]) <- fields_this_col
    attributes(out[, i]) <- append( attributes(out[,i]), fields_this_col)

Date -datatype works as I would expect.

Documentation on https://stat.ethz.ch/R-manual/R-devel/library/base/html/attributes.html says

"Assigning attributes first removes all attributes, then sets any dim attribute and then the remaining attributes
in the order given: this ensures that setting a dim attribute always precedes the dimnames attribute."

Is it possible that the original version removes class -attribute, which never gets set anymore?

I have made small test, which fails when running it using the version downloaded with the command "git clone https://github.com/leeper/csvy",
and passes after the modification. The test is attached below.

test_that( "data-types", {
    context("CSVY imports/exports/data-types")
    test_that( "csvy_write/csvy_read",  {
        attrKey <- "myatt"
        attrValue <- "attribute value"
        df <- data.frame(
            d =c("1990-01-01", "1990-01-02"),
            n =c(1,2.5),
            i =c(1L,2L)            
        )
        df$d <- as.Date( df$d )
        attr( df$d, attrKey ) <- attrValue

        ## Start df
        expect_is( df, "data.frame" )
        expect_is( df$d, "Date" )
        expect_is( df$n, "numeric" )
        expect_is( df$i, "integer" )        
        expect_equal( attr( df$d, attrKey), attrValue )

        ## write/read
        filePath = file.path( "tmp", "csvy-data-types.csvy" )
        ret <- write_csvy( df, filePath  )
        df2 <- read_csvy( filePath )        

        expect_is( df2, "data.frame" )
        expect_is( df2$n, "numeric" )
        expect_is( df2$i, "integer" )        
        expect_is( df2$d, "Date" )
        expect_equal( attr( df2$d, attrKey ), attrValue )  
    })
})

Multiple tables in one file

To keep the metadata and the data together in one file so that they don't get separated, I'd like to keep multiple tables in one TSV file. See an example below. The function read_csvy_tables would return a file and return a list of data frames. How do you feel about this feature, and is it something you'd be interested in implementing, or would you accept a pull request?

Background: Over at jennybc/sanesheets#3 we're discussing a tidy file format that provides some of the benefits of a spreadsheet without all the ick of a spreadsheet. For one, we'd like a plain text file format. One of the nice features of a spreadsheet is the ability to keep multiple related sheets in one file.

sanesheet.tsvy


---
name: sheet1

---
A   B
1   X
2   Y

---
name: sheet2

---
C   D   E
3   4   5

Tabs / TSVY?

One feature of the CSV ecosystem is the use of either tabs (TSV files) or, alternately, the ability to specify an arbitrary custom delimiter.

Could you see any affordances for TSV files in CSVY -- i.e. extending the specification ("or YAML+TSV") and perhaps extending toolchains to deal with that case?

Character vectors lose leading zeroes

Please specify whether your issue is about:

  • a possible bug
  • a question about package functionality
  • a suggested code or documentation change, improvement to the code, or feature request

Writing and reading seem to mess up character vectors with leading zeroes.

data <- data.frame(x = 1, y = c("10", "05"))
file <- tempfile()
csvy::write_csvy(data, file, name = "abc")
csvy::read_csvy(file)
#>   x  y
#> 1 1 10
#> 2 1  5
str(csvy::read_csvy(file))
#> 'data.frame':    2 obs. of  2 variables:
#>  $ x: num  1 1
#>  $ y: chr  "10" "5"
#>   ..- attr(*, "levels")= chr  "05" "10"
#>  - attr(*, "profile")= chr "tabular-data-package"
#>  - attr(*, "name")= chr "abc"

Created on 2019-12-16 by the reprex package (v0.3.0)

Maintainership offer

Please specify whether your issue is about:

  • a possible bug
  • a question about package functionality
  • a suggested code or documentation change, improvement to the code, or feature request
  • offering to take on some responsibility

As per Twitter, I'm willing to lend a hand with this package. It looks to be in good condition and the code-burden looks to be quite light. I have an interest in using this format for storing structured data. Of course, if you have anyone more suitable in mind then I defer to them.

My preference would be a transition period in which I tackle some issues and triage whatever I can to get accustomed to the codebase.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.