GithubHelp home page GithubHelp logo

tidyxl's Introduction

tidyxl

Travis-CI Build Status AppVeyor Build Status Cran Status Cran Downloads codecov

tidyxl imports non-tabular data from Excel files into R. It exposes cell content, position, formatting and comments in a tidy structure for further manipulation, especialy by the unpivotr package. It supports the xml-based file formats '.xlsx' and '.xlsm' via the embedded RapidXML C++ library. It does not support the binary file formats '.xlsb' or '.xls'.

Mailing list

For bugs and/or issues, create a new issue on GitHub For other questions or comments, please subscribe to the tidyxl-devel mailing list. You must be a member to post messages, but anyone can read the archived discussions.

Installation

devtools::install_github("nacnudus/tidyxl")

Examples

The package includes a spreadsheet, 'titanic.xlsx', which contains the following pivot table:

ftable(Titanic, row.vars = 1:2)
#>              Age      Child     Adult    
#>              Survived    No Yes    No Yes
#> Class Sex                                
#> 1st   Male                0   5   118  57
#>       Female              0   1     4 140
#> 2nd   Male                0  11   154  14
#>       Female              0  13    13  80
#> 3rd   Male               35  13   387  75
#>       Female             17  14    89  76
#> Crew  Male                0   0   670 192
#>       Female              0   0     3  20

The multi-row column headers make this difficult to import. A popular package for importing spreadsheets coerces the pivot table into a dataframe. It treats the second header row as though it were observations.

titanic <- system.file("extdata/titanic.xlsx", package = "tidyxl")
readxl::read_excel(titanic)
#> # A tibble: 10 × 7
#>       ``     ``      Age Child  `NA` Adult  `NA`
#>    <chr>  <chr>    <chr> <chr> <chr> <chr> <chr>
#> 1   <NA>   <NA> Survived    No   Yes    No   Yes
#> 2  Class    Sex     <NA>  <NA>  <NA>  <NA>  <NA>
#> 3    1st   Male     <NA>     0     5   118    57
#> 4   <NA> Female     <NA>     0     1     4   140
#> 5    2nd   Male     <NA>     0    11   154    14
#> 6   <NA> Female     <NA>     0    13    13    80
#> 7    3rd   Male     <NA>    35    13   387    75
#> 8   <NA> Female     <NA>    17    14    89    76
#> 9   Crew   Male     <NA>     0     0   670   192
#> 10  <NA> Female     <NA>     0     0     3    20

tidyxl doesn't coerce the pivot table into a data frame. Instead, it represents each cell in its own row, where it describes the cell's address, value and other properties.

library(tidyxl)
x <- tidy_xlsx(titanic)$data$Sheet1
# Specific sheets can be requested using `tidy_xlsx(file, sheet)`
str(x)
#> Classes 'tbl_df', 'tbl' and 'data.frame':    60 obs. of  20 variables:
#>  $ address        : chr  "C1" "D1" "E1" "F1" ...
#>  $ row            : int  1 1 1 1 1 2 2 2 2 2 ...
#>  $ col            : int  3 4 5 6 7 3 4 5 6 7 ...
#>  $ content        : chr  "0" "1" NA "2" ...
#>  $ formula        : chr  NA NA NA NA ...
#>  $ formula_type   : chr  NA NA NA NA ...
#>  $ formula_ref    : chr  NA NA NA NA ...
#>  $ formula_group  : int  NA NA NA NA NA NA NA NA NA NA ...
#>  $ type           : chr  "s" "s" NA "s" ...
#>  $ data_type      : chr  "character" "character" "blank" "character" ...
#>  $ error          : chr  NA NA NA NA ...
#>  $ logical        : logi  NA NA NA NA NA NA ...
#>  $ numeric        : num  NA NA NA NA NA NA NA NA NA NA ...
#>  $ date           : POSIXct, format: NA NA ...
#>  $ character      : chr  "Age" "Child" NA "Adult" ...
#>  $ comment        : chr  NA NA NA NA ...
#>  $ height         : num  15 15 15 15 15 15 15 15 15 15 ...
#>  $ width          : num  8.38 8.38 8.38 8.38 8.38 8.38 8.38 8.38 8.38 8.38 ...
#>  $ style_format   : chr  "Normal" "Normal" "Normal" "Normal" ...
#>  $ local_format_id: int  2 3 3 3 3 2 3 3 3 3 ...

In this structure, the cells can be found by filtering.

x[x$data_type == "character", c("address", "character")]
#> # A tibble: 22 × 2
#>    address character
#>      <chr>     <chr>
#> 1       C1       Age
#> 2       D1     Child
#> 3       F1     Adult
#> 4       C2  Survived
#> 5       D2        No
#> 6       E2       Yes
#> 7       F2        No
#> 8       G2       Yes
#> 9       A3     Class
#> 10      B3       Sex
#> # ... with 12 more rows
x[x$row == 4, c("address", "character", "numeric")]
#> # A tibble: 6 × 3
#>   address character numeric
#>     <chr>     <chr>   <dbl>
#> 1      A4       1st      NA
#> 2      B4      Male      NA
#> 3      D4      <NA>       0
#> 4      E4      <NA>       5
#> 5      F4      <NA>     118
#> 6      G4      <NA>      57

Formatting

The original spreadsheet has formatting applied to the cells. This can also be retrieved using tidyxl.

iris-screenshot

Formatting is available by using the columns local_format_id and style_format as indexes into a separate list-of-lists structure. 'Local' formatting is the most common kind, applied to individual cells. 'Style' formatting is usually applied to blocks of cells, and defines several formats at once. Here is a screenshot of the styles buttons in Excel.

styles-screenshot

Formatting can be looked up as follows.

# Bold
formats <- tidy_xlsx(titanic)$formats
formats$local$font$bold
#> [1] FALSE  TRUE FALSE FALSE
x[x$local_format_id %in% which(formats$local$font$bold),
  c("address", "character")]
#> # A tibble: 4 × 2
#>   address character
#>     <chr>     <chr>
#> 1      C1       Age
#> 2      C2  Survived
#> 3      A3     Class
#> 4      B3       Sex

# Yellow fill
formats$local$fill$patternFill$fgColor$rgb
#> [1] NA         NA         NA         "FFFFFF00"
x[x$local_format_id %in%
  which(formats$local$fill$patternFill$fgColor$rgb == "FFFFFF00"),
  c("address", "numeric")]
#> # A tibble: 2 × 2
#>   address numeric
#>     <chr>   <dbl>
#> 1     F11       3
#> 2     G11      20

# Styles by name
formats$style$font$name["Normal"]
#>    Normal 
#> "Calibri"
head(x[x$style_format == "Normal", c("address", "character")])
#> # A tibble: 6 × 2
#>   address character
#>     <chr>     <chr>
#> 1      C1       Age
#> 2      D1     Child
#> 3      E1      <NA>
#> 4      F1     Adult
#> 5      G1      <NA>
#> 6      C2  Survived

To see all the available kinds of formats, use str(formats).

Comments

Comments are available alongside cell values.

x[!is.na(x$comment), c("address", "comment")]
#> # A tibble: 1 × 2
#>   address                                                     comment
#>     <chr>                                                       <chr>
#> 1     G11 All women in the crew worked in the victualling department.

Formulas

Formulas are available, but with a few quirks.

options(width = 120)
y <- tidy_xlsx(system.file("/extdata/examples.xlsx", package = "tidyxl"),
               "Sheet1")$data[[1]]
y[!is.na(y$formula),
  c("address", "formula", "formula_type", "formula_ref", "formula_group",
    "error", "logical", "numeric", "date", "character")]
#> # A tibble: 14 × 10
#>    address              formula formula_type formula_ref formula_group   error logical numeric       date     character
#>      <chr>                <chr>        <chr>       <chr>         <int>   <chr>   <lgl>   <dbl>     <dttm>         <chr>
#> 1       A1                  1/0         <NA>        <NA>            NA #DIV/0!      NA      NA       <NA>          <NA>
#> 2      A14                  1=1         <NA>        <NA>            NA    <NA>    TRUE      NA       <NA>          <NA>
#> 3      A15                 A4+1         <NA>        <NA>            NA    <NA>      NA    1338       <NA>          <NA>
#> 4      A16      DATE(2017,1,18)         <NA>        <NA>            NA    <NA>      NA      NA 2017-01-18          <NA>
#> 5      A17      "Hello, World!"         <NA>        <NA>            NA    <NA>      NA      NA       <NA> Hello, World!
#> 6      A19                A18+1         <NA>        <NA>            NA    <NA>      NA       2       <NA>          <NA>
#> 7      B19                A18+2         <NA>        <NA>            NA    <NA>      NA       3       <NA>          <NA>
#> 8      A20                A19+1       shared     A20:A21             0    <NA>      NA       3       <NA>          <NA>
#> 9      B20                A19+2       shared     B20:B21             1    <NA>      NA       4       <NA>          <NA>
#> 10     A21                            shared        <NA>             0    <NA>      NA       4       <NA>          <NA>
#> 11     B21                            shared        <NA>             1    <NA>      NA       5       <NA>          <NA>
#> 12     A22 SUM(A19:A21*B19:B21)        array         A22            NA    <NA>      NA      38       <NA>          <NA>
#> 13     A23      A19:A20*B19:B20        array     A23:A24            NA    <NA>      NA       6       <NA>          <NA>
#> 14     A25       [1]Sheet1!$A$1         <NA>        <NA>            NA    <NA>      NA      NA       <NA>        normal

The top five cells show that the results of formulas are available as usual in the columns error, logical, numeric, date, and character.

Cells A20 and A21 share a formula definition. The formula is given against cell A20, and assigned to formula_group 0, which spans the cells given by the formula_ref, A20:A21. A spreadsheet application would infer that cell A21 had the formula A20+1. Cells B20 and B21 are similar. The roadmap tidyxl for tidyxl includes de-normalising shared formulas. If you can suggest how to tokenize Excel formulas, then please contact me.

Cell A22 contains an array formula, which, in a spreadsheet application, would appear with curly braces {SUM(A19:A21*B19:B21)}. Cells A23 and A24 contain a single multi-cell array formula (single formula, multi-cell result), indicated by the formula_ref, but unlike cells A20:A21 and B20:B21, the formula for A24 is NA rather than blank (""), and it doesn't have a formula_group.

Cell A25 contains a formula that refers to another file. The [1] is an index into a table of files. The roadmap tidyxl for tidyxl includes de-referencing such numbers.

tidyxl imports the same table into a format suitable for non-tabular processing (see e.g. the unpivotr package in 'Similar projects' below).

Philosophy

Information in in many spreadsheets cannot be easily imported into R. Why?

Most R packages that import spreadsheets have difficulty unless the layout of the spreadsheet conforms to a strict definition of a 'table', e.g.:

  • observations in rows
  • variables in columns
  • a single header row
  • all information represented by characters, whether textual, logical, or numeric

These rules are designed to eliminate ambiguity in the interpretation of the information. But most spreadsheeting software relaxes these rules in a trade of ambiguity for expression via other media:

  • proximity (other than headers, i.e. other than being the first value at the top of a column)
  • formatting (colours and borders)

Humans can usually resolve the ambiguities with contextual knowledge, but computers are limited by their ignorance. Programmers are hampered by:

  • their language's expressiveness
  • loss of information in transfer from spreadsheet to programming library

Information is lost when software discards it in order to force the data into tabular form. Sometimes date formatting is retained, but mostly formatting is lost, and position has to be inferred again.

tidyxl addresses the programmer's problems by not discarding information. It imports the content, position and formatting of cells, leaving it up to the user to associate the different forms of information, and to re-encode them in tabular form without loss. The unpivotr package has been developed to assist with that step.

Similar projects

tidyxl was originally derived from readxl and still contains some of the same code, hence it inherits the GPL-3 licence. readxl is intended for importing tabular data with a single row of column headers, whereas tidyxl is more general, and less magic.

The rsheets project of several R packages is in the early stages of importing spreadsheet information from Excel and Google Sheets into R, manipulating it, and potentially parsing and processing formulas and writing out to spreadsheet files. In particular, jailbreaker attempts to extract non-tabular data from spreadsheets into tabular structures automatically via some clever algorithms.

tidyxl differs from rsheets in scope (tidyxl will never import charts, for example), and implementation (tidyxl is implemented mainly in C++ and is quite fast, only a little slower than readxl). unpivotr is a package related to tidyxl that provides tools for unpivoting complex and non-tabular data layouts using I not AI (intelligence, not artificial intelligence). In this way it corresponds to jailbreaker, but with a different philosophy.

Roadmap

  • Parse shared formulas and propagate to all associated cells.
  • Propagate array formulas to all associated cells.
  • Parse dates
  • Detect cell types (date, boolean, string, number)
  • Implement formatting import in C++ for speed.
  • Write more tests

tidyxl's People

Contributors

nacnudus avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.