tidyxl

tidyxl imports non-tabular data from Excel files into R. It exposes cell content, position, formatting and comments in a tidy structure for further manipulation, especialy by the unpivotr package. It supports the xml-based file formats '.xlsx' and '.xlsm' via the embedded RapidXML C++ library. It does not support the binary file formats '.xlsb' or '.xls'.

Mailing list

For bugs and/or issues, create a new issue on GitHub For other questions or comments, please subscribe to the tidyxl-devel mailing list. You must be a member to post messages, but anyone can read the archived discussions.

Installation

devtools::install_github("nacnudus/tidyxl")

Examples

The package includes a spreadsheet, 'titanic.xlsx', which contains the following pivot table:

ftable(Titanic, row.vars = 1:2)
#>              Age      Child     Adult    
#>              Survived    No Yes    No Yes
#> Class Sex                                
#> 1st   Male                0   5   118  57
#>       Female              0   1     4 140
#> 2nd   Male                0  11   154  14
#>       Female              0  13    13  80
#> 3rd   Male               35  13   387  75
#>       Female             17  14    89  76
#> Crew  Male                0   0   670 192
#>       Female              0   0     3  20

The multi-row column headers make this difficult to import. A popular package for importing spreadsheets coerces the pivot table into a dataframe. It treats the second header row as though it were observations.

titanic <- system.file("extdata/titanic.xlsx", package = "tidyxl")
readxl::read_excel(titanic)
#> # A tibble: 10 × 7
#>       ``     ``      Age Child  `NA` Adult  `NA`
#>    <chr>  <chr>    <chr> <chr> <chr> <chr> <chr>
#> 1   <NA>   <NA> Survived    No   Yes    No   Yes
#> 2  Class    Sex     <NA>  <NA>  <NA>  <NA>  <NA>
#> 3    1st   Male     <NA>     0     5   118    57
#> 4   <NA> Female     <NA>     0     1     4   140
#> 5    2nd   Male     <NA>     0    11   154    14
#> 6   <NA> Female     <NA>     0    13    13    80
#> 7    3rd   Male     <NA>    35    13   387    75
#> 8   <NA> Female     <NA>    17    14    89    76
#> 9   Crew   Male     <NA>     0     0   670   192
#> 10  <NA> Female     <NA>     0     0     3    20

tidyxl doesn't coerce the pivot table into a data frame. Instead, it represents each cell in its own row, where it describes the cell's address, value and other properties.

library(tidyxl)
x <- tidy_xlsx(titanic)$data$Sheet1
# Specific sheets can be requested using `tidy_xlsx(file, sheet)`
str(x)
#> Classes 'tbl_df', 'tbl' and 'data.frame':    60 obs. of  20 variables:
#>  $ address        : chr  "C1" "D1" "E1" "F1" ...
#>  $ row            : int  1 1 1 1 1 2 2 2 2 2 ...
#>  $ col            : int  3 4 5 6 7 3 4 5 6 7 ...
#>  $ content        : chr  "0" "1" NA "2" ...
#>  $ formula        : chr  NA NA NA NA ...
#>  $ formula_type   : chr  NA NA NA NA ...
#>  $ formula_ref    : chr  NA NA NA NA ...
#>  $ formula_group  : int  NA NA NA NA NA NA NA NA NA NA ...
#>  $ type           : chr  "s" "s" NA "s" ...
#>  $ data_type      : chr  "character" "character" "blank" "character" ...
#>  $ error          : chr  NA NA NA NA ...
#>  $ logical        : logi  NA NA NA NA NA NA ...
#>  $ numeric        : num  NA NA NA NA NA NA NA NA NA NA ...
#>  $ date           : POSIXct, format: NA NA ...
#>  $ character      : chr  "Age" "Child" NA "Adult" ...
#>  $ comment        : chr  NA NA NA NA ...
#>  $ height         : num  15 15 15 15 15 15 15 15 15 15 ...
#>  $ width          : num  8.38 8.38 8.38 8.38 8.38 8.38 8.38 8.38 8.38 8.38 ...
#>  $ style_format   : chr  "Normal" "Normal" "Normal" "Normal" ...
#>  $ local_format_id: int  2 3 3 3 3 2 3 3 3 3 ...

In this structure, the cells can be found by filtering.

x[x$data_type == "character", c("address", "character")]
#> # A tibble: 22 × 2
#>    address character
#>      <chr>     <chr>
#> 1       C1       Age
#> 2       D1     Child
#> 3       F1     Adult
#> 4       C2  Survived
#> 5       D2        No
#> 6       E2       Yes
#> 7       F2        No
#> 8       G2       Yes
#> 9       A3     Class
#> 10      B3       Sex
#> # ... with 12 more rows
x[x$row == 4, c("address", "character", "numeric")]
#> # A tibble: 6 × 3
#>   address character numeric
#>     <chr>     <chr>   <dbl>
#> 1      A4       1st      NA
#> 2      B4      Male      NA
#> 3      D4      <NA>       0
#> 4      E4      <NA>       5
#> 5      F4      <NA>     118
#> 6      G4      <NA>      57

Formatting

The original spreadsheet has formatting applied to the cells. This can also be retrieved using tidyxl.

Formatting is available by using the columns local_format_id and style_format as indexes into a separate list-of-lists structure. 'Local' formatting is the most common kind, applied to individual cells. 'Style' formatting is usually applied to blocks of cells, and defines several formats at once. Here is a screenshot of the styles buttons in Excel.

Formatting can be looked up as follows.

# Bold
formats <- tidy_xlsx(titanic)$formats
formats$local$font$bold
#> [1] FALSE  TRUE FALSE FALSE
x[x$local_format_id %in% which(formats$local$font$bold),
  c("address", "character")]
#> # A tibble: 4 × 2
#>   address character
#>     <chr>     <chr>
#> 1      C1       Age
#> 2      C2  Survived
#> 3      A3     Class
#> 4      B3       Sex

# Yellow fill
formats$local$fill$patternFill$fgColor$rgb
#> [1] NA         NA         NA         "FFFFFF00"
x[x$local_format_id %in%
  which(formats$local$fill$patternFill$fgColor$rgb == "FFFFFF00"),
  c("address", "numeric")]
#> # A tibble: 2 × 2
#>   address numeric
#>     <chr>   <dbl>
#> 1     F11       3
#> 2     G11      20

# Styles by name
formats$style$font$name["Normal"]
#>    Normal 
#> "Calibri"
head(x[x$style_format == "Normal", c("address", "character")])
#> # A tibble: 6 × 2
#>   address character
#>     <chr>     <chr>
#> 1      C1       Age
#> 2      D1     Child
#> 3      E1      <NA>
#> 4      F1     Adult
#> 5      G1      <NA>
#> 6      C2  Survived

To see all the available kinds of formats, use str(formats).

Comments

Comments are available alongside cell values.

x[!is.na(x$comment), c("address", "comment")]
#> # A tibble: 1 × 2
#>   address                                                     comment
#>     <chr>                                                       <chr>
#> 1     G11 All women in the crew worked in the victualling department.

Formulas

Formulas are available, but with a few quirks.

options(width = 120)
y <- tidy_xlsx(system.file("/extdata/examples.xlsx", package = "tidyxl"),
               "Sheet1")$data[[1]]
y[!is.na(y$formula),
  c("address", "formula", "formula_type", "formula_ref", "formula_group",
    "error", "logical", "numeric", "date", "character")]
#> # A tibble: 14 × 10
#>    address              formula formula_type formula_ref formula_group   error logical numeric       date     character
#>      <chr>                <chr>        <chr>       <chr>         <int>   <chr>   <lgl>   <dbl>     <dttm>         <chr>
#> 1       A1                  1/0         <NA>        <NA>            NA #DIV/0!      NA      NA       <NA>          <NA>
#> 2      A14                  1=1         <NA>        <NA>            NA    <NA>    TRUE      NA       <NA>          <NA>
#> 3      A15                 A4+1         <NA>        <NA>            NA    <NA>      NA    1338       <NA>          <NA>
#> 4      A16      DATE(2017,1,18)         <NA>        <NA>            NA    <NA>      NA      NA 2017-01-18          <NA>
#> 5      A17      "Hello, World!"         <NA>        <NA>            NA    <NA>      NA      NA       <NA> Hello, World!
#> 6      A19                A18+1         <NA>        <NA>            NA    <NA>      NA       2       <NA>          <NA>
#> 7      B19                A18+2         <NA>        <NA>            NA    <NA>      NA       3       <NA>          <NA>
#> 8      A20                A19+1       shared     A20:A21             0    <NA>      NA       3       <NA>          <NA>
#> 9      B20                A19+2       shared     B20:B21             1    <NA>      NA       4       <NA>          <NA>
#> 10     A21                            shared        <NA>             0    <NA>      NA       4       <NA>          <NA>
#> 11     B21                            shared        <NA>             1    <NA>      NA       5       <NA>          <NA>
#> 12     A22 SUM(A19:A21*B19:B21)        array         A22            NA    <NA>      NA      38       <NA>          <NA>
#> 13     A23      A19:A20*B19:B20        array     A23:A24            NA    <NA>      NA       6       <NA>          <NA>
#> 14     A25       [1]Sheet1!$A$1         <NA>        <NA>            NA    <NA>      NA      NA       <NA>        normal

The top five cells show that the results of formulas are available as usual in the columns error, logical, numeric, date, and character.

Cells A20 and A21 share a formula definition. The formula is given against cell A20, and assigned to formula_group 0, which spans the cells given by the formula_ref, A20:A21. A spreadsheet application would infer that cell A21 had the formula A20+1. Cells B20 and B21 are similar. The roadmap tidyxl for tidyxl includes de-normalising shared formulas. If you can suggest how to tokenize Excel formulas, then please contact me.

Cell A22 contains an array formula, which, in a spreadsheet application, would appear with curly braces {SUM(A19:A21*B19:B21)}. Cells A23 and A24 contain a single multi-cell array formula (single formula, multi-cell result), indicated by the formula_ref, but unlike cells A20:A21 and B20:B21, the formula for A24 is NA rather than blank (""), and it doesn't have a formula_group.

Cell A25 contains a formula that refers to another file. The [1] is an index into a table of files. The roadmap tidyxl for tidyxl includes de-referencing such numbers.

tidyxl imports the same table into a format suitable for non-tabular processing (see e.g. the unpivotr package in 'Similar projects' below).

Philosophy

Information in in many spreadsheets cannot be easily imported into R. Why?

Most R packages that import spreadsheets have difficulty unless the layout of the spreadsheet conforms to a strict definition of a 'table', e.g.:

observations in rows
variables in columns
a single header row
all information represented by characters, whether textual, logical, or numeric

These rules are designed to eliminate ambiguity in the interpretation of the information. But most spreadsheeting software relaxes these rules in a trade of ambiguity for expression via other media:

proximity (other than headers, i.e. other than being the first value at the top of a column)
formatting (colours and borders)

Humans can usually resolve the ambiguities with contextual knowledge, but computers are limited by their ignorance. Programmers are hampered by:

their language's expressiveness
loss of information in transfer from spreadsheet to programming library

Information is lost when software discards it in order to force the data into tabular form. Sometimes date formatting is retained, but mostly formatting is lost, and position has to be inferred again.

tidyxl addresses the programmer's problems by not discarding information. It imports the content, position and formatting of cells, leaving it up to the user to associate the different forms of information, and to re-encode them in tabular form without loss. The unpivotr package has been developed to assist with that step.

Similar projects

tidyxl was originally derived from readxl and still contains some of the same code, hence it inherits the GPL-3 licence. readxl is intended for importing tabular data with a single row of column headers, whereas tidyxl is more general, and less magic.

The rsheets project of several R packages is in the early stages of importing spreadsheet information from Excel and Google Sheets into R, manipulating it, and potentially parsing and processing formulas and writing out to spreadsheet files. In particular, jailbreaker attempts to extract non-tabular data from spreadsheets into tabular structures automatically via some clever algorithms.

tidyxl differs from rsheets in scope (tidyxl will never import charts, for example), and implementation (tidyxl is implemented mainly in C++ and is quite fast, only a little slower than readxl). unpivotr is a package related to tidyxl that provides tools for unpivoting complex and non-tabular data layouts using I not AI (intelligence, not artificial intelligence). In this way it corresponds to jailbreaker, but with a different philosophy.

Roadmap

Parse shared formulas and propagate to all associated cells.
Propagate array formulas to all associated cells.
Parse dates
Detect cell types (date, boolean, string, number)
Implement formatting import in C++ for speed.
Write more tests

rlugojr / tidyxl Goto Github PK

tidyxl's Introduction

tidyxl

Mailing list

Installation

Examples

Formatting

Comments

Formulas

Philosophy

Similar projects

Roadmap

tidyxl's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs