ropensci / skimr Goto Github PK
View Code? Open in Web Editor NEWA frictionless, pipeable approach to dealing with summary statistics
Home Page: https://docs.ropensci.org/skimr
A frictionless, pipeable approach to dealing with summary statistics
Home Page: https://docs.ropensci.org/skimr
When skim encounters an unknown data type it attempts to coerce to numeric and do a set of default functions for numeric.
However, some data types, such as lists, cannot be coerced to numeric. In that case the following error is returned:
Error: (list) object cannot be coerced to type 'double'
Perhaps it would be good to use character as a fallback rather than numeric.
Sometimes it is useful to report statistics based on groups, ie., what's the recovery rate in the experimental group compared to the control group.
In dplyr/tidyverse, building groups is left to group_by
. However, it appears that this feature is not (yet) supported by skimr
.
I would expect that this code yields a grouped dataframe, as other tidyverse-code does:
mtcars %>% group_by(cyl) %>% skim
However, the code does not split up the results in groups.
It would be great to have that feature. Many thanks for the great work1 👍
Big fan of the package. However, it is awkward to use when doing analyses in RMarkdown.
Rather than appearing as a single concise output, like glimpse, the results manifest as multiple separate outputs, a console that is blank other than Numeric Variables
and Character Variables
and an html tbl_df output for every variable type in the data_frame.
Furthermore, the tables often don't show all of the variables at once, which makes using skim
difficult as well.
There's an open question of what the skim output should be for grouped dataframes. In my view we should match the dplyr::summarize()
behaviour and display the grouping variables in the leading columns and preserve the grouping values in the skim_df. Currently I have the function behaving like this:
mtcars %>%
group_by(cyl, gear) %>%
skim() %>%
.[1:10,] %>%
knitr::kable()
cyl | gear | var | type | stat | level | value |
---|---|---|---|---|---|---|
6 | 4 | mpg | numeric | missing | .all | 0.000000 |
6 | 4 | mpg | numeric | complete | .all | 4.000000 |
6 | 4 | mpg | numeric | n | .all | 4.000000 |
6 | 4 | mpg | numeric | mean | .all | 19.750000 |
6 | 4 | mpg | numeric | sd | .all | 1.552418 |
6 | 4 | mpg | numeric | min | .all | 17.800000 |
6 | 4 | mpg | numeric | median | .all | 20.100000 |
6 | 4 | mpg | numeric | quantile | 25% | 18.850000 |
6 | 4 | mpg | numeric | quantile | 75% | 21.000000 |
6 | 4 | mpg | numeric | max | .all | 21.000000 |
We need a simple function to return the current list for a type, both because people want to know without reading the code but also for selectively dropping functions.
Hello,
I cannot get the histogram to show up in my console (RStudio) when I run some of the example code on the GitHub page:
The following code:
# install.packages("devtools")
devtools::install_github("hadley/colformat")
devtools::install_github("ropenscilabs/skimr")
library(tidyverse)
library(colformat)
library(skimr)
skim(mtcars) %>% filter(stat=="hist")
The following are the results:
# A tibble: 11 x 5
var type stat level value
<chr> <chr> <chr> <chr> <dbl>
1 mpg numeric hist <U+2582><U+2585><U+2587><U+2587><U+2587><U+2583><U+2581><U+2581><U+2582><U+2582> 0
2 cyl numeric hist <U+2586><U+2581><U+2581><U+2581><U+2583><U+2581><U+2581><U+2581><U+2581><U+2587> 0
3 disp numeric hist <U+2587><U+2587><U+2585><U+2581><U+2581><U+2587><U+2583><U+2582><U+2581><U+2583> 0
4 hp numeric hist <U+2586><U+2586><U+2587><U+2582><U+2587><U+2582><U+2583><U+2581><U+2581><U+2581> 0
5 drat numeric hist <U+2583><U+2587><U+2582><U+2582><U+2583><U+2586><U+2585><U+2581><U+2581><U+2581> 0
6 wt numeric hist <U+2582><U+2582><U+2582><U+2582><U+2587><U+2586><U+2581><U+2581><U+2581><U+2582> 0
7 qsec numeric hist <U+2582><U+2583><U+2587><U+2587><U+2587><U+2585><U+2585><U+2581><U+2581><U+2581> 0
8 vs numeric hist <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2586> 0
9 am numeric hist <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2586> 0
10 gear numeric hist <U+2587><U+2581><U+2581><U+2581><U+2586><U+2581><U+2581><U+2581><U+2581><U+2582> 0
11 carb numeric hist <U+2586><U+2587><U+2582><U+2581><U+2587><U+2581><U+2581><U+2581><U+2581><U+2581> 0
I get similar results for the other examples on the GitHub page.
Session Info
> sessionInfo()
R version 3.3.3 (2017-03-06)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252 LC_NUMERIC=C LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] dplyr_0.5.0 purrr_0.2.2.2 readr_1.1.1 tidyr_0.6.3 tibble_1.3.3 ggplot2_2.2.1 tidyverse_1.1.1 skimr_1.0 colformat_0.0.0.9000
loaded via a namespace (and not attached):
[1] Rcpp_0.12.11 cellranger_1.1.0 plyr_1.8.4 forcats_0.2.0 tools_3.3.3 digest_0.6.12 lubridate_1.6.0 jsonlite_1.4 memoise_1.1.0 nlme_3.1-131
[11] gtable_0.2.0 lattice_0.20-35 rlang_0.1.1 psych_1.7.5 DBI_0.6-1 rstudioapi_0.6 parallel_3.3.3 haven_1.0.0 xml2_1.1.1 httr_1.2.1
[21] withr_1.0.2 stringr_1.2.0 hms_0.3 devtools_1.13.1 grid_3.3.3 R6_2.2.1 readxl_1.0.0 foreign_0.8-68 modelr_0.1.0 reshape2_1.4.2
[31] magrittr_1.5 scales_0.4.1.9000 rvest_0.3.2 assertthat_0.2.0 mnormt_1.5-5 colorspace_1.3-2 stringi_1.1.5 lazyeval_0.2.0 munsell_0.4.3 broom_0.4.2
[41] crayon_1.3.2.9000
Right now, skim(mtcars$mpg)
fails with Error in UseMethod("skim") : no applicable method for 'skim' applied to an object of class "c('double', 'numeric')"
. skim_v()
solves the issue but we should do something better by default. Better error message? Use skim_v()
?
Here's a simple reproducible example:
library(dplyr)
library(skimr)
skim_with(numeric=list(mn=purrr::partial(mean, na.rm=TRUE)), append=FALSE)
iris %>% skim
yields:
Error in enc2utf8(col_names(col_labels, sep = sep)) :
argument is not a character vector
Something is skim_print.R
seems to be interfering with this working properly. Does format_num
rely on the default function being there?
Hi Elinw -
skimr working great! (Rstudio / Ubuntu Linux 32 bits).
An easy to implement suggestion
to save precious screen real estate
and make the skimr output
more readable in smaller screens.
In the top title line of skim,
please shorten the names of some of the title text...
Specifically:
25% quantile
to simply: Q1
75% quantile
to simply: Q3
missing
to simply: miss
(or NA
)complete
to simply: compl.
Just these 4 easy text changes,
will avoid the "wrap around" of each variable line
in smaller monitor screens.
= much easier to read (every var is contained in one line...).
Values for each var
will then fit much better within a single screen line...
Thanks Elinw :-)
Really appreciate your effort!
The example from precis is organized a lot like the output of str()
or dplyr::glimpse()
. We don't have to adhere to this format if we don't want to.
If we have lots of variables, do we want grouped output from print.skim()
. Here is one suggestion:
# Skim of a My data frame
# META Stats Nvariables N obs
## Numeric Variables
#> x missing: mean: median: sd: ...
## Categorical (factors or character vectors? Separate?)
#> c missing: level_a: level_b: ...
get_funs()
doesn't work when there are multiple classes.
The function works fine if you use type[1]
directly but if I use skim it throws
Error in .summary_functions[[type]] :
wrong arguments for subsetting an environment
I think it's working like a hash and maybe needs %in% ?
The documentation for n_unique() says it returns the number of unique values but currently it returns the vector of unique values.
Skim is designed to provide the most useful defaults to a user, given a set of data types. We've mentioned the possibility of allowing users to provide their sets of summary functions. This would be a stretch version of our work.
> skim(iris) Error: .onLoad failed in loadNamespace() for 'crayon', details: call: NULL error: 'hasColorConsole' is not an exported object from 'namespace:rstudioapi'
Latest daily build of Rstudio, latest R, other packages. OS X 10.11.6
Hi,
I get this error.
> kimr(mtcars)
Error in kimr(mtcars) : could not find function "kimr"
> skim(mtcars)
Error: .onLoad failed in loadNamespace() for 'crayon', details:
call: NULL
error: 'hasColorConsole' is not an exported object from 'namespace:rstudioapi'
>
R version
> sessionInfo()
R version 3.4.0 (2017-04-21)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)
thanks
sk_print currently only handles numeric, character and factor. As a result ordered factors, dates and complex are not printing.
It might be worth considering if a list-column might be slightly more flexible.
tribble(
~ var, ~ summary, ~ value,
"cyl", "mean", 3.5,
"cyl", "median", 3,
"cyl", "sd", 2.75
)
# vs
tribble(
~ var, ~ summary,
"cyl", list(mean = 3.5, median = 3, sd = 2.75)
)
mtcars %>%
filter(cyl == 8) %>%
skim()
Produces the following error: Error in cut.default(x, 10) : 'breaks' are not unique Looks like it's caused by line 33 in Stats.R
, and is caused by the vs
variable, which is all zeroes.
Hello,
I love the layout of the skimr output. I have totally replaced the use of the summary function with skim(). With that said, knowing that the spark-histograms don't generate properly in Windows, is there any way to add an option to make it FALSE so that it does not print out. That would be great, and I think it would go faster. I am an R user not an R programmer, otherwise, I would submitted a pull request :).
Thank you,
Alfredo
Shouldn't "colformats" be "colformat" (twice) in README.md?
skmir
chokes on ordered factors.
library(tidyverse)
library(skimr)
df <- data_frame(x = rnorm(100),
y = rnorm(100),
z = factor(sample(LETTERS[1:5], 100, replace = TRUE)))
skim(df)
#> Numeric Variables
#> # A tibble: 2 x 13
#> var type missing complete n mean sd min
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 x numeric 0 100 100 -0.01725644 1.065178 -2.477188
#> 2 y numeric 0 100 100 -0.02650740 1.041577 -2.259213
#> # ... with 5 more variables: `25% quantile` <dbl>, median <dbl>, `75%
#> # quantile` <dbl>, max <dbl>, hist <chr>
#>
#> Factor Variables
#> # A tibble: 1 x 7
#> var type complete missing n n_unique
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 z factor 100 0 100 5
#> # ... with 1 more variables: stat <chr>
df1 <- df %>%
mutate(z = factor(z, ordered=TRUE))
skim(df1)
#> Error in .summary_functions[[type]]: wrong arguments for subsetting an environment
Gerring error....
Error in nchar(x) : invalid multibyte string, element 11
Is it possible to return the name of the column that is causing the issue?
Installed colformat and skimr pkgs,
as indicated in GitHub pg. Ok!
skim(chickwts) # or any other data frame (ie: mtcars, iris...)
returns this message:
"Error in overscope_eval_next(overscope, expr) : object 'level' not found"
Using:
Thanks for any guidance-
...skimr looks VERY useful, eager to use it in Rstudio / Linux :-)
What summary statistics do people who work with complex numbers want? Is mean()
meaningful?
Although the following code displays the histograms properly when I run the chunk in rmd, it turns into symbols in the rendered html or the md.
library(tidyverse)
library(skimr)
Sys.setlocale("LC_CTYPE", "Chinese")
skim(storms) %>% filter(stat=="hist")
# A tibble: 10 x 5
var type stat level value
<chr> <chr> <chr> <chr> <dbl>
1 year numeric hist ¨z¨z¨z¨}¨~¨}¨~¨~¨~¨} 0
2 month numeric hist ¨x¨x¨x¨x¨x¨y¨|¨~¨z¨x 0
3 day integer hist ¨~¨}¨}¨}¨}¨}¨}¨}¨}¨} 0
4 hour numeric hist ¨~¨x¨~¨x¨x¨~¨x¨~¨x¨x 0
5 lat numeric hist ¨y¨~¨~¨}¨~¨~¨|¨y¨x¨x 0
6 long numeric hist ¨x¨|¨~¨~¨~¨}¨}¨z¨x¨x 0
7 wind integer hist ¨y¨~¨|¨z¨y¨y¨x¨x¨x¨x 0
8 pressure integer hist ¨x¨x¨x¨x¨x¨x¨y¨z¨~¨y 0
9 ts_diameter numeric hist ¨~¨~¨|¨y¨x¨x¨x¨x¨x¨x 0
10 hu_diameter numeric hist ¨~¨x¨x¨x¨x¨x¨x¨x¨x¨x 0
> sessionInfo()
R version 3.4.0 (2017-04-21)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=Chinese (Simplified)_China.936
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] bindrcpp_0.2 skimr_0.9000 dplyr_0.7.3 purrr_0.2.3 readr_1.1.1 tidyr_0.7.1 tibble_1.3.4
[8] ggplot2_2.2.1 tidyverse_1.1.1
loaded via a namespace (and not attached):
[1] colformat_0.0.0.9000 tidyselect_0.2.0 reshape2_1.4.2 haven_1.0.0 lattice_0.20-35
[6] colorspace_1.3-2 htmltools_0.3.6 yaml_2.1.14 rlang_0.1.2 foreign_0.8-67
[11] glue_1.1.1 modelr_0.1.0 readxl_1.0.0 bindr_0.1 plyr_1.8.4
[16] stringr_1.2.0 munsell_0.4.3 blogdown_0.0.42 gtable_0.2.0 cellranger_1.1.0
[21] rvest_0.3.2 psych_1.7.5 evaluate_0.10 knitr_1.16 forcats_0.2.0
[26] gapminder_0.2.0 parallel_3.4.0 broom_0.4.2 Rcpp_0.12.12 scales_0.4.1
[31] backports_1.1.0 jsonlite_1.5 mnormt_1.5-5 hms_0.3 digest_0.6.12
[36] stringi_1.1.5 bookdown_0.4 grid_3.4.0 rprojroot_1.2 tools_3.4.0
[41] magrittr_1.5 lazyeval_0.2.0 crayon_1.3.2.9000 pkgconfig_2.0.1 rsconnect_0.8
[46] xml2_1.1.1 lubridate_1.6.0 assertthat_0.2.0 rmarkdown_1.6 httr_1.2.1
[51] rstudioapi_0.6 R6_2.2.2 nlme_3.1-131 compiler_3.4.0
Good reminder https://twitter.com/EdwardTufte/status/871049024048115713
We should think about how to handle digits and also when integers should be returned.
Ideally the decimals should line up in a way similar to that in lucid.
https://cran.r-project.org/web/packages/lucid/vignettes/lucid_printing.html
Using skim_with() someone can make multiple statistics with the same names. Should we prevent that?
Hello there,
thanks for this promising package. I wonder if you plan to add support for time based variables such as dates, timestamps, etc. The same way Pandas
does it: that is showing minimum/maximum date, frequency, etc.
That would be extremely useful!
Thanks!
skim_tee <- function(x) {
print(skim(x))
invisible()
}
So you can verify the distribution multiple times inside a pipeline.
(This should be a separate function not an argument in order to be type stable)
We could use some good vignettes of both simple and advanced use.
as discussed. and mentioned on twitter.
How should we handle summarizing lists in columns?
https://github.com/ropenscilabs/skimr/blob/master/R/functions.R#L198
The documentation is the same as get_funs()
.
We have pretty good coverage so far
https://codecov.io/gh/ropenscilabs/skimr/tree/master/R
but it would be great to get to 100% (or close to it). Most of them are easy tests for the individual functions in the stats.R file.
We could also use more tests of things like a column that is entirely NA.
To repoduce:
nycflights13::weather %>% skim()
nycflights13::fights %>% skim()
I get this error:
Error in .summary_functions[[type]] :
wrong arguments for subsetting an environment
#Callstack
13. get_funs(FUNS) at skim_v.R#19
12. .f(.x[[i]], ...)
11. purrr::map(.data, skim_v) at skim.R#16
10. skim.data.frame(.) at skim.R#10
9. skim(.)
8. function_list[[k]](value)
7. withVisible(function_list[[k]](value))
6. freduce(value, `_function_list`)
5. `_fseq`(`_lhs`)
4. eval(expr, envir, enclos)
3. eval(quote(`_fseq`(`_lhs`)), env, env)
2. withVisible(eval(quote(`_fseq`(`_lhs`)), env, env))
1. nycflights13::weather %>% skim()
For ""
or " "
, notify users that they might need to do a data cleaning before summarizing.
in latex output "hist" is coming as boxes, screen-shots of output
Hi
I was unable to install skimr on mac using below command.
install_github("ropenscilabs/skimr")
Thanks
The sf
package is the R implementation of Simple Features and starts to be a new standard for working with spatial data in R. More information at https://github.com/edzer/sfr and http://robinlovelace.net/geocompr/spatial-class.html.
The most important element of this package is the sf
class. It is a simple data.frame with a one, additional list-column
, which store a geometry of the data.
I think it would be useful to add an ability of creating a summary of sf
objects. A summary of the geometry
column could return some basic informations, such as projection, geometry type, etc.
library(sf)
library(skimr)
nc = st_read(system.file("shape/nc.shp", package="sf"))
nc
nc %>% skim()
Error in .f(.x[[i]], ...) :
(list) object cannot be coerced to type 'double'
In addition: Warning message:
Skim does not know how to summarize of vector of class: sfc_MULTIPOLYGON. Coercing to numericSkim does not know how to summarize of vector of class: sfc. Coercing to numeric
Time series data, unbalanced
str(data_complete_raw)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 2448 obs. of 37 variables:
$ iso : chr "AUS" "AUS" "AUS" "AUS" ...
$ Country : chr "Australia" "Australia" "Australia" "Australia" ...
$ Year : POSIXct, format: "1870-01-01" "1871-01-01" "1872-01-01" "1873-01-01" ...
$ before.indepence : num 0 0 0 0 0 0 0 0 0 0 ...
$ currency.crises : num 0 0 0 0 0 0 0 0 0 0 ...
$ inflation.crises : num 0 0 0 0 0 0 0 0 0 0 ...
$ stock.crash : num 0 0 0 0 0 0 0 0 0 0 ...
$ sov.debt.crises.dom: num 0 0 0 0 0 0 0 0 0 0 ...
One approach to writing skim piplines keeps us away from having to reimplement dplyr tools. For example:
select(mtcars, cyl) %>%
skim()
Alternatively, we might be interested in allowing for column selection within skim()
.
skim(mtcars, cyl)
The latter approach gets us closer to the API listed in Amelia's original issue.
The test-skim.R test is failing because there is no row for the inline histogram. But I'm not sure how to add that in the correct listing. Tried a few ideas but no success.
The package cannot be installed anymore as the colformat
repo doesn't exist anymore and is replaced by https://github.com/hadley/pillar.
See this commit: r-lib/pillar@831aade
@haozhu233 and I did a bit of benchmarking of skim()
and it looks like there are some performance issues with drawing the histogram. This is evident on large grouped data frames. We might want allow the user to not draw the histograms if they are interested in speedier skimming.
How should we skim the object produced by skim?
Most of our functions don't have examples in the documentation.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.