Data Visualizations

Cristian E. Nuno April 16, 2018

Visualizing Categorical Counts
- Multiple Elements
Reshaping iris into Tidy Format

Visualizing Categorical Counts

This week, I spent some time learning about the data.table package. A lot of folks on Stack Overflow recommended I look into it to speed up my processing time.

At first, I didn't know if the package was being hyped up. But after experimenting, the hype is real: data.table performs operations extremely quickly.

At first, the trade-off for this speed is readability. But the folks at data.table provide great documentation to help newbies like myself become more familiar with their syntax.

# load necessary packages
library( data.table )
library( ggplot2 )

# load necessary data
df <- as.data.table( mtcars )

# print object size 
object.size( df )

## 5608 bytes

# expand the rows in df
# from 32 to 16 billion
df <- df[ rep( x = 1:nrow( df ), times = 500000), ]

# check dim
dim( df )

## [1] 16000000       11

# now size of df is nearly 1.5 GB
object.size( df )

## 1408002792 bytes

# count the number of unique values 
# that appear in the `cyl` column
cyl.counts <-
  df[, j = .( Count = .N)
     , by = .(Cylinders = cyl ) ][ order( Cylinders ) ]


# visualize results
ggplot( cyl.counts, aes( x = factor( Cylinders )
                         , y = Count ) ) +
  geom_bar( stat = "identity"
            , aes( fill = factor( Cylinders ) )
            , position = "dodge" ) + 
  labs( title = "Counting Cylinders"
        , subtitle = "There were fewer cars with 6 cylinders than those with 4 or 8 cylinders."
        , caption = "Source: 1974 Motor Trend Car Road Tests"
        , fill = "Cylinder"
        , x = "Cylinder" )

Multiple Elements

This is a great example of not knowing what story to tell with my data.

# load necessary data
df <- 
  data.frame( Grade = factor( x = rep( x = c( "PK", "K", 1:12 )
                                       , times = 30 )
                              , levels = c( "PK", "K", 1:12) )
              , Count = sample( x = 1000:30000, size = 420, replace = FALSE)
              , School_Year = do.call( what = "c"
                                       , args = lapply( X = 2009:2018, FUN = rep, times = 42 ) )
              , File_Type = factor( x = rep( x = c( rep( x = "20th Day", times = 14), rep( x = "Mid Year", times = 14), rep( x = "End of Year", times = 14 ) ), times = 10 )
                                     , levels = c("20th Day", "Mid Year", "End of Year" ) )
              , stringsAsFactors = TRUE )

ggplot( data = df
        , aes( x = School_Year, y = mean( Count ) ) ) +
  geom_bar( aes( fill = Grade )
            , position = "dodge"
            , stat = "identity" ) + 
  facet_grid( facets = Grade ~ File_Type )

Reshaping `iris` into Tidy Format

Tidy datasets are easy to manipulate, model and visualise, and have a specific structure: each variable is a column, each observation is a row, and each type of observational unit is a table. - Hadley Wickham

Using the concepts laid out in Tidy Data, I reshape Edgar Anderson's iris Data.

# load library
library( tidyverse )

## ── Attaching packages ───────────────────── tidyverse 1.2.1 ──

## ✔ tibble  1.4.2     ✔ purrr   0.2.4
## ✔ tidyr   0.8.0     ✔ dplyr   0.7.4
## ✔ readr   1.1.1     ✔ stringr 1.3.0
## ✔ tibble  1.4.2     ✔ forcats 0.3.0

## ── Conflicts ──────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::between()   masks data.table::between()
## ✖ dplyr::filter()    masks stats::filter()
## ✖ dplyr::first()     masks data.table::first()
## ✖ dplyr::lag()       masks stats::lag()
## ✖ dplyr::last()      masks data.table::last()
## ✖ purrr::transpose() masks data.table::transpose()

# view structure of iris
str( datasets::iris )

## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

iris contains 150 rows by 5 columns. Each row contains multiple measurement values by each observation.

Below demonstrates how to transfrom iris so that each row contains one measurement value by each observation.

# reshape iris
# to long format
# where each row represents a measurement
# of type (Length, Width)
# by each part (Sepal, Petal)
# for each Species (setosa, virginica, versicolor)
iris.tidy <-
  iris %>%
  gather( key = "Measure"
          , value = "Value"
          , -Species ) %>% 
  separate( col = Measure
            , into = c("Part", "Measure")
            , sep = "\\." )

# view results
str( iris.tidy )

## 'data.frame':    600 obs. of  4 variables:
##  $ Species: Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Part   : chr  "Sepal" "Sepal" "Sepal" "Sepal" ...
##  $ Measure: chr  "Length" "Length" "Length" "Length" ...
##  $ Value  : num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...

iris.tidy contains 600 rows by 4 columns.

The number of rows in iris.tidy grew by 4 because the use of gather() combined the Sepal.Length, Sepal.Width, Petal.Length, Petal.Width values into the newly created Value column.

At the same time, iris.tidy retains two important distinctions:

Measurement type - length or width - in the newly created Measure column; and
Flower part - sepal or petal - in the newly created Part column.

Each row in iris.tidy now follows the tenants of being tidy since each row contains one measurement value by each observation.

# visualize results
ggplot( data = iris.tidy
        , aes( x = Measure, y = Value, col = Species ) ) +
  geom_jitter() +
  facet_grid( facets = . ~ Species ) +
  labs( title = "Length and Width Values by Flower Species and Measurement Type"
       , subtitle = "Setosa's sepals tend to be larger than their petals."
       , caption = "Source: Edgar Anderson's Iris Data" )

cenuno / dataviz Goto Github PK

dataviz's Introduction

Data Visualizations

Visualizing Categorical Counts

Multiple Elements

Reshaping `iris` into Tidy Format

dataviz's People

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs

cenuno / dataviz Goto Github PK

dataviz's Introduction

Data Visualizations

Visualizing Categorical Counts

Multiple Elements

Reshaping iris into Tidy Format

dataviz's People

Recommend Projects

Recommend Topics

Recommend Org

Jobs

Reshaping `iris` into Tidy Format