GithubHelp home page GithubHelp logo

dataviz's Introduction

Data Visualizations

Cristian E. Nuno April 16, 2018

Visualizing Categorical Counts

This week, I spent some time learning about the data.table package. A lot of folks on Stack Overflow recommended I look into it to speed up my processing time.

At first, I didn't know if the package was being hyped up. But after experimenting, the hype is real: data.table performs operations extremely quickly.

At first, the trade-off for this speed is readability. But the folks at data.table provide great documentation to help newbies like myself become more familiar with their syntax.

# load necessary packages
library( data.table )
library( ggplot2 )

# load necessary data
df <- as.data.table( mtcars )

# print object size 
object.size( df )
## 5608 bytes
# expand the rows in df
# from 32 to 16 billion
df <- df[ rep( x = 1:nrow( df ), times = 500000), ]

# check dim
dim( df )
## [1] 16000000       11
# now size of df is nearly 1.5 GB
object.size( df )
## 1408002792 bytes
# count the number of unique values 
# that appear in the `cyl` column
cyl.counts <-
  df[, j = .( Count = .N)
     , by = .(Cylinders = cyl ) ][ order( Cylinders ) ]


# visualize results
ggplot( cyl.counts, aes( x = factor( Cylinders )
                         , y = Count ) ) +
  geom_bar( stat = "identity"
            , aes( fill = factor( Cylinders ) )
            , position = "dodge" ) + 
  labs( title = "Counting Cylinders"
        , subtitle = "There were fewer cars with 6 cylinders than those with 4 or 8 cylinders."
        , caption = "Source: 1974 Motor Trend Car Road Tests"
        , fill = "Cylinder"
        , x = "Cylinder" )

Multiple Elements

This is a great example of not knowing what story to tell with my data.

# load necessary data
df <- 
  data.frame( Grade = factor( x = rep( x = c( "PK", "K", 1:12 )
                                       , times = 30 )
                              , levels = c( "PK", "K", 1:12) )
              , Count = sample( x = 1000:30000, size = 420, replace = FALSE)
              , School_Year = do.call( what = "c"
                                       , args = lapply( X = 2009:2018, FUN = rep, times = 42 ) )
              , File_Type = factor( x = rep( x = c( rep( x = "20th Day", times = 14), rep( x = "Mid Year", times = 14), rep( x = "End of Year", times = 14 ) ), times = 10 )
                                     , levels = c("20th Day", "Mid Year", "End of Year" ) )
              , stringsAsFactors = TRUE )

ggplot( data = df
        , aes( x = School_Year, y = mean( Count ) ) ) +
  geom_bar( aes( fill = Grade )
            , position = "dodge"
            , stat = "identity" ) + 
  facet_grid( facets = Grade ~ File_Type )

Reshaping iris into Tidy Format

Tidy datasets are easy to manipulate, model and visualise, and have a specific structure: each variable is a column, each observation is a row, and each type of observational unit is a table. - Hadley Wickham

Using the concepts laid out in Tidy Data, I reshape Edgar Anderson's iris Data.

# load library
library( tidyverse )
## ── Attaching packages ───────────────────── tidyverse 1.2.1 ──

## ✔ tibble  1.4.2     ✔ purrr   0.2.4
## ✔ tidyr   0.8.0     ✔ dplyr   0.7.4
## ✔ readr   1.1.1     ✔ stringr 1.3.0
## ✔ tibble  1.4.2     ✔ forcats 0.3.0

## ── Conflicts ──────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::between()   masks data.table::between()
## ✖ dplyr::filter()    masks stats::filter()
## ✖ dplyr::first()     masks data.table::first()
## ✖ dplyr::lag()       masks stats::lag()
## ✖ dplyr::last()      masks data.table::last()
## ✖ purrr::transpose() masks data.table::transpose()
# view structure of iris
str( datasets::iris )
## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

iris contains 150 rows by 5 columns. Each row contains multiple measurement values by each observation.

Below demonstrates how to transfrom iris so that each row contains one measurement value by each observation.

# reshape iris
# to long format
# where each row represents a measurement
# of type (Length, Width)
# by each part (Sepal, Petal)
# for each Species (setosa, virginica, versicolor)
iris.tidy <-
  iris %>%
  gather( key = "Measure"
          , value = "Value"
          , -Species ) %>% 
  separate( col = Measure
            , into = c("Part", "Measure")
            , sep = "\\." )

# view results
str( iris.tidy )
## 'data.frame':    600 obs. of  4 variables:
##  $ Species: Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Part   : chr  "Sepal" "Sepal" "Sepal" "Sepal" ...
##  $ Measure: chr  "Length" "Length" "Length" "Length" ...
##  $ Value  : num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...

iris.tidy contains 600 rows by 4 columns.

The number of rows in iris.tidy grew by 4 because the use of gather() combined the Sepal.Length, Sepal.Width, Petal.Length, Petal.Width values into the newly created Value column.

At the same time, iris.tidy retains two important distinctions:

  1. Measurement type - length or width - in the newly created Measure column; and

  2. Flower part - sepal or petal - in the newly created Part column.

Each row in iris.tidy now follows the tenants of being tidy since each row contains one measurement value by each observation.

# visualize results
ggplot( data = iris.tidy
        , aes( x = Measure, y = Value, col = Species ) ) +
  geom_jitter() +
  facet_grid( facets = . ~ Species ) +
  labs( title = "Length and Width Values by Flower Species and Measurement Type"
       , subtitle = "Setosa's sepals tend to be larger than their petals."
       , caption = "Source: Edgar Anderson's Iris Data" )

dataviz's People

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.