GithubHelp home page GithubHelp logo

intro-tidyverse's Introduction

Data wrangling and Visualization in R

This tutorial introduces basic techniques in data wrangling and visualization in R. Specifically, we will cover some basic tools using out-of-the-box R commands, then introduce the powerful framework of the "tidyverse" (both in wrangling and visualizing data), and finally gain some understanding of the philosophy of this framework to set up deeper exploration of our data. Throughout, we will be using a publicly available dataset of AirBnB listings.

This was originally presented as part of a month-long course in Software Tools for Optimization and Analytics given by the Operations Research Center at MIT. Here is a link to some of the other sessions and notes, if you're interested.

Pre-assignment 1: Installation

  1. Before beginning, ensure you have RStudio installed. This provides a graphical user interface (GUI) or integrated development environment (IDE) for programming in R, and is free at the RStudio site. This installation will come with an installation of the base libraries of R itself.

  2. Next step, make sure you have the course materials. The easiest way to ensure you have all the materials for this class is clone this repository. On a Mac, you can open a Terminal, navigate to a directory of your choice, and run

$ git clone https://github.com/stmorse/intro-tidyverse.git

Another way is to simply download all the course material as a .zip file (in the "Clone or download" dropdown menu).

These materials are summarized in an easy-to-digest form in the online session notes.

The materials consist of a script (script.R) and corresponding exercises for each section (exercises.R). Maybe the best way to self-teach this material is to open the session notes above in a browser window, and the two R scripts in RStudio, and work your way through them, doing the code yourself, flipping back and forth as necessary.

The R scripts with all code filled in are also provided, in script_full.R and exercises_solved.R.

(The master.Rmd and master.html files creating the online session notes can be ignored.)

The data is publicly available at Kaggle as the Boston Airbnb dataset, but we also provide it in this repository for convenience.

Pre-assignment 2: Installing libraries

We will use three libraries for this session: tidyr, dplyr, and ggplot2. Before beginning, ensure that you install them, and are able to load them into an R session in RStudio. You can install them by executing the following commands in the RStudio console:

install.packages('dplyr')
install.packages('tidyr')
install.packages('ggplot2')

You should test that the libraries will load by then running

library(dplyr)
library(tidyr)
library(ggplot2)

Then test that dplyr/tidyr work by executing the command:

data.frame(name=c('Ann', 'Bob'), number=c(3.141, 2.718)) %>% gather(type, favorite, -name)

which should output something like this

      name   type favorite
    1  Ann number    3.141
    2  Bob number    2.718

Finally, test that ggplot works by executing the command

data.frame(x=rnorm(1000), y=rnorm(1000)) %>% ggplot(aes(x,y)) + geom_point()

which should produce a cloud of points centered around the origin.

Now you're ready to begin!

Additional Resources

dplyr and tidyr are well-established packages within the R community, and there are many resources to use for reference and further learning. Some of our favorites are below.

Some of the infinitude of visualization subjects we did not cover are: heatmaps and 2D histograms, statistical functions, plot insets, ... And even within the Tidyverse, don't feel you need to limit yourself to ggplot. Here's a good overview of some 2d histogram techniques, a discussion on overlaying a normal curve over a histogram, a workaround to fit multiple plots in one giant chart.

For other datasets and applications, one place to start is data hosting and competition websites like Kaggle, and there many areas like sports analytics, political forecasting, historical analysis, and countless others that have clean, open, and interesting data just waiting for you to read.csv.

intro-tidyverse's People

Contributors

stmorse avatar

Stargazers

 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.