GithubHelp home page GithubHelp logo

getting-and-cleaning-data's Introduction

Getting and Cleaning Data

The purpose of this project is to demonstrate the ability to collect, work with, and clean a data set. The goal is to prepare tidy data that can be used for later analysis.

One of the most exciting areas in all of data science right now is wearable computing - see for example this article. Companies like Fitbit, Nike, and Jawbone Up are racing to develop the most advanced algorithms to attract new users. The data represents data collected from the accelerometers from the Samsung Galaxy S smartphone.

A full description is available at the site where the data was obtained.

This repo contains the run_analysis.R script that does the following:

  1. Merges the training and the test sets to create one data set.
  2. Extracts only the measurements on the mean and standard deviation for each measurement.
  3. Uses descriptive activity names to name the activities in the data set
  4. Appropriately labels the data set with descriptive variable names.
  5. From the data set in step 4, creates a second, independent tidy data set with the average of each variable for each activity and each subject.

Unless specified otherwise, all commands were run from within a R CLI session (on GNU Linux).

All data transformations are documented in the form of comments in run_analysis.R.

Note: All data files are kept in ./data.

CodeBook.md describes the variables, data and any transformations or work that were performed to clean up the data for this project.

Save/Extract Raw Data

The following steps we followed to download and extract the raw data:

> download.file("https://d396qusza40orc.cloudfront.net/getdata%2Fprojectfiles%2FUCI%20HAR%20Dataset.zip", destfile="./data/getdata_projectfiles_UCI_HAR_Dataset.zip", method="curl")
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 59.6M  100 59.6M    0     0   339k      0  0:03:00  0:03:00 --:--:--  378k
>
> unzip ("./data/getdata_projectfiles_UCI_HAR_Dataset.zip", exdir = "./data")
> list.files("./data")
[1] "getdata_projectfiles_UCI_HAR_Dataset.zip"
[2] "UCI HAR Dataset"

The raw data file is a little big so an exclusion entry was added to the project's .gitignore file to avoid pushing the file to Github.

Running the Analysis

To run the analysis you simply need the run_analysis.R script and the extracted data (see Save/Extract Raw Data) in the ./data directory.

From the working directory where you saved the script (and assuming you extracted the data in ./data), open a R console session and run:

source("run_analysis.R")

This will take a while as the script performs the required data processing and analysis set out in steps 1-5 above. You will most likely see the following warnings in the console that can be safely ignored:

Attaching package:dplyrThe following object is masked frompackage:MASS:

    select

The following object is masked frompackage:stats:

    filter

The following objects are masked frompackage:base:

    intersect, setdiff, setequal, union

Once done the tidy data set for numbers 1-4 will be stored in the totalData data frame:

> str(totalData)
'data.frame':	10299 obs. of  69 variables:
 $ Subject                    : int  1 1 1 1 1 1 1 1 1 1 ...
 $ DataType                   : Factor w/ 2 levels "Training","Test": 1 1 1 1 1 1 1 1 1 1 ...
 $ Activity                   : Factor w/ 6 levels "LAYING","SITTING",..: 3 3 3 3 3 3 3 3 3 3 ...
 $ tBodyAcc-mean()-X          : num  0.289 0.278 0.28 0.279 0.277 ...
 $ tBodyAcc-mean()-Y          : num  -0.0203 -0.0164 -0.0195 -0.0262 -0.0166 ...
 $ tBodyAcc-mean()-Z          : num  -0.133 -0.124 -0.113 -0.123 -0.115 ...
 $ tBodyAcc-std()-X           : num  -0.995 -0.998 -0.995 -0.996 -0.998 ...
 $ tBodyAcc-std()-Y           : num  -0.983 -0.975 -0.967 -0.983 -0.981 ...
 $ tBodyAcc-std()-Z           : num  -0.914 -0.96 -0.979 -0.991 -0.99 ...
 $ tGravityAcc-mean()-X       : num  0.963 0.967 0.967 0.968 0.968 ...
 $ tGravityAcc-mean()-Y       : num  -0.141 -0.142 -0.142 -0.144 -0.149 ...
 $ tGravityAcc-mean()-Z       : num  0.1154 0.1094 0.1019 0.0999 0.0945 ...
 $ tGravityAcc-std()-X        : num  -0.985 -0.997 -1 -0.997 -0.998 ...
 $ tGravityAcc-std()-Y        : num  -0.982 -0.989 -0.993 -0.981 -0.988 ...
 $ tGravityAcc-std()-Z        : num  -0.878 -0.932 -0.993 -0.978 -0.979 ...
 $ tBodyAccJerk-mean()-X      : num  0.078 0.074 0.0736 0.0773 0.0734 ...
 $ tBodyAccJerk-mean()-Y      : num  0.005 0.00577 0.0031 0.02006 0.01912 ...
 $ tBodyAccJerk-mean()-Z      : num  -0.06783 0.02938 -0.00905 -0.00986 0.01678 ...
 $ tBodyAccJerk-std()-X       : num  -0.994 -0.996 -0.991 -0.993 -0.996 ...
 $ tBodyAccJerk-std()-Y       : num  -0.988 -0.981 -0.981 -0.988 -0.988 ...
 $ tBodyAccJerk-std()-Z       : num  -0.994 -0.992 -0.99 -0.993 -0.992 ...
 $ tBodyGyro-mean()-X         : num  -0.0061 -0.0161 -0.0317 -0.0434 -0.034 ...
 $ tBodyGyro-mean()-Y         : num  -0.0314 -0.0839 -0.1023 -0.0914 -0.0747 ...
 $ tBodyGyro-mean()-Z         : num  0.1077 0.1006 0.0961 0.0855 0.0774 ...
 $ tBodyGyro-std()-X          : num  -0.985 -0.983 -0.976 -0.991 -0.985 ...
 $ tBodyGyro-std()-Y          : num  -0.977 -0.989 -0.994 -0.992 -0.992 ...
 $ tBodyGyro-std()-Z          : num  -0.992 -0.989 -0.986 -0.988 -0.987 ...
 $ tBodyGyroJerk-mean()-X     : num  -0.0992 -0.1105 -0.1085 -0.0912 -0.0908 ...
 $ tBodyGyroJerk-mean()-Y     : num  -0.0555 -0.0448 -0.0424 -0.0363 -0.0376 ...
 $ tBodyGyroJerk-mean()-Z     : num  -0.062 -0.0592 -0.0558 -0.0605 -0.0583 ...
 $ tBodyGyroJerk-std()-X      : num  -0.992 -0.99 -0.988 -0.991 -0.991 ...
 $ tBodyGyroJerk-std()-Y      : num  -0.993 -0.997 -0.996 -0.997 -0.996 ...
 $ tBodyGyroJerk-std()-Z      : num  -0.992 -0.994 -0.992 -0.993 -0.995 ...
 $ tBodyAccMag-mean()         : num  -0.959 -0.979 -0.984 -0.987 -0.993 ...
 $ tBodyAccMag-std()          : num  -0.951 -0.976 -0.988 -0.986 -0.991 ...
 $ tGravityAccMag-mean()      : num  -0.959 -0.979 -0.984 -0.987 -0.993 ...
 $ tGravityAccMag-std()       : num  -0.951 -0.976 -0.988 -0.986 -0.991 ...
 $ tBodyAccJerkMag-mean()     : num  -0.993 -0.991 -0.989 -0.993 -0.993 ...
 $ tBodyAccJerkMag-std()      : num  -0.994 -0.992 -0.99 -0.993 -0.996 ...
 $ tBodyGyroMag-mean()        : num  -0.969 -0.981 -0.976 -0.982 -0.985 ...
 $ tBodyGyroMag-std()         : num  -0.964 -0.984 -0.986 -0.987 -0.989 ...
 $ tBodyGyroJerkMag-mean()    : num  -0.994 -0.995 -0.993 -0.996 -0.996 ...
 $ tBodyGyroJerkMag-std()     : num  -0.991 -0.996 -0.995 -0.995 -0.995 ...
 $ fBodyAcc-mean()-X          : num  -0.995 -0.997 -0.994 -0.995 -0.997 ...
 $ fBodyAcc-mean()-Y          : num  -0.983 -0.977 -0.973 -0.984 -0.982 ...
 $ fBodyAcc-mean()-Z          : num  -0.939 -0.974 -0.983 -0.991 -0.988 ...
 $ fBodyAcc-std()-X           : num  -0.995 -0.999 -0.996 -0.996 -0.999 ...
 $ fBodyAcc-std()-Y           : num  -0.983 -0.975 -0.966 -0.983 -0.98 ...
 $ fBodyAcc-std()-Z           : num  -0.906 -0.955 -0.977 -0.99 -0.992 ...
 $ fBodyAccJerk-mean()-X      : num  -0.992 -0.995 -0.991 -0.994 -0.996 ...
 $ fBodyAccJerk-mean()-Y      : num  -0.987 -0.981 -0.982 -0.989 -0.989 ...
 $ fBodyAccJerk-mean()-Z      : num  -0.99 -0.99 -0.988 -0.991 -0.991 ...
 $ fBodyAccJerk-std()-X       : num  -0.996 -0.997 -0.991 -0.991 -0.997 ...
 $ fBodyAccJerk-std()-Y       : num  -0.991 -0.982 -0.981 -0.987 -0.989 ...
 $ fBodyAccJerk-std()-Z       : num  -0.997 -0.993 -0.99 -0.994 -0.993 ...
 $ fBodyGyro-mean()-X         : num  -0.987 -0.977 -0.975 -0.987 -0.982 ...
 $ fBodyGyro-mean()-Y         : num  -0.982 -0.993 -0.994 -0.994 -0.993 ...
 $ fBodyGyro-mean()-Z         : num  -0.99 -0.99 -0.987 -0.987 -0.989 ...
 $ fBodyGyro-std()-X          : num  -0.985 -0.985 -0.977 -0.993 -0.986 ...
 $ fBodyGyro-std()-Y          : num  -0.974 -0.987 -0.993 -0.992 -0.992 ...
 $ fBodyGyro-std()-Z          : num  -0.994 -0.99 -0.987 -0.989 -0.988 ...
 $ fBodyAccMag-mean()         : num  -0.952 -0.981 -0.988 -0.988 -0.994 ...
 $ fBodyAccMag-std()          : num  -0.956 -0.976 -0.989 -0.987 -0.99 ...
 $ fBodyBodyAccJerkMag-mean() : num  -0.994 -0.99 -0.989 -0.993 -0.996 ...
 $ fBodyBodyAccJerkMag-std()  : num  -0.994 -0.992 -0.991 -0.992 -0.994 ...
 $ fBodyBodyGyroMag-mean()    : num  -0.98 -0.988 -0.989 -0.989 -0.991 ...
 $ fBodyBodyGyroMag-std()     : num  -0.961 -0.983 -0.986 -0.988 -0.989 ...
 $ fBodyBodyGyroJerkMag-mean(): num  -0.992 -0.996 -0.995 -0.995 -0.995 ...
 $ fBodyBodyGyroJerkMag-std() : num  -0.991 -0.996 -0.995 -0.995 -0.995 ...

The tidy data set for number 5 can be found in the summary variable:

> str(summary)
Classesgrouped_df’, ‘tbl_df’, ‘tbland 'data.frame':	40 obs. of  69 variables:
 $ Subject                    : int  1 2 3 4 4 5 6 7 7 8 ...
 $ Activity                   : Factor w/ 6 levels "LAYING","SITTING",..: 3 3 3 2 3 3 3 2 3 2 ...
 $ DataType                   : chr  "Training" "Test" "Training" "Test" ...
 $ tBodyAcc-mean()-X          : num  0.266 0.273 0.273 0.273 0.275 ...
 $ tBodyAcc-mean()-Y          : num  -0.0183 -0.0191 -0.0179 -0.0196 -0.013 ...
 $ tBodyAcc-mean()-Z          : num  -0.108 -0.116 -0.106 -0.113 -0.105 ...
 $ tBodyAcc-std()-X           : num  -0.546 -0.606 -0.623 -0.282 -0.727 ...
 $ tBodyAcc-std()-Y           : num  -0.368 -0.429 -0.48 -0.176 -0.636 ...
 $ tBodyAcc-std()-Z           : num  -0.503 -0.589 -0.654 -0.549 -0.77 ...
 $ tGravityAcc-mean()-X       : num  0.745 0.661 0.708 0.764 0.685 ...
 $ tGravityAcc-mean()-Y       : num  -0.0826 -0.1472 -0.0261 0.0443 0.1384 ...
 $ tGravityAcc-mean()-Z       : num  0.0723 0.1349 0.0481 0.1255 0.1788 ...
 $ tGravityAcc-std()-X        : num  -0.96 -0.963 -0.966 -0.957 -0.965 ...
 $ tGravityAcc-std()-Y        : num  -0.951 -0.96 -0.945 -0.939 -0.942 ...
 $ tGravityAcc-std()-Z        : num  -0.926 -0.945 -0.927 -0.949 -0.938 ...
 $ tBodyAccJerk-mean()-X      : num  0.0771 0.0785 0.0702 0.0776 0.0794 ...
 $ tBodyAccJerk-mean()-Y      : num  0.01659 0.00709 0.01447 0.01741 -0.00174 ...
 $ tBodyAccJerk-mean()-Z      : num  -0.009108 0.000756 -0.000527 -0.003608 -0.008798 ...
 $ tBodyAccJerk-std()-X       : num  -0.525 -0.558 -0.635 -0.337 -0.743 ...
 $ tBodyAccJerk-std()-Y       : num  -0.47 -0.492 -0.557 -0.251 -0.71 ...
 $ tBodyAccJerk-std()-Z       : num  -0.717 -0.742 -0.796 -0.754 -0.877 ...
 $ tBodyGyro-mean()-X         : num  -0.0209 -0.0517 -0.0248 -0.0233 -0.0311 ...
 $ tBodyGyro-mean()-Y         : num  -0.0881 -0.0568 -0.0744 -0.084 -0.0767 ...
 $ tBodyGyro-mean()-Z         : num  0.0863 0.0873 0.0867 0.0869 0.0991 ...
 $ tBodyGyro-std()-X          : num  -0.687 -0.711 -0.699 -0.483 -0.784 ...
 $ tBodyGyro-std()-Y          : num  -0.451 -0.723 -0.763 -0.668 -0.848 ...
 $ tBodyGyro-std()-Z          : num  -0.597 -0.635 -0.709 -0.603 -0.773 ...
 $ tBodyGyroJerk-mean()-X     : num  -0.0971 -0.0876 -0.0992 -0.1162 -0.1047 ...
 $ tBodyGyroJerk-mean()-Y     : num  -0.0417 -0.0434 -0.0402 -0.0359 -0.0416 ...
 $ tBodyGyroJerk-mean()-Z     : num  -0.0471 -0.0558 -0.0521 -0.0493 -0.061 ...
 $ tBodyGyroJerk-std()-X      : num  -0.638 -0.672 -0.689 -0.503 -0.807 ...
 $ tBodyGyroJerk-std()-Y      : num  -0.634 -0.784 -0.843 -0.85 -0.922 ...
 $ tBodyGyroJerk-std()-Z      : num  -0.665 -0.675 -0.743 -0.566 -0.817 ...
 $ tBodyAccMag-mean()         : num  -0.454 -0.535 -0.563 -0.242 -0.683 ...
 $ tBodyAccMag-std()          : num  -0.497 -0.553 -0.591 -0.345 -0.705 ...
 $ tGravityAccMag-mean()      : num  -0.454 -0.535 -0.563 -0.242 -0.683 ...
 $ tGravityAccMag-std()       : num  -0.497 -0.553 -0.591 -0.345 -0.705 ...
 $ tBodyAccJerkMag-mean()     : num  -0.545 -0.588 -0.65 -0.383 -0.76 ...
 $ tBodyAccJerkMag-std()      : num  -0.516 -0.512 -0.608 -0.384 -0.747 ...
 $ tBodyGyroMag-mean()        : num  -0.475 -0.615 -0.643 -0.433 -0.741 ...
 $ tBodyGyroMag-std()         : num  -0.5 -0.681 -0.674 -0.526 -0.775 ...
 $ tBodyGyroJerkMag-mean()    : num  -0.64 -0.747 -0.784 -0.686 -0.869 ...
 $ tBodyGyroJerkMag-std()     : num  -0.652 -0.74 -0.804 -0.733 -0.886 ...
 $ fBodyAcc-mean()-X          : num  -0.532 -0.574 -0.626 -0.335 -0.74 ...
 $ fBodyAcc-mean()-Y          : num  -0.406 -0.433 -0.502 -0.184 -0.655 ...
 $ fBodyAcc-mean()-Z          : num  -0.596 -0.63 -0.7 -0.612 -0.809 ...
 $ fBodyAcc-std()-X           : num  -0.553 -0.62 -0.624 -0.264 -0.724 ...
 $ fBodyAcc-std()-Y           : num  -0.39 -0.465 -0.503 -0.228 -0.651 ...
 $ fBodyAcc-std()-Z           : num  -0.499 -0.601 -0.657 -0.553 -0.768 ...
 $ fBodyAccJerk-mean()-X      : num  -0.547 -0.562 -0.646 -0.373 -0.758 ...
 $ fBodyAccJerk-mean()-Y      : num  -0.507 -0.509 -0.583 -0.285 -0.721 ...
 $ fBodyAccJerk-mean()-Z      : num  -0.695 -0.716 -0.78 -0.721 -0.865 ...
 $ fBodyAccJerk-std()-X       : num  -0.544 -0.595 -0.658 -0.36 -0.752 ...
 $ fBodyAccJerk-std()-Y       : num  -0.466 -0.509 -0.56 -0.266 -0.718 ...
 $ fBodyAccJerk-std()-Z       : num  -0.738 -0.767 -0.811 -0.786 -0.889 ...
 $ fBodyGyro-mean()-X         : num  -0.623 -0.639 -0.642 -0.366 -0.746 ...
 $ fBodyGyro-mean()-Y         : num  -0.505 -0.722 -0.775 -0.733 -0.869 ...
 $ fBodyGyro-mean()-Z         : num  -0.554 -0.602 -0.671 -0.515 -0.755 ...
 $ fBodyGyro-std()-X          : num  -0.708 -0.735 -0.719 -0.522 -0.797 ...
 $ fBodyGyro-std()-Y          : num  -0.43 -0.727 -0.759 -0.638 -0.838 ...
 $ fBodyGyro-std()-Z          : num  -0.65 -0.683 -0.751 -0.676 -0.802 ...
 $ fBodyAccMag-mean()         : num  -0.478 -0.515 -0.579 -0.327 -0.706 ...
 $ fBodyAccMag-std()          : num  -0.59 -0.647 -0.663 -0.461 -0.753 ...
 $ fBodyBodyAccJerkMag-mean() : num  -0.499 -0.51 -0.605 -0.357 -0.74 ...
 $ fBodyBodyAccJerkMag-std()  : num  -0.542 -0.519 -0.616 -0.426 -0.758 ...
 $ fBodyBodyGyroMag-mean()    : num  -0.535 -0.7 -0.717 -0.577 -0.809 ...
 $ fBodyBodyGyroMag-std()     : num  -0.567 -0.725 -0.704 -0.575 -0.792 ...
 $ fBodyBodyGyroJerkMag-mean(): num  -0.646 -0.752 -0.81 -0.72 -0.884 ...
 $ fBodyBodyGyroJerkMag-std() : num  -0.686 -0.744 -0.81 -0.772 -0.898 ...
 - attr(*, "vars")=List of 1
  ..$ : symbol Subject
 - attr(*, "drop")= logi TRUE

getting-and-cleaning-data's People

Contributors

charl avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.