GithubHelp home page GithubHelp logo

gettingandcleaningdata's Introduction

GettingAndCleaningData

Getting And Cleaning Data - Course Project - getdata-007

Pre-requirements

This R script requires the following libraries to be installed prior execution:

  • dplyr: a grammar of data manipulation - version 0.2 or higher
  • reshape2: Flexibly reshape data: a reboot of the reshape package.

How to use the run_analysis.R script

To run the run_analysis.R script you will need to source the file in R or RStudio and then call the main function

> source('run_analysis.R')
> main()
[1] "Starting run_analysis..."

Once you sourced the script, the main function will be executed and will call the following functions:

  1. downloadAndExtractZipFile(): this function will create a folder data locally if doesn't exist and download the data zip file in https://d396qusza40orc.cloudfront.net/getdata%2Fprojectfiles%2FUCI%20HAR%20Dataset.zip. After downloading the function will extract all files. Alternatively if you already have the zip file and don't want to download it again, please make sure you have the following folder structure - it'll prevent the function downloading and unzipping the file:
data
     |-UCI HAR Dataset
        |-activity_labels.txt
        |-features.txt
        |-features_info.tx
        |-README.txt
        |-test
        |   |-Inertial Signals (content ignored in this script)
        |   |-subject_test.txt
        |   |-X_test.txt
        |   |-y_test.txt
        |-train
            |-Inertial Signals (content ignored in this script)
            |-subject_train.txt
            |-X_train.txt
            |-y_train.txt
  1. readAndMergeTrainingAndTestSets(): this function will read the following files:
  • X_train.txt - contain the training set - into trainX
  • y_train.txt - contain the training labels - into trainY
  • subject_train.txt - contain the training subjects per row - into trainSubject
  • X_test.txt - contain the test set - into testX
  • y_test.txt - contain the test labels - into testY
  • subject_test.txt - contain the test subjects per row - into testSubject

After reading the files, the function will assign subject as the column name for the datasets trainSubject and testSubject, and assign label as the column name for the datasets trainYand testY.

Will then column bind the 3 training datasets into trainDS and the 3 test datasets into testDS:

trainDS <- cbind(trainX, trainY, trainSubject)  
testDS <- cbind(testX, testY, testSubject)

Then will use the trainDS and testDS and perform a row bind to merge the training and test data:

mergedDS <- rbind(trainDS, testDS)

The mergedDS will be returned.

  1. extractMeanAndSTD(dataSet): this function receives as input the merged dataset created on the previous function. This will read the features.txt file into features. From this features data the function will filter only the mean and standard deviation measures - for this it uses the grepl function using the patterns "-mean()" or "-std()"
meanAndStd <- features[grepl("-(mean|std)\\(\\)", features$V2, perl = TRUE), ]

The input dataset has 563 columns (561 columns with measures plus the subject and label). Using the meanAnStd dataset the script reduced the input dataset to 68 columns (66 measure columns for mean and std plus subject and label). The new dataset is called reducedDS. After the reduction, the script will use the variables names for the 66 columns on meanAnStd$V2 and update them to have descriptive name and then update the reducedDScolumn names with those. The rules to have a descriptive name are:

  • "-" was removed
  • "()" was removed
  • starting "t" -> "time"
  • starting "f" -> "frequency"
  • "BodyBody" -> "body"
  • "Acc" -> "accelerator"
  • "Gyro" -> "gyroscope"
  • "Mag" -> "magnitude"
  • characters were converted to lowercase

the function will return the reducedDS with the new columns names updated.

  1. updateActivityNames(dataSet): this function receives the reduced dataset return by the previous function as input and will merge it with the dataset with the content from the activity_labels.txt file. For the merge, the script uses the label as the key from the input dataset and V1 as the key from the activity_labels.txt dataset. Then a new column called activity with the content on the V2 will be created and the columns V2 and label will be dropped from the final and reduced dataset.

  2. tidyDataSet(dataSet): this function receives as input the dataset generated on the previous function, and will melt the input dataset using subject and activity as ids and feature will be the variable name. Then it will group_by subject, activity and feature to finalize with a call to the summarise function with mean(value). The return of this function will be the narrow tidy data.

  3. writeTidyDataFile(tidyData, fileName): this function receives two input parameters:

  • tidyData: the dataset with the narrow tidy dataset to be written into the file
  • filename: filename that will be created with the tidy data

The function will check if the file exists locally and delete it if exists. Then will write the tidy dataset using write.table with the flag row.name = FALSE using the filename and the name of the output file.

Auxiliar Functions

  • readDataFile(fileName, ...): this function checks if the file exists and if it exists will read the file using read.table function and will pass any extra parameters ... directly. The function will return a dataset. If the file doesn't exist the function will stop with a "file doesn't exist" message.

  • readTidyDataFile(): this function will call the readDataFile function with the filename and the flag header = TRUE

How to read the tidy data file generated by this script

Assuming the file is in the same folder as the run_analysis.R script and has the original name tidyData.txt you can call the function readTidyDataFile() to load the file

> tidyData <- readTidyDataFile()
[1] "Reading file  tidyData.txt"
> dim(tidyData)
[1] 11880     4
> head(tidyData, n = 2)
  subject activity                    feature mean.value.
1       1   LAYING timebodyaccelerometermeanx  0.22159824
2       1   LAYING timebodyaccelerometermeany -0.04051395

September 20, 2014

gettingandcleaningdata's People

Contributors

taxinha avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.