GithubHelp home page GithubHelp logo

gettingandcleaningdatacourseproject's Introduction

Getting and Cleaning Data Course Project

Table of Contents

Objective

This repository contain the Course Project for the Getting and Cleaning Data Course at the Coursera Specialization in Data Science.
The goal of the project consists in create a R language script to import, clean and summarize a dataset from different files and export the obtained dataset to a file.
The raw data to process can be obtained in the next link: Raw dataset

The files must be downloaded and extracted in the working directory where the R script will be launched.

The Raw Data

The structure we must have after the extraction must be:

  • UCI HAR Dataset
    • test
      • Inertial Signals
      • subject_test.txt
      • X_test.txt
      • y_test.txt
    • train
      • Inertial Signals
      • subject_train.txt
      • X_train.txt
      • y_train.txt
    • activity_labels.txt
    • features.txt
    • features_info.txt
    • README.txt

More information about the raw data can be obtained in the CodeBook.

Desired Datasets

The R script is expected to create two dataset combining the data stored in the different raw data files.

Merged dataset

This is one table that Merges the training and the test sets in only one table adding one column for the subject who practice the activity, and other one for the activity practiced.
Only the measurements on the mean and standard deviation for each measurement must be extracted to create the new table.
The structure of this table is described bellow:

subject activity sensors data [1..66]
subject_train.txt Y_train.txt X_train.txt
subject_test.txt Y_test.txt X_test.txt

More information about the variables in the table can be found in the CodeBook
The resulting dataset is a table with 10299 observations of 68 variables.

Summarized dataset

This table consist in the summarization of the previous one where the mean of each variable data is represented for each subject and activity.
The structure of this table is described bellow:

subject activity sensor measure 1 sensor measure 2 ... sensor measure n
subject 1 activity 1 mean measure 1 mean measure 2 ... mean measure n
subject 2 activity 1 mean measure 1 mean measure 2 ... mean measure n
... activity 1 mean measure 1 mean measure 2 ... mean measure n
subject n activity 1 mean measure 1 mean measure 2 ... mean measure n
subject 1 activity 2 mean measure 1 mean measure 2 ... mean measure n
... ... mean measure 1 mean measure 2 ... mean measure n
subject n activity n mean measure 1 mean measure 2 ... mean measure n

The resulting dataset is a table with 180 observations of 68 variables.

The R Script

Only one script have been written in order to obtain the two requested dataset. The script named run_analysis.R must be executed in a working directory at the same level as the UCI HAR Dataset directory.
The script return the summarized dataset as output and both datasets (merged and summarized ones) are available in the environment memory after the script has finished. Also, two comma separated txt files are created with the two datasets in the working directory. Their names are tidyyData.txt and summaryData.txt

Imported Libraries

  • data.table: Loaded to use the 'data.table' data type.
  • reshape2: Loaded to use the 'aggregate' function.
  • stringr: Loaded to use the 'str_replace' function.

The Script Variables

  • columns: This variable stores the features numbers that are going to be extracted from the raw datasets. It's removed before the script finishes
  • trainSubject: This variable stores the subject variable for each observation in the 'train' dataset and it's loaded from ./UCI HAR Dataset/train/subject_test.txt file. It's removed before the script finishes
  • trainActivity: This variable stores the activity definition variable for each observation in the 'train' dataset and it's loaded from ./UCI HAR Dataset/train/y_test.txt file. It's removed before the script finishes
  • trainData: This variable stores all the observations for each variable in the 'train' dataset and it's loaded from ./UCI HAR Dataset/train/X_test.txt file. After that, it's merged with trainSubject and trainActivity variables to form an unique table. It's removed before the script finishes
  • testSubject: This variable stores the activity definition variable for each observation in the 'test' dataset and it's loaded from ./UCI HAR Dataset/test/y_test.txt file. It's removed before the script finishes
  • testActivity: This variable stores the activity definition variable for each observation in the 'test' dataset and it's loaded from ./UCI HAR Dataset/test/y_test.txt file. It's removed before the script finishes
  • testData: This variable stores all the observations for each variable in the 'test' dataset and it's loaded from ./UCI HAR Dataset/test/X_test.txt file. After that, it's merged with testSubject and testActivity variables to form an unique table. It's removed before the script finishes
  • tidyData: This variable merges the trainData and testData variables in an unique dataset. It remains available in the R environment after the script has finished
  • colNames: This variable stores the description of the dataset variables loaded from ./UCI HAR Dataset/features.txt file. It's removed before the script finishes
  • charToRemove: This variable stores a vector with the characters that are going to be removed from the original variable labels in order to do that more appropriate. It's removed before the script finishes
  • activityNames: This variable stores a vector with the activities descriptions loaded from ./UCI HAR Dataset/activity_labels.txt file in order to change the activity numerical observations into more descriptive ones. It's removed before the script finishes
  • summaryData: This variable summarizes the tidyData with the mean of each variable taken by 'subject' and 'activity' variables. It remains available in the R environment after the script has finished

At the end of the script only the tidyData y summaryData variables remain available in the R environment. The rest of the variables are deleted before the script finishes.

Cleaning and Sorting process

  1. In order to subset the given data sets with the different sensors values, an array with the desirables columns numbers is created based in the variable description given in the features.txt file.

    columns <- c(1:6, 41:46, 81:86, 121:126, 161:166, 201:202, 214:215, 227:228, 240:241, 253:254, 266:271, 345:350, 424:429, 503:504, 516:517, 529:530, 542:543)

  2. After this the different files are loaded in memory. To do this read.table function is used with each relevant file described above:

    trainData <- read.table("./UCI HAR Dataset//train/X_train.txt")
    trainSubject <- read.table("./UCI HAR Dataset//train/subject_train.txt")
    trainActivity <- read.table("./UCI HAR Dataset//train/y_train.txt")
    testData <- read.table("./UCI HAR Dataset//test/X_test.txt")
    testSubject <- read.table("./UCI HAR Dataset//test/subject_test.txt")
    testActivity <- read.table("./UCI HAR Dataset//test/y_test.txt")

  3. The data files are subset using the columns variable previously declared:

    trainData <- trainData[,columns]
    testData <- testData[,columns]

  4. And columns subject and activity are added to its correspondent variable data set using the cbind function:

    trainData <- cbind(trainSubject, trainActivity, trainData)
    testData <- cbind(testSubject, testActivity, testData)

  5. Then, rows in both tables are merged using the rbind function, one over the other:

    tidyData <- rbind(trainData, testData)

  6. Now it's time to label the data set with descriptive variable names. To do this, variable names are loaded from the features.txt file, and subject and activity variables names are manually added for the first and second column.

    colNames <- read.table("./UCI HAR Dataset//features.txt")
    colNames <- c("subject", "activity", as.character(colNames[columns,2]))

  7. In order to obtain better variable names, special characters are removed from the colNames variable using the gsub function. After that, a variable containing all the special characters that are going to be removed, is created: (escape character is necessary because the especial characters)

    charToRemove <- c( "-", "\(", "\)" ) ## Characters to remove
    for (k in charToRemove) colNames <- gsub(k,"",colNames)

  8. To obtain a more homogeneous variables names, some strings are changed in colNames variable using the str_replace function. This way, all the variables names in colNames follows the pattern myVariableNameMeanX with a capital letter marking the beginning of a new word except for the first one.

    colNames <- str_replace(col_names, "mean", "Mean")
    colNames <- str_replace( col_names, "std", "Std")

  9. And now, names are assigned to the data set columns using the colnames function:

    colnames(tidyData) <- colNames

  10. The last step to obtain the requested data set is to use descriptive activities names to name the observations in the activities variable. To do this, activities names are loaded from activity_labels.txt file and the value for each observation is changed transforming the activities variable into a factor variable and assigning the read descriptive values:

    activityNames <- read.table("./UCI HAR Dataset/activity_labels.txt")
    tidyData$activity <- factor(tidyData$activity, levels= 1:6, labels = activityNames[,2])

  11. The data set is ready and it's exported to a comma separated text file in the current directory with the write.table function:

    write.table(tidyData, "./tidyData.txt", sep = ",", row.name=FALSE)

In order to create the second data set requested, all the observations for each variable are merged into the mean taken by the subject and activity variables.

  1. To do this, a new data set is created using the aggregate function from the previous tidy dataset. This function allows us to created new tables passing a function to each column and ordered by certain columns. The new columns label are specified in the function call.

    summaryData <- aggregate(tidyData[,3:ncol(tidyData)], by = list(subject=tidyData$subject, activity=tidyData$activity), FUN = "mean")

  2. To finish, the obtained data set is exported to a comma separated text file in the current directory with the write.table function:

    write.table(summaryData, "./summaryData.txt", sep = ",", row.name=FALSE)

In order to have a more ordered environment, temporal variables are deleted with the rm function. When the script finish, only the tidyData and the summaryData datasets are available in the local environment.

gettingandcleaningdatacourseproject's People

Contributors

antoniomr avatar

Watchers

 avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.