Human-Resources-Analytics
1052 nccu data science course - final project
Introduction
Human Resources Analytics is an interesting dataset from Kaggle to explore. Our goal is trying to understand why our best and most experienced employees are leaving the company prematurely. We have this database with ten variables and ten thousand observations. Our challege consists in guessing the reasons behind their leaving and to predict which valuable employees will leave next.
Goal
- Understand the data and its variables
- Perform exploratory analysis by visualizing variables of interest
- Perform predictive analysis based on variables
This project uses R to analyze the dataset, and combined with Shiny App for information visualization.
- human_resources_analytics.R is for model prediction and performance evaluation.
- app.R is for shiny app.
Data Information
Data Source: Human Resources Analytics(From Kaggle)
This dataset contains 10 variables and 15K rows. Each row corresponds to an employee.
Below are the descriptions about these variables:
Variable Name | Description |
---|---|
satisfaction_leve | Level of satisfaction (0-1) |
last_evaluation | Last evaluation |
number_project | Number of projects completed while at work |
average_montly_hours | Average monthly hours at workplace |
time_spend_company | Number of years spent in the company |
Work_accident | Whether the employee had a workplace accident |
left | Whether the employee left the workplace or not (1 or 0) Factor |
promotion_last_5years | Whether the employee was promoted in the last five years |
sales(String) | Department in which they work for |
salary(String) | Relative level of salary (high) |
Data Analysis
- The Total Number of employee: 14999
- The Number of employee who left the company: 3571
- The Number of employee who didn't left the company: 11428
- The proportion of employee who left: 0.24
# read data
hrdata <- read.csv('HR_comma_sep.csv', header = TRUE)
# summary of the data
head(hrdata)
summary(hrdata)
# check numbers of missing values
sum(is.na(hrdata))
# transform the factor variables into numeric data
levels(hrdata$salary) <- c("low", "medium", "high")
hrdata$salary <- as.numeric(hrdata$salary)
hrdata$left = as.factor(hrdata$left)
Model Prediction
Use four different models to predict results, and compare their performance with multiple evaluation methods.
Model
# split data into training and testing data
trainIndex <- createDataPartition(hrdata$left, p = 0.7, list = FALSE, times = 1)
trainData <- hrdata[trainIndex,]
testData <- hrdata[-trainIndex,]
- Logistic Regression
model_glm <- glm(left ~., data = trainData, family = 'binomial')
# predict output of testing data
prediction_glm <- predict(model_glm, testData, type = 'response')
prediction_glm <- ifelse(prediction_glm > 0.5,1,0)
# get confusion matrix
cm_glm <- table(Truth = testData$left, Pred = prediction_glm)
table_glm <- getPerformanceTable("Logistic Regression", cm_glm)
# accuracy
print(paste("Ligistic Regression Accuracy: ", round(mean(prediction_glm == testData$left), digits = 2)))
- Decision Tree
model_dt <- rpart(left ~., data = trainData, method="class", minbucket = 25)
prediction_dt <- predict(model_dt, testData, type = "class")
cm_dt <- table(Truth = testData$left, Pred = prediction_dt)
table_dt <- getPerformanceTable("Decision Tree", cm_dt)
print(paste("Decision Tree Accuracy: ", round(mean(prediction_dt == testData$left), digits = 2)))
- Random Forest
model_rf <- randomForest(as.factor(left) ~., data = trainData, nsize = 20, ntree = 200)
prediction_rf <- predict(model_rf, testData)
cm_rf <- table(Truth = testData$left, Pred = prediction_rf)
table_rf <- getPerformanceTable("Random Forest", cm_rf)
print(paste("Random Tree Accuracy: ", round(mean(prediction_rf == testData$left), digits = 2)))
- Support Vector Machine (SVM)
model_svm <- svm(left~ ., data = trainData, gamma = 0.25, cost = 10)
prediction_svm <- predict(model_svm, testData)
cm_svm <- table(Truth = testData$left, Pred = prediction_svm)
table_svm <- getPerformanceTable("SVM", cm_svm)
print(paste("SVM Accuracy: ", round(mean(prediction_svm == testData$left), digits = 2)) )
Evaluation Performance
Model | Sensitivity | Specificity | Precision | Recall | F1 | AUC |
---|---|---|---|---|---|---|
Logistic Regression | 0.10 | 0.74 | 0.61 | 0.10 | 0.17 | 0.82 |
Decision Tree | 0.23 | 0.58 | 0.94 | 0.23 | 0.37 | 0.97 |
Random Forest | 0.23 | 0.77 | 0.99 | 0.23 | 0.37 | 0.99 |
SVM | 0.23 | 0.46 | 0.93 | 0.23 | 0.37 | 0.96 |
Data Visualization
Use Plotly and ggplot packages in R for data visualization, and present the graphs in shiny app.
Shiny App
Human Resources Analytics Shiny App
app.R is the code for this shiny app.