Anomaly Detection Models Using Keystroke Dynamics (Behavioral Biometrics)

Project Outline
1.Brief Introduction: Anomaly Detection Models Using Keystroke Dynamics
2.Problem Statement
3.Dataset Introduction
4.Data Exploration
5.Model Building :
6.Multinomial logistic Regression
7.LDA
8.PCA+LDA
9.Random forest, Bagging and Boosting
10.Support Vector MAchine
11.Model Performance Comparision
12.Recommended Model
13.Conclusion

Introduction

How does a system identify users based on their typing rhythms? Keystroke dynamics is the analysis of typing rhythms to discriminate among users and it has been proposed for detecting impostors (i.e. both insiders and external attackers). Typing biometrics is part of a larger class of biometrics known as "behavioral biometrics" which uses different individual's typing pattern to authenticate individual person's identity.

The case study includes using a dataset which consists of typing speed information of a passcode by group of 51 different users recorded at different time intervals, building a classifier that recognizes typing pattern of different users, test the classifier on 5 different datasets having typing speed information of 5 different users and identify the right user.

Problem Statement

The passcode dataset consists of passcode related information of 51 different users. It is a multi-class classification problem with 51 levels. We will be looking at two sets of data:

Known.csv
Set of 12 files with unknown users

(The set of 12 files are the test files used to test the prediction accuracy of the final classifier selected to predict the imposter. These files were give at the time of presentation by the professor Dr.Christopher Saunders at South Dakota State University. Hence, these files will not be made available in the repository.)

And perform following tasks:

Task 1 : Design a model. Task 2 : Predict the individual user. Task 3 : Provide method's accuracy and reasons of selecting particular methods.

Dataset introduction

Dataset consists of 1777 observations over 35 columns, which represents typing speed of 51 different individuals who have access to a passcode. Only one user can use system at a time.

Passcode used: .tie5Roanl

Variables Information:

Subject sessionIndex Reps Hold time Up-Hold Time Down-Down Time (DD time = Hold time + Up-Down time )

Uderstanding Keystroke dynamics used for biometrics

Applications:

Biometric Identification for user authentication.
Prevent Identity theft
As handwriting is used to identify the author of a written text, we can use typing pattern to distinguish people in a computer-based crime.
Access-control and authentication mechanism

Advantages:

Improved security benefits.
More cost effective.
Continuous authentication strategies.

Dataset Exploration

1.Subjects Brief look up at number of reps in each class.

2.Subjects' frequency by session Index All users participate in session 8 but fewer users do not participate in session 7.

Hypothesis testing

Does sessionIndex makes any difference in measurements of passcode for users?

We performed independence test:

Null hypothesis: The true mean of a user in session 7 and session 8 are same.

Alternative hypothesis: The true mean of a user in session 7 and session 8 are different.

We found out that most of the p-values are greater than 0.05. Below are the results:

Conclusion: Fail to reject the null hypothesis at 0.05 significance level.. Hence we conclude that session index does not make any difference in individual user's passcode typing measurements.

Results:

Distribution Graph

Let us look at the distribution of hold time for 3 users: s003, s004, s005.

Let us look at the distribution of all user's hold time for typing passcode.

Scatter plot:

Hold time and Up-Down time

Box-plots

Correlation Matrix

Looking at Correlation Matrix of variables: UD and DD variables are highly correlated.

Histogram

Variable Selection

Variables Removed:

X, Rep, SessionIndex, DD variables-highly correlated with other variables and highly skewed

Models

Data Partition

Using 'createDataPartition' method that creates balanced splits of the data. The random sampling occurs within each class and it preserves the overall class distribution of the data.