GithubHelp home page GithubHelp logo

ml's Introduction

PHBS_MLF_2018

Arianna MISERICORDIA 1802010234

Ewa GERUS 1802010260

Noam PELEG 1802010275

Lillah BENARD 1802010264

Prediction of Automotive Accident Severity

MOTIVATION:

The motivation behind our research is the understanding of specific conditions that affect the severity of an automotive accident. The purpose of this project is to highlight impactful variables while operating a vehicle in order to improve accident prevention.

GOAL:

We executed a classification based on U.K. road accidents ranging from 2014 to 2016 using the methodologies covered in class (Logistic Regression, Neural network, KNN, Decision tree). Our classification specifies the impact of certain features on car wreckage.

DATA SOURCE:

The data collected comes from the U.K. government who amassed traffic data based on police reports. The analysis of data executed here is composed of the U.K. road accidents from 2014 to 2016.

Accidents are recorded according to these features:

  • Reference Number
  • Grid Ref: Easting
  • Grid Ref: Northing
  • Expr1
  • Severity
  • Day of the week
  • Time (24hr)
  • 1st Road Class
  • Road surface
  • Accident date
  • Weather condition
  • Lighting conditions
  • Number of vehicles
  • Casualty class
  • Sex of casualty
  • Age of casualty
  • Type of vehicle

DATASET SOURCE:

https://data.gov.uk/dataset/6efe5505-941f-45bf-b576-4c1e09b579a1/road-traffic-accidents

METHODOLOGY:

I. Data preprocessing:

data preprocessing

  • Merging datasets

  • Dropping columns containing references (Reference number, Grid Ref: Easting, Grid Ref: Northing) and correlated variables (Lighting conditions, Accident Date).

  • Dealing with missing data by deleting observations that are labeled with NaNs.

  • Listing variables:

    • Time (24hr): Day-time, Night-time
    • Weather conditions: Fine, Snowing, Raining, Fog, Other
    • Type of Vehicle: Car, Bus, Goods vehicles, Motorcycle, Other
    • Day: Weekday, Weekend
    • Casualty class: Passenger, Pedestrian, Driver
  • Creating dummies out of categorical variables and dropping variables containing the same information (Sex of casualty_Female, Day_Weekday, Time (24hr)_Day-time)

  • Resampling unbalanced data

    • Slight: 6739, Serious: 957, Fatal: 48

      • Undersampling from slight to serious
    • Slight: 957, Serious: 957, Fatal: 48

      • Oversampling from fatal to serious
    • Slight: 957, Serious: 957, Fatal: 957

II. Standardization and PCA:

  • Standardization

  • PCA

pca

We utilize the first 12 components as they make up approximately 90% of the variance.

III. Prediction:

Accuracy of each of the following methods were examined to choose the best classifier for reaching our goal. To implement the methods mentioned below, scikit-learn and Keras were used.

To avoid overfitting, we used K-fold cross-validation method with ten splits.

Decision Tree

  • The graph below shows the depth that returns the best accuracy based on the number of features that we have in the dataset.

kfold decision tree

  • K-fold best mean accuracy is 73.95% (standard deviation 2.64%) for a decision tree depth equal to six.

decision tree

  • The three most important features in the decision tree model are: Casualty Class_Pedestrian, Road Surface_Dry, Road Surface_Wet or Damp.

Random Forest

  • The mean accuracy is equal to 67.15% (standard deviation 3.98%).

Neural Network

  • Using preprocessed standarized data followed by PCA.

  • Two hidden layers each containing 24 nodes.

  • The mean accuracy is equal to 72.66% (standard deviation 2.95%).

KNN

  • Using preprocessed standarized data followed by PCA.

  • The graph below shows the number of neighbors that returns the best accuracy based on the number of features that we have in the dataset.

kfold knn

  • K-fold best mean accuracy is 71.37% (standard deviation 2.77%) for number of neighbors equal to five.

Logistic Regression

using PCA

  • The mean accuracy is equal to 53.12% (standard deviation 2.32%)

without PCA

  • Dropping reference variables.

  • Dropping 'Weather Condition' variable due to its high correlation with 'Road Surface'.

  • The mean accuracy is equal to 54.09% (standard deviation 3.03%).

logistic regression

CONCLUSION:

sans titre

The best model is the decision tree with a mean accuracy of 73.95%.

We can conclude that the three most important features that affect the severity of an automotive accident are: Casualty Class_Pedestrian, Road Surface_Dry, Road Surface_Wet or Damp.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.