GithubHelp home page GithubHelp logo

dse511-project3-team / dse511-project-3-code-repo Goto Github PK

View Code? Open in Web Editor NEW
0.0 1.0 0.0 10.54 MB

Git repository to manage the machine learning learning project-3 for DSE511.

Python 2.17% Jupyter Notebook 97.83%
machine-learning accidents-analysis classification python3 numpy matplotlib pandas

dse511-project-3-code-repo's Introduction

Traffic Accident Prediction with Machine Learning

In this project, we work with an accidents dataset carrying features related to location, weather conditions, and road infrastructure to predict the severity of accidents. The project is also a part of a final project task for DSE511 course under the data science and the engineering program at the Bredesen Center (University of Tennessee, Knoxville).

Project Introduction

This project aims to model vehicle accident severity based on weather and road conditions. Once a final model is selected, we plan on performing exploratory factor analysis (or similar methodology) to identify which variables contribute the most to the severity of accidents. Additionally, we explore whether or not accident severity varies significantly by major city within the United States. This is a topic of great significance as vehicular accidents make up approximately 38,000 deaths in the United States each year and cause about 4.4 million hospitalizations.

Data

We have selected the dataset titled: “US Accidents (updated) A Countrywide Traffic Accident Dataset (2016 - 2020).” The dataset can be found here: https://www.kaggle.com/sobhanmoosavi/us-accidents

The dataset consists of 1.5 million observations, and each observation has 47 features. And each sample represents an accident that occurred in the United States between 2016 and 2020.

Accessing Data:

  1. The full raw data carrying 1.5 million observations can be downloaded from here.

  2. We build our analysis after performing downsampling for six cities: Phoenix, Los Angeles, New York, Philadelphia, Houston, and Chicago. This dataset can be found inside the folder: /data/raw/accident_data.csv. To regenerate this dataset execute the following command.

python main.py data

Note that the above command also creates an imputed dataset which we use to do further downstream work. Find it inside /data/processed/imputed.pkl.

Generate Results

  1. To generate the results from our selected models execute the following command. But do make sure you have the neccessary pickle file under /data/processed/imputed.pkl folder. See "Accessing Data" section to know how you can create this file if it does not exists.
python main.py results
  1. To see the results from the hyperparameter tuning execute the following command. Again, make sure you have the neccessary pickle file under /data/processed/imputed.pkl folder. See "Accessing Data" section to know how you can create this file if it does not exists. But be warned that, it'll take hours to complete this execution.
python main.py tune

Methods

We employ ANOVA to find if there is a statistically significant difference between accident severity by city. For the classification task, we use the following methods:

  1. Logistic Regression (Baseline)
  2. Multinomial Naive Bayes

Ensemble methods that we use are:

  1. Random Forest
  2. XGBoost
  3. Adaboost
  4. Gradient Boosting

Github Workflow

We plan to manage the development of code by having one main branch and three development branches. Each respective member of the team is assigned a development branch. The task is divided using issues and each issues is grouped under milestones. Finally we track all the issues under the project section. Find the issues and the project at the following location:

Project Board, Current Issues

Repository Structure

├── README.md          <- The top-level README carrying the project description and organization.
├── data
│   ├── processed      <- The final, canonical data sets for modeling.
│   └── raw            <- The original data that we use for further processing.
│
├── docs
│   ├── images         <- Folder saving generated images for report and presentation.
│   └── reports        <- Folder carrying reports and presentations submitted during the project.
│
├── notebooks          <- Folder carrying Jupyter notebooks.
│
├── requirements.txt   <- The requirements file for reproducing the analysis environment, e.g.
│                         generated with `pip freeze > requirements.txt`
│
├── src                <- Source code used in this project.
│   │
│   ├── data           <- Scripts to download or generate data.
│   ├── models         <- Scripts to train models and then use trained models to make
│   │                     predictions.
│   ├── preprocessing  <- Scripts to perform train/test split, feature selection, feature 
|   |                     extraction etcetra.
|   |── results        <- Folder carrying script to generate results from our tuned models.    
│   └── visualization  <- Scripts to create exploratory and results oriented visualizations.
│
└── main.py            <- Script that will run all the necessary code to generate the results.

Team Members

Name Github Handle
Sanjeev Singh @isanjeevsingh
Russ Limber @Russtyhub
EonYeon Jo @EYJo1

dse511-project-3-code-repo's People

Contributors

eyjo1 avatar isanjeevsingh avatar russtyhub avatar

Watchers

 avatar

dse511-project-3-code-repo's Issues

EDA: Mean Duration

  • Plot mean duration of the accident by severity
  • Plot mean duration of the accident by severity for each city

Factor Analysis

Finding the factors that carry the greatest significance.

EDA: By month and weekday

  • Distribution of accidents by month and weekday split by severity. We can also provide the stats by cities.

Preprocess Location + Basic Variables

In this issue, we plan to preprocess the following variables:

'ID', 'Severity', 'Start_Time', 'End_Time', 'Distance(mi)', 'Description', 'Start_Lat', 'Start_Lng', 'End_Lat', 'End_Lng', 'Number', 'Street', 'Side', 'City', 'County', 'State', 'Zipcode', 'Country', 'Timezone', 'Airport_Code'

The goal is to impute the missing values and to drop the irrelevant columns.

Create EDA Folder under Docs

Create a new folder called "EDA" under the docs folder, which will carry all the images that we generated for our EDA.

Chi Square Test

Conduct analysis of variance by city to determine a trend in accident rate by city.

EDA: Side accident

  • Plot the distribution of accidents by side of street the accident occurred (left or right)

Post Modeling Analysis

  • Generate feature importance plots to understand what features were important for predicting the severity of the accidents.
  • Generate learning curves.
  • Generate probability distribution plots for prediction probabilities.

Preprocess Infrastructure

Infrastructure Analysis will conducted in this issues to preprocess the following variables:

'Traffic_Signal', 'Crossing', 'Station','Amenity', 'Bump', 'Give_Way', 'Junction', 'No_Exit', 'Railway', 'Roundabout','Stop', 'Traffic_Calming', 'Turning_Loop'.

Preprocess the dataset, compress it, upload to GitHub

Russ is on this!

I will have the final result either end of today or tomorrow morning and we can go over the changes I made and how to work with the compressed version. Our goal is to trim it down a lot from 1.3 Million observations to around 75 thousand. We will only use data from certain US Cities.

EDA: Side accident

  • Plot the distribution of accidents by side of street the accident occurred (left or right)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.