The fraud-detection from billvanderlugt

Fraud Detection Case Study

In this project, we were tasked with predicting fraud for an event management company. Given over 14,000 instances of past events booked through the company, we worked to classify these bookings as either fraudulent or not based on the previously determined account type.

Using a Random Forest model, we train on the provided data in order to predict on future unseen data. The model provides promising results and is made accessible via a web app. Our deployed model achieves an f1 score of 0.89. We provide a scalable framework to incorporate natural language processing (NLP) in future models.

The data is proprietary and as such, some details are excluded.

_{Figure 1: Website homepage}

Feature Importance

For a minimum viable product, we decided to first focus on only the numerical columns. The ones of most importance were:

body_length (length of the event description)
sale_duration2 (days posted)
user_age (days between user sign-up and event post)
name_length (length of event host's name)
payee_ind (computed from the payee_name field; 0 if no payee_name provided)
user_type (integers between 0 and 3; meaning unknown)
fb_published (Facebook published; 0 or 1)

Model

The deployed model (pure_rf_model.pkl) used for our fraud predictions is a Random Forest, utilizing only the 7 numerical features listed above. Cross-validation yielded an f1 score of 0.89 and an accuracy of 0.91.

We figured our predictions could increase through analyzing the text data (namely the event description), so we began implementing NLP (TDIDF feeding an SVC) to formulate an initial fraud probability prediction to feed as a feature to our Random Forest. This increased our cross-validated f1 score to 0.92 and accuracy to 0.94 (promising results!) However, we ran into trouble when predicting on new data with event descriptions containing words not in our trained vocabulary. The model incorporating NLP is model.pkl, but is not yet ready for deployment.

Web App

Our model can be used to predict on new data through a convenient Flask web application hosted on an EC2 instance on AWS. The website provides an overview of the problem and our process along with two separate ways to make a prediction via our model.

Ping an existing server for a random data point.
Manually input values for the numerical fields of our model.

These data points and predictions will then be stored in a PostgresSQL database on the instance.

The Team

Our product was deployed by a team of four in under 16 hours. Some readability and tuning was sacrificed in getting our app deployed and proving our MVP.

Meet our team of talented data scientists brought together by the Denver Galvanize Data Science Immersive:

billvanderlugt / fraud-detection Goto Github PK

fraud-detection's Introduction

Fraud Detection Case Study

Feature Importance

Model

Web App

The Team

fraud-detection's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs