GithubHelp home page GithubHelp logo

fraud-detection's Introduction

Fraud Detection Case Study

In this project, we were tasked with predicting fraud for an event management company. Given over 14,000 instances of past events booked through the company, we worked to classify these bookings as either fraudulent or not based on the previously determined account type.

Using a Random Forest model, we train on the provided data in order to predict on future unseen data. The model provides promising results and is made accessible via a web app. Our deployed model achieves an f1 score of 0.89. We provide a scalable framework to incorporate natural language processing (NLP) in future models.

The data is proprietary and as such, some details are excluded.

Website Screenshot

Figure 1: Website homepage

Feature Importance

For a minimum viable product, we decided to first focus on only the numerical columns. The ones of most importance were:

  • body_length (length of the event description)
  • sale_duration2 (days posted)
  • user_age (days between user sign-up and event post)
  • name_length (length of event host's name)
  • payee_ind (computed from the payee_name field; 0 if no payee_name provided)
  • user_type (integers between 0 and 3; meaning unknown)
  • fb_published (Facebook published; 0 or 1)

Model

The deployed model (pure_rf_model.pkl) used for our fraud predictions is a Random Forest, utilizing only the 7 numerical features listed above. Cross-validation yielded an f1 score of 0.89 and an accuracy of 0.91.

We figured our predictions could increase through analyzing the text data (namely the event description), so we began implementing NLP (TDIDF feeding an SVC) to formulate an initial fraud probability prediction to feed as a feature to our Random Forest. This increased our cross-validated f1 score to 0.92 and accuracy to 0.94 (promising results!) However, we ran into trouble when predicting on new data with event descriptions containing words not in our trained vocabulary. The model incorporating NLP is model.pkl, but is not yet ready for deployment.

Web App

Our model can be used to predict on new data through a convenient Flask web application hosted on an EC2 instance on AWS. The website provides an overview of the problem and our process along with two separate ways to make a prediction via our model.

  1. Ping an existing server for a random data point.
  2. Manually input values for the numerical fields of our model.

These data points and predictions will then be stored in a PostgresSQL database on the instance.

The Team

Our product was deployed by a team of four in under 16 hours. Some readability and tuning was sacrificed in getting our app deployed and proving our MVP.

Meet our team of talented data scientists brought together by the Denver Galvanize Data Science Immersive:

fraud-detection's People

Contributors

sjh226 avatar gioravioli avatar kykiefer avatar tilla232 avatar cwschupp avatar jyt109 avatar billvanderlugt avatar jonathandinu avatar michaeljancsy avatar d43 avatar rkw0k avatar cmmyers avatar lemurey avatar mnghuang avatar rsepassi avatar tammyclee avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.