GithubHelp home page GithubHelp logo

peter-pogorelov / fraud-dataset-benchmark Goto Github PK

View Code? Open in Web Editor NEW

This project forked from amazon-science/fraud-dataset-benchmark

0.0 0.0 0.0 21.97 MB

Repository for Fraud Dataset Benchmark

License: MIT License

Python 11.84% Jupyter Notebook 88.16%

fraud-dataset-benchmark's Introduction

FDB: Fraud Dataset Benchmark

By Prince Grover, Zheng Li, Jianbo Liu, Jakub Zablocki, Hao Zhou, Julia Xu and Anqi Cheng

made-with-python License: MIT

The Fraud Dataset Benchmark (FDB) is a compilation of publicly available datasets relevant to fraud detection (arXiv Link). The FDB aims to cover a wide variety of fraud detection tasks, ranging from card not present transaction fraud, bot attacks, malicious traffic, loan risk and content moderation. The Python based data loaders from FDB provide dataset loading, standardized train-test splits and performance evaluation metrics. The goal of our work is to provide researchers working in the field of fraud and abuse detection a standardized set of benchmarking datasets and evaluation tools for their experiments. Using FDB tools we evaluate 4 AutoML pipelines including AutoGluon, H2O, Amazon Fraud Detector and Auto-sklearn across 9 different fraud detection datasets and discuss the results.

Datasets used in FDB

Brief summary of the datasets used in FDB. Each dataset is described in detail in data source section.

# Dataset name Dataset key Fraud category #Train #Test Class ratio (train) #Feats #Cat #Num #Text #Enrichable
1 IEEE-CIS Fraud Detection ieeecis Card Not Present Transactions Fraud 561,013 28,527 3.50% 67 6 61 0 0
2 Credit Card Fraud Detection ccfraud Card Not Present Transactions Fraud 227,845 56,962 0.18% 28 0 28 0 0
3 Fraud ecommerce fraudecom Card Not Present Transactions Fraud 120,889 30,223 10.60% 6 2 3 0 1
4 Simulated Credit Card Transactions generated using Sparkov sparknov Card Not Present Transactions Fraud 1,296,675 20,000 5.70% 17 10 6 1 0
5 Twitter Bots Accounts twitterbot Bot Attacks 29,950 7,488 33.10% 16 6 6 4 0
6 Malicious URLs dataset malurl Malicious Traffic 586,072 65,119 34.20% 2 0 1 1 0
7 Fake Job Posting Prediction fakejob Content Moderation 14,304 3,576 4.70% 16 10 1 5 0
8 Vehicle Loan Default Prediction vehicleloan Credit Risk 186,523 46,631 21.60% 38 13 22 3 0
9 IP Blocklist ipblock Malicious Traffic 172,000 43,000 7% 1 0 0 0 1

Installation

Requirements

  • Kaggle account

    • Important: ieeecis dataset requires you to join IEEE-CIS competetion from your Kaggle account, before you can call fdb API. Otherwise you will get ApiException: (403).
  • AWS account

  • Python 3.7+

  • Python requirements

autogluon==0.4.2
h2o==3.36.1.2
boto3==1.20.21
click==8.0.3
click-plugins==1.1.1
Faker==4.14.2
joblib==1.0.0
kaggle==1.5.12
numpy==1.19.5
pandas==1.1.2
regex==2020.7.14
scikit-learn==0.22.1
scipy==1.5.4
auto-sklearn==0.14.7
dask==2022.8.1

Step 1: Setup Kaggle CLI

The FraudDatasetBenchmark object is going to load datasets from the source (which in most of the cases is Kaggle), and then it will modify/standardize on the fly, and provide train-test splits. So, the first step is to setup Kaggle CLI in the machine being used to run Python.

Use intructions from How to Use Kaggle guide. The steps include:

Remember to download the authentication token from "My Account" on Kaggle, and save token at ~/.kaggle/kaggle.json on Linux, OSX and at C:\Users<Windows-username>.kaggle\kaggle.json on Windows. If the token is not there, an error will be raised. Hence, once you’ve downloaded the token, you should move it from your Downloads folder to this folder.

Step 1.2. Join IEEE-CIS competetion from your Kaggle account, before you can call fdb.datasets with ieeecis. Otherwise you will get ApiException: (403).

Step 2: Clone Repo

Once Kaggle CLI is setup and installed, clone the github repo using git clone https://github.com/amazon-research/fraud-dataset-benchmark.git if using HTTPS, or git clone [email protected]:amazon-research/fraud-dataset-benchmark.git if using SSH.

Step 3: Install

Once repo is cloned, from your terminal, cd to the repo and type pip install ., which will install the required classes and methods.

FraudDatasetBenchmark Usage

The usage is straightforward, where you create a dataset object of FraudDatasetBenchmark class, and extract useful goodies like train/test splits and eval_metrics.

Important note: If you are running multiple experiments that require re-loading dataframes multiple times, default setting of downloading from Kaggle before loading into dataframe exceed the account level API limits. So, use the setting to persist the downloaded dataset and then load from the persisted data. During the first call of FraudDatasetBenchmark(), use load_pre_downloaded=False, delete_downloaded=False and for subsequent calls, use load_pre_downloaded=True, delete_downloaded=False. The default setting is load_pre_downloaded=False, delete_downloaded=True

from fdb.datasets import FraudDatasetBenchmark

# all_keys = ['fakejob', 'vehicleloan', 'malurl', 'ieeecis', 'ccfraud', 'fraudecom', 'twitterbot', 'ipblock'] 
key = 'ipblock'

obj = FraudDatasetBenchmark(
    key=key,
    load_pre_downloaded=False,  # default
    delete_downloaded=True,  # default
    add_random_values_if_real_na = { 
        "EVENT_TIMESTAMP": True, 
        "LABEL_TIMESTAMP": True,
        "ENTITY_ID": True,
        "ENTITY_TYPE": True,
        "ENTITY_ID": True,
        "EVENT_ID": True
        } # default
    )
print(obj.key)

print('Train set: ')
display(obj.train.head())
print(len(obj.train.columns))
print(obj.train.shape)

print('Test set: ')
display(obj.test.head())
print(obj.test.shape)

print('Test scores')
display(obj.test_labels.head())
print(obj.test_labels['EVENT_LABEL'].value_counts())
print(obj.train['EVENT_LABEL'].value_counts(normalize=True))
print('=========')

Notebook template to load dataset using FDB data-loader is available at scripts/examples/Test_FDB_Loader.ipynb

Reproducibility

Reproducibility scripts are available at scripts/reproducibility/ in respective folders for afd, autogluon and h2o. Each folder also had README with steps to reproduce.

Benchmark Results

Dataset key AUC-ROC
AFD OFI AFD TFI AutoGluon H2O Auto-sklearn
ccfraud 0.985 0.99 0.99 0.992 0.988
fakejob 0.987 - 0.998 0.99 0.983
fraudecom 0.519 0.636 0.522 0.518 0.515
ieeecis 0.938 0.94 0.855 0.89 0.932
malurl 0.985 - 0.998 Training failure 0.5
sparknov 0.998 - 0.997 0.997 0.995
twitterbot 0.934 - 0.943 0.938 0.936
vehicleloan 0.673 - 0.669 0.67 0.664
ipblock 0.937 - 0.804 Training failure 0.5

ROC Curves

The numbers in the legend represent AUC-ROC from different models from our baseline evaluations on AutoML.
roc curves

Data Sources

  1. IEEE-CIS Fraud Detection

  2. Credit Card Fraud Detection

  3. Fraud ecommerce

  4. Simulated Credit Card Transactions generated using Sparkov

  5. Twitter Bots Accounts

  6. Malicious URLs dataset

  7. Real / Fake Job Posting Prediction

  8. Vehicle Loan Default Prediction

  9. IP Blocklist

Citation

@misc{grover2022fdb,
      title={FDB: Fraud Dataset Benchmark}, 
      author={Prince Grover and Zheng Li and Jianbo Liu and Jakub Zablocki and Hao Zhou and Julia Xu and Anqi Cheng},
      year={2022},
      eprint={2208.14417},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}

License

This project is licensed under the MIT-0 License.

Acknowledgement

We thank creators of all datasets used in the benchmark and organizations that have helped in hosting the datasets and making them widely availabel for research purposes.

fraud-dataset-benchmark's People

Contributors

amazon-auto avatar cmauck10 avatar groverpr avatar zhengli0817 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.