GithubHelp home page GithubHelp logo

tspannhw / deepad Goto Github PK

View Code? Open in Web Editor NEW

This project forked from fastforwardlabs/deepad

1.0 2.0 0.0 8.93 MB

Deep Learning for Anomaly Deteection

Home Page: https://ff12.fastforwardlabs.com/

Python 100.00%

deepad's Introduction

Deep Learning for Anomaly Detection

This repo contains code for experiments we have run at Cloudera Fast Forward for implementing deep learning for anomaly detection usecases.

We include implementations of several neural networks (Autoencoder, Variational Autoencoder, Bidirectional GAN, Sequence Models) in Tensorflow 2.0 and two other baselines (One Class SVM, PCA). We have released a report detailing the technical details for each approach in our online report here. An interactive visualization of some results can be found here.

Anomalies, often referred to as outliers, abnormalities, rare events, or deviants, are data points or patterns in data that do not conform to a notion of normal behavior. Anomaly detection, then, is the task of finding those patterns in data that do not adhere to expected norms, given previous observations. The capability to recognize or detect anomalous behavior can provide highly useful insights across industries. Flagging unusual cases or enacting a planned response when they occur can save businesses time, costs, and customers. Hence, anomaly detection has found diverse applications in a variety of domains, including IT analytics, network intrusion analytics, medical diagnostics, financial fraud protection, manufacturing quality control, marketing and social media analytics, and more.

How Anomaly Detection Works

The underlying strategy for most approaches to anomaly detection is to first model normal behavior, and then exploit this knowledge to identify deviations (anomalies). In this repo, we accomplish this across two steps

  • Build a model of normal behavior using available data. Typically the model is trained on normal behaviour data (or assumes a small amount of abnormal data).
  • Based on this model, assign an anomaly score to each data point that represents a measure of deviation from normal behavior. The models in this repo use a reconstruction error approach, where the model attempts to reconstruct a sample at test time, and uses the reconstruction error as an anomaly score. The intuition is that normal samples will be reconstructed with almost no error, while abnormal/unusual samples can be reconstructed with larger error margins.
  • Apply a threshold on the anomaly score to determine which samples are anomalies.

As an illustrative example, an autoencoder model is trained on normal samples where the task is to reconstruct the input. At test time, we can use the reconstruction error (mean squared error) for each sample as anomaly scores.

Structure of Repo

├── data
│   ├── kdd
│   │   ├── all.csv
│   │   ├── train.csv
│   │   ├── test.csv
├── cml
├── metrics
├── models
│   ├── ae  
│   ├── bigan
│   ├── ocsvm
│   ├── vae
├── utils
├── train.py 

The data directory contains the dataset (kdd network intrusion) used the experiments. It contains a script (data_gen.py) that downloads the data and constructs train and test sets separated into inliers and outliers. The models directory contains code to specify the parameters of each model and methods for training as well as computing an anomaly score. train.py contains code to train each model and then evaluate each model (generate a histogram of anomaly scores assigned by each model, and ROC curve to assess model skill on the anomaly detection task).

python3 train.py

This above script does the follow

  • Downloads the kdd dataset if not already downloaded
  • Trains all of the models ((Autoencoder, Variational Autoencoder, Bidirectional GAN, Sequence Models, PCA, OCSVM)
  • Evaluates the models on a test split (8000 inlier, 2000 outliers). Generates charts on model performance - Histogram of anomaly scores, ROC, general metrics (f1,f2, precision, recall, accuracy).

Summary of Results

For each model, we use labeled test data to first select a threshold that yields the best accuracy and then report on metrics such as f1, f2, precision, and recall at that threshold. We also report on ROC (area under the curve) to evaluate the overall skill of each model. Given that the dataset we use is not extremely complex (18 features), we see that most models perform relatively well. Deep models (BiGAN, AE) are more robust (precision, recall, ROC AUC), compared to PCA and OCSVM. The sequence-to-sequence model is not particularly competitive, given the data is not temporal. On a more complex dataset (e.g., images), we expect to see (similar to existing research), more pronounced advantages in using a deep learning model.

For additional details on each model, see our report. Note that models implemented here are optimized for tabular data. For example, extending this to work with image data will usually require the use of convolutional layers (as opposed to dense layers) within the neural network models to get good results.

How to Decide on a Modeling Approach?

Given the differences between the deep learning methods discussed above (and their variants), it can be challenging to decide on the right model. When data contains sequences with temporal dependencies, a sequence-to-sequence model (or architectures with LSTM layers) can model these relationships, yielding better results. For scenarios requiring principled estimates of uncertainty, generative models such as a VAE and GAN based approaches are suitable. For scenarios where the data is images, AEs, VAEs and GANs designed with convolution layers are suitable. The following table highlights the pros and cons of the different types of models, to provide guidance on when they are a good fit.

Model Pros Cons
AutoEncoder
  • Flexible approach to modeling complex non-linear patterns in data
  • Does not support variational inference (estimates of uncertainty)
  • Requires a large dataset for training
Variational AutoEncoder
  • Supports variational inference (probabilistic measure of uncertainty)
  • Requires a large amount of training data, training can take a while
GAN (BiGAN)
  • Supports variational inference (probabilistic measure of uncertainty)
  • Use of discriminator signal allows better learning of data manifold^[Mihaela Rosca (2018), Issues with VAEs and GANs, CVPR18] (useful for high dimensional image data).
  • GANs trained in semi-supervised learning mode have shown great promise, even with very few labeled data^[ Raghavendra Chalapathy et al. (2019) "Deep Learning for Anomaly Detection: A Survey" https://arxiv.org/abs/1901.03407]
  • Requires a large amount of training data, and longer training time (epochs) to arrive at stable results^[Tim Salimans et al. (2016) "Improved Techniques for Training GANs", Neurips 2016 https://arxiv.org/abs/1606.03498]
  • Training can be unstable (GAN mode collapse)
Sequence-to-Sequence Model
  • Well suited for data with temporal components (e.g., discretized time series data)
  • Slow inference (compute scales with sequence length which needs to be fixed)
  • Training can be slow
  • Limited accuracy when data contains features with no temporal dependence
  • Supports variational inference (probabilistic measure of uncertainty)
One Class SVM
  • Does not require a large amount of data
  • Fast to train
  • Fast inference time
  • Limited capacity in capturing complex relationships within data
  • Requires careful parameter selection (kernel, nu, gamma) that need to be carefully tuned.
  • Does not model a probability distribution, harder to compute estimates of confidence.

Deploying on Cloudera Machine Learning (CML)

For users interested in deploying this application on Cloudera Machine Learning, you can build this project and auto deploy with the cml_build.py script.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.