GithubHelp home page GithubHelp logo

jadelhelm / automated-anomaly-detection-preprocessing-pipeline Goto Github PK

View Code? Open in Web Editor NEW
1.0 3.0 1.0 20.32 MB

Automate preprocessing of tabular data for anomaly detection methods. This pipeline handles data cleaning, normalization, and transformation, making your anomaly detection process efficient and accurate.

HTML 12.04% Python 46.57% Jupyter Notebook 41.38%
anomaly-detection automated-machine-learning data-quality preprocessing-data anomaly anomalydetection automated machine-learning preprocessing preprocessing-pipeline

automated-anomaly-detection-preprocessing-pipeline's Introduction

Automated (Unsupervised) Anomaly Detection Preprocessing Pipeline


I used sklearn's Pipeline and Transformer concept to create this preprocessing pipeline


How to use the pipeline

import numpy as np
import pandas as pd
from dataqualitypipeline import initialize_autoencoder, initialize_autoencoder_modified
from pyod.models.iforest import IForest
from pyod.models.lof import LOF

df_data = pd.read_csv("./HOWTO/players_20.csv")
clf_lof = LOF(n_jobs=-1)

# Init Preprocessing Pipeline
from dataqualitypipeline import DQPipeline
dq_pipe = DQPipeline(
    nominal_columns=["player_tags","preferred_foot",
                     "work_rate","team_position","loaned_from"],

    exclude_columns=["player_url","body_type","short_name", "long_name", 
                     "team_jersey_number","joined","contract_valid_until",
                     "real_face","nation_position","player_positions","nationality","club"],

    time_column_names=["dob"],
    deactivate_pattern_recognition=True,
    remove_columns_with_no_variance=True,
)


# Run Preprocessing-Pipeline (Named dq_pipe)
X_output = dq_pipe.run_pipeline(
    X_train=df_data.iloc[:,0:37],
# Add Anomaly Detection Model (clf)
    clf=clf_lof,
    dump_model=False,
)

X_output.head(40)
  • Checkout the how_to.ipynb Notebook to use this pipeline.
    • There is an example with only train data (unsupervised)

Highlights โญ

๐Ÿ“Œ BinaryEncoder instead of OneHotEncoder for nominal columns / Big Data and Performance

Newest research shows similar results for encoding nominal columns with significantly fewer dimensions.

  • (John T. Hancock and Taghi M. Khoshgoftaar. "Survey on categorical data for neural networks." In: Journal of Big Data 7.1 (2020), pp. 1โ€“41.)
    • Tables 2, 4
  • (Diogo Seca and Joรฃo Mendes-Moreira. "Benchmark of Encoders of Nominal Features for Regression." In: World Conference on Information Systems and Technologies. 2021, pp. 146โ€“155.)
    • P. 151

๐Ÿ“Œ Implementation of univariate methods / Detection of univariate anomalies

Both methods (MOD Z-Value and Tukey Method) are resilient against outliers, ensuring that the position measurement will not be biased. They also support multivariate anomaly detection algorithms in identifying univariate anomalies.

๐Ÿ“Œ Transformation of time series data and standardization of data with RobustScaler / Normalization for better prediction results

๐Ÿ“Œ Labeling of NaN values in an extra column instead of removing them / No loss of information


Abstract View - Project

alt text


Decision rules of the pipeline

alt text


Feel free to contribute ๐Ÿ™‚

Reference

automated-anomaly-detection-preprocessing-pipeline's People

Contributors

jadelhelm avatar

Stargazers

 avatar

Watchers

 avatar

Forkers

showkeyjar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.