GithubHelp home page GithubHelp logo

madou-sow / streaming-online-learning Goto Github PK

View Code? Open in Web Editor NEW

This project forked from mariambarry/real-time-online-learning-data-streaming

0.0 0.0 0.0 564 KB

License: Apache License 2.0

Python 86.00% Jupyter Notebook 12.54% Dockerfile 1.47%

streaming-online-learning's Introduction

Online Learning Deployment for Streaming Applications

This repository is the official implementation of the paper Online Learning Deployment for Streaming Applications in the Banking Sector (Barry, Montiel, Bifet, Chiky, Shakman, Manchev, Wadkar, El Baroudi, Tran, KDD 2021). The ressources can be used to set up and deploy instances of online machine learning models, to generate predictions and update the model weights on streaming data.

Motivations Our goal is to propose platform to provide a seamless bridge between data science-centric activities and data engineering activities, in a way that satisfies both the imposed production constraints in term of scalability and streaming application requirements in term of online learning. Examples of potential use cases can be anomaly and fraud detection for time-evolving data streams or real-time classification of user activities or IT or logs events. This is can be a real accelerator to gain in pro-activity for real world problems solving.

Tools used : RIVER, Kafka & Domino Platform on AWS

River [1] is an open-source online machine learning library written in Python which main focus is instance-incremental learning, meaning that every component (estimators, transformers, performance metrics, etc.) is designed to be updated one sample at a time. We used River to continuously train and update online learning model from last data streams. KAFKA is a state of the art open-source distributed event streaming platform and we used a managed hosted Kafka (confluent. We used it as a data streams generator.

The Domino Platform platform is implemented on top of Kubernetes, where it spins up containers on demand for running user workloads. The containers are based on Docker images, which are fully customizable. We used Domino to host the models and run scalability tests on hig velocity data generated as streams.

technologies_used_river_domino

Repository Files Description

The files in the reposirory are the one used to run the differents data processing steps in the streams learning pipeline we set up. It is the source code to reproduce experiments mentioned in the paper, and the major steps are :

  • Set up streams generators using Kafka producer and consumers to process data as streams, create requires features for model input and output. Files confluent_producer.py and KafkaConsumer.py

  • Set up model training and serving on multiples instances of Domino instances (or Docker) to bechnmark scalability. The models are deployed on parallel processing operators (independent instances of Domino Platform), which listens to a mutually exclusive set of Apache Kafka topic partitions (8 partitions). For each sample, the data features (X) and the corresponding truth value (Y) are published to the Kafka topic.

  • Compute and store model operationalization metrics incrementaly to assess operational performance of the solution : latency, throughput, model size on disk and ROCAUC. This is detailled in the confluent_compute_statistics.py.

  • Online learning to continuously train and update model weights : all models were build and continuously trained using River processing incrementally each Kafka event. The files _HoeffdingTreeClassifier_.py and _HalfSpaceTrees_.py computes the entire steps mentionned above in the ModelInference Class to deliver desired output and metrics mentionned in the results table below for both models.

The future work should adress upgrading the solution to include model persistence and investigate the trade-off between model serving parallelization to reduce latency and improve throughput vs using distributed methods.

Model Deployment : Steps to set up online learning pipeline

Here we list the high-level steps to set up, deploy, run, and evaluate online learning experiments:

  1. Deploy infrastructure to host experiments (for our purposes this was done on the AWS cloud).

  2. Set up Kafka to generate data streams (this can be done with a managed Kafka service such as Confluent Kafka, although any Kafka solution should work). You will need both topics of data to read from (we expect 8 partitions of the data stream) as well as topics where you can write results.

  3. [Optional] Set up or gain access to Domino environment (alternatives can be set up using other solutions).

  4. Connect your compute/model instances to your Kafka cluster. An example configuration is shown in https://github.com/dominodatalab/streaming-online-learning/blob/main/src/hostedkafka/KafkaConsumer.py. Model instances will pull from a stream on designated topics and write back results on separate topics.

  5. Set up the Kafka producer stream on the Kafka end. This is the stream from which the model instances will pull data for inference and learning. A producer configuration is demonstrated in https://github.com/dominodatalab/streaming-online-learning/blob/main/src/hostedkafka/confluent_producer.py.

  6. Utilize an appropriate Docker image or virtual environment (or a compute environment if using Domino) with the necessary dependencies, including River-ML and Kafka dependencies. Install River version 0.1.0 from github as well as confluent-kafka version 1.6.0 via pip. All required dependencies are included in the provided Docker file: https://quay.io/repository/katieshakman/streaming-online-learning-20200208.

  7. Configure models with appropriate settings. In our benchmarking tests, HoeffdingTreeClassifier model was configured with all defaults except for max_depth: tree.HoeffdingTreeClassifier(max_depth=10) The HalfSpaceTrees model was configured with all defaults except for its seed value: anomaly.HalfSpaceTrees(seed=42)

  8. Collect performance metrics for the deployed models. Predictive performance can be incrementally measured using the ROCAUC metric available in River. Metrics can be sent along with inferences (predictions) to the "inferences" Kafka topic created above. They can also be written to a file which is persisted to a blob store or other convenient storage. When analyzing the results on Domino, confluent_compute_statistics.py (included in the repository) can be run to persist the results to a file and generate summary statistics.

Experiments and Results

We set up online learning models (supervised HoeffdingTree Classifier [2] and unsupervised anomaly detector model, HalfSpaces Trees [3]) to incrementally learn and update from streams events. The datasets used are publicly available on River : Credit cards transactions in September 2013 by european cardholders and Electricity prices in New South Wales The pipeline was hosted on AWS Cloud using the Domino Data Science platform connected to a managed Kafka to process streams data. The workflow of experiments set up is below and detaisl are provided in the paper.

pipeline_experiments_kafka_domino

A series of experiments was conducted with the main objective being the functional verification of the proposed streaming architecture in the paper and scalability exercise. The results table can be found below :

results_experiments

We demonstrate that the proposed system can successfully ingest and process high-velocity streaming data and that online learning models can be horizontally scaled. More details are available in the paper.

Acknowledgements

We Thank Nikolay MANCHEV for his contributions and key role to achieve this repository.

References

[1] Jacob Montiel, Max Halford, Saulo Martiello Mastelini, Geoffrey Bolmier, Raphael Sourty, Robin Vaysse, Adil Zouitine, Heitor Murilo Gomes, Jesse Read, Talel Abdessalem, and Albert Bifet. 2020. River: machine learning for streaming data in Python. arXiv:2012.04740 [cs.LG]

[2] G. Hulten, L. Spencer, and P. Domingos. Mining time-changing data streams. In KDD’01, pages 97–106, San Francisco, CA, 2001. ACM Press.

[3] Swee Chuan Tan, Kai Ming Ting, and Tony Fei Liu. 2011. Fast Anomaly Detection for Streaming Data. In Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence - Volume Volume Two (Barcelona, Catalonia, Spain) (IJCAI’11). AAAI Press, 1511–1516.

streaming-online-learning's People

Contributors

mariambarry avatar sameerwadkar avatar katieshakman avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.