GithubHelp home page GithubHelp logo

microsoft / dstoolkit-mlops-base Goto Github PK

View Code? Open in Web Editor NEW
88.0 23.0 38.0 1.72 MB

Support ML teams to accelerate their model deployment to production leveraging Azure

License: MIT License

Jupyter Notebook 25.04% Python 74.96%
mlops azure-machine-learning dstoolkit

dstoolkit-mlops-base's Introduction

banner

MLOps Solution Accelerator

This repository contains the basic repository structure for machine learning projects based on Azure technologies (Azure Machine Learning and Azure DevOps). The folder names and files are chosen based on personal experience. You can find the principles and ideas behind the structure, which we recommend to follow when customizing your own project and MLOps process. Also, we expect users to be familiar with Azure Machine Learning (AML) concepts and how to use the technology.

Prerequisites

In order to successfully complete your solution, you will need to have access to and or provisioned the following:

  • Access to an Azure subscription
  • Access to an Azure Devops subscription
  • Service Principal

Getting Started

Follow the step below to setup the project in your subscription.

  1. Setting up the Azure infrastructure:

  2. Creating your CI/CD Pipeline to Azure Devops. In the folder ./azure-pipelines you will find the yaml files to setup your CI/CD pipeline in Azure Devops (ADO). To do so, have a look at 'Azure Devops Setup'.

If you have managed to run the entire example, well done! You can now adapt the same code to your own use case with the exact same infrastructure and CI/CD pipeline. To do so, follow these steps:

  1. Add your AML-related variables (model, dataset name, experiment name, pipeline name ...) in the configuration file configuration-aml.variables.yml.

  2. Add your infra-related environment variables (azure environment, ...) in configuration-infra-*.variables.yml in the ./configuration folder. By default, the template provides two yaml files for DEV and PROD environment.

  3. Add your core machine learning code (feature engineering, training, scoring, etc) in ./src. We provide the structure of the core scripts. You can fill the core scripts with your own functionality.

  4. If needed, adapt the ML operation scripts that handle the core scripts (e.g sending the training script to a compute target, registering a model, creating an azure ml pipeline,etc) in ./mlops. We provide some examples to easily setup your experiments and Azure Machine Learning pipelines.

The project folders are structured in a way to rapidly move from a notebook experimentation to refactored code ready for deployment as following: design folder

Core MLOps Principles

  1. Continuous Integration: testing ML systems comes down to testing feature engineering scripts, validating data schema, testing the model and validating the ML infrastructure (access permission, model registries, inference service,...).

  2. Continuous Delivery: CD in the context of ML is the capacity to automatically deliver artefacts to different environment (i.e DEV/STAGE/PROD). ML artefacts consist of a feature engineering pipeline, a model, and an automated retraining pipeline depending on the use-case.

  3. Continuous Monitoring: it is mandatory to provide a consistent feedback loop from model prediction results in production. The only real model test is in production where the model is fed live data. Hence, not having a monitoring system in place to enable ML practitioners to review model predictions may have catastrophic consequences.

  4. Continuous Training: to attain a high level of ML autonomy, the ML systems ought to be able to automatically detect data drifts or be triggered based on business rule to retrain models in production. This principle however can only be applied if a monitoring system is running to ensure that the retraining is activated in pre-defined conditions.

General Coding Guidelines

For more details on the coding guidelines and explanation on the folder structure, please go to docs/how-to.

  1. Core scripts should receive parameters/config variables only via code arguments and must not contain any hardcoded variables in the code (like dataset names, model names, input/output path, ...). If you want to provide constant variables in those scripts, write default values in the argument parser.

  2. Variable values must be stored in configuration/configuration.yml. These files will be used by the execution scripts (azureml python sdk or azure-cli) to extract the variables and run the core scripts.

  3. Two distinct configuration files for environment creation:

    • (A) for local dev/experimentation: may be stored in the project root folder (requirement.txt or environment.yml). It is required to install the project environment on a different laptop, devops agent, etc.
    • (B) for remote compute: stored in configuration/environments contains only the necessary packages to be installed on remote compute targets or AKS.
  4. There are only 2 core secrets to handle: the azureml workspace authentication key and a service principal. Depending on your use-case or constraints, these secrets may be required in the core scripts or execution scripts. We provide the logic to retrieve them in a aml_utils file/module in both src and mlops.

Default Directory Structure

├───azure-pipelines       # folder containing all the azure devops pipelines for CI/CD
│   └───templates         # any yaml template files
├───configuration         # all configuration files
│       ├───compute       # definitions of computes used for training and inference
│       └───environments  # definitions of environments used for training and inference
├── docs                  # documentation folder
│   ├── how-to            # documents on how to use this template and how to setup the environment
│   ├── media             # storing images, videos, etc, needed for docs.
│   └── references        # external resources relevant to the project
├───notebooks             # experimentation folder with notebooks, code and other. This code is not part of the operationalized flow.
├───mlops                 # all the code to orchestrate machine learning operations
│   └───tests             # for testing your code, data, and outputs
├── src                   # data science code adapted to the particular use case.
├── .gitignore
├── README.md
└── requirements.txt

Contribution

We welcome any contribution to improve the accelerator. For more information, please go through the contribution guideline

Support & FAQ

If you have any questions or new ideas you would like to discuss, you can start a new conversation in the discussions tab

Frequently asked questions can be found in how-to

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.

dstoolkit-mlops-base's People

Contributors

davidefornelli avatar farah-saab avatar florianpydde avatar kyoro1 avatar lianan avatar mariamedp avatar microsoftopensource avatar rdzotz avatar tranguyen221 avatar yvonnebarthp avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

dstoolkit-mlops-base's Issues

The latest version of az cli (2.30.0) break while running az commands

azure-cli 2.30.0 throws ERROR: {'Error': TypeError("init() got an unexpected keyword argument 'async_persist'",)} while running the pipeline: dstoolkit-mlops-base/invoke-aml-pipeline.template.yml at main · microsoft/dstoolkit-mlops-base (github.com)

As a workaround, we found that the az cli 2.30.0 will install azure-cli-ml version 1.5.0 as default
{
"experimental": false,
"extensionType": "whl",
"name": "azure-cli-ml",
"path": "/opt/az/azcliextensions/azure-cli-ml",
"preview": false,
"version": "1.5.0"
}
We need to downgrade the azure-cli 2.29.2 and the azure-cli-ml version 1.33.1 is installed correctly.

Feature Request - Provide High Level Deployment method to higher environments

Hi,

I've used this for demoing to my customer and I think it would great to show how the azure-pipelines can be used to deploy to higher environments using the recommended approach of "compile once promote everywhere" off of the main branch.

As I am new to ML Ops, I'm not sure the recommended approach for deploying to higher environments

Should the training be part of the "compile once" continuous integration/build phase

and these pieces

#########################################
be part of the continuous deployment/promote everywhere

The high level of what I'm trying to understand is how the batch inference and training pipeline should fit into this flow

image

Handle Multiple Models in deployment

When training multiple models, the ado pipelines should be able to deploy all trained models into other environments.

The change need to be applied to:

Work-around for auto-testing after PR

I put my PR today, and auto-testing failed, because we cannot use ACI(Azure Container Instances) in a region.
Can we have any work-around for the above like re-try the auto-testing?

FYI, I didn't change any python codes in the PR, and the processing flows must not be changed.

Create Github Action Pipelines

We want to extend the devops pipeline to integrate github actions and infrastructure-as-code with terraform scripts. The resulting devops repo may follow this structure:

devops-pipelines

  • .ado
    • pipeline-0 IaC with ARM templates (current)
    • pipeline-1
    • ....
  • .github
    • pipeline-0 IaC with Terraform templates
    • pipeline-1
    • ....

Data preparation step in training pipeline

Add data prep as initial step in the training pipeline, where all feature engineering and train-test split work will be done. Providing by default the train and holdout test datasets will enforce good practices and avoid data leakage, thus accelerating the model performance analysis and reporting.

Train sub-dataset should be redirected to the train step (2nd step in the pipeline), and test sub-dataset should be redirected to the evaluation step (3rd in the pipeline). As a result, evaluation step should be modified to include the generation of evaluation metrics, while comparison with the current active model should be done later (as a part of the register step? or include a compare step in between?).

The train step can still have its own data splitting mechanism inside, to do any type of cross-validation needed to select the best model from all the approaches tested out.

Create CONTIBUTING file

We should have a CONTRIBUTING.md file to explain how to contribute to the repo, and remove the section from the README.
Example: https://github.com/microsoft/solution-accelerator-many-models/blob/master/CONTRIBUTING.md

It'd be nice to have detailed instructions (or links to official instructions, depending on the case) on how to:

  • Fork the repo or create a branch
  • Submit a bug
  • Create a pull request
  • Set up a test environment and pipelines so contributors can make sure everything works before submitting the PR.

To be adapted from the Contribution Guide that @FlorianPydde created in the internal wiki.

Artefact migration between AML workspaces

Currently the template reruns the scripts in different environments. Although it ensures that automate retraining process works, this functionality should be defined as an integration test on a sample set rather than a mean of promoting artefacts. The template needs to implement a process that download and reuploads artefacts to the next AML workspace. This will lower cost and time to production.

Endpoint Model Profiler

AML provides a model profiling functionality which enables team to assess their deployment services (memory consumption, latency, etc): https://docs.microsoft.com/en-us/azure/machine-learning/how-to-deploy-profile-model?pivots=py-sdk

During model deployment to AKS (TEST/PROD), it may be useful to create the model profile and upload it to the default blob storage, for simplicity. This file can used to trigger action based specifics metrics or display it on a operation dashboard

Model deployment with Docker image

The template currently relies on the azureml SDK to natively deploy the model as a real-time webservice in a selected compute, using Model.deploy. A common request from client is to provide a Docker file that a production team can deploy with a higher degree of flexibility (pod security, management, etc).

The template needs to implement a second scenario which leverages the Model.package functionality to create a Docker imagefile.

It would be nice to have a parameter in the deploy-model YAML template to choose which type of deployment the user wants:

  • Native deployment: the pipeline deploys the model in a webservice using AML, runs a smoke test, etc. (current behavior).
  • Docker image: the pipeline generates an artifact with this packaged model instead of deploying it as a webservice.

After the package has been created, a kubectl command may connect to the targeted AKS and run the docker image

Common service connection issue forinvoke pipeline task

In the pipeline azure-pipelines/templates/utils/invoke-aml-pipeline.template.yml

- task: ms-air-aiagility.vss-services-azureml.azureml-restApi-task.MLPublishedPipelineRestAPITask@0

Reading online, and from my own experiences, the task will not run unless given a specific machine learning workspace service principal connection in DevOps. A typical service principal connection will not suffice.

No error message is available, hence this issue needs to be documented

Add input/output schema definition for webservice

Add input & output samples and decorators to generate Swagger schema in src/score.py. This will make it easier for teams using the prediction webservice to interact with it.

Use a simple dummy schema as the template doesn't contemplate any particular dataset. Currently, we are doing the smoke test with the input defined in this file.

Fix pipeline trigger

Pipelines are triggered when changes are made to documentation which should be disregarded

Common functionalities wrapping

As you see our github repositories, there are some similar functionalities like

https://github.com/microsoft/dstoolkit-classification-solution-accelerator/blob/main/src/utils.py
and
https://github.com/microsoft/dstoolkit-mlops-base/blob/main/src/utils.py

They define common functions like generating Workspace, getting Datasets etc.

The proposition is to come up with a pypi package that utilizes common functionalities (see attached ppt)
common_function_dstoolkit.pptx

Add extra environment version with custom Docker image

Add an example of environment with custom Dockerfile as another folder inside configuration/environments/.

People will then just need to change the AML_TRAINING_ENV_PATH / AML_BATCHINFERENCE_ENV_PATH in configuration/configuration-aml.variables.yml to this new path to use a custom Docker image for the AML pipelines. We should add instructions on how to do this and how to configure the environment (links to the official AML docs?) in docs/how-to.

Change default auth method from pipelines to Service Principal

Pipeline execution will trigger AML InteractiveLoginAuthentication by default once we switch to Azure CLI 2.30 (see comment here). In preparation to that, we should change workspace authentication to use ServicePrincipalAuthentication as default method when run from pipelines (right now it's using CLI credentials).

Add instructions on how to use self-hosted agents in Azure DevOps

The template currently uses Microsoft-hosted agents to run pipelines in Azure DevOps, which is the simplest way to run the jobs and very useful to set up a quick MLOps demo/showcase. However, our customer usually has some specific requirements, for example, security configuration, dependent software needed, etc.; and it can be easier with self-hosted as it gives us more control. Also, the private agent has performance advantages, for example, the ability to run incremental builds, start a job faster, etc.

The documentation needs to have instructions on how to set up a self-hosted agent and how to modify the template pipelines to use it.

Support for local prediction

Currently, score.py is the only "src" file that has no main method and thus cannot be easily run locally.

It would be nice to to have that to ease testing during development.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.