azure / mlops-v2 Goto Github PK

Azure MLOps (v2) solution accelerators. Enterprise ready templates to deploy your machine learning models on the Azure Platform.

Home Page: https://learn.microsoft.com/en-us/azure/machine-learning/concept-model-management-and-deployment

License: MIT License

Shell 100.00%

azure azuremachinelearning azureml deep-learning devops machine-learning microsoft mlops mlops-workflow mlops-environment

mlops-v2's Introduction

Azure MLOps (v2) Solution Accelerator

Welcome to the MLOps (v2) solution accelerator repository! This project is intended to serve as the starting point for MLOps implementation in Azure.

MLOps is a set of repeatable, automated, and collaborative workflows with best practices that empower teams of ML professionals to quickly and easily get their machine learning models deployed into production. You can learn more about MLOps here:

Project overview

The solution accelerator provides a modular end-to-end approach for MLOps in Azure based on pattern architectures. As each organization is unique, solutions will often need to be customized to fit the organization's needs.

The solution accelerator goals are:

Simplicity
Modularity
Repeatability & Security
Collaboration
Enterprise readiness

It accomplishes these goals with a template-based approach for end-to-end data science, driving operational efficiency at each stage. You should be able to get up and running with the solution accelerator in a few hours.

Prerequisites

An Azure subscription. If you don't have an Azure subscription, create a free account before you begin.
- Important - If you use either a Free/Trial, or similar learning purpose subscriptions like Visual Studio Premium with MSDN, some provisioning tasks might not run as expected due to limitations imposed on 'Usage + quotas' on your subscription. To help you succeed, we have provided specific instructions before provisioning throughout the guide, and you are highly advised to read those instructions carefully
For Azure DevOps-based deployments and projects:
- Azure CLI with azure-devops extension.
- Terraform extension for Azure DevOps if you are using Terraform to spin up infrastructure
For GitHub-based deployments and projects:
- Azure CLI
- GitHub client
Git bash, WSL, or another shell script editor on your local machine

Documentation

Solution Accelerator Concepts and Structure - Philosophy and organization
Architectural Patterns - Supported Machine Learning patterns
Accelerator Deployment Guides - How to deploy and use the soluation accelerator with Azure DevOps or GitHub
Quickstarts - Precreated project scenarios for demos/POCs. Azure DevOps ADO Quickstart.
YouTube Videos: Deploy MLOps on Azure in Less Than an Hour and AI Show

Contributing

This project welcomes contributions and suggestions. To learn more visit the contributing section, see CONTRIBUTING.md for details.

Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.

mlops-v2's People

Contributors

Stargazers

Watchers

Forkers

felipemoz abhiagar2019 dhee2211 carla-fiadeiro azeltov isabella232 malikamalik baranyiszabolcs anideswandikar-github nishantsulay thinkgradient lordlinus dhangerkapil jlsfernandez n-majdou caiomsouza uvashisth taocao yaruyi carlosasantos amjadmkhan rui-ren mblanc yelizkilinc zachary-syd dina-dede anideswandikar1 murugeswarimuthurajan jiportilla ozlemakpinar adrianfz radhakrishnang karlotimmerman krisbock maheshsql shubhamc247 lparisi drosevear dixiev lukenjmcd suhaibmo constancelee afsajid edloh1 akritim f-follonier amaliaspataru lindacmsheard arun13go raineldias88 thamerla balakreshnan abeomor yogidosalwar nwilliams71 jangchaeyeon jimyangmsft pannettifuji srini-t10 sulecs1 aadelugba hugobettmach ritaja shamprasadrh raviraj988 wickkiey timschps basalganglia numericx dosalwaryogi anacarolines jplummer01 sdonohoo alihaidry mhonakerredapt karunrg nguyentrungthang-developer vaibhav-ahuja1 yxinjiang abad623 shark1976 timzhan corticalstack juan-zambrano 127-0-0-vvk jennywukpmg mamidis-ld damirbeksydykov seb-jul bibuwei akhil-v-nair gokhanttr shubhamkansal1993 tuscar2001 rakeshkumarford georgeannie hiralthadeshwar31 ujjwalmsft karthikkadajji wuguo5982

mlops-v2's Issues

[QuickStart.md] Error when running deploy-model-training-pipeline with enable_aml_secure_workspace: true

Hey,

After infra deploy "tf-ado-deploy-infra.yml" with "enable_aml_secure_workspace: true" did you test "deploy-model-training-pipeline" in Azure DevOps?

Stack:

My conclusion is that the stack is as expected because public agent is used
pool:
vmImage: $(ap_vm_image)
where ap_vm_image: ubuntu-20.04.
to perform actions in Azure Machine Learning Workspace (ex. pipeline.publish(config['training_pipeline_name'])). In case you are able to perform this action on your side isn't it a security issue? Do you expect to add self hosted agent deployment and configuration to tf ?
Additionally, don't you need "Machine Learning Workspace" scoped service principal to perform AML Workspace actions? Should instructed Azure-ARM-Prod (subscription level type) be able to preform any actions?

Model Promotion

I wasnt able to find in the code or in the documentation anything related to model promotion and data drift.

If I have a daily ingestion of new data and I run the training pipeline on a daily basis, probably at some point the daily model will have better metrics that the one deployed.

How can I replace the model ONLY if the metrics are better? Could you provide some guidance or links if it exists?

Job naming issues with AML CLI training pipeline

Why?

Currently when we deploy training pipeline using AML CLI we run into the problem that job names could not be validated.

How?

As we can see in the screenshot above that hyphens ain't allowed in the naming conventions for the jobs. So we should consider repacing hyphens with underscores in pipeline.yml file.

I could also create a PR if you consider this as a valid way of solving this issue.

Anything else?

[QUICKSTART.md] pipeline build failing when creating compute cluster.

Describe the bug or the issue that you are facing

Outer Loop: Deploying Infrastructure via Azure DevOps

Error: waiting for creation of Compute Cluster "cpu-cluster" in workspace "mlw-mlopsv2-0769dev" Code="BadRequest" Message=The specified subscription has a total vCPU quota of 0 and cannot accomodate for at least 1 requested managed compute node which maps to 2 vCPUs.

My guess is as BatchAI is deprecated, it cannot allocate the cpu cluster. I have also tried changing regions and vmsize. None of them seems to work. Or it could be my subscription 'Azure for Students' not having the quota. If it is because of my subscription quota, is there a way around it?

Steps/Code to Reproduce

Run the pipeline as usual in region eastus.

Expected Output

Demo running as intended

Versions

Azure DevOps
Terraform
Azure ML CLI

Which platform are you using for deploying your infrastrucutre?

Azure DevOps (ADO)

If you mentioned Others, please mention which platformm are you using?

No response

What are you using for deploying your infrastrucutre?

Terraform

Are you using Azure ML CLI v2 or Azure ML Python SDK v2

Azure ML CLI v2

Describe the example that you are trying to run?

Classical ML provided with the repo

Batch Endpoint - AutoML

Hi,

We opted for automated ML , ran the job and registered the model in dev workspace
Now we want to deploy it in test and prod workspace. We have downloaded the .pkl file, scoring script, conda env file which are all required for deployment.

But it seems the scoring script generated is applicable only for online deployment. So whether batch deployment is not possible in automated ml?

Thanks

The operation was canceled

When I run Outer Loop: Deploying Infrastructure via Azure DevOps pip line Azure DevOps give me this error

2022-08-04T07:59:48.3348777Z [command]/opt/hostedtoolcache/terraform/0.14.7/x64/terraform plan -var location=westus -var prefix=mlopsv2 -var postfix=1699 -var environment=prod -var enable_aml_computecluster=true -detailed-exitcode
2022-08-04T07:59:50.9124061Z Acquiring state lock. This may take a few moments...
2022-08-04T07:59:52.1053307Z �[0m�[1mvar.client_secret�[0m
2022-08-04T07:59:52.1054837Z Service Principal Secret
2022-08-04T07:59:52.1055171Z
2022-08-04T08:59:23.3030923Z �[1mEnter a value:�[0m �[0m
2022-08-04T08:59:23.3501608Z ##[error]The operation was canceled.
2022-08-04T08:59:23.3508829Z ##[section]Finishing: Terraform plan

Correct Deployment of Dev Environments

Hello,

Thank you for this repo, and all your hard work maintaining it. I am trying to deploy the quickstart guide - I have successfully been able to deploy the outer loop prod environment using Azure DevOps.

However, my understanding was that I would end up with a rg-mlopsv2-XXXprod and an rg-mlopsv2-XXXdev - an area to do the dev work on the project, and an area to host the production model (in the future, I want to expand to also contain a UAT env). The pipelines would allow me to move models between the two. However, after running the tf-ado-deploy-infra.yml pipeline I only have the prod environment. I tried creating a new branch, called dev, and reran the pipeline. However, now, during the Terraform Init step I get an error:

##[error]Error: There was an error when attempting to execute the process '/opt/hostedtoolcache/terraform/0.14.7/x64/terraform'. This may indicate the process failed to start. Error: spawn /opt/hostedtoolcache/terraform/0.14.7/x64/terraform ENOENT

The guide always mentions staying on main, and doesn't mention creating a dev environment. Have I conceptually misunderstood what architecture (ie, there shouldn't be a rg-mlopsv2-XXXdev), or, if I need to create this env how do I overcome my error?

How to achieve Continuous Integration

Hi,

As per the process, we are training the and registering the model in DEV Workspace. But how to bring/ deploy that model in Stage.

It would be better if we have the steps between registering the model in dev workspace and deploy in Stage Workspace

How to stop the instance in online deployment when not required

Hi,

In online-deployment config file, we are mentioning the instance type and instance count

But where the instance is actually created? We tried to locate in compute session of ml workspace but could not find it

And the minimum value of instance count is 1 [ Tried making it 0 and enabled scale settings but we are getting the error]. So the instance is always running which is incurring us the additional cost when not in use

So please suggest a way to stop the instance/ additional cost when not using

Thanks,
M.Murugeswari

Quickstart.md - Job: Create Bicep Deployment issue with the storage account name

Why?

While running pipeline with the Create Bicep Deployment Job we will get the error: {'code': 'StorageAccountAlreadyTaken', 'target': 'stmlopsv2819prod', 'message': 'The storage account named stmlopsv2819prod is already taken.'}

I believe storage account names are unique and since this was already executed I can't create a storage account with the same name.

How?

I believe in the both files config-infra-dev.yml and config-infra-prod.yml this line needs to change: st$(namespace)$(postfix)$(environment)

and we might have to add some initials of our company, subscription or even from our name

Thank you,
Carla

CD pipeline should deploy model produced by CI, not hardcoded

Why?

It's common sense. CD should deploy model produced by CI pipeline which was registered in the workspace.
There's a custom template that allows to pull the latest model from the repo.

How?

Anything else?

error while running 3rd pipeline Inner / Outer Loop: Moving to Production - Azure DevOps

Hi @setuc @cindyweng, getting error while running deploy-batch-endpoint-pipeline.yml .
while creating deployment

ERROR: (UserError) The specified resource was not found.
Code: UserError
Message: The specified resource was not found.
Exception Details: (ModelNotFound) Model container with name: patient-model not found.
Code: ModelNotFound
Message: Model container with name: xyz-model not found.

Thank you

QUICKSTART.md: Deploy AML Workspace - Terraform Apply Failed - Resource Group Name Already Exists

When I run the ADO Pipeline (Section: Outer Loop: Deploying Infrastructure via Azure DevOps, Step 8: Run the pipeline), I kept getting resource group already exists errors, even though I have changed names to be unique several times (also each time deleted accordingly those resource groups in Azure portal). Please help. Thanks a lot!

I put the error below:

Error: A resource with the ID "/subscriptions/7b60df94-2d4a-463d-9468-5ed1cce920b2/resourceGroups/rg-mlopsv2hez-0919prod" already exists - to be managed via Terraform this resource needs to be imported into the State. Please see the resource documentation for "azurerm_resource_group" for more information.

on modules/resource-group/main.tf line 1, in resource "azurerm_resource_group" "adl_rg":
1: resource "azurerm_resource_group" "adl_rg" {

Releasing state lock. This may take a few moments...
##[error]Error: The process '/opt/hostedtoolcache/terraform/0.14.7/x64/terraform' failed with exit code 1
Finishing: Terraform apply

Unable to register the model as mlflow model

Hi,

We ran the model training pipeline and registered the model as mlflow model in dev workspace

We then used release pipeline, to register the model in test workspace. Kindly find the below command

But the model is registering as custom model

This is working fine before, but now it is registering as custom model only. Please let me know the solution

Hosted Agent ran longer than the maximum time of 60 minutes

@setuc @cindyweng while running 3rd pipeline(Inner / Outer Loop: Moving to Production - Azure DevOps) Showing error The job running on agent Hosted Agent ran longer than the maximum time of 60 minutes while Test deployment. What should i do ? any suggestions ??

Thank you.

[repo] Automate service principal setup?

Why?

Currently the QUICKSTART.md includes a lengthy setup through the GUI for creating a couple Service Principals for use later. Can't this be automated with the Azure CLI? Then this section can simply be to run a script or can be included in the infrastructure setup.

How?

Probably a series of az ad sp commands. Should be idempotent.

Anything else?

A user also needs to ensure they're using the same names. All of this should go in a configuration file and get used throughout the project. @dkmiller may have some thoughts here too, perhaps that hydra tool would be good here?

[QuickStart.md] Error when running deploy-infra pipeline with Terraform

[QuickStart.md] 3.3.5 github-connection explanation inconsistent with the image on it

The image on 3.3.5 doesn't indicate the name of github-connection, but is indicating "mlops-v2-service-connection".
A user who follows the steps may consider that the name could be arbitrary, but the name is hardcoded in yaml files.

[repo] ARM template within the bicep folders

Why?

Incorrect files in location

How?

Currently there is a main.json ARM template within infrastructure/bicep/main.json, this should be removed (if was being used to create the bicep) or moved to infrastructure/arm which is currently empty

[QuickStart.md] "Azure" is hardcoded on "resources: repositories: name:" in "mlops/devops-pipelines/ deploy-**-pipeline.yml" files, but not on QuickStart.md

It is stated on yaml file as "need to change org name from "Azure" to your own org", but not on QuickStart.md

resources:
  repositories:
    - repository: mlops-templates  # Template Repo
      name: Azure/mlops-templates # need to change org name from "Azure" to your own org
      endpoint: github-connection # need to set up and hardcode
      type: github

Getting error while creating Responsible Ai dashboard

I am trying to use different algorithms for training. Model training and model registration part is successfully completed, but RAI insights dashboard constructor is failed. I am using the xgboost model for training.

Traceback (most recent call last):
  File "create_rai_insights.py", line 184, in <module>
    main(args)
  File "create_rai_insights.py", line 128, in main
    model_estimator = load_mlflow_model(my_run.experiment.workspace, model_id=model_id)
  File "/mnt/azureml/cr/j/exe/wd/rai_component_utilities.py", line 79, in load_mlflow_model
    return mlflow.pyfunc.load_model(model_uri)._model_impl
  File "/azureml-envs/responsibleai-0.18/lib/python3.8/site-packages/mlflow/pyfunc/__init__.py", line 484, in load_model
    model_impl = importlib.import_module(conf[MAIN])._load_pyfunc(data_path)
  File "/azureml-envs/responsibleai-0.18/lib/python3.8/site-packages/mlflow/sklearn/__init__.py", line 494, in _load_pyfunc
    return _load_model_from_local_file(path=path, serialization_format=serialization_format)
  File "/azureml-envs/responsibleai-0.18/lib/python3.8/site-packages/mlflow/sklearn/__init__.py", line 452, in _load_model_from_local_file
    return cloudpickle.load(f)
ModuleNotFoundError: No module named 'xgboost'

Thank You

Authorisation problem when deploying training pipeline

Hello,

I have been following your quick start guide, and have got to the stage where I need to deploy the pipeline "deploy-model-training-pipeline.yml" on Azure DevOps.

When I run this, it goes as far as the Run pipeline in AML step in DevOps, then I get this error:

If there is an Authorization error, check your Azure KeyVault secret named kvmonitoringspkey. Terraform might put single quotation marks around the secret. Remove the single quotes and the secret should work.
.create table mlmonitoring (['Sno']: int, ['Age']: int, ['Sex']: string, ['Job']: int, ['Housing']: string, ['Saving accounts']: string, ['Checking account']: string, ['Credit amount']: int, ['Duration']: int, ['Purpose']: string, ['Risk']: string, ['timestamp']: datetime)
Cleaning up all outstanding Run operations, waiting 300.0 seconds
2 items cleaning up...
Cleanup took 0.08368873596191406 seconds
Traceback (most recent call last):
  File "/azureml-envs/XXXXXX/lib/python3.7/site-packages/azure/kusto/data/security.py", line 68, in acquire_authorization_header
    return _get_header_from_dict(self.token_provider.get_token())
  File "/azureml-envs/XXXXXX/lib/python3.7/site-packages/azure/kusto/data/_token_providers.py", line 123, in get_token
    token = self._get_token_impl()
  File "/azureml-envs/XXXXXX/lib/python3.7/site-packages/azure/kusto/data/_token_providers.py", line 554, in _get_token_impl
    return self._valid_token_or_throw(token)
  File "/azureml-envs/XXXXXXX/lib/python3.7/site-packages/azure/kusto/data/_token_providers.py", line 201, in _valid_token_or_throw
    raise KustoClientError(message)
azure.kusto.data.exceptions.KustoClientError: ApplicationKeyTokenProvider - failed to obtain a token. 
invalid_client
AADSTS7000215: Invalid client secret provided. Ensure the secret being sent in the request is the client secret value, not the client secret ID, for a secret added to app 'XXXXX'.
Trace ID: XXXXXX
Correlation ID: XXXXXXX
Timestamp: 2022-11-01 15:46:31Z

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "prep.py", line 83, in <module>
    main()
  File "prep.py", line 80, in main
    log_training_data(df, args.table_name)
  File "prep.py", line 35, in log_training_data
    collector.batch_collect(df)
  File "/azureml-envs/XXXXX/lib/python3.7/site-packages/obs/collector.py", line 158, in batch_collect
    self.create_table_and_mapping()
  File "/azureml-envs/XXXXX/lib/python3.7/site-packages/obs/collector.py", line 132, in create_table_and_mapping
    self.kusto_client.execute_mgmt(self.database_name, CREATE_TABLE_COMMAND)
  File "/azureml-envs/XXXXX/lib/python3.7/site-packages/azure/kusto/data/client.py", line 891, in execute_mgmt
    return self._execute(self._mgmt_endpoint, database, query, None, self._mgmt_default_timeout, properties)
  File "/azureml-envs/XXXXX/lib/python3.7/site-packages/azure/kusto/data/client.py", line 959, in _execute
    request_headers["Authorization"] = self._aad_helper.acquire_authorization_header()
  File "/azureml-envs/XXXXX/lib/python3.7/site-packages/azure/kusto/data/security.py", line 72, in acquire_authorization_header
    raise KustoAuthenticationError(self.token_provider.name(), error, **kwargs)
azure.kusto.data.exceptions.KustoAuthenticationError: KustoAuthenticationError('ApplicationKeyTokenProvider', 'KustoClientError("ApplicationKeyTokenProvider - failed to obtain a token. \ninvalid_client\nAADSTS7000215: Invalid client secret provided. Ensure the secret being sent in the request is the client secret value, not the client secret ID, for a secret added to app 'XXXXXXX'.\r\nTrace ID: XXXXXXXX\r\nTimestamp: 2022-11-01 15:46:31Z")', '{'authority': 'XXXXXX', 'client_id': 'XXXXX', 'kusto_uri': 'https://adxmlopsv286309prod.uksouth.kusto.windows.net'}')

I have done some investigating:

It appears my Prod Service Principal is set up correctly. When I go to Project Settings > Service Connections > Azure-ARM-Prod > Edit, there is an option to verify the connection. It works here.
When I check my App Registrations and investigate the "Certificates and Secrets tab" of "Azure-ARM-Prod-mlops-sparse" there is indeed a secret in here. (The terraform pipeline to create the Prod infrastructure works, so I am lead to believe that my SPs are working correctly.)
When I go to my Prod Key Vault, there is a secret called kvmonitoringspkey - looking at this though the Secret Value is just $(CLIENT_SECRET) - is it meant to be this? If so, why? And where was is set to this?

Do you have any advice on how I can fix this error?

[repo] Depoly Batch Scoring pipeline - Test is failing

What?

When running deployment of batch inferencing pipeline from template https://github.com/Azure/mlops-templates/blob/main/templates/aml-cli-v2/test-deployment.yml

Part when it is testing the newly created endpoint is failing on missing argument:

https://github.com/Azure/mlops-templates/blob/84aff2f4e13c5bc0b86477118cf5279fac3cfaf9/templates/aml-cli-v2/test-deployment.yml#L7-L19

ERROR: the following arguments are required: --input

by docs the --input is mandatory - maybe recent change (https://docs.microsoft.com/en-us/cli/azure/ml/batch-endpoint?view=azure-cli-latest#az-ml-batch-endpoint-invoke)

[repo] sparse_checkout.sh not working on wsl (ubuntu 20.04 focal)

Why?

when trying to run the bash script sparse_checkout.sh on wsl linux ubuntu 20.04

line git clone ... throws an error : fatal: cannot change to 'https://github.com/Azure/mlops-project-template': No such file or directory
error: failed to initialize sparse-checkout
the line git init -b main throws error unknown switch -b

[QuickStart.md] Path structure in deploy-infra pipeline is different from current path structure

The Pipeline Path can not be recreated since the folder structure in the given repo is different from the screenshot shown in the quickstart.
Screenshot from the quickstart:

Current path structure:

issues with mlops-v2/sparse_checkout.sh

I bumped into the following issues while running mlops-v2/sparse_checkout.sh. I worked around them because i knew what the cmds were trying to accomplish.

Issue 1:

git clone --depth 1 --filter=blob:none --sparse https://github.com/azure/mlops-project-template MLOps711Demo
Cloning into 'MLOps711-sparse'...
fatal: cannot change to 'https://github.com/azure/mlops-project-template': No such file or directory
error: failed to initialize sparse-checkout

workaroud - used this cmd (not sure why above doesn't work)
git clone https://github.com/azure/mlops-project-template --depth 1 --filter=blob:none MLOps711Demo

Issue 2:

git init -b main
error: unknown switch `b'
usage: git init [-q | --quiet] [--bare] [--template=] [--shared[=]] []

--template
directory from which templates will be used
--bare create a bare repository
--shared[=]
specify that the git repository is to be shared amongst several users
-q, --quiet be quiet
--separate-git-dir
separate git dir from working tree

Workaound: first inited the repo and then renamed branch from master to main

git init
git add . && git commit -m 'initial commit'
git branch -m master main

Quickstart.md - Job: Create Bicep Deployment issue with does not have authorization to perform action

WHY?
While running pipeline with the Create Bicep Deployment Job we will get the error:

ERROR: {'code': 'AuthorizationFailed', 'message': "The client '0ddf4906-d1f2-4d9f-a900-91af50dba0ec' with object id '0ddf4906-d1f2-4d9f-a900-91af50dba0ec' does not have authorization to perform action 'Microsoft.Resources/deployments/validate/action' over scope '/subscriptions/f1af7a1f-db16-4c85-b133-e7c3c1f094cc' or the scope is invalid. If access was recently granted, please refresh your credentials."}
##[error]Script failed with exit code: 1

Could not get the latest source version for repository

@setuc /mlops/devops-pipelines/deploy-model-training-pipeline.yml: Could not get the latest source version for repository Azure/mlops-templates hosted on https://github.com/ using ref refs/heads/main. GitHub reported the error, "Bad credentials"

Thank you ..

Misleading Architecture Diagram & README.md

I think that the wording of the documentation and the architecture diagram are misleading.

"Continuous Deployment (VD) pipelines manage the promotion of the model and related assets through production"

In this accelerator you are not deploying the model between the different environments, you are promoting the training pipeline. Azure ML Studio doesn't currently support sharing resources across Workspaces (though this is being rectified in https://github.com/Azure/azureml-previews/tree/main/previews/registries).

Could the documentation be updated to make it clear that you are deploying the inner loop in each of the environments.

This confusion is seen in other issues such as #36

[QuickStart.md] Error when running deploy-model-training-pipeline due to container security

When running the deploy-model-training-pipeline.yml [QuickStart.md - Step "Inner Loop: Deploying Classical ML Model Development / Moving to Test Environment"] in DevOps I receive the following container security error:

##[warning]cv/aml-cli-v2/data-science/environment/Dockerfile - Container usage from external registry 'nvcr.io' found.
##[error]Container security analysis found 1 violations. This repo has one or more docker files having references to images from external registries. Please review https://aka.ms/containers-security-guidance to remove the reference of container images from external registries. Please reach out via teams (https://aka.ms/cssc-teams) or email ([email protected]) for any questions or clarifications.

I assume the problem is either this docker image here:
https://github.com/Azure/mlops-project-template/blob/62cd04cb283fb46580558e17117ed701f90dfcbe/classical/aml-cli-v2/mlops/azureml/train/train-env.yml#L3

Or this docker image:
https://github.com/Azure/mlops-project-template/blob/62cd04cb283fb46580558e17117ed701f90dfcbe/cv/aml-cli-v2/data-science/environment/Dockerfile#L2

When working with Conda, there are no checks on the environment packages

Describe the bug or the issue that you are facing

When creating the environments with conda, there are no checks performed when creating the environment to see if the packages exist or if the pip dependencies are also correctly installed.

Steps/Code to Reproduce

When creating the environments with conda, there are no checks performed when creating the environment to see if the packages exist or if the pip dependencies are also correctly installed.

az ml environment create --file .\train-conda.yml

train-conda.yml

$schema: https://azuremlschemas.azureedge.net/latest/environment.schema.json
name: docker-image-plus-conda-example
image: mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04
conda_file: environment.yml
description: Environment created from a Docker image plus Conda environment.

environment.yml

channels:
  - defaults
  - anaconda
  - conda-forge
dependencies:
  - python=3.7.5
  - pip
  - pip:
      - azureml-mlflow==1.38.0
      - azureml-sdk==1.38.0
      - sklearn==0.24.1

Expected Output

No error is thrown even though there is no such package like sklearn.

There should be a check to make sure that the environment has been successfully created before moving on to the next steps.

Versions

running az -v
azure-cli 2.43.0

core 2.43.0
telemetry 1.0.8

Extensions:
account 0.2.5
ml 2.12.1

Dependencies:
msal 1.20.0
azure-mgmt-resource 21.1.0b1

Which platform are you using for deploying your infrastrucutre?

Azure DevOps (ADO)

If you mentioned Others, please mention which platformm are you using?

No response

What are you using for deploying your infrastrucutre?

Bicep

Are you using Azure ML CLI v2 or Azure ML Python SDK v2

Azure ML CLI v2

Describe the example that you are trying to run?

Tabular. but this affects all the environments.

[mlops-v2] Is the Quickstart tutorial limited to MS/Azure employees ?

Why?

This repo (mlops-v2) was recommended to us (MS/Azure Client) by our MS business contact to try out Azure MLOps. However, it looks like some steps require specific organization access.

How?

Doing the Quickstart
Creating the PAT in the Developer Settings of my GH account works fine
However, step 2.7 (pictured bellow), isn't available to us for pretty obvious reasons (not part of Azure) :

is there a way to go around this ? A public version of the tutorial/setup ?

looking forward to try it out !
Thanks in advance

how can i create inference clusters ?

Hi @setuc , I want to create inference cluster, i.e. Deploy model to the Azure Kubernetes Service for large scale inferencing. How can I create a compute inference cluster?

Thank you

GitHub Actions instead of DevOps

Do you have guide (similar to Quickstart) on how to do everything in GitHub rather than Azure DevOps - probably only the orchestration part is in Azure DevOps currently.

Moreover, currently referencing GitHub repos in Azure DevOps pipeline is considered as unsafe:

@msteller-Ai

[Quickstart.md] Deploying Infrastructure via Azure DevOps

Hi,
I am facing an issue with the deployment via Azure Pipeline. When I run the pipeline, the following error is thrown.

It seemed like I had to change the resource repositories and connections. So, I tried changing it to the following and another error is thrown.

Can anyone help me with the issue?

Thanks,
Saiham

Model tagging as Legacy

Hi,

The model which has been trained and registered is now showing as Legacy

We tried creating few new automl models. But those models were also created with the tag Legacy

Can you please let us know on what basis the model is getting tagged as legacy

Thanks

Has responsible Ai been discontinued in mlopsv2?

Hi @setuc In the previous code responsible AI related code was present. Now code is not present as well as I am unable to find https://github.com/Azure/RAI-vNext-Preview github repo.

Thank you

Load testing of endpoint in Azure ML and data sanity tests

I wanted to load test a deployed model endpoint and add it to an MLOps v2 framework, so that new deployed models can go through load testing. I tried Azure load testing resource for web app but I think it is just for static content. Also, I saw a python package - locust which seems promising. Is there any other approach?
Also, for data sanity and integration tests for deployed models - great_expectations is available package. Is there a Azure service or recommended package for the same?

[QuickStart.md] Error when running deploy-infra pipeline due to changed folder structure

[QUICKSTART.md]
When running the "tf-ado-deply-infra.yml" pipeline I receive this error:

The reason might be due to changed folder structure in row 7 and 9 (tf-ado-deploy-infra.yml) :

- template: ../../config-infra-prod.yml
- template: ../../config-infra-dev.yml

Here we might need go one hierarchy level higher.

Kubernetes issue - online deployment

Hi,

For taxi mlflow model usecase - online deployment, I tried using Kubernetes as compute resource

As per the Microsoft docs, created aks, configured extensions, attached in ml workspace. Then I tried online deployment via cli v2, where I am facing the below issues - "InferencingClientCallFailed"

When deploying via ui, for mlflow models also, it is asking to upload scoring script and environment [ which is not incase of managed instance]

So, whether we need to manually add / configure scoring script and environment details for mlflow models when we use Kubernetes?

[repo] SSH auth shouldn't be needed for git cloning

Why?

This is an annoying extra step. If I can gh repo clone the private repo, I shouldn't need to do this.

How?

Can the scripts be modified to use the gh CLI? Some other solution?

Checkout mlops-project-template@cindy-test to s/mlops-project-template

Describe the bug or the issue that you are facing

Is it correct to use a branch different to main?
Starting: Checkout mlops-project-template@cindy-test to s/mlops-project-template

Steps/Code to Reproduce

Same as https://github.com/Azure/mlops-v2/blob/main/documentation/deployguides/deployguide_ado.md
Section: Set up source repository with Azure DevOps

Expected Output

I dont have an error per see, but its weird to me that its using a named branch (cindy-test) instead of main.

Versions

I am using: initialise-project.yml
from mlops-v2 repo

Which platform are you using for deploying your infrastrucutre?

Azure DevOps (ADO)

If you mentioned Others, please mention which platformm are you using?

No response

What are you using for deploying your infrastrucutre?

Pre-Deployed

Are you using Azure ML CLI v2 or Azure ML Python SDK v2

Azure ML Python SDK v2

Describe the example that you are trying to run?

I am just following the section:
Setup MLOps V2 and a New MLOps Project in Azure DevOps

from:
https://github.com/Azure/mlops-v2/blob/main/documentation/deployguides/deployguide_ado.md

We use ADO not GHA

Continuous Integration

I think Continuous Integration should be in the middle loop or outer loop (denpends on the definition of the loop). But currently Continous Integration dones't belong to any loops. Why?

[repo] <Fix versions of all libraries >

Why?

If you don't fix the version, if there's an upgrade from the library the pipeline may break.
Currently the install CLI action does not fix the version which means that Devops Agent will pick up the latest Azure ML CLI version while the job/pipeline definition still follow old version. This may break the pipeline.

How?

Anything else?

"DeployTrainingPipeline" throws error at the "Connect to AML Workspace" step

To make the infrastructure, I ran an Azure Pipeline using infrastructure/pipelines/tf-ado-deploy-infra.yml, and it worked. Then, used mlops/devops-pipelines/deploy-model-training-pipeline.yml to run the (training) Pipeline on the sample data provided, but it stops at the "Connect to AML Workspace" step with the following error:

Then, checked the workspace that's created automatically by the Pipeline and noticed that I don't have access to the Jobs, while receiving a weird notification:

That is while I have a "Contributor" access to the subscription and also, all those secrets and App registrations, etc are made/done without any problem, based on the QuickStart tutorial.

Enterprise readiness where Github is not the source control repository

The current Quickstart doesn't cater for organisations that don't use GitHub as their source control repository. There's a need for these organisations to get started on Azure quickly – for MLOps projects to be initialised in Azure DevOps itself based on 'MLOps version' and 'project type'.

This repo is missing important files

There are important files that Microsoft projects should all have that are not present in this repository. A pull request has been opened to add the missing file(s). When the pr is merged this issue will be closed automatically.

Microsoft teams can learn more about this effort and share feedback within the open source guidance available internally.

Merge this pull request

Expand to add AzureRepos not just GitHub

Would it be possible to expand example to use Azure Repos as well and not just GitHub? We don't use GitHub at all, but are very interested in the MLOps accelerator.

outer-loop-deploying-infrastructure-via-azure-devops fails becuase prod env is not available

These steps: https://github.com/Azure/mlops-v2/blob/main/QUICKSTART.md#outer-loop-deploying-infrastructure-via-azure-devops fail with the following error:

I had to manually create the prod env to get this pipeline to run

Python SDK issue with azure-cli version 2.29

Hey together,

first of all thanks alot for the great tool, i am really enjoing version 2 of the mlops. It is very powerful and really helps to speed up the mlops process for customers.

There is some version problem i am currently experiencing with the python-sdk.

How to replicate the problem :

Run the sparse_check.sh to create a new classical with the python sdk
Run the training pipeline

What causes the problem ?

The problem is that the new version of the az ml cli is not working with azure-cli version 2.29 even longer, it needs at least version 2.30.

Where is the problem located ?

Currently the version 2.29 of the azure-cli is hardcoded in the mlops-template file (install-az-cli.yaml)

What did i try to fix it ?

i changed the version in the yaml file to the desired version 2.30, but then i ran into the next problem.
After increasing the version the method Workspace.from_config() method wanted me to use a DeviceLogin inside the Pipeline.
That indicates, that it does not use the AzCli Authentication somehow.

Screenshot after increasing the version :

Some help would be apreciated, i will also keep trying to fix it :)

azure / mlops-v2 Goto Github PK

mlops-v2's Introduction

Azure MLOps (v2) Solution Accelerator

Project overview

Prerequisites

Documentation

Contributing

Trademarks

mlops-v2's People

Contributors

Stargazers

Watchers

Forkers

mlops-v2's Issues

Why?

How?

Anything else?

Describe the bug or the issue that you are facing

Steps/Code to Reproduce

Expected Output

Versions

Which platform are you using for deploying your infrastrucutre?

If you mentioned Others, please mention which platformm are you using?

What are you using for deploying your infrastrucutre?

Are you using Azure ML CLI v2 or Azure ML Python SDK v2

Describe the example that you are trying to run?

Why?

How?

Why?

How?

Anything else?

I put the error below:

Why?

How?

Anything else?

Why?

How?

What?

Why?

Describe the bug or the issue that you are facing

Steps/Code to Reproduce

Expected Output

Versions

Which platform are you using for deploying your infrastrucutre?

If you mentioned Others, please mention which platformm are you using?

What are you using for deploying your infrastrucutre?

Are you using Azure ML CLI v2 or Azure ML Python SDK v2

Describe the example that you are trying to run?

Why?

How?

Why?

How?

Describe the bug or the issue that you are facing

Steps/Code to Reproduce

Expected Output

Versions

Which platform are you using for deploying your infrastrucutre?

If you mentioned Others, please mention which platformm are you using?

What are you using for deploying your infrastrucutre?

Are you using Azure ML CLI v2 or Azure ML Python SDK v2

Describe the example that you are trying to run?

Why?

How?

Anything else?

Recommend Projects

Recommend Topics

Recommend Org

Jobs