Safe blue/green deployment of Amazon SageMaker endpoints using AWS CodePipeline, CodeBuild and CodeDeploy.

Home Page: https://aws.amazon.com/blogs/machine-learning/safely-deploying-and-monitoring-amazon-sagemaker-endpoints-with-aws-codepipeline-and-aws-codedeploy/

License: MIT No Attribution

Python 33.02% Jupyter Notebook 65.45% Shell 1.53%

amazon-sagemaker aws-codepipeline aws-cloudformation

amazon-sagemaker-safe-deployment-pipeline's Introduction

Amazon SageMaker Safe Deployment Pipeline

Introduction

This is a sample solution to build a safe deployment pipeline for Amazon SageMaker. This example could be useful for any organization looking to operationalize machine learning with native AWS development tools such as AWS CodePipeline, AWS CodeBuild and AWS CodeDeploy.

This solution provides a Blue/Green, also known as an Canary deployment, by creating an AWS Lambda API that calls into an Amazon SageMaker Endpoint for real-time inference.

Architecture

In the following diagram, you can view the continuous delivery stages of AWS CodePipeline.

Build Artifacts: Runs an AWS CodeBuild job to create AWS CloudFormation templates.
Train: Trains an Amazon SageMaker pipeline and Baseline Processing Job
Deploy Dev: Deploys a development Amazon SageMaker Endpoint
Deploy Prod: Deploys an Amazon API Gateway endpoint, and AWS Lambda function in front of Amazon SageMaker Endpoints using AWS CodeDeploy for blue/green deployment and rollback.

Components Details

AWS CodePipeline – CodePipeline has various stages defined in CloudFormation, which step through which actions must be taken in which order to go from source code to creation of the production endpoint.
AWS CodeBuild – This solution uses AWS CodeBuild to build the source code from GitHub.
Amazon S3 – Artifacts created throughout the pipeline as well as the data for the model is stored in an Simple Storage Service (S3) Bucket.
AWS CloudFormation – This solution uses the AWS CloudFormation Template language, in either YAML or JSON, to create each resource including a custom resource.
AWS Step Functions – This solutions creates AWS StepFunctions to orchestrate Amazon SageMaker training and processing jobs.
Amazon SageMaker – This solution uses Amazon SageMaker to train and deploy the machine learning model.
AWS CodeDeploy – This solution uses AWS CodeDeploy to automate shifting traffic between two AWS Lambda functions.
Amazon API Gateway – This solutions creates an HTTPS REST API endpoint for AWS Lambda functions that invoke deployed Amazon SageMaker Endpoint.

Deployment Steps

The following is the list of steps required to get up and running with this sample.

Requirements

Create your AWS account at http://aws.amazon.com by following the instructions on the site.
A Studio user account, see onboard to Amazon SageMaker Studio

Enable Amazon SageMaker Studio Project

From AWS console navigate to Amazon SageMaker Studio and click on your studio user name (do not Open Studio now) and copy the name of execution role as shown below (similar to AmazonSageMaker-ExecutionRole-20210112T085906)

Click on the launch button below to setup the stack

and paste the role name copied in step 1 as the value of the parameter SageMakerStudioRoleName as shown below and click Create Stack

Alternatively, one can use the provided scripts/build.sh (which required AWS CLI installed with appropriate IAM permissions) as follows

# bash scripts/build.sh S3_BUCKET_NAME STACK_NAME REGION STUDIO_ROLE_NAME
# REGION should match your default AWS CLI region
# STUDIO_ROLE_NAME is copied from step 1. Example:
bash scripts/build.sh example-studio example-pipeline us-east-1 AmazonSageMaker-ExecutionRole-20210112T085906

From the AWS console navigate to cloudformation and once the stack STACK_NAME is ready
Go to your SageMaker Studio and Open Studio (and possibly refresh your browser if you're already in Studio) and from the your left hand side panel, click on the inverted triangle. As with the screenshot below, under Projects -> Create project -> Organization templates, you should be able to see the added SageMaker Safe Deployment Pipeline. Click on the template name and Select project template

Choose a name for the project and can leave the rest of the fields with their default values (can use your own email for SNS notifications) and click on Create project
Once the project is created, it gives you the option to clone it locally from AWS CodeCommit by a single click. Click clone and it goes directly to the project
Navigate to the code base and go to notebook/mlops.ipynb
Choose a kernel from the prompt such as Python 3 (Data Science)
Assign your project name to the placeholder PROJECT_NAME in the first code cell of the mlops.ipynb notebook
Now you are ready to go through the rest of the cells in notebook/mlops.ipynb

Start, Test and Approve the Deployment

Once the deployment is complete, there will be a new AWS CodePipeline created, with a Source stage that is linked to your source code repository. You will notice initially that it will be in a Failed state as it is waiting on an S3 data source.

Once the notebook is running, you will be guided through a series of steps starting with downloading the New York City Taxi dataset, uploading this to an Amazon SageMaker S3 bucket along with the data source meta data to trigger a new build in the AWS CodePipeline.

Once your pipeline is kicked off it will run model training and deploy a development SageMaker Endpoint.

There is a manual approval step which you can action directly within the SageMaker Notebook to promote this to production, send some traffic to the live endpoint and create a REST API.

Subsequent deployments of the pipeline will use AWS CodeDeploy to perform a blue/green deployment to shift traffic from the Original to Replacement endpoint over a period of 5 minutes.

Finally, the SageMaker Notebook provides the ability to retrieve the results from the Monitoring Schedule that is run on the hour.

Approximate Times:

The following is a list of approximate running times for the pipeline:

Full Pipeline: 35 minutes
Start Build: 2 minutes
Model Training and Baseline: 5 minutes
Launch Dev Endpoint: 10 minutes
Launch Prod Endpoint: 15 minutes
Monitoring Schedule: runs on the hour

Customizing for your own model

This project is written in Python, and design to be customized for your own model and API.

.
├── api
│   ├── __init__.py
│   ├── app.py
│   ├── post_traffic_hook.py
│   └── pre_traffic_hook.py
├── assets
│   ├── deploy-model-dev.yml
│   ├── deploy-model-prod.yml
│   ├── suggest-baseline.yml
│   └── training-job.yml
├── custom_resource
|   ├── __init__.py
|   ├── sagemaker_monitoring_schedule.py
|   ├── sagemaker_suggest_baseline.py
|   ├── sagemaker_training_job.py
│   └── sagemaker-custom-resource.yml
├── model
│   ├── buildspec.yml
│   ├── dashboard.json
│   ├── requirements.txt
│   └── run_pipeline.py
├── notebook
│   ├── dashboard.json
|   ├── workflow.ipynb
│   └── mlops.ipynb
├── scripts
|   ├── build.sh
|   ├── lint.sh
|   └── set_kernelspec.py
├── pipeline.yml
└── studio.yml

Edit the get_training_params method in the model/run_pipeline.py script that is run as part of the AWS CodeBuild step to add your own estimator or model definition.

Extend the AWS Lambda hooks in api/pre_traffic_hook.py and api/post_traffic_hook.py to add your own validation or inference against the deployed Amazon SageMaker endpoints. You can also edit the api/app.py lambda to add any enrichment or transformation to the request/response payload.

Running Costs

This section outlines cost considerations for running the SageMaker Safe Deployment Pipeline. Completing the pipeline will deploy development and production SageMaker endpoints which will cost less than $10 per day. Further cost breakdowns are below.

CodeBuild – Charges per minute used. First 100 minutes each month come at no charge. For information on pricing beyond the first 100 minutes, see AWS CodeBuild Pricing.
CodeCommit – $1/month if you didn't opt to use your own GitHub repository.
CodeDeploy – No cost with AWS Lambda.
CodePipeline – CodePipeline costs $1 per active pipeline* per month. Pipelines are free for the first 30 days after creation. More can be found at AWS CodePipeline Pricing.
CloudWatch - This template includes 1 dashboard and 3 alarms (2 for deployment and 1 for model drift) which costs less than $10 per month.
- Dashboards cost $3/month.
- Alarm metrics cost $0.10 per alarm.
CloudTrail - Low cost, $0.10 per 100,000 data events to enable S3 CloudWatch Event. For more information, see AWS CloudTrail Pricing
KMS – $1/month for the Customer Managed CMK created.
API Gateway - Low cost, $1.29 for first 300 million requests. For more info see Amazon API Gateway pricing
Lambda - Low cost, $0.20 per 1 million request see AWS Lambda Pricing.
SageMaker – Prices vary based on EC2 instance usage for the Notebook Instances, Model Hosting, Model Training and Model Monitoring; each charged per hour of use. For more information, see Amazon SageMaker Pricing.
- The ml.t3.medium instance notebook costs $0.0582 an hour.
- The ml.m4.xlarge instance for the training job costs $0.28 an hour.
- The ml.m5.xlarge instance for the monitoring baseline costs $0.269 an hour.
- The ml.t2.medium instance for the dev hosting endpoint costs $0.065 an hour.
- The two ml.m5.large instances for production hosting endpoint costs 2 x $0.134 per hour.
- The ml.m5.xlarge instance for the hourly scheduled monitoring job costs $0.269 an hour.
S3 – Prices will vary depending on the size of the model/artifacts stored. The first 50 TB each month will cost only $0.023 per GB stored. For more information, see Amazon S3 Pricing.

Cleaning Up

First, delete the stacks used as part of the pipeline for deployment, training job and suggest baseline. For a model name of nyctaxi that would be:

nyctaxi-deploy-prd
nyctaxi-deploy-dev
nyctaxi-workflow
sagemaker-custom-resource

Finally, delete the stack you created in AWS CloudFormation.

Security

See CONTRIBUTING for more information.

License

This library is licensed under the MIT-0 License. See the LICENSE file.

amazon-sagemaker-safe-deployment-pipeline's People

Contributors

Stargazers

Watchers

Forkers

raigonjolly brightsparc toofie go-nzo albertoflorez rafacermeno srstfca carlosafonso iciarcivantosgomez alutu marionacaros pmateosmasa japerezgonz david-luca ssshivnani harcalion victortiws ailanmar jarechu fernandodelaiglesia eblancoh jrubio melenagarcia alejandro-hernandezcamacho albsoguero njf-ops danielhidalg j4pross ticogarciad jbernalvallejo edwiles eliekawerk chrisbarrantes mlvats nickmccarthy101 emarenas chenwuperth rnesvig dping1 kirit93 arqgroup-marnielamprecht lpsoltys baraia la0216 adnanrashidtipu74 tannydobreva inigofernandez-cepsa alfa65 kaontopion juan-arevalo patcalsi tvangalapat darkreapyre stmnk veerathp seanpmorgan bimsbaby avjdataminer jamesleoni frank-tan amitbaranwal53 rajeshjain shahardaniel asuresh1 rrichards itsaboutprogress leafmulch akhilsh984 tom5610 kzngroup johnmousa jwholme2 rajat-fmgl nagibshah abhishekkirloskar hary10 wisde sangramzero moose-in-australia zack4m kapoork31 yuansingapore anish94 mstfldmr chris-chris sakethvaliveti gmcorral wmlba zanzas gautam-matkar yuhaocai blakewell mullue dimiiako dalacan bramcals brambet gonsoomoon-ml factorwonk pmhargis-aws

amazon-sagemaker-safe-deployment-pipeline's Issues

fix: Update notebook link in CodePipeline approval stage

Update the notebook url to use the GitHubRepo variable to correctly reference the notebook.

https://${ModelName}-notebook.notebook.${AWS::Region}.sagemaker.aws/notebooks/${GitHubRepo}/notebook/mlops.ipynb

Update solution to use AWS Step Functions Data Science SDK offical v2

Update SageMaker notebook and AWS CodeBuild requirements to use the latest v2.0.0 step functions which is upgrade to support the SageMaker v2

CFN error

CFN fails to launch giving me the following error. I was able to launch it without an issue on Thursday.

feat: Trigger retraining on CloudWatch alarm drift detection

Alarms are currently created when drift is detected, but need to add a cloud watch event to start pipeline.

This can be done by creating a role with permission to start the code pipeline. Note there is no ARN for the AWS::CodePipeline::Pipeline resource, so need format the name.

  CloudWatchEventRole:
    Type: AWS::IAM::Role
    Properties:
      RoleName: !Sub mlops-${ModelName}-cwe-role
      AssumeRolePolicyDocument:
        Version: "2012-10-17"
        Statement:
          -
            Effect: Allow
            Principal:
              Service:
                - events.amazonaws.com
            Action: sts:AssumeRole
      Path: /
      Policies:
        -
          PolicyName: "mlops-cwe-pipeline-execution"
          PolicyDocument:
            Version: "2012-10-17"
            Statement:
              -
                Effect: Allow
                Action: codepipeline:StartPipelineExecution
                Resource: !Sub "arn:aws:codepipeline:${AWS::Region}:${AWS::AccountId}:${DeployPipeline}"

To start the pipeline, you just need to specify the Target Arn

  CloudWatchEventRule:
    Type: AWS::Events::Rule
    Properties:
      EventPattern:
        source:
          - aws.cloudwatch
        detail-type:
          - "CloudWatch Alarm State Change"
        detail:
          alarmName:
            - !Sub mlops-${ModelName}-metric-gt-threshold
          state:
            value: 
              - "ALARM"
      Targets:
        -
          Arn: !Sub "arn:aws:codepipeline:${AWS::Region}:${AWS::AccountId}:${DeployPipeline}"
          RoleArn: !GetAtt CloudWatchEventRole.Arn
          Id: !Sub codepipeline-${DeployPipeline}

see docs

"Launch Stack" link is not working

"Launch Stack" button's link is not working. I get 2 error messages in Cloudformation console.

You must specify a Stack ID. Displaying your Stacks page.
S3 error: Access Denied For more information check http://docs.aws.amazon.com/AmazonS3/latest/API/ErrorResponses.html

feat: Add native CloudFormation support for MonitoringSchedule

Update the production deployment template to use the native MonitoringSchedule instead of the custom CloudFormation resource.

  SagemakerMonitoringSchedule:
    Type: AWS::SageMaker::MonitoringSchedule
    Properties:
      EndpointName: !GetAtt Endpoint.EndpointName
      MonitoringScheduleArn: String
      MonitoringScheduleConfig:
        MonitoringJobDefinition:
          BaselineConfig:
            ConstraintsResource:
              S3Uri: !Sub s3://sagemaker-${AWS::Region}-${AWS::AccountId}/${ModelName}/monitoring/baseline/mlops-${ModelName}-pbl-${TrainJobId}/constraints.json
            StatisticsResource:
              S3Uri: !Sub s3://sagemaker-${AWS::Region}-${AWS::AccountId}/${ModelName}/monitoring/baseline/mlops-${ModelName}-pbl-${TrainJobId}/statistics.json
          MonitoringAppSpecification:
            ImageUri:
              !FindInMap [ModelAnalyzerMap, !Ref "AWS::Region", "ImageUri"]
          MonitoringInputs:
            - EndpointInput:
                EndpointName: !GetAtt Endpoint.EndpointName
                LocalPath: "/opt/ml/processing/endpointdata"
          MonitoringOutputConfig:
            MonitoringOutputs:
              - S3Output:
                  LocalPath: "/opt/ml/processing/localpath"
                  S3Uri: !Sub s3://sagemaker-${AWS::Region}-${AWS::AccountId}/${ModelName}/monitoring/reports
          MonitoringResources:
            ClusterConfig:
              InstanceCount: 1
              InstanceType: ml.m5.xlarge
              VolumeKmsKeyId: !Ref KmsKeyId
              VolumeSizeInGB: 30
          RoleArn: !Ref MLOpsRoleArn
          StoppingCondition:
            MaxRuntimeInSeconds: 1800
        ScheduleConfig:
          ScheduleExpression: "cron(0 * ? * * *)"
      MonitoringScheduleName: !Sub mlops-${ModelName}-pms-${TrainJobId}

Require defining a Region mapping for the model analyzer:

  ModelAnalyzerMap:
    "us-west-2":
      "ImageUri": "159807026194.dkr.ecr.us-west-2.amazonaws.com/sagemaker-model-monitor-analyzer:latest"
    "us-east-2":
      "ImageUri": "680080141114.dkr.ecr.us-east-2.amazonaws.com/sagemaker-model-monitor-analyzer:latest"
    "us-east-1":
      "ImageUri": "156813124566.dkr.ecr.us-east-1.amazonaws.com/sagemaker-model-monitor-analyzer:latest"
    "eu-west-1":
      "ImageUri": "890145073186.dkr.ecr.eu-west-1.amazonaws.com/sagemaker-model-monitor-analyzer:latest"
    "ap-northeast-1":
      "ImageUri": "574779866223.dkr.ecr.ap-northeast-1.amazonaws.com/sagemaker-model-monitor-analyzer:latest"
    "ap-northeast-2":
      "ImageUri": "709848358524.dkr.ecr.ap-northeast-2.amazonaws.com/sagemaker-model-monitor-analyzer:latest"
    "ap-southeast-2":
      "ImageUri": "563025443158.dkr.ecr.ap-southeast-2.amazonaws.com/sagemaker-model-monitor-analyzer:latest"
    "eu-central-1":
      "ImageUri": "048819808253.dkr.ecr.eu-central-1.amazonaws.com/sagemaker-model-monitor-analyzer:latest"

nyctaxi-deploy-prd fails

The pipeline fails to create the prod stack in SagemakerMonitoringSchedule because of

Resource handler returned message: "Error occurred during operation 'CREATE'." (RequestToken: 40af8897-76d2-abb5-6efc-ef8c6948d42b, HandlerErrorCode: GeneralServiceException)

Note that everything else is successful and the works in us-east-1

feat: Add support for StepFunctions to run Training and Baseline jobs.

The build currently outputs a custom cloud formation script to run a training job and baseline job.

This could be updated to use the new AWS Step Functions Data Science Python SDK to output the CloudFormation script to create/update a pipeline.

This script can then be can executed via the CodePipeline with the new StepFunctions action

Name: ActionName
ActionTypeId:
  Category: Invoke
  Owner: AWS
  Version: 1
  Provider: StepFunctions
OutputArtifacts:
  - Name: myOutputArtifact
Configuration:
  StateMachineArn: arn:aws:states:us-east-1:111122223333:stateMachine:HelloWorld-StateMachine
  ExecutionNamePrefix: my-prefix

Custom resources are not deleted

When I try deleting the following Cloudformation stacks, custom resources are stuck in DELETE_IN_PROGRESS state for long time.

nyctaxi-training-job
nyctaxi-suggest-baseline
nyctaxi-training-job

feat: Update prod deploy to enable DataCaptureConfig on create

CloudFormation now supports included DataCaptureConfig as part of EndpointConfig which can removed the need to have an explicit step to enable data config after an endpoint is created.

The deploy-model-prd.yml can be extended with the following configuration.

  EndpointConfig:
    Type: "AWS::SageMaker::EndpointConfig"
    Properties:
      DataCaptureConfig:
        CaptureContentTypeHeader:
          CsvContentTypes:
            - "text/csv"
          JsonContentTypes:
            - "application/json"
        CaptureOptions:
          - CaptureMode: Input
          - CaptureMode: Output
        DestinationS3Uri: !Sub s3://sagemaker-${AWS::Region}-${AWS::AccountId}/${ModelName}/datacapture
        EnableCapture: True
        InitialSamplingPercentage: 100
        KmsKeyId: !Ref KmsKeyId

enhancement: Add support for Version 2.x of SageMaker Python SDK

Update build code and notebook to support the breaking changes coming in v2.0 of SageMaker Python SDK

Image URI Functions (e.g. get_image_uri)

try:
    # Support SageMaker 2 SDK: https://sagemaker.readthedocs.io/en/stable/v2.html
    def get_training_image(region=None):
        region = region or boto3.Session().region_name
        return sagemaker.image_uris.retrieve(region=region, 
                                             framework='xgboost', version="1.0-1")
except:
    from sagemaker.amazon.amazon_estimator import get_image_uri
    def get_training_image(region=None):
        region = region or boto3.Session().region_name
        return get_image_uri(region, "xgboost", "1.0-1")

XGBoost Predictor

try:
    # Support SageMaker 2 SDK: https://sagemaker.readthedocs.io/en/stable/v2.html
    from sagemaker.predictor import Predictor
    from sagemaker.serializers import CSVSerializer
    def get_predictor(endpoint_name):
        xgb_predictor = Predictor(endpoint_name)
        xgb_predictor.serializer = CSVSerializer()
        return xgb_predictor
except Exception as e:
    from sagemaker.predictor import RealTimePredictor, csv_serializer
    def get_predictor(endpoint_name):
        xgb_predictor = RealTimePredictor(endpoint_name)
        xgb_predictor.content_type = 'text/csv'
        xgb_predictor.serializer = csv_serializer
        return xgb_predictor

AWS Cloud Formation failing with 403 s3 forbidden error

Hello Developers,
After creating the initial template and launching a project with sage maker studio project launch is failing saying cloud formation failed when i saw the error it says 403 forbidden s3 error I have given full s3 access and admin access to the role but still same issue, changed the s3 bucket name all possible solutions. Even tested in different vpc all those giving same error. can someone please run the template from your end and help us with possible solution. Role is given admin access as well.

Why not using CDK to build the Pipeline and WorkFlow with stepfunctions?

The pipeline and build in pure CloudFormation is not easy to follow
The workflow/step functions generated from run_pipeline.py is also not easy to follow because a part of infrastructure created by CloudFormation, and then in run_pipeline, infrastructure created by SDK. It is not consistent and mixing.
Why not using CDK to do all for consistency?

Build error in nyctaxi pipeline

I tried to prepare data using deployed jupyter notebook named "mlops.ipynb", but after uploading the zipped data source to S3, which will trigger the "nyctaxi" pipeline, after several minutes, in the last step in Build stage the following error happened:
"""
CreateWorkflow
AWS CloudFormation
Failed
"""

After inspecting events in CFn, the following error was reported:
"""
Resource handler returned message: "'arn:aws:iam::(aws-account):role/mlops-nyctaxi-sfn-execution-role' is not authorized to create managed-rule. (Service: AWSStepFunctions; Status Code: 400; Error Code: AccessDeniedException; Request ID: 7198988e-765b-4ffa-aca1-183dc957d581; Proxy: null)" (RequestToken: 8b3b9e75-16e5-89bc-2e1b-69340f289e41, HandlerErrorCode: AccessDenied)
"""

Is this a new issue? Any ideas how to fix it?

Thanks in advance.

Fix: Upgrade awscli for code build job

Upgrade the awscli to the latest version to avoid an boto3 error relating to an older version in the aws/codebuild/amazonlinux2-x86_64-standard:1.0 image

see: boto/boto3#2596

aws-samples / amazon-sagemaker-safe-deployment-pipeline Goto Github PK