GithubHelp home page GithubHelp logo

chenliu0831 / ecs-machine-learning Goto Github PK

View Code? Open in Web Editor NEW

This project forked from aws-samples/ecs-machine-learning

0.0 2.0 0.0 28 KB

An EC2 Container Service Architecture which provides the infrastructure for Deep Learning

License: Apache License 2.0

ecs-machine-learning's Introduction

Orchestrating GPU-Accelerated Workloads on Amazon ECS

Background

AWS Solutions Architects are seeing an emerging type of application for ECS: GPU-accelerated workloads, or, more specifically, workloads that need to leverage large amounts of GPUs across many nodes. For example, at Amazon.com, the Amazon Personalization Team runs significant Machine Learning workloads that leverage many GPUs on Amazon ECS. Let’s take a look at how ECS enables GPU workloads.

Solution

In order to run GPU-enabled work on an ECS cluster, a Docker image configured with Nvidia CUDA drivers, which allow the container to communicate with the GPU hardware, is built and stored in EC2 Container Registry. An ECS Task Definition is used to point to the container image in ECR and specify configuration for the container at runtime, like how much CPU and memory each container should use, the command to run inside the container, if a data volume should be mounted in the container, where the source dataset lives in S3, and so on.

Once the ECS Tasks are run, the ECS scheduler finds a suitable place to run the containers by identifying an instance in the cluster with available resources. As shown in the below architecture diagram, ECS can place containers into the cluster of GPU instances (“GPU slaves” in the diagram)

Deploying the architecture

In this template, we spin up an ECS Cluster with a single GPU instance in an autoscaling group. You can, however, adjust the ASG desired capacity to run a larger cluster if you’d like. The instance is configured with all of the necessary software, like Nvidia drivers, that DSSTNE requires for interaction with the underlying GPU hardware. We also install some development tools, like Make and GCC, so that we can compile the DSSTNE library at boot time. We then build a Docker container with the DSSTNE library packaged up and upload it to EC2 Container Registry. We grab the URL of the resulting container image in ECR and build an ECS Task Definition that points to the container.

Once the CloudFormation template completes, take a look at the “Outputs” tab to get an idea of where to look for your new resources.

Prerequisistes

Network configuration

The instances launched will need to have access to the internet hence either be in a public subnet and provided a public IP or in a private subnet with acess to a NAT gateway.

Accepting terms

  1. Accept AWS Marketplace terms for Amazon Linux AMI with NVIDIA GRID GPU Driver by going to the MarketPlace page

  2. Click Continue on the right.

  3. Click on the Manual Launch tab and click on the Accept Software Terms button.

  4. Wait for an email confirmation that your marketplace subscription is active.

Launch the stack

  1. Choose Launch Stack to launch the template in the us-east-1 region in your account: Launch ECS Machine Learning into North Virginia with CloudFormation

(The template will build a DSSTNE container on the ECS cluster instance. Note this can take up to 25 minutes and the CloudFormation stack will not report completion until the entire build process is done.)

  1. Give a Stack Name and select your preferred key name. If you do not have a key available, see Amazon EC2 Key Pairs.

Run the model

  1. Find the name of the DSSTNE ECS Task Definition in CloudFormation stack outputs. It will start with "arn:aws[...]" and contain the CloudFormation template name right after "task-defintion/".

  2. Go to the ECS Console, click on Task Definitions (left column) and find the one you spotted in the step above.

  3. Tick the one revision you see. Click on the Actions drop-down menu and hit Run task. Make sure to select the ECS Cluster that was brought up by the CloudFormation template. By running this task, you are essentially running the DSSTNE sample modeling as described on the amazon-dsstne GitHub page.

  4. You can easily check that the GPU is being used by logging to the EC2 instance and running watch -n1 nvidia-smi

Collect predictions

  1. You should be able to find the name of the relevant CloudWatch Logs Group in CloudFormation stack outputs

  2. Look at the task logs for details and output from the task run and location of results file in S3

  3. Navigate to this S3 bucket via the S3 Console. This is where you will be able to access the results file and confirm that this GPU-enabled Machine Learning run was successful.

Bonus activities

  • Bonus activity #1: Repeat step 2 in the Run the model section, but change the config URL and training command by overriding the task definition environment variables to perform a benchmark.
  • Bonus activity #2: Modify the CloudFormation template (or launch a new stack) and use a g2.8xlarge instead of g2.2xlarge (you could also try a P2 instance...) Repeat step 2 in the Run the model section, but override the training command in the task definition environment variables to use MPI to take advantage of all 4 GPUs: add mpirun -np <NUMBER OF GPU> in front of the train command. For this you will have to create a new revision of the task definition.

In both bonus steps, look at the CloudWatch Logs to view the task logs (different training commands, taking advantage of multiple GPUs, etc.)

Conclusion

You should now have a good grasp on how to leverage ECS and GPU-optimized EC2 instances for your Machine Learning needs. Head on over to the AWS Big Data blog to learn more about how DSSTNE interacts with Apache Spark, trains models, generates predictions, and other fun Machine Learning concepts.

ecs-machine-learning's People

Contributors

ovalba avatar steckmey avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.