GithubHelp home page GithubHelp logo

batchai_inference's Introduction

Distributed Batched Scoring Using Azure Batch AI

This recipe shows how to run distributed Batched scoring Using Azure Batch AI, and run benchmark in term of the GPU cluster scale.

Image Data, Codebase and Models

The image dataset for scoring is provided by CaiCloud and VIP.com. The dataset is hosted in Azure Storage account octopustext and under Blob Container batchaisample. The dataset contains 366111 jpg format images with total size approximately 45 GB.

The main codebase and pertained model files are also provided by CaiCloud, and are hosted in the same storage account under File Share abc. The codebase contains all required dependency to run batch scoring of image classification and cloth recognition. Pretained VGG16 and Inception V3 models have also been uploaded to the File Share under the output directory.

The Storage account octopustext is in East US data center. Please use Azure Storage Explorer or Azure Portal to view the detail directory structures in Blob Container batchaisample and File Share abc.

Batch Scoring Job Script

The main script used for the scoring job dist_inference.py locates in the root directory of File Share abc. To view it, please download it from Azure Storage.

The input argument '--inference' is used to specify the scoring task: 'cloth_recognition' or 'classification'. The main idea is to shad the whole dataset into partitions based on the total number of workers, and each work process its assigned partition of images independently. There is no inter-communication between each work.

When the script completes, it outputs the total number of images it processed and how long it took, such as:

Worker 0 Processed 3124 images, took 0:10:43.983157

Please feel free to edit/optimize the logics of the script if needed

Prerequisites

Please place the filled configuration.json in the same directory based on the template. It should includes the Azure Batch AI authentication information and credentials of the Storage Account octopustext.

Install Azure Batch AI management client using the following command:

pip install azure-mgmt-batchai

We may need to utilize APIs from other Azure products (e.g, Azure storage, credentials), it is also required to install the full package of Azure Python SDK:

pip install azure

Install Jupyter Notebook from https://jupyter.org/ or run

python -m pip install jupyter

Run the Batch Scoring Recipe

This Jupyter Notebook file contains information on how to run Batch Scoring job on a GPU node with BatchAI. You will be able to tune variables including node_count, vm_size to obtain different benchmark results.

Note that, since there is no communication between each worker, Parameter Server is not required in this case. Therefore, we use the customToolkitSettings in the Batch AI job definition (instead of TensorFlowSettings) and use OpenMPI to launch and monitor all workers more efficiently. The OpenMPI binary is installed in container using JobPreparation task.

Benchmark Results

The below table illustrates elapsed time to label 100k images for 'cloth_inference'(VGG-16) task:

Number of GPUs 1 8 16 32
K80 741 mins 99 mins 49 mins 25 mins
P100 255 mins 32 mins 19 mins 10 mins

Qusi-linear scaling-up can be observed in terms of number of GPUs.

The benchmark for 'classification'(Inception-V3) task has not been done yet. The test code needs to be optimized to achieve higher GPU efficiency.

Reference

  • To transfer large amount of data between local device and Azure Storage, please use AzCopy or Blobxfer instead of Portal/Storage Explorer

  • A detail reference of Batch AI python SDK can be found here

  • If you prefer to use Azure CLI 2.0 instead of python SDK, please see the article for instruction.

batchai_inference's People

Contributors

llidev avatar

Watchers

 avatar

Forkers

ahgi

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.