GithubHelp home page GithubHelp logo

bonsai-batch's Introduction

Batch Orchestration for Bonsai Simulations

Overview

You've found the bonsai-batch framework. Here you'll find a set of tools to assist you in scaling out simulators using Azure Batch. Note, while this was developed with Bonsai in mind, it's a general framework for running tasks on Azure Batch using its Python SDK.

⚠️ Disclaimer: This is not an official Microsoft product. This application is considered an experimental addition to Microsoft Project Bonsai's software toolchain. It's primary goal is to reduce barriers of entry to use Project Bonsai's core Machine Teaching, and no warranties are provided for its use.

Prerequisites

  1. An Azure account.
  2. Bonsai workspace. You can find instructions on provisioning a bonsai workspace here.
  3. Anaconda or miniconda.
  4. Create a virtual environment with libraries dependencies (described in environment.yml file)
conda env create -f environment.yml
conda activate bonsai-batch

Quick Start

  • Create your resources: python batch_creation.py create_resources
  • Build your image: python batch_creation.py build_image
  • Run your tasks: python batch_containers.py run_tasks
  • Create your brain and start training: bonsai brain version start-training --name <brain-name>
  • Attach your simulators: bonsai simulator unmanaged connect -b <brain-name> -a Train -c <concept_name> --simulator-name <simulator-name>

Scaling Simulators Using Azure Batch and Azure Container Registry

There are two executable scripts in this repository:

  1. batch_creation.py -> creates the necessary resources on Azure to scale your simulations: Azure Batch, Azure Container Registry, and Azure Blob Storage, all within a single resource group.
    • NOTE: Resources may contain only lowercase alphanumeric characters, and must be between 3 and 25 characters in length.
  2. batch_containers.py -> executes a set of simulation jobs as a set of tasks on the Azure Batch account you created in step 1.

Both of these scripts rely on the fire package to execute the scripts. To view how to use these scripts you are recommended to view their associated arguments and documentation:

python batch_creation.py -h
NAME
    batch_creation.py

SYNOPSIS
    batch_creation.py GROUP | COMMAND

GROUPS
    GROUP is one of the following:

     configparser
       Configuration file parser.

     pathlib

     re
       Support for regular expressions (RE).

     Dict
       The central part of internal API.

     Union
       Internal indicator of special typing constructs. See _doc instance attribute for specific docs.

     fire
       The Python Fire module.

COMMANDS
    COMMAND is one of the following:

     get_default_cli

     azure_cli_run
       Run Azure CLI command

     AzCreateBatch

     AzExtract

     AcrBuild

     delete_resources
       Delete resource group

     write_azure_config

     create_resources
       Main function to create azure resources and write out credentials to config file

     build_image
       Build ACR image from a source directory containing a dockerfile and src files.


python batch_containers.py -h
NAME
    batch_containers.py

SYNOPSIS
    batch_containers.py GROUP | COMMAND

GROUPS
    GROUP is one of the following:

     configparser
       Configuration file parser.

     datetime
       Fast implementation of the datetime type.

     pathlib

     sys
       This module provides access to some objects used or maintained by the interpreter and to functions that interact strongly with the interpreter.

     time
       This module provides various functions to manipulate time values.

     List
       The central part of internal API.

     batch_auth

     batch

     batchmodels

     blobxfer

     fire
       The Python Fire module.

     xfer_utils
       Run and scale simulation experiments on Azure Batch.

COMMANDS
    COMMAND is one of the following:

     AzureBatchContainers

     run_tasks
       Run simulators in Azure Batch.

     stop_job

     upload_files
       Upload files into attached batch storage account.

While there are a lot of different functions exposed, the most common usage only relies on two of them from batch_creation, and one from batch_containers:

  1. python batch_creation.py create_resources
    • create the resources you need for orchestrating tasks on Azure Batch. If you already have some resources created (e.g., a resource group) you can pass their names directly: python batch_creation.py create_resources --rg=<existing-rg> --loc=<location-of-resources> (⚠️: if your resource group exists in different location from your other resources you'll need to pass a location parameter for it separately: python batch_creation.py create_resources --rg=<existing-rg> --rg_loc=<rg_location> --loc=<all-other-resources-loc>).
  2. python batch_creation.py build_image --image-name <image-name>
    • build your Docker image on Azure Container Registry
    • ⚠️ : if your image is very large you may need to increase the --timeout parameter to avoid the script from timing out when buidling the image. If you still encounter issues while creating and pushing the image you may prefer building the image locally and pushing using docker push, > docker tag, and docker push.
  3. python batch_containers.py run_tasks
    • run your batch pool

Already Built Resources, Just Write them to the Config File 📝

If you already have a resource-group, ACR repository, storage, and a Batch account, you can use the write_azure_config function to write out the credentials to a config file rather than creating new resources:

python batch_creation.py write_azure_config \
	--rg=my-beautiful-rg \
	--batch=has-a-batch-account \
	--acr=and-acr-registyr \
	--store=plus-storage! \
	--loc=in-this-location

Comments on Usage

The main advantage of this repository is it streamlines the process of scaling simulators using Azure Container Registry with Docker images. The only thing the user needs to do is write a Dockerfile containing their source code for running the simulator. In most cases, this is a very simple Docker image, and hence the Dockerfile is very concise. Building and running the image is done entirely using Azure Container Registry, which means you don't even need to install Docker locally!

Scaling Number Of Sims, Number Of Tasks And Number Of Instances In A Pool

To specify the number of nodes in the pool, define the following arguments:

python batch_containers.py run_tasks --dedicated-nodes=<#_of_dedicated nodes> --low-pri-nodes=<#_of_lo_pri_nodes>

The number of tasks per node will be automatically deduced as number_of_sims/(number_low_pri_nodes + number_dedicated_nodes). You can view this parameter by inspecting your pool's configuration and the value of Task slots per node:

How to Delete an Existing Pool

Note, deleting pools is the best way to completely ensure you don't run into additional costs once the brain training has completed.

In order to delete from command line, you have the following options:

  • Delete last created pool: python batch_containers.py delete_pool.
  • Delete specific pool: python batch_containers.py delete_pool --pool_name="pool-name".
  • Delete all pools within last created resource: python batch_containers.py delete_pool --delete-all=True.

If you want to see and/or manage your current pools through Azure, you can follow the following steps:

  1. Search for the Resource Group you selected when running python batch_creation.py create_resources.
  2. On Overview tab, click over the item name with TYPE "Batch Account" (by default: "<your_group_name>batch").
  3. On left pane, on 'Features' section, click over 'Pools'.
  4. You can now see a drop down with the list of previously created pools.

Note About Modifying An Existing Pool

If you want to resize an existing pool, use the function resize_pool

python batch_containers.py resize_pool --low_pri_nodes <new-low-pri-node-count> --dedicated_nodes <new-dedicated-node-count>

If you would like to modify the pool altogether, perhaps so that it uses a new image, then delete the pool first and run new tasks on the new pool:

python batch_containers.py delete_pool --pool_name <pool-to-delete>
python batch_containers.py run_tasks

Building Windows Containers

The build_image function contains a few arguments for specifying the platform, image name, as well as the docker path. Here is an example of specifying a windows platform version with a different Dockerfile location:

python batch_creation.py build_image \
  --docker_folder=examples/cs-house-energy \
  --dockerfile_path=Dockerfile-windows \
  --platform=windows --image_name=winhouse

After building, you can run your tasks with the specific image you've created:

python batch_containers.py run_tasks --image_name=winhouse

Mounting and Accessing Azure Fileshares

You can also mount an Azure Fileshare and access it from your container. This can be useful if you want to write logs to a persistent storage facility or need to access files located on an external storage system when running your containers.

If you want ot use Azure Fileshare, you need to ensure that create_fileshare=True during resource creation, i.e.,

python batch_creation.py create_resources --create_fileshare=True

Second, when running your containers you need to set log_iterations=True, i.e.,

python batch_containers.py run_tasks --log_iterations=True

This ensures your batch pool mounts the Azure Fileshare during pool creation. The fileshare will be mounted to the path azfiles under your mount directory. You can access the fileshare by utilizing the environment variable $AZ_BATCH_NODE_MOUNTS_DIR"/azfiles" (mount directory + path) (see [here] for more information on how directories are structured in your batch pools).

In order to ensure your batch pools correctly invoke the environment variable when running their tasks you may want to write add shell script to your container which you can then call by batch_containers.run_tasks. For instance, add a file called batch-start-script.sh with the following contents:

#!/usr/bin/env bash
python main.py --log-iterations --log-path=$AZ_BATCH_NODE_MOUNTS_DIR/"azfiles/cartpole-logs"
> python batch_containers.py run_tasks --log_iterations=True
Please enter task to run from container (e.g., python main.py): bash batch-start-script.sh

This will log the iterations to the directory cartpole-logs in your Azure Fileshare.

For linux, the value of $AZ_BATCH_NODE_MOUNTS_DIR is typically /mnt/batch/tasks/fsmounts/, so you can also manually specify the entire path rather than using the environment variable (for instance if you cannot pass the environment variable during your command).

⚠️ If you are logging from each of your containers you should try and ensure the filename is unique for each container to avoid clashes between logs from different containers.

Installation

There is currently no updated batch_orchestration package. The best way to use this package is to install the bonsai-batch conda environment (Follow this link if you need to install conda):

conda update -n base -c defaults conda
conda env update -f environment.yml 
conda activate bonsai-preview

This provides the exact versions of the packages and python environment we used to test this library and therefore will give you the highest chance of success.

The first time you use this package you'll also need to login to azure and set your subscription appropriately:

az login
az account list -o table
az account set -s <subscription-id>

Testing Docker Images

The only caveat is if you need to debug your Docker image, you will need to install Docker locally (or write a batch script to run on ACR, which is a pretty inefficient method of debugging). For example, after running the batch_creation script above, you could test your image by:

docker login azhvacacr.azurecr.io
# your username and password are available in the newconf.ini file
docker pull azhvacacr.azurecr.io/hvac:1.0
docker run -it azhvacacr.azurecr.io/hvac:1.0 bash

Reasons Your Unmanaged Sims May Become Unregistered

Simulators may unregister from the Bonsai platform for any of the following reasons:

  • Software update to Bonsai platform
  • WaitForState timeout
  • WaitForAction timeout

When using managed simulators, the platform will automatically re-register and connect sims when they unregister. When using unmanaged simulators, such as with the bonsai-batch scripts, the user is responsible for registering the simulators again. To aid this effort, the reconnect.py with the following flags to repeatedly look for sims with an Unset purpose and connect those specific session-id's back with your brain.

python reconnect.py --simulator-name HouseEnergy --brain-name 20201116_he --brain-version 1 --concept-name SmartHouse --interval 1

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.

bonsai-batch's People

Contributors

akzaidi avatar journeyman-msft avatar juanvergaramunoz avatar microsoftopensource avatar sdk-ai avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

bonsai-batch's Issues

show_price = True default is causing the code to choke

Traceback (most recent call last):
  File "batch_containers.py", line 790, in <module>
    fire.Fire()
  File "C:\Users\JOMCDOWE\Miniconda3\envs\batch\lib\site-packages\fire\core.py", line 138, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "C:\Users\JOMCDOWE\Miniconda3\envs\batch\lib\site-packages\fire\core.py", line 471, in _Fire
    target=component.__name__)
  File "C:\Users\JOMCDOWE\Miniconda3\envs\batch\lib\site-packages\fire\core.py", line 675, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "batch_containers.py", line 691, in run_tasks
    show_price=show_price,
  File "batch_containers.py", line 467, in batch_main
    host_os=self.config["ACR"]["PLATFORM"],
  File "C:\Users\JOMCDOWE\OneDrive - Microsoft\Documents\bonsai\preview-repos\bonsai-batch\get_azure_data.py", line 125, in show_hourly_price
    hourly_price = (float(low_table.iloc[0, 3]) * low_pri_nodes) + (
  File "C:\Users\JOMCDOWE\Miniconda3\envs\batch\lib\site-packages\pandas\core\indexing.py", line 873, in __getitem__
    return self._getitem_tuple(key)
  File "C:\Users\JOMCDOWE\Miniconda3\envs\batch\lib\site-packages\pandas\core\indexing.py", line 1443, in _getitem_tuple
    self._has_valid_tuple(tup)
  File "C:\Users\JOMCDOWE\Miniconda3\envs\batch\lib\site-packages\pandas\core\indexing.py", line 702, in _has_valid_tuple
    self._validate_key(k, i)
  File "C:\Users\JOMCDOWE\Miniconda3\envs\batch\lib\site-packages\pandas\core\indexing.py", line 1352, in _validate_key
    self._validate_integer(key, axis)
  File "C:\Users\JOMCDOWE\Miniconda3\envs\batch\lib\site-packages\pandas\core\indexing.py", line 1437, in _validate_integer
    raise IndexError("single positional indexer is out-of-bounds")
IndexError: single positional indexer is out-of-bound

The specified nodeAgentSKUId does not exist (ubuntu 16.04)

10/19/21 I started receiving this issue when i successfully ran 3 other runs with the same img and no changes to bonsai-batch. Has the API changed drastically?

Code: InvalidPropertyValue
Message: {'additional_properties': {}, 'lang': 'en-US', 'value': 'The value provided for one of the properties in the request body is invalid.\nRequestId:06894e94-463d-4a41-9980-873e5d1515d7\nTime:2021-10-20T04:43:00.5940380Z'}
PropertyName: nodeAgentSKUId
PropertyValue: batch.node.ubuntu 16.04
Reason: The specified nodeAgentSKUId does not exist.

Using new VM Name/SKU fails when inputting Standard_D8s_v3 or None

(bonsai-preview) C:\Users\JOMCDOWE\Documents\bonsai\preview-repos\bonsai-batch>python batch_containers.py run_tasks
Please enter task to run from container (e.g., python __main__.py): python3 main.py
Number of simulators to run as tasks on Batch: 100
What VM Name / SKU do you want to use? (if you don't know say None): Standard_D8s_v3
root         : WARNING  Creating new pool named houseenergypool10
root         : WARNING  Creating new pool named houseenergypool10
Traceback (most recent call last):
  File "batch_containers.py", line 642, in <module>
    fire.Fire()
  File "C:\Users\JOMCDOWE\Miniconda3\envs\bonsai-preview\lib\site-packages\fire\core.py", line 138, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "C:\Users\JOMCDOWE\Miniconda3\envs\bonsai-preview\lib\site-packages\fire\core.py", line 471, in _Fire
    target=component.__name__)
  File "C:\Users\JOMCDOWE\Miniconda3\envs\bonsai-preview\lib\site-packages\fire\core.py", line 675, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "batch_containers.py", line 608, in run_tasks
    command=task_to_run, log_iterations=log_iterations, workdir=workdir,
  File "batch_containers.py", line 400, in batch_main
    self.create_pool(use_fileshare=log_iterations)
  File "batch_containers.py", line 244, in create_pool
    self.batch_client.pool.add(self.new_pool)
  File "C:\Users\JOMCDOWE\Miniconda3\envs\bonsai-preview\lib\site-packages\azure\batch\operations\_pool_operations.py", line 325, in add
    raise models.BatchErrorException(self._deserialize, response)
azure.batch.models._models_py3.BatchErrorException: Request encountered an exception.
Code: InvalidRequestBody
Message: {'additional_properties': {}, 'lang': 'en-US', 'value': 'The specified Request Body is not syntactically valid.\nRequestId:be254e54-cb23-4c02-b203-e3de66591681\nTime:2020-11-20T19:57:48.8929293Z'}
Reason: A null value was found with the expected type 'Microsoft.Azure.Batch.Protocol.Entities.MountConfiguration[Nullable=False]'. The expected type 'Microsoft.Azure.Batch.Protocol.Entities.MountConfiguration[Nullable=False]' does not allow null values.

using absolute paths to sim files in build_image can cause failures due to too long of a name

(bonsai-preview) C:\Users\JOMCDOWE\Documents\bonsai\preview-repos\bonsai-batch>python batch_creation.py build_image
Directory of Dockerfile and source files: C:\Users\JOMCDOWE\Documents\bonsai\preview-repos\microsoft-bonsai-api\Python\samples\house-energy
az_command_data_logger: INFO     command args: acr build --image {} --registry {} --file {} {} --platform {}
Packing source code into tar to upload...
Uploading archived source code from 'C:\Users\JOMCDOWE\AppData\Local\Temp\build_archive_f83aa4c18a974bb5aaea11df7cd19f00.tar.gz'...
azure.storage.common.storageclient: INFO     Client-Request-ID=88dc0b76-2b69-11eb-8101-ba31b58b277f Outgoing request: Method=PUT, Path=/f51bb67e90d44c908953ad20a3c05ea8-n5kw6plecp/source/202011200000/c9a5f5a0-ea98-4433-954b-aebd65e80e48.tar.gz, Query={'timeout': None, 'sv': '2018-03-28', 'sr': 'b', 'sig': 'REDACTED', 'se': '2020-11-20T20:49:42Z', 'sp': 'cw'}, Headers={'x-ms-blob-type': 'BlockBlob', 'x-ms-lease-id': None, 'If-Modified-Since': None, 'If-Unmodified-Since': None, 'If-Match': None, 'If-None-Match': None, 'Content-Length': '307083', 'x-ms-version': '2018-03-28', 'User-Agent': 'Azure-Storage/1.4.2-1.5.0 (Python CPython 3.7.7; Windows 10)', 'x-ms-client-request-id': '88dc0b76-2b69-11eb-8101-ba31b58b277f', 'x-ms-date': 'Fri, 20 Nov 2020 19:49:43 GMT'}.
azure.storage.common.storageclient: INFO     Client-Request-ID=88dc0b76-2b69-11eb-8101-ba31b58b277f Receiving Response: Server-Timestamp=Fri, 20 Nov 2020 19:49:42 GMT, Server-Request-ID=12ca1bfc-d01e-0064-0b76-bfd625000000, HTTP Status Code=201, Message=Created, Headers={'content-length': '0', 'content-md5': 'SEFKjfpJvMR4DwpCTR2OSA==', 'last-modified': 'Fri, 20 Nov 2020 19:49:42 GMT', 'etag': '"0x8D88D8D6CD0EB8C"', 'server': 'Windows-Azure-Blob/1.0 Microsoft-HTTPAPI/2.0', 'x-ms-request-id': '12ca1bfc-d01e-0064-0b76-bfd625000000', 'x-ms-version': '2018-03-28', 'x-ms-request-server-encrypted': 'true', 'date': 'Fri, 20 Nov 2020 19:49:42 GMT'}.
Sending context (299.886 KiB) to registry: jmanrgacr...
cli.azure.cli.core.util: ERROR    InvalidInputValue - Some of the properties of 'runRequest' are invalid.. InnerErrors: DockerBuildRequest:Image name 'CUsersJOMCDOWEDocumentsbonsaipreviewreposmicrosoftbonsaiapiPythonsampleshouseenergy:latest' is invalid.
The acceptable patterns are 'repository' or 'repository:tag'.The repository and tag names follow the standardized docker repository and tag naming conventions. Tag templates are supported within '{{' and '}}'.
Please refer to docs for more information on using image name and tag:  https://aka.ms/az/acr/az-acr-build
InvalidInputValue - Some of the properties of 'runRequest' are invalid.. InnerErrors: DockerBuildRequest:Image name 'CUsersJOMCDOWEDocumentsbonsaipreviewreposmicrosoftbonsaiapiPythonsampleshouseenergy:latest' is invalid.
The acceptable patterns are 'repository' or 'repository:tag'.The repository and tag names follow the standardized docker repository and tag naming conventions. Tag templates are supported within '{{' and '}}'.
Please refer to docs for more information on using image name and tag:  https://aka.ms/az/acr/az-acr-build

Need to create logs/ folder

(bonsai-preview) C:\Users\JOMCDOWE\Documents\bonsai\preview-repos\bonsai-batch>python batch_creation.py create_resources
Traceback (most recent call last):
  File "batch_creation.py", line 30, in <module>
    filename=os.path.join("logs", "batch_creation.log"), mode="a"
  File "C:\Users\JOMCDOWE\Miniconda3\envs\bonsai-preview\lib\logging\__init__.py", line 1087, in __init__
    StreamHandler.__init__(self, self._open())
  File "C:\Users\JOMCDOWE\Miniconda3\envs\bonsai-preview\lib\logging\__init__.py", line 1116, in _open
    return open(self.baseFilename, self.mode, encoding=self.encoding)
FileNotFoundError: [Errno 2] No such file or directory: 'C:\\Users\\JOMCDOWE\\Documents\\bonsai\\preview-repos\\bonsai-batch\\logs\\batch_creation.log'

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.