GithubHelp home page GithubHelp logo

microsoft / switchml Goto Github PK

View Code? Open in Web Editor NEW
13.0 7.0 4.0 507 KB

Switch-based Training Acceleration for Machine Learning (SwitchML)

License: MIT License

C++ 15.05% Makefile 0.45% Shell 0.83% P4 32.01% Python 51.67%

switchml's Introduction

Switch-Based Training Acceleration for Machine Learning (SwitchML)

This is an implementation of SwitchML in P4_16, with support for large packets and RDMA via RoCE.

Status

This is a work in progress.

Features:

  • 256 byte (64 entry) or 1024 byte (256 entry) packets
  • Up to 4 8-bit exponents per message
  • Up to 32 workers in a job

Limitations:

  • Only the first pipeline's front-panel ports (dev_ports 0 through 63) are currently supported, in order to support 1024 byte packets
  • When used with RDMA, ICRC checking on the NIC must be disabled. See README.md in RDMAExampleClient for more info.
  • The tests are not fully functional right now.

Requirements

The p4 code requires SDE 9.1.0 or above.

For the control plane, Python 2.7 with Scapy and other SDE dependences is required. This should be installed on any machine/switch with the SDE installed.

For use with RDMA, additional dependencies are required. See RDMAExampleClient/README.md for more details.

Instructions

  • Clone the switchml repo.
  • Build the P4 code with a command like p4_build.sh p4/switchml.p4 or the equivalent.
  • Run the control plane with a command like python py/switchml.py.
    • Set the switch MAC and IP with the --switch_mac and --switch_ip arguments.
    • To specify ports and MAC addresses, either edit py/switchml.py or make a version py/prometheus-fib.yml and load using the --ports argument.
  • For RDMA, job configuration is done via GRPC.
  • For SwitchML-UDP, job configuration can be done by loading a file like py/prometheus-fib.yml with the --job argument or the worker_file command in the CLI.

For use with Daiet, ensure Daiet is configured with num_updates = 64 or num_updates = 256.

For use with the RDMA example client, follow the directions in README.md in RDMAExampleClient.

Testing

  • Build with SWITCHML_TEST set to minimize register size for speedier model initialization: p4_build.sh p4/switchml.p4 -DSWITCHML_TEST=1
  • After starting model and switchd, run tests with bash run_tests.sh

Glossary

Pool: a collection of aggregator slots

Slot: register storage for one packet's worth of data

Set: each pool element is divided into two sets: odd and even. Pool sizes should always be multiples of two, so that we have storage for both sets.

Consume: adding values into registers

Harvest: reading aggregated values out of registers

SwitchML packet formats

This code supports two packet formats: the original UDP format, and RoCE v2.

For SwitchML-UDP, the packet is laid out like this:

  • Ethernet
  • IP
  • UDP (base port 0xbee0)
  • SwitchML header
  • SwitchML exponent header
  • SwitchML significands (either 256 bytes or 1024 bytes
  • Ethernet FCS

For SwitchML-RDMA, the packet layout is slightly different depending on which part of a message a packet contains. A message with a single packet looks like this:

  • Ethernet
  • IP
  • UDP (dest port: RoCEv2 (4791))
  • IB BTH
  • IB RETH, with the following components:
    • Address: virtual address response should be directed to
    • rkey:
      • bits 31:16: currently unused
      • bits 15:1: pool index
      • bit 0: set bit
  • IB IMM: contains 4 8-bit exponents
  • Payload: significands, either 256 bytes or 1024 bytes
  • IB ICRC: ignored
  • Ethernet FCS

Design overview

TBD

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

switchml's People

Contributors

amedeosapio avatar nelsonje avatar nelsonje-msr avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

switchml's Issues

ACTION REQUIRED: Microsoft needs this private repository to complete compliance info

There are open compliance tasks that need to be reviewed for your SwitchML repo.

Action required: 4 compliance tasks

To bring this repository to the standard required for 2021, we require administrators of this and all Microsoft GitHub repositories to complete a small set of tasks within the next 60 days. This is critical work to ensure the compliance and security of your microsoft GitHub organization.

Please take a few minutes to complete the tasks at: https://repos.opensource.microsoft.com/orgs/microsoft/repos/SwitchML/compliance

  • The GitHub AE (GitHub inside Microsoft) migration survey has not been completed for this private repository
  • No Service Tree mapping has been set for this repo. If this team does not use Service Tree, they can also opt-out of providing Service Tree data in the Compliance tab.
  • No repository maintainers are set. The Open Source Maintainers are the decision-makers and actionable owners of the repository, irrespective of administrator permission grants on GitHub.
  • Classification of the repository as production/non-production is missing in the Compliance tab.

You can close this work item once you have completed the compliance tasks, or it will automatically close within a day of taking action.

If you no longer need this repository, it might be quickest to delete the repo, too.

GitHub inside Microsoft program information

More information about GitHub inside Microsoft and the new GitHub AE product can be found at https://aka.ms/gim or by contacting [email protected]

FYI: current admins at Microsoft include @drkp, @nelsonje

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.