GithubHelp home page GithubHelp logo

isabella232 / azhpc-diagnostics Goto Github PK

View Code? Open in Web Editor NEW

This project forked from azure/azhpc-diagnostics

0.0 0.0 0.0 73 KB

Scripts that run on Azure VM's and gather variety of diagnostic information to debug common issues with VM, GPU and Infiniband.

License: MIT License

Shell 89.88% PowerShell 10.12%

azhpc-diagnostics's Introduction

Build Status

VM Size OS Version Status Badge
HB60 CentOS 8.1 Build Status
HB60 CentOS 7.6 Build Status
HB60 CentOS 7.7 Build Status
H16r CentOS 7.4 Build Status
NC24rs_v3 Ubuntu 18.04 Build Status
ND40rs_v2 Ubuntu 18.04 Build Status
NV48s_v3 Ubuntu 18.04 Build Status

Overview

This repo holds a script that, when run on an Azure VM, gathers a variety of diagnostic information for the purposes of diagnosing common HPC, Infiniband, and GPU problems. It runs a suite of diagnostic tools ranging from built-in Linux tools like lscpu to vendor-specific CLI's like nvidia-smi. The resulting information is packaged up into a tarball, so that it can be shared with support engineers to speed up the troubleshooting process.

If you are reading this, you are likely troubleshooting problems on an Azure HPC VM, in which case we suggest you contact support if you have not already and run this tool on your VM so that you can provide the output to support engineers when prompted.

If you have special privacy requirements concerning logs leaving your VM, make sure to open up the tarball and redact any sensitive information before re-tarring it and handing it off to support engineers.

Warning

This tool is meant for diagnosing inactive systems. It runs benchmarks that stress various system devices such as memory, GPU, and Infiniband. It will cause performance degradation for or otherwise interfere with other active processes that use these resources. It is not advised to use this tool on systems where other jobs are currenlty running.

To stop the tool while it is running, interrupt the process (i.e. ctrl-c) to force it to reset system state and terminate.

Install and Run

After cloning this repo, no further installation is required. To run the script, run the following command, replacing {repo-root} with the name of this repo's directory on your VM:

sudo bash {repo-root}/Linux/src/gather_azhpc_vm_diagnostics.sh

Usage

This section describes the output of the script and the configuration options available.

Options

Option (Short) Option (Long) Parameters Description Example Example Description
-d --dir Directory Name Specify custom output location --dir=. Put the tarball in the current directory
-V --version display version information and exit --version Outputs 0.0.1
-h --help display help text -h Outputs the help message
-v --verbose verbose output --verbose Enables more verbose terminal output
--gpu-level 1 (default), 2, or 3 GPU diagnostics run-level --gpu-level=3 Sets dcgmi run-level to 3
--mem-level 0 (default) or 1 Memory diagnostics run-level --mem-level=1 Enables stream benchmark test

Tarball Structure

Note that not all these files will be generated on all runs. What appears below is union of all files that could be generated, which depends on script parameters and VM size:

{vm-id}.{timestamp}.tar.gz
|-- general.log (logs for the tool itself)
|-- VM
|   -- dmesg.log
|   -- metadata.json
|   -- waagent.log
|   -- lspci.txt
|   -- lsvmbus.log
|   -- ipconfig.txt
|   -- sysctl.txt
|   -- uname.txt
|   -- dmidecode.txt
|   -- syslog
|-- CPU
|   -- lscpu.txt
|-- Memory
|   -- stream.txt
|-- Infiniband
    -- ib-vmext.log
|   -- ibstat.txt
|   -- ibv_devinfo.txt
|   -- pkey0.txt
|   -- pkey1.txt
|-- Nvidia
    -- nvidia-vmext.log
    -- nvidia-smi.txt (human-readable)
    -- nvidia-debugdump.zip (only Nvidia can read)
    -- dcgm-diag-2.log
    -- dcgm-diag-3.log
    -- nvvs.log
    -- stats_*.json

Diagnostic Tools Table

Tool Command Output File(s) Description EULA
dmesg dmesg VM/dmesg.log Dump of kernel ring buffer
syslog syslog VM/syslog Dump of system log
Azure IMDS curl http://169.254.169.254/metadata/... VM/metadata.json VM Metadata (ID,Region,OS Image, etc)
Azure VM Agent cp /var/log/waagent.log waagent.log Logs from the Azure VM Agent
lspci lspci VM/lspci.txt Info on installed PCI devices
lsvmbus lsvmbus VM/lsvmbus.log Displays devices attached to the Hyper-V VMBus
ipconfig ipconfig VM/ipconfig.txt Checking TCP/IP configuration
sysctl sysctl VM/sysctl.txt Checking kernel parameters
uname uname VM/uname.txt Checking system information
dmidecode dmidecode VM/dmidecode.txt DMI table dump (info on hardware components)
lscpu lscpu CPU/lscpu.txt Information about the system CPU architecture
stream stream_zen_double Memory/stream.txt The stream benchmark suite (AMD Only) Steam License
ibstat ibstat Infiniband/ibstat.txt Mellanox OFED command for checking Infiniband status MOFED End-User Agreement
ibv_devinfo ibv_devinfo Infiniband/ibv_devinfo.txt Mellanox OFED commnd for checking Infiniband Device info MOFED End-User Agreement
Partition Key cp /sys/.../pkeys/... Infiniband/pkey0.txt Infiniband/pkey1.txt Checks the configured Infinband Partition Key
Infiniband Driver Extension Logs cp /var/log/azure/ib-vmext-status Infiniband/ib-vmext-status Logs from the Infiniband Driver Extension
NVIDIA System Management Interface nvidia-smi Nvidia/nvidia-smi.txt Checks GPU health and configuration CUDA EULA GRID EULA
NVIDIA Debug Dump nvidia-debugbump Nvidia/nvidia-debugdump.zip Generates a binary blob for use with Nvidia internal engineering tools CUDA EULA GRID EULA
NVIDIA Data Center GPU Manager dcgmi Nvidia/dcgm-diag-2.log Nvidia/dcgm-diag-3.log Nvidia/nvvs.log Nvidia/stats_*.json Health monitoring for GPUs in cluster environments DCGM EULA
GPU Driver Extension Logs cp /var/log/azure/nvidia-vmext-status Nvidia/nvidia-vmext-status Logs from the GPU Driver Extension

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

azhpc-diagnostics's People

Contributors

jithinjosepkl avatar microsoftopensource avatar sakshamgupta006 avatar tlcyr4 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.