GithubHelp home page GithubHelp logo

mlcommons / power-dev Goto Github PK

View Code? Open in Web Editor NEW
16.0 17.0 22.0 908 KB

Dev repo for power measurement for the MLPerf™ benchmarks

Home Page: https://mlcommons.org/en/groups/best-practices-power

License: Apache License 2.0

Shell 2.95% Python 97.05%

power-dev's Introduction

Power

This repo contains the development branch of MLPerf™ power measurement code. Everything is Apache 2.0 code developed by MLCommons™. Access is available to anyone.

MLPerf™ is using [SPEC PTDaemon] tool for measuring power. Please see this README for more details on how to use it.

This tutorial demonstrates how to do a power measurement setup and do MLPerf™ Inference benchmarking with power measurements using MLCommons CK2/CM framework.

power-dev's People

Contributors

aatangulov avatar araghun avatar arjunsuresh avatar bitfort avatar dmiskovic-nv avatar guschmue avatar lilit1122 avatar manuprasad07 avatar morphine00 avatar nathanw-mlc avatar nvpohanh avatar petermattson avatar pgmpablo157321 avatar psyhtest avatar radmer avatar rakshithvasudev avatar s-idgunji avatar thekanter avatar trevor-cockrell avatar xzfc avatar ykurlaev avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

power-dev's Issues

[Manu] PTD fails on consecutive launches

This is because port isn’t closed. This is on serial port.

How to reproduce this: Get full workload to run once successfully. After successful run, terminate server script per instructions and launch script again. It will fail. This is on a serial port only.

[NTP sync] NTP sync is not working on the server side.

On running the server script, NTP sync step fails and somehow always resets the computer’s date to June 2015. I’ve looked at the commands within the common.py script – running these commands manually in a command window seems to work but no idea why it doesn’t run as part of the script. I eventually had to run server side script without the NTP sync. NTP sync works fine on the client side.

Add test plan.

Test plan.

Client tests

Case 1: Check all client required configuration options.

  1. Run client.py:
    ./client.py
  2. Expected result: usage: client.py [-h] -a ADDR [-p PORT] -o OUTDIR [-n ADDR] [-l LABEL] -w CMD [-s] client.py: error: the following arguments are required: -a/--addr, -o/--output, -w/--run-workload

Case 2: Check client help.

  1. Run client --help
  2. Expected result:
python3.7 client.py --help
usage: client.py [-h] -a ADDR -w CMD -L INDIR -o OUTDIR [-p PORT] [-n ADDR]
                 [-l LABEL] [-s] [-f]

PTD client

required arguments:
  -a ADDR, --addr ADDR            server address
  -w CMD, --run-workload CMD      a shell command to run under power
                                  measurement
  -L INDIR, --loadgen-logs INDIR  collect loadgen logs from INDIR
  -o OUTDIR, --output OUTDIR      put logs into OUTDIR (copied from INDIR)

optional arguments:
  -h, --help                      show this help message and exit
  -p PORT, --port PORT            server port, defaults to 4950
  -n ADDR, --ntp ADDR             NTP server address, optional
  -l LABEL, --label LABEL         a label to include into the resulting
                                  directory name
  -s, --send-logs                 send loadgen logs to the server
  -f, --force                     force remove loadgen logs directory (INDIR)

Case 3: Check connection to the server(Wrong ip address):

  1. Run client.py:
    ./client.py -w ' cd /home/user/inference/loadgen/mlperf_inference/vision/classification_and_detection && ./run_local.sh tf ssd-mobilenet cpu --scenario Offline' -a 192.168.104. -o ./dir
  2. Expected result: Handle connection error.
 python3 ./client.py -w ' cd /home/user/inference/loadgen/mlperf_inference/vision/classification_and_detection && ./run_local.sh tf ssd-mobilenet cpu --scenario Offline' -a 192.168.104. -o ./dir                  
client 2021-01-13 20:49:19,639 [CRITICAL] Could not connect to server 192.168.104.:4950 [Errno -2] Name or service not known

Case 4: Check out.dir existence.

  1. Create $PWD/out folder.
  2. Run ./client.py \ --ntp centos.xored.com \ --send-logs \ --label 'ssd-mobilenet-tf-offline' \ --run-workload 'sleep 0.1s' \ --output "$PWD/out" \ -a 192.168.104.169
  3. Expected output:
client 2020-12-22 17:19:38,612 [CRITICAL] The output directory 'out' already exists.
client 2020-12-22 17:19:38,612 [CRITICAL] Please remove it or select another directory.

Case 5: Run workflow fails:

  1. Run client:
sudo python3 ./client.py 
        --ntp centos.xored.com 
        --send-logs 
        --label 'ssd-mobilenet-tf-offline' 
        --run-workload 'sleep(0.1s)' 
        --output "$PWD/out" 
        -a 192.168.104.169
  1. Expected result: We do not handle the problem. Useful information can be in the log messages. Example:
sudo python3 ./client.py --ntp centos.xored.com --send-logs --label 'ssd-mobilenet-tf-offline' --run-workload 'sleep(0.1s)' --output "$PWD/out" -a 192.168.104.169
/bin/sh: 1: Syntax error: word unexpected (expecting ")")
Traceback (most recent call last):
  File "./client.py", line 20, in <module>
    client.main()
  File "/home/julia/project/power/ptd_client_server/lib/client.py", line 181, in main
    subprocess.run(args.run_workload, shell=True, check=True)
  File "/usr/lib/python3.6/subprocess.py", line 438, in run
    output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command 'sleep(0.1s)' returned non-zero exit status 2.

Case 6: Two clients try to connect to the server simultaneously.

  1. Start server: python server.py
  2. Start client:
python3 ./client.py \
        --send-logs \
        --label 'ssd-mobilenet-tf-offline' \
        --run-workload 'sleep 40s' \
        --output "$PWD/out" \
  1. Start client:
python3 ./client.py \
       --send-logs \
       --label 'ssd-mobilenet-tf-offline' \
       --run-workload 'sleep 40s' \
       --output "$PWD/out1" \
  1. Expected result: The second client will wait while the first client will finish. Maximum server queue size is 5. If we try to start 6 clients. Sixth client will have mistake after run:
    client 2021-01-15 16:52:40,808 [CRITICAL] Could not connect to the server 192.168.104.169:4950 [Errno 111] Connection refused

Server tests

Case 1: Handle server.config absence.

  1. Remove or do not create server.conf
  2. Run python server.py
  3. Expected result:
 ➜  ptd_client_server git:(handle_CalledProcessError) ✗ python3.7 server.py
ptd-server 2021-01-13 21:38:03,288 [CRITICAL] Configuration file 'server.conf' does not exist.
➜  ptd_client_server git:(handle_CalledProcessError) ✗ python3.7 server.py -c ololo
ptd-server 2021-01-13 21:38:09,778 [CRITICAL] Configuration file 'ololo' does not exist.

Case 2: Simultaneously couple servers start with the same configuration.

  1. Start server.
    python3.7 server.py
  2. Start server again:
    python3.7 server.py
  3. Expected result: The second server will fail.
python3.7 server.py
ptd-server 2021-01-15 18:14:06,441 [WARNING] server.conf: There is no listen option. Server use 0.0.0.0:4950
ptd-server 2021-01-15 18:14:06,441 [WARNING] No NTP server configured. Skipping NTP sync.
Traceback (most recent call last):
  File "server.py", line 20, in <module>
    server.main()
  File "/home/julia/project/power/ptd_client_server/lib/server.py", line 594, in main
    server.handle_connection,
  File "/home/julia/project/power/ptd_client_server/lib/common.py", line 254, in run_server
    with Server((host, port), Handler) as server:
  File "/usr/lib/python3.7/socketserver.py", line 452, in __init__
    self.server_bind()
  File "/usr/lib/python3.7/socketserver.py", line 466, in server_bind
    self.socket.bind(self.server_address)
OSError: [Errno 98] Address already in use

Case 3: There are a lack of mandatory parameter in configuration file. (Check absence of ptdPort, ptdLogfile, ptdCommand and outDir).

  1. Run server.py with next configuration:
    python server.py.
[server]
ntpCommand: echo ok
ntpServer:  centos.xored.com
ptdPort: 8888
ptdLogfile: D:\logs_ptdeamon.txt
outDir: D:\work\ptd_server_logs
  1. Expected result: Server doesn't start with next console output.Check of another options will have the same critical message with missing option name.
C:\cygwin64\home\alchi\3\power\ptd_client_server>python server.py
ptd-server 2021-01-13 17:58:00,808 [CRITICAL] server.conf: missing option: 'ptdCommand'

Case 4: There are a lack of optional parameter in configuration file. (Check absence of ).

  1. Run server.py with next configuration: python server.py. Check of another options will have the same critical message with missing option name.
[server]
ptdPort: 8888
ptdCommand: D:\work\spec_ptd-main\PTD\ptd-windows-x86.exe -p 8888 -l D:\logs_ptdeamon.txt -e -y 49 C2PH13047V
ptdLogfile: D:\logs_ptdeamon.txt
outDir: D:\work\ptd_server_logs
  1. Expected result:
    Server start with next console output.
C:\cygwin64\home\alchi\3\power\ptd_client_server>python server.py
ptd-server 2021-01-13 18:33:38,657 [WARNING] server.conf: There is no listen option. Server use 0.0.0.0:4950
ptd-server 2021-01-13 18:33:38,657 [INFO] No NTP server configured. Skipping NTP sync.
...

[Test] Start server test

  1. Launch server
  2. Launch client a couple times
  3. Stop server
  4. Launch server with new outdir
  5. Launch client a couple times again
  6. Make sure server and client work fine

Tested. Works fine

Add a delay after workload

Add a 10 seconds delay after load before stopping the measurement to capture more tail-end data.
It should be non-configurable.
It should be only in the testing phase.

Send emergency termination message for client if we can not get log from PTD.

if we could not execute --run-before.
Expected result: Send message to server before exit “emergency stop problem with run before”.

if we could not execute run-workload.
Expected result: Send message to server before exit “emergency stop problem with run workload”

if we could not execute run-after.
Expected result: Send message to server before exit “emergency stop problem with run-after”

Test disconnection with power meter

Test USB cable disconnection

Tests with USB connection.

Possible cases:

  • Pull out USB; Turn off Yokogawa:
  • In all described cases we could not get information from PTD after connection recovering or turning on Yokogawa. In log messages we can see next messages:
Time,01-13-2021 15:41:30.548,ERROR,write to WT310 failed with error 0.
Time,01-13-2021 15:41:30.550,Watts,-1.000000,Volts,-1.000000,Amps,-1.000000,PF,-1.000000,Mark,notset 
Time,01-13-2021 15:41:31.547,WARNING,No valid watts samples for this measurement!
Time,01-13-2021 15:41:54.704,ERROR,write to WT310 failed with error 0.
Time,01-13-2021 15:41:54.705,Watts,-1.000000,Volts,-1.000000,Amps,-1.000000,PF,-1.000000,Mark,notset
Time,01-13-2021 15:41:55.716,ERROR,write to WT310 failed with error 0.
  • To get correct data from Yokogawa again we have to restart PTD.

Test RS-232 connection

  • Case 1. Pull out RS-232:
    When we pull out RS-232 we get next messages in log file:
Time,01-14-2021 16:13:41.449,Watts,-2.000000,Volts,-2.000000,Amps,-2.000000,PF,-2.000000,Mark,notset
Time,01-14-2021 16:13:44.453,Watts,-1.000000,Volts,-1.000000,Amps,-1.000000,PF,-1.000000,Mark,notset   
Time,01-14-2021 16:13:44.455,Watts,-2.000000,Volts,-2.000000,Amps,-2.000000,PF,-2.000000,Mark,notset   
Time,01-14-2021 16:13:44.457,Watts,-2.000000,Volts,-2.000000,Amps,-2.000000,PF,-2.000000,Mark,notset   
Time,01-14-2021 16:13:47.458,Watts,-1.000000,Volts,-1.000000,Amps,-1.000000,PF,-1.000000,Mark,notset   
Time,01-14-2021 16:13:47.460,Watts,-2.000000,Volts,-2.000000,Amps,-2.000000,PF,-2.000000,Mark,notset   
  • Messages in console:
01-14-2021 16:14:23.451: ERROR: read of WT310 failed with error 0.
01-14-2021 16:14:23.452: WARNING: Missed 2 samples
01-14-2021 16:14:26.336: timeout, 0 bytes read:!

  • If we plug RS-232 the connection completely recovering, we can get data from Yokogawa by PTD.

  • Case 2. Turn off Yokogawa.

  • Messages in log file:

Time,01-14-2021 16:26:45.374,Watts,-1.000000,Volts,-1.000000,Amps,-1.000000,PF,-1.000000,Mark,notset 
Time,01-14-2021 16:26:46.378,Watts,-1.000000,Volts,-1.000000,Amps,-1.000000,PF,-1.000000,Mark,notset 
Time,01-14-2021 16:26:47.372,Watts,-1.000000,Volts,-1.000000,Amps,-1.000000,PF,-1.000000,Mark,notset
  • Messages in console:
01-14-2021 16:26:54.375: ERROR: write failed with error 995.
01-14-2021 16:26:54.376: ERROR: write to WT310 failed with error 0.
  • To get correct data from Yokogawa again we have to restart PTD.

In cases when Yokogawa is turned off or unavailable throw USB and we try to start PTD and connect to Yokogawa by RS-232 we will have the same error:


****************************************************************************
***********************************
SPEC PTDaemon Tool
Version 1.9.1-a2d19f26-20190717
***********************************
Licensed Materials - Property of SPEC
Copyright 2006-2019 Standard Performance Evaluation Corporation (SPEC)
All Rights Reserved.
For use with benchmark products from SPEC and authorized organizations only.
****************************************************************************
  Selected power meter 'Yokogawa WT310' from wt310.cpp
Redirecting data output to file D:\logs_ptdeamon.txt
Calculated PTD CRC: 0xa2d19f26, 2017792
01-14-2021 16:06:26.715: Attempting to connect to measurement device type 49...
01-14-2021 16:06:29.629: ERROR: Zero byte count from meter
01-14-2021 16:06:29.630: ERROR: write to Yokogawa WT310 failed with error 0.
01-14-2021 16:06:29.630: ERROR: Invalid device type specified.

Drop client config file

After #8, client options could be passed both as CLI arguments and through the config file.
Probably, now we may drop the config file, preserving only command-line options.
(closes #18)

Check parameters from server config.

If there are lack of mandatory parameters we should exit and print error message. If lack of optional parameters we should add warning message.

Add documentation

  • Where logs are stored, how to configure it, etc. #45
  • Network failures and keep-alive behavior. #45
  • Sequence diagram. #47

Add ctrl+C handling

**Expected result:**Server and client should completed after getting Ctrl+C without exceptions and extra log messages.

Test cases:

  • Start server. Ctrl+C.
  • Start server. Start client. Start Ptd. Ctrl+C.
  • Start server. Start client. Wait message: "SR,V,300" Ctrl+C.
  • Start server. Start client. Wait message: "SR,A,{maxAmps}"
  • Ctrl-C while "push-log" "get-log" commands.

Fix OSError: [Errno 98] Address already in use.

Steps to reproducing (on Linux server):

  1. Change allow_reuse_address to False.
  2. Connect to server by telnet.
  3. Interrupt server by Ctrl+C.
  4. Try to start server again.

As result we will have [Errno 98] Address already in use. And server will not start.


Steps to reproducing (on Windows server):

  1. Do not change anything in the code.
  2. Start two servers on the same port.

Expected result: the second server should exit with "the address already used" error message.
Actual result: two servers wold run simultaneously on the same port.


We need a solution that has none of the described issues both on Windows and Linux.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.