GithubHelp home page GithubHelp logo

nordichpc / sonar Goto Github PK

View Code? Open in Web Editor NEW
8.0 8.0 5.0 1.95 MB

Tool to profile usage of HPC resources by regularly probing processes using ps.

License: GNU General Public License v3.0

Rust 92.71% Shell 5.60% Makefile 0.07% C 1.62%
cluster hpc monitoring profiling usage

sonar's People

Contributors

bast avatar lars-t-hansen avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

sonar's Issues

Why support both file names and stdout?

I would like to write to either stdout or to files but not support both. Do we need both? I think we get simpler code and less surprises if we allow either/or. I am not too happy that "-" means stdout. I can do the code change but wanted to check before I go in with the wrecking ball.

Log fatal errors

See #51. We should detect unrecoverable error situations that prevent monitoring from working, and we should arguably log them to some standard syslog or other medium where they will be seen.

Log communications volume

The context is NAICNO/Jobanalyzer#6. For multi-node jobs we want analysis to be able to flag poor use of communication. The complexity of this task - which I suspect is considerable - can be pushed into the analysis code but sonar would need to capture data about communication performed by each job on each node: volume and conduit / interface. It's OK if these data are cumulative or since-last-probe, so long as we know which.

(This is not yet high priority.)

Make output fields self-identifying?

As we experiment with sonar and add more fields and move it to new systems there will come a time when some log data are older and some are newer and data of different ages will have different columns. For a while it will be good enough to always add new columns at the end of the line, but in the long term this will likely become brittle. It may be good to add an identifier to each field to reduce the impact of this problem. The identifier can be a short name and can prefix the field, for example,

date=...,host=...,cores=...,user=...,job=...,cmd=...,cpu%=...,kib=...,gpus=...,gpu%=...,gpumem%=...,gpukib=...

would do it for now. Names (columns) could be removed if they turn out to be worthless but names would never be reused with a different meaning.

Maintain monotonicity of timestamps even in the face of some failures

For the background, see NAICNO/Jobanalyzer#63, where timestamps did not properly reflect the time of measurement, but the time of reporting, and where measurements that were ordered by timestamps were not ordered by true time. The reason for this is ultimately a system problem (program was hung waiting for disk and many instances were queued up and ran out of order, generating out of order timestamps) but we can mitigate it.

It might be to our advantage to obtain the timestamp for the output records very early, so that if jobs are backed up for system reasons, an output record with a lower timestamp has older measurements than a record with a higher timestamp. (The order of the records in the output file does not matter, at least not to sonalyze, but internal consistency matters.) This probably means obtaining the timestamp in main() even before initializing the logger. This might still not be early enough if the system is stuck waiting for disk, but it'll likely be an improvement.

Add GPU usage statistics

nvidia-smi and rocm-smi can display the compute/memory use of GPUs broken down by host system PIDs, this information can be meaningful for GPU-bound jobs (and we would want it for the JobAnalyzer). Possibly we could just add the GPU compute and memory to the output of the ps command, if those subprograms are present to run; we'd likely want to sum across all GPUs on the same node (discuss) and just have the two additional fields.

Read data directly from GPU APIs

Related to #86. Currently we run nvidia-smi and rocm-smi to obtain GPU data. This is bad for several reasons:

  • the output formats are idiosyncratic, not documented, not stable
  • sometimes we have to run commands multiple times to get all data we need

Much better would probably be to use the programmatic APIs towards the cards.

On the other hand, needing to link against these C libraries adds to the complexity of sonar and creates a situation where the same sonar binary may not be usable on all systems. So a compromise solution would be to create small C (probably) programs that we wrap around the programmatic APIs and invoke from sonar. These would need to be run once and would have a defined and compact output format.

Sonar is slow without --batchless because of how we get slurm job IDs

Running a sonar release build just now on a lightly-loaded ML node (ml7, a beefy AMD system), it runs in 0.27s real time with --batchless and in 2.5s real time without --batchless (about 10x). The difference is even more stark on my development system (a slightly older Xeon tower): 0.03s vs 1.63s (about 50x).

I run with --exclude-users=root --exclude-system-jobs --rollup to keep the amount of output to a minimum, so that we can know it's not output generation that's the main problem.

Running perf on this it is clear that the problem is in get_slurm_job_id: every profiling hit in the first several pages of profiler output is in the pipeline that that function runs to get the job ID. We can probably do much better here (and we'll need to).

Running `date` three times to compute the output file name is risky

The shell script here runs date three times to generate a file name. But if the first invocation is on yyyy-12-31 and the second invocation is on yyyz-01-01 this will compute the wrong name. Ditto if the day changes between the second and third invocation, yielding yyyy-mm-31 and yyyy-mn-01. Even though this is unlikely, date should ideally be run just once.

Ideas for "sonar analyze"

  • sum up by process and identify most used processes
  • map processes to actual codes that we recognize taking the mappings in data/ as starting point
  • later we will extend the mappings
  • allow users to run this to see a history of their own processes
  • the output can be like a ranking (top 10 or 20 most used codes)
  • for the most used codes/processes we can then look how are they used (what memory and CPU footprint)
  • later: since we have the slurm job ID, we can also compare what the job asked for vs. what the job used (either user wants to know on their own, or we want to know for jobs that consume lots of CPU/mem -> advanced user support)

big picture goals:

  • have data instead of anecdotes about how the system is used
  • input for procurement benchmarks
  • identify resource usage problems which will generate interesting support projects

Processes with "," in name break the CSV output

I created a program called "com,ma". This breaks the CSV output, note the last line:

[larstha@deathstar sonar]$ target/debug/sonar ps | grep larstha
2023-06-12T07:17:10.545212265+00:00,deathstar,16,larstha,0,Isolated,8.9,1539092
2023-06-12T07:17:10.545212265+00:00,deathstar,16,larstha,0,Web,2.4,38828
2023-06-12T07:17:10.545212265+00:00,deathstar,16,larstha,0,emacs,0.5,105808
2023-06-12T07:17:10.545212265+00:00,deathstar,16,larstha,0,WebExtensions,0.6,180132
2023-06-12T07:17:10.545212265+00:00,deathstar,16,larstha,0,firefox,10.4,699380
2023-06-12T07:17:10.545212265+00:00,deathstar,16,larstha,0,gnome-terminal-,0.6,69720
2023-06-12T07:17:10.545212265+00:00,deathstar,16,larstha,0,gnome-shell,4,217748
2023-06-12T07:17:10.545212265+00:00,deathstar,16,larstha,0,com,ma,0.5,340

In general, we should probably launder host names, user names, and process names to make sure this does not happen. Host names and user names are mostly controlled and mostly trustworthy but we don't want process names to break the tool, you never know what might happen.

Print raw cpu time

Currently sonar logs the pcpu field from the ps output, but this is a pretty tricky number. It is not a sample: it is the ratio of the process's consumed CPU time by the elapsed time since the process started, if I read the manual page correctly. As time moves forward, it will take ever greater changes in process behavior to move this number at all. (If a process sits around doing nothing for 24 hours and then runs 100% for an hour, this number will move from 0% to 2.5%.) I think that what we instead want to log is the cputimes field, which is the consumed CPU time up until that point. Given two consecutive log records we can then say something about how busy the process was during the last sampling interval, and it will be meaningful to look at averages, trends, and so on. Some of these are core use cases for NAICNO/Jobanalyzer.

(Given enough precision in the output it may be possible to compute the desired value from the pcpu value, but ps prints only one digit after the decimal point and this is unlikely to be very good in practice.)

Once we have support for named fields in the sonar output (#41) we can easily add this field, and it shouldn't take any extra effort during sampling to generate the data.

Also see NAICNO/Jobanalyzer#27.

Record also Slurm job ID

It seems that it is possible if we know the pid which we do.

$ cat /proc/12345/cgroup | grep -oP '(?<=job_).*?(?=/)' | head -n 1

Add an option to filter by command name

Looking at the logs with the new code, they look pretty good but one thing that stands out is that, due to the new system with the cpu time for a terminated process being accounted to its parent, it may be useful for the sake of reducing log volume to allow filtering by command name. The purpose would be to remove sshd / bash / zsh / systemd / tmux processes from the listing, they are not likely to be interesting though they pass other filters for the reason given.

Currently I run some tests with

--exclude-system-jobs --min-cpu-time=60 --batchless --rollup

and "shell" programs account for 2/3 of all output lines on ML8. This is probably not representative for a batch system but it still seems like it'd be useful to do the filtering.

Print the error when a command fails

I've seen some command failures on a local system and it will be useful to print the error itself, not just a simple error message that the command failed. This is related to #52 also.

Improve error reporting with more detail

See #92 for some background: there are some error codes such as "Hung" (but really most of them) that can have multiple root causes, we should include these root causes in the error message for smoother diagnosis of problems, or we should have more error codes. (The latter complicates the API though.)

Read data directly from /proc

Right now we're running ps but there are good reasons not to do so:

  • needs to run a subprocess which needs to parse its command line, figure out what to do, obtain data, and generate textual data, which we then have to read and re-parse
  • we need a lot of data from the subprocess to generate a process tree, which makes the previous problem quite a lot worse

By reading data from /proc/PID/status (mostly) we would streamline this process and avoid unnecessary work.

Bump the version number for other things than file format changes

As we're getting closer to deploying sonar in earnest - testing so far proves that it has real utility - it will be helpful for us to update the version number more frequently, so that we can more easily track the version that's being used. For example, the recent upgrade to reading data from /proc warranted such a change (though we did not make it). I guess we'll have to figure out some sensible policies for when to bump the version.

Intermittent failures on busy systems - 2s timeout may be too low

On some of our heavily loaded ML nodes I see intermittent failures in running the nvidia-smi command. The error (here on ml8, which has been overly busy since the end of vacation) is

[2023-08-17T03:00:04Z ERROR sonar::ps] GPU (Nvidia) process listing failed: "Hung(\"COMMAND:\\nnvidia-smi pmon -c 1 -s mu\")"

which is either in response to a timeout or a SIGTERM exit. It would be useful to distinguish these, so that's one tweak to implement. But it is likely a timeout. It's possible that the timeout should be longer, or that the default 2s should be overridable on the command line to allow sonar to adapt more easily to its environment.

Use exec, not shell, to run subprograms

See #86 and #87 also; full fixes for those makes the following point moot.

Sonar uses the shell command to run subprograms. This forks an intermediate shell with all its baggage that then runs the subprogram for us. This is not really necessary (it adds overhead); we could be using fork+exec or similar functionality to invoke the subprograms directly.

For `gpus` implement "unknown"

The data is currently implemented as a HashSet and the output will be gpus= for "none" and "unknown" is not implemented.

Update author information

Just to document that I know that this is outdated:

  • list of authors
  • SPDX identifiers on top of files
  • CITATION.cff file

Job filtering should be post-accumulation, and must be more subtle

At the moment, sonar filters processes and jobs early based on the raw ps output: if the cpu percentage of the process itself is below a threshold or its current memory usage is below a threshold, then the record is discarded. This is iffy for several reasons:

  • multiple processes can be accumulated into a job and in total they may create a load that is above the cutoffs, yet none of those processes will be considered
  • for --batchless in particular, but I suspect even in general, the root process of a job may not be doing much but it may well be that it should be present to represent the job as a whole
  • the cpu percentage is, as we've seen in #68, not a high-quality number (and in particular it is not a sample)
  • filtering, even when based on good samples such as for mem_percentage, creates a log that has holes, which can be surprising to log consumers, jobs will appear to have stopped and then restarted, and there can be confusion about whether a job ID seen after such a hole is the "same" job or a "new" job

I suspect that we should do multiple things here:

  • filtering should be applied to the accumulated job records, if at all
  • filtering might be disabled altogether for --batchless
  • filtering by cpu% probably needs to be removed, but cputime is not much of a substitute since that is a cumulative value and only the consumer of the log can reconstruct a sample-ish value for it

It's possible this is a good time to discuss why we're doing filtering and what we're really interested in getting rid of. Looking at my own logs, I'd be more interested in getting rid of jobs by root and zabbix than user processes, for example.

pid is a unique key for results from the subprocesses

For ps, nvidia-smi, and rocm-smi (and in general for all future uses like these) the pid will be a unique key (for a given pid there is only ever a single user and a single command, at least per invocation of ps etc). In ps.rs we still use the tuple (user, pid, command) as the key for various processing steps. This is not necessary and obscures what happens. It's probably just an effect of an older design that has yet to be cleaned up completely.

Log GPU status

A reality of the GPUs is that they fail intermittently, either together or separately. The GPU modules of sonar can (to a point) detect this and can at that point perform a slightly deeper analysis, and can add that analysis to the sonar log, this will likely be helpful in running a continuous health check. The complexity for sonar as a whole is minimal: just a field devoted to GPU health, which can be absent by default and default to "all's well".

Just now I have a card failing on our ML2 node, and as a result the GPU data are not collected because nvidia-smi in its default mode reports that one of the cards could not be reached and bails out. But digging slightly deeper we see it's just the one card

$ nvidia-smi -L
Unable to determine the device handle for gpu 0000:18:00.0: Unknown Error
GPU 1: NVIDIA GeForce RTX 2080 Ti (UUID: GPU-f0f3ae96-fe93-f5c8-4295-bf657e5a25da)
GPU 2: NVIDIA GeForce RTX 2080 Ti (UUID: GPU-99c4f278-11da-052d-ac5c-3490b58ade04)
GPU 3: NVIDIA GeForce RTX 2080 Ti (UUID: GPU-3c748ed6-c6b6-6ef0-ad7d-fe109cff16bc)

Packaging this information in the log somehow ("gpufail=") is probably useful. This is not high priority.

Large output is a reality, and killing the subprocess is not a workaround

Basically, running ps on "big enough" systems or asking for the unfiltered ps output will create enough output to make the pipe fill up and the process hang, in turn causing the subprocess to be killed by the timeout in command.rs: #50 (comment).

There are really two bugs here. One is that if the subprocess is killed there should be some report up the call chain so that whoever started the process can take evasive action, log an error, etc. (In general there are probably good reasons why we should try to catch errors high up and log them.) The other is that we need to fix the hangs caused by too-large output.

Define, implement, and document defaults for values where it makes sense

On that topic, a useful optimization for the log writer is the use of default values that are agreed between the logger and the ingestor. The default value for gpus would be none and sensibly the field does not appear in the output at all. For several other fields, a sensible default is zero and if the value is zero we could (optionally of course) omit the field. The ingestor will have to deal with missing fields and obsoleted fields and newly-introduced fields anyway, so this adds very little complexity.

Following up from #66.

Process names containing spaces are truncated

Firefox creates a bunch of processes called "Isolated Web Process". sonar shows these as just "Isolated", which is unfortunate information loss. Would probably be good to fix this, the space could be shown literally (process names could be quoted) or as _ or with some sort of escape. It could look like the underlying problem is the split_whitespace() call in ps.rs.

Add `ps` test cases

The ps command is now almost perfectly parameterized: all it takes is for us to pass the output stream as a parameter instead of always going to stdout. With that in mind, it should be easy to construct synthetic test cases for the command so that bugs like #93 can be avoided in the future.

Unlikely peak CPU usage for jobs with --batchless and subjobs

Here's a funny one. The fifth field (26268) is peak CPU usage, corresponding to 262 cores going at 100%:

957985	joachipo  0d 0h35m  3284 	26268 	1    	1     	0    	0     	0       	0        	ml3   bash

This is on a system with 64 cores, so it's clearly wrong: the max peak is 6400.

The reason this happens is an artifact of how --batchless jobs are handled by sonar. In this case, processes below a session leader are taken to be root processes for individual jobs (which could be trees of processes); typically the shell is a session leader. So if somebody runs five (say) heavy python processes (or process trees) in sequence from the shell, each of those is a job, as desired. Because of how we collect the cputime_sec field as the sum of the process's own CPU time and that of all terminated children, however, the CPU time for all those jobs is eventually accumulated in the shell. It can thus happen that when a long-running job terminates the shell receives a tremendous increase in accumulated CPU time out of nowhere within the time window between two samples. Since the analysis code computes cpu utilization as (delta cpu usage)/(delta time), the usage can become very high, leading to the artifact shown above.

This is not a big deal at this time, as it pertains only to the shell and its relation to jobs below it, and really we don't care about the shells for analysis, so we can just sweep these artifacts under the carpet. But it really comes back to how our aggregating data collection leads us into tight corners. We chose to use the bsdtime field of the ps output along with the --cumulative switch to collect self+children time, and this is right for individual jobs, but it breaks down in this weird cross-job situation.

There are a couple of possible fixes, none of them urgent (details need to be worked out in both cases):

  • for processes that are session leaders, collect only self time, not self+children
  • do not aggregate data the way we do but instead present it in a way that lets the consumer aggregate according to need, for example, by a sonar record containing a list of processes with the self cpu time

Rust version requirement

We currently have rust 1.60 as the latest version on some of our HPC systems. clap 4.3.3 requires anstyle 1.0.0 which requires rust 1.64. I'll just hack around it for now but in the longer term we should probably figure out what version of rust is both conducive to development velocity and likely to be available where sonar must run.

job_id is possibly a unique key for job information, but "it's complicated"

Follow-on from #76.

Currently the key for the "job" table in ps.rs is (user, job_id, command). It's not completely clear why the user name or command are desirable here.

On the one hand, the job id should be a unique key for the job.

On the other hand: for a given job there could be many processes each with their own command, and with the command being part of the key each will have its own line in the sonar log. This may be what we want (discuss), since it gives postprocessing a chance to see this. (The alternative would be for the sonar log to be more complex and contain multiple commands.) More far-fetched, each of those processes could have its own effective UID, so there could be multiple users, too.

On the third hand: it's a little strange that sonar would roll up multiple python processes into one record but distinguish this from, say, a python3.9 process belonging to the same job. In this case, it would be more natural for sonar to produce a record for each process, all of them tagged with the same job number.

This may become a discussion about what sonar should do, and what the postprocessor should do.

Generalize the job ID

At the moment, sonar is tied to "slurm job ID" (which defaults to 0 if it can't be computed). Sonar should also be able to run on systems that don't have a job queue (and maybe on systems with a job queue other than slurm). The job ID idea could be generalized; the appropriate ID for the system could be computed by an appropriate function. It may be that the best way to control this is with a flag to sonar ps, or with a compile-time configuration thing, instead of trying to detect the system type (discuss).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.