flux-framework / flux-docs Goto Github PK

View Code? Open in Web Editor NEW

10.0 10.0 21.0 5.25 MB

Documentation for the Flux-Framework

Home Page: https://flux-framework.readthedocs.io/

License: Other

Makefile 0.78% Python 19.07% CSS 64.51% HTML 9.27% Dockerfile 1.34% Jupyter Notebook 5.03%

docs

flux-docs's People

Contributors

Stargazers

Watchers

flux-docs's Issues

contributing: bullets/formatting broken

There are a few sections in the contributing page where there should seemingly be bulleted lists, but there aren't.

Background Section

We need the "elevator pitch" for Flux in our docs.

What is Flux? What problems is it trying to solve? Why is it better than the "competition"?

We have lots of research papers that we can probably copy/paste from.

create a top-level reference page for flux jobs (their lifecycle, states, events, etc)

We have several RFCs and command man-pages that document these specifics, but the information isn't all in one place. It would be nice to have a top-level doc page that gives the high-level details of Flux jobs, and then provides pointers to the RFCs/manpages for more nitty-gritty details.

We probably can't do this type of accumulation for everything within Flux, but jobs seem essential enough of a concept as to warrant it.

Guide for porting batch scripts from Slurm to Flux

Per our recent TOSS4 phone call, we will need a guide that documents how users can port their scripts from sbatch to flux mini batch.

CC @dongahn

FAQ: question on binding/affinity issues when running under Slurm

Context: flux-framework/flux-core#3003

Create an FAQ page

Good questions to bootstrap the page:

Why is Flux not discovering and managing all of the resources on the system/node?
- Various bind flags passed to the parallel launcher that started Flux
- CUDA enabled hwloc (for GPUs)
How do I efficiently launch a large number of jobs to Flux?
- Pointer to our bulksubmit workflow example and flux-tree
Memory exhaustion on a node when running large ensembles with Flux
- Set rundir so that the content store is not saved to /tmp but to a filesystem
Bring Your Own Python (BYOP)
- Caveats on what will work and what won't when using a version of python that isn't the one Flux was configured against
How do I mimic Slurm's jobstep semantics (could later be moved to a "Flux vs Slurm" or "Porting from Slurm" page)
- flux mini submitting a script with flux mini runs will not result in Slurm-style jobsteps unless you stick flux start in from of the jobscript in flux mini submit
Flux fails to bootstrap a specific MPI implemention (e.g., OpenMPI, MPICH, Spectrum MPI):
- Use the -o mpi= option when running/submitting jobs
Does Flux run on a mac?
- Not yet. We have an open issue on GitHub tracking the progress towards the goal of natively compiling on a mac. In the meantime, you can use Docker (pointer to our docker install instructions).
How do I report a bug or ask a question not answered here?
- Either a pointer to the GitHub issue tracker, or a pointer to a docs page that outlines how to get help
How do I request a resource for scheduling not listed as an argument to flux mini (e.g., memory or storage)?
- Once flux run is merged, we can point to that documentation

Some RTD section titles could be improved

Just noticed while navigating the rtd site that a couple section titles could use improvment:

The RFC Index is titled "API Reference"
The link to flux-core mapages is titled Welcome to flux-core's documentation!

ECP AM tutorial

Are there any materials from the ECP AM tutorial that is not already in the tutorials or examples repos? Slides? We should gather them in this repo, probably on a page listing past events.

Add a copy-to-clipboard button to code blocks

For long code-blocks, like our recommended jsrun line on CORAL systems, it is annoying to try and highlight the whole things to copy-and-paste into a terminal. A button, like Github supports, would make this much easier. Seems this is already supported via a sphinx extension

I can give this a try and add to #102 if it is easy.

Spell check does not appear to run as part of the CI

Caught several spelling mistakes in a recent PR despite the CI returning all green. We probably need to add an explicit github action that runs make check.

FAQ: Flux Job IDs in F58 format print improperly

Depending on the system you are running on and the locale settings, the printing of Flux Job IDs in F58 format can result in weird boxes, underscores, mojibake, or complete corruption of the terminal state. For example:

$ flux mini submit hostname
_Kg1PZgns

$ flux mini submit hostname
Kg1PZgns

This issue appears on systems with no locale information set (i.e., LC_ALL, LC_CTYPE, and LANG are all unset and the values printed by locale are all implied). When no locale information is set, python 3.7+ automatically coerces the locale to C.UTF-8

The preferred solution is to set LC_ALL to something (either C, POSIX, or *.UTF-8). Try adding the following to your shell profile script (e.g., ~/.profile): export LC_ALL="C.UTF-8". If your terminal does not support unicode, try just export LC_ALL=C. Ultimately anything that isn't empty should work (C will just prevent the use of unicode characters).

In scenarios where you cannot change your locale (i.e., it must be *.UTF-8), but you want the Flux IDs in F58 to be printed with the ascii f, you should set FLUX_F58_FORCE_ASCII=1.

Relevant Repos in Flux-Framework Organization

Per @gonsie's request, a list of all the repos relevant to documentation

Graphics in RFC4 are rendered too large

https://flux-framework.readthedocs.io/projects/flux-rfc/en/latest/spec_4.html

Flux Archicture Diagrams

It would be good to have some documentation on the main modules and their interactions.

Including:

Our "hub and spoke" setup
- Specifically job-manager is the "hub", and "job-ingest", "depend", "sched", and "exec" are all spokes
The difference in modules loaded on rank0 vs other ranks
The messages that pass between modules when A) submitting a job B) querying job statuses

I put together some diagrams in PowerPoint that might be a good starting point:
Flux-Architectural-Diagrams.pptx
Flux-Architectural-Diagrams.pdf

flux-sched: describe ways to specify qmanager's queueing policies and behavior for system instance

For the system instance, we will need backfilling policies instead of FCFS and other ways to tune the qmanager parameters. Minimum TOML support is there for qmanager and we just need to document some key ones to admin guide.

quickstart: update `flux module` and `flux job list` usage

After the most recent release (i.e., in current master), the flux module command lost the --rank argument, and the flux job list command moved from being a "porcelain" command to being a "plumbing" command. It has been replaced by "flux jobs".

The quickstart.rst should be updated to reflect these changes.

quickstart: add `flux keygen` for users running Flux for the first time

Without this, a user will get flux-broker: zsecurity_comms_init: The directory '$HOME/.flux' does not exist. Have you run "flux keygen"?

contributing.rst: explain "subsystem" prefix requirement in git commit subject

Our contributing document requires a subsystem: prefix in the git commit subjects for proposed PRs, but doesn't explain what that means. Furthermore, we've adhered very well to including these prefixes in our work to date, but as @chu11 noted in flux-framework/flux-core#2790, we've not been consistent in the naming used in these prefixes.

If we're going to be rigorous about requiring subsystem prefixes, we should probably document why we want them, as well as what an acceptable subsystem prefix is.

Here's my take, 100% good old fashioned opinion:

Since we're not using the commit subject prefix in any kind of automated way, I think of it as a hint at what parts of the code the commit touches, so that relevant developers can quickly determine if they will have expertise for a code review and/or approval. In that case something like job-info: or python: seems fine, while modules/job-info: or bindings/python: are also ok, as long as those 8-9 characters are not needed to create a terse commit description.

Since concise commit subjects are highly valued, I try to pick the shortest name that fully qualifies the code being modified, unless with the longer prefix the commit message fits within 50 characters. Therefore, instead of cmd/flux-jobs:, the subject could be just flux-jobs: since there is no other 'modules/flux-jobs or libutil/flux-jobs, etc -- it is obvious what "subsystem" I am referencing.

Of course, I may be missing some other purpose for the subsystem prefix. In which case we can discuss that here and come up with some rules to place in our contributing doc to assist future contributors.

Document the limitations of hwloc reader

As discovered in flux-framework/flux-core#3375

Add CORAL-specific documentation from Confluence

Need to copy over our Lassen/Sierra specific notes from confluence into here.

Would also serve as a good starting point for Summit-specific docs.

[question] Incorporate doxygen-based in-line docs for Fluxion

flux-sched has a ton of doxygen-based in-line documents designed to help developers to extend our Fluxion graph scheduler. For C++ based project (or even high level object based on like Python), this style of documentations would be more common.

As a test, I have a doxygen configuration for our fluxion resource infrastructure piece at https://github.com/flux-framework/flux-sched/blob/master/resource/doxygen/doxy_conf.txt, what would be a feasibility step we should take to see if it is possible to incorporate this into the overall readthedoc?

If I manually build the doxygen document out of it, I get something like the following.

If this test goes well, I can add a top level doxygen configuration spanning both resource and qmanager components.

I remember @trws said this would be possible with some tweak. And as I remember he (and @SteVwonder) would have other in-line docs for Python bindings.

flux-accounting: add docs for building, installing, and setting up flux-accounting

Once flux-accounting v0.10.0 is released, it would be a good idea for me to add some official documentation on building/installing flux-accounting and setting it up on a machine. This should include instructions on some of the major components of flux-accounting, which include:

setting up the DB
setting up the cron job for updating job usage values periodically
setting up the cron job for updating fairshare values periodically

and any other important notes while setting up the project.

Its location in the Flux docs perhaps fit best in the Admin Guide.

FAQ: MPI init failure when running wrapped MPI process

If you attempt to launch an MPI process using a wrapper, it may fail. For example, if you run flux mini run my_script.py mpi_app.exe, where my_script.py Popens mpi_app.exe, the MPI init will fail, even if it is just a single rank MPI application. Same thing for flux mini run totalview mpi_app.exe. The issue is that Python's Popen and Totalview close file descriptors before launching the child process. MPI uses the PMI_FD file descriptor to communicate with Flux in order to bootstrap.

If you are attempting to debug your MPI application with Totalview, follow these instructions: https://flux-framework.readthedocs.io/en/latest/debugging.html#parallel-debugging-using-totalview

If you are attempting to wrap your MPI application with a Python script and Popen, then make sure to pass close_fds=False to Popen: https://docs.python.org/3/library/subprocess.html#subprocess.Popen

Flux-core & Flux-sched as Subprojects?

Post-mortem of a recent coffee discussion

Problems:

How should the submodules for flux-core and flux-sched within flux-docs be updated?
How should flux-docs be tagged given the loose coupling of the various projects?

Proposed Solution: The rfc repo would remain a submodule because it is not tagged/versions, so we can freely update the submodule reference everytime a PR is merged into rfc's master. flux-core and flux-sched would be RTD subprojects for flux-docs. So all of the various namespaces and search indexes are shared, but the various projects would still be able to host their own docs that are tagged in the same way that the projects are. flux-docs would be a parent project that ties everything together, but could now be tagged independently of the other projects.

Open Questions: What is the best way to reference flux-core/flux-sched material from flux-docs? RTDs subprojects seems to take care of this with the /projects/<name> addition to the URL, but does this break sphinx? Is there any easy way to reference the latest tagged version of a subproject in RTD or RST (or is latest as close as we can get)?

I'm sure there a bunch of other questions and details to work out that we haven't thought of yet.

Tagging @gonsie since our discussion last Friday spawn this whole side discussion.

Asciidoc Auto-convert Magic

Summoning the great and wonderful wizard, @gonsie.

docker: include instructions for starting munge

Per flux-framework/flux-core#2915, job submission fails without pre-starting munge. We should make that note in the docker quick start instructions as a stop-gap until flux-framework/flux-core#2919 is closed.

Update CORAL GPU section with better test for CUDA-enabled hwloc

Right now we have the user run lstopo | grep -i coproc, which is an OK test, but it is possible A) for the right binary to be in your PATH but the wrong library to be found by ld B) the wrong binary to be in your PATH but the right library be found by ld. The latter is the case on Lassen where the module file only updated the LD_LIBRARY_PATH. We should just test the intended behavior directly with something like flux start flux resource list.

FAQ: Flux is failing to boostrap under another resource manager (Slurm, Moab, PBS Pro, LSF, etc)

Try setting the FLUX_PMI_DEBUG=1 variable when bootstrapping Flux:

FLUX_PMI_DEBUG=1 srun flux start.

If you are running on a Cray system, you may be required to set the path to the Slurm PMI library:

PMI_LIBRARY=/path/to/slurm/libpmi.so srun flux start

Add examples of bootstrapping under sbatch

As @Larofeticus pointed out, most (all?) of our examples involve working with Flux in an interactive manner. In particular, they all use salloc to grab a set of nodes and then invoke Flux commands interactively. It would be instructive to have an example where we create a script that bootstraps Flux (and invokes a Flux initial program) and we submit that script with sbatch to show how the whole workflow would work in batch mode.

README.md: env questions

Just wondering if the following virtual environment command may be preferable to the one in the README.md? I understand virtualenv is now built into python3 (as venv). On a Ubuntu18.04 LTS system, python is actually python2, so running the commands in the README.md resulted in a python 2 environment. I figured that was going to end badly, and in backing up, discovered this alternate method:

$ python3.6 -m venv --system-site-packages env

Also, maybe it would be better to tell people to use _env or _sphinx_env, add it to a "clean" or "veryclean" make target, and to .gitignore to underscore that it's not part of the project?

use github actions to build with sphinx (and maybe deploy to readthedocs)

Relevant github action in the marketplace: https://github.com/marketplace/actions/sphinx-build

Create PR checklist/template

See bottom of PR #7 for discussion. Could possibly be combined with the "Developer Guidelines" section in contributing.rst (which has content like "commit etiquette"), or it could be a standalone file.

Parallel debugging description

As discussed this at 11AM meeting 7/16, we need a high level text on how to parallel debug a flux job. A good time will be after Perforce Software's support engineer will have a chance to poke at a Flux version on one of LC TOSS clusters. I will coordinate.

In the meanwhile, should this text go into https://flux-framework.readthedocs.io/en/latest/quickstart.html or somewhere else?

Create unified sidebar for RTD subprojects

Carryover from #32

Did some googling and it looks like RobotPy has a nice readthedocs site with a nice integration of various subprojects. We could probably emulate them. One key technology that they reference is the intersphinx module.

I just added the rfc repo as a readthedocs subproject: flux-framework.readthedocs.io/projects/flux-rfc/en/latest/README.html. I’m not sure about how to correctly link between the projects, but eventually I’d like to do something similar to the robotpy project for generating the sidebar.

srun --mpibind option is LLNL specific

There are a few places in the docs where example commands for launching flux under Slurm include a --mpibind=off flag. This flag refers to an LLNL specific plugin and will lead to errors for users at other sites.

examples:
https://flux-framework.readthedocs.io/en/latest/batch.html#launching-flux-in-slurm-s-batch-mode
https://flux-framework.readthedocs.io/en/latest/batch.html#fluxion-scheduler

contributing.rst: add testsuite tips

How to run a single test within the test harness: cd t && make check TESTS=t0000-testname.t
How to get extra output from the tests: debug=t FLUX_TESTS_LOGFILE=t make check
Running style checks and formatters: cd $REPO_ROOT && ./scripts/pylint && ./scripts/format
You can use $REPO_ROOT/scripts/check-format as a git pre-commit hook

Any other useful tips to include?

flux system administration guide

I think we'll need a guide for system administrators when flux is deployed as the system resource manager. We were talking about getting some feedback from sys admins on our current set of tools. Perhaps an admin guide would help with that, and could document known shortcomings in our early releases.

A straw man outline to start the discussion:

Configuration
- Installing flux packages
- Security: setuid IMP, flux user, MUNGE, Curve keys
- Overlay: selecting a network, TBON topology
- Resource: hwloc, config via R TBD
- Storage: content backing store, logs, job archive
Day to day administration
- Changing the Flux configuration
- Updating Flux software
- Draining nodes
- Stopping the flux queue
- Expediting jobs
- Canceling jobs
- Dedicated Application Time
Troubleshooting
- Logs: systemd journal, flux dmesg, KVS eventlogs
- What happened to a job?
- What happened to the system?
- What happened to a node/resource?
- Queue order?
- Anticipated/common failure modes TBD

contributing.rst: Add Code Structure Section

See discussion at bottom of issue #1. See https://github.com/mfem/mfem/blob/master/CONTRIBUTING.md#code-overview as example.

FAQ: Bring Your Own Python (BYOP)

Caveats on what will work and what won't when using a version of python that isn't the one Flux was configured against

Improved write-up (above and beyond man pages) for composite futures

In a ☕ hour call, @garlick suggested that the complexity of the composite futures is a bit much to fully explain in a manpage and that it would be good to have a lengthier writeup on composite futures in ReST form, with inline code examples and the whole shebang.

Using `flux-proxy` on a machine with launch nodes

For Summit (and Lassen, depending on if you use lalloc or not), the default behavior is to allocate compute nodes and a launch node for every job. The batch script is then run on the launch node and any use of the system parallel launcher jsrun to execute Flux results in all of the Flux processes to land on the compute nodes and none on the launch node. If a user captures the local:// FLUX URI and tries to use that from the launch node, it will fail, since the Unix socket is on the compute nodes not the launch node. This behavior is different from typical clusters where launch nodes are not used.

unify all example output under a common username

As noted in #25, example output in the quickstart guide uses real usernames, and thus usernames may not match across all examples.

One idea would be to paste all example output from the docker image, in which case I think the username would end up being flux (if that's not confusing enough).

Update CORAL module section to not load Flux-specific Spectrum MPI

With the latest PMI fixes upstreamed to OpenMPI and then Spectrum, our Flux-specific Spectrum module should no longer be required.

Make "Scheduling GPUs" its own (sub-)section

This should include details about hwloc needing to be compiled against NVML and OpenCL. The check for if GPU detection in Flux worked (#71). And that CUDA_DEVICE_ORDER should be set if launching with a system launcher like srun or jsrun.

Move Contributing Guide to this Repo and then Improve

Other project contributing documents to reference:

create an "atlas" or "cheat sheet" for accessing workflow-relevant information

We have a CLI for poking at job-specific information stored within the KVS: job info, but @Larofeticus pointed out that having some documentation about where to find job-specific information within the KVS might also be helpful. To start, maybe we can make a table of various workflow/job-relevant information, where it is stored in the KVS and if/where it is stored in the environment, and maybe even where you used to be able to find it in Slurm etc.

Data that comes to mind:

jobid
job size (in nodes)
rank id
hostnames for all nodes in job
job endtime
cpu mask (i.e., allocated cores)
job account/bank
job working directory
job stdout/stderr location (either file or KVS stream)

Add link to flux-workflow-examples subproject

https://flux-framework.readthedocs.io/projects/flux-workflow-examples/en/latest/index.html

Admin Guide: syntax highlighting in TOML examples

It looks like the admin guide uses TOML code blocks in the ReST code, and the syntax highlighting of those blocks works on GitHub but not on ReadTheDocs. Not sure if this reproduces locally, but it happens on the hosted RTD site.

GitHub Screenshot:

RTD Screenshot:

sys admin guide: multiple queues

From a recent sched ops syncup meeting, @ryanday36 requested details about multiple queues be added to the sys admin guide.

Topics to include:

How to configure queue parameters
How to configure resources under various resources
How to set queue limits and QoS

Running jobs: task affinity

Flux will automatically set the CPU affinity and set CUDA_VISIBLE_DEVICES based on the cores and GPUs allocated to a job. If you are launching multiple tasks in a job, then you may be interested in the shell options “cpu-affinity” and “gpu-affinity”.

If you launch 2 tasks with flux mini run -n2 -N1 or flux mini run -n2 -N1 -o cpu-affinity=on -o gpu-affinity=on, both tasks/processes will see the same 2 cores and GPUs. If you launch 2 tasks with flux mini run -n2 -o cpu-affinity=per-task -o gpu-affinity=per-task, then each task will only see its own unique core and GPU. If you launch 2 tasks with flux mini run -n2 -o cpu-affinity=off -o gpu-affinity=off, then each task/process will see everything on the entire node.

Note: You can easily test and inspect the effects of various affinity policies using lstopo --restrict binding as the job task (e.g., flux mini run -n2 -N -o cpu-affinity=per-task lstopo --restrict binding).

flux-framework / flux-docs Goto Github PK

flux-docs's People

Contributors

Stargazers

Watchers

Forkers

flux-docs's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs