flux-framework / flux-docs Goto Github PK
View Code? Open in Web Editor NEWDocumentation for the Flux-Framework
Home Page: https://flux-framework.readthedocs.io/
License: Other
Documentation for the Flux-Framework
Home Page: https://flux-framework.readthedocs.io/
License: Other
There are a few sections in the contributing page where there should seemingly be bulleted lists, but there aren't.
We need the "elevator pitch" for Flux in our docs.
What is Flux? What problems is it trying to solve? Why is it better than the "competition"?
We have lots of research papers that we can probably copy/paste from.
We have several RFCs and command man-pages that document these specifics, but the information isn't all in one place. It would be nice to have a top-level doc page that gives the high-level details of Flux jobs, and then provides pointers to the RFCs/manpages for more nitty-gritty details.
We probably can't do this type of accumulation for everything within Flux, but jobs seem essential enough of a concept as to warrant it.
Per our recent TOSS4 phone call, we will need a guide that documents how users can port their scripts from sbatch
to flux mini batch
.
CC @dongahn
Context: flux-framework/flux-core#3003
Good questions to bootstrap the page:
rundir
so that the content store is not saved to /tmp
but to a filesystemflux mini submit
ting a script with flux mini run
s will not result in Slurm-style jobsteps unless you stick flux start
in from of the jobscript in flux mini submit
-o mpi=
option when running/submitting jobsflux mini
(e.g., memory
or storage
)?
flux run
is merged, we can point to that documentationJust noticed while navigating the rtd site that a couple section titles could use improvment:
Are there any materials from the ECP AM tutorial that is not already in the tutorials or examples repos? Slides? We should gather them in this repo, probably on a page listing past events.
For long code-blocks, like our recommended jsrun
line on CORAL systems, it is annoying to try and highlight the whole things to copy-and-paste into a terminal. A button, like Github supports, would make this much easier. Seems this is already supported via a sphinx extension
I can give this a try and add to #102 if it is easy.
Caught several spelling mistakes in a recent PR despite the CI returning all green. We probably need to add an explicit github action that runs make check
.
Depending on the system you are running on and the locale settings, the printing of Flux Job IDs in F58 format can result in weird boxes, underscores, mojibake, or complete corruption of the terminal state. For example:
$ flux mini submit hostname
_Kg1PZgns
or
$ flux mini submit hostname
Kg1PZgns
This issue appears on systems with no locale information set (i.e., LC_ALL
, LC_CTYPE
, and LANG
are all unset and the values printed by locale
are all implied). When no locale information is set, python 3.7+ automatically coerces the locale to C.UTF-8
The preferred solution is to set LC_ALL
to something (either C, POSIX, or *.UTF-8). Try adding the following to your shell profile script (e.g., ~/.profile
): export LC_ALL="C.UTF-8"
. If your terminal does not support unicode, try just export LC_ALL=C
. Ultimately anything that isn't empty should work (C
will just prevent the use of unicode characters).
In scenarios where you cannot change your locale (i.e., it must be *.UTF-8
), but you want the Flux IDs in F58 to be printed with the ascii f
, you should set FLUX_F58_FORCE_ASCII=1
.
Per @gonsie's request, a list of all the repos relevant to documentation
It would be good to have some documentation on the main modules and their interactions.
Including:
job-manager
is the "hub", and "job-ingest", "depend", "sched", and "exec" are all spokesI put together some diagrams in PowerPoint that might be a good starting point:
Flux-Architectural-Diagrams.pptx
Flux-Architectural-Diagrams.pdf
For the system instance, we will need backfilling policies instead of FCFS and other ways to tune the qmanager parameters. Minimum TOML support is there for qmanager and we just need to document some key ones to admin guide.
After the most recent release (i.e., in current master), the flux module command lost the --rank argument, and the flux job list command moved from being a "porcelain" command to being a "plumbing" command. It has been replaced by "flux jobs".
The quickstart.rst should be updated to reflect these changes.
Without this, a user will get flux-broker: zsecurity_comms_init: The directory '$HOME/.flux' does not exist. Have you run "flux keygen"?
Our contributing document requires a subsystem:
prefix in the git commit subjects for proposed PRs, but doesn't explain what that means. Furthermore, we've adhered very well to including these prefixes in our work to date, but as @chu11 noted in flux-framework/flux-core#2790, we've not been consistent in the naming used in these prefixes.
If we're going to be rigorous about requiring subsystem prefixes, we should probably document why we want them, as well as what an acceptable subsystem prefix is.
Here's my take, 100% good old fashioned opinion:
Since we're not using the commit subject prefix in any kind of automated way, I think of it as a hint at what parts of the code the commit touches, so that relevant developers can quickly determine if they will have expertise for a code review and/or approval. In that case something like job-info:
or python:
seems fine, while modules/job-info:
or bindings/python:
are also ok, as long as those 8-9 characters are not needed to create a terse commit description.
Since concise commit subjects are highly valued, I try to pick the shortest name that fully qualifies the code being modified, unless with the longer prefix the commit message fits within 50 characters. Therefore, instead of cmd/flux-jobs:
, the subject could be just flux-jobs:
since there is no other 'modules/flux-jobs
or libutil/flux-jobs
, etc -- it is obvious what "subsystem" I am referencing.
Of course, I may be missing some other purpose for the subsystem prefix. In which case we can discuss that here and come up with some rules to place in our contributing doc to assist future contributors.
As discovered in flux-framework/flux-core#3375
Need to copy over our Lassen/Sierra specific notes from confluence into here.
Would also serve as a good starting point for Summit-specific docs.
flux-sched
has a ton of doxygen-based in-line documents designed to help developers to extend our Fluxion graph scheduler. For C++ based project (or even high level object based on like Python), this style of documentations would be more common.
As a test, I have a doxygen configuration for our fluxion resource
infrastructure piece at https://github.com/flux-framework/flux-sched/blob/master/resource/doxygen/doxy_conf.txt, what would be a feasibility step we should take to see if it is possible to incorporate this into the overall readthedoc?
If I manually build the doxygen document out of it, I get something like the following.
If this test goes well, I can add a top level doxygen configuration spanning both resource and qmanager components.
I remember @trws said this would be possible with some tweak. And as I remember he (and @SteVwonder) would have other in-line docs for Python bindings.
Once flux-accounting v0.10.0
is released, it would be a good idea for me to add some official documentation on building/installing flux-accounting and setting it up on a machine. This should include instructions on some of the major components of flux-accounting, which include:
and any other important notes while setting up the project.
Its location in the Flux docs perhaps fit best in the Admin Guide.
If you attempt to launch an MPI process using a wrapper, it may fail. For example, if you run flux mini run my_script.py mpi_app.exe
, where my_script.py
Popen
s mpi_app.exe
, the MPI init will fail, even if it is just a single rank MPI application. Same thing for flux mini run totalview mpi_app.exe
. The issue is that Python's Popen and Totalview close file descriptors before launching the child process. MPI uses the PMI_FD
file descriptor to communicate with Flux in order to bootstrap.
If you are attempting to debug your MPI application with Totalview, follow these instructions: https://flux-framework.readthedocs.io/en/latest/debugging.html#parallel-debugging-using-totalview
If you are attempting to wrap your MPI application with a Python script and Popen
, then make sure to pass close_fds=False
to Popen
: https://docs.python.org/3/library/subprocess.html#subprocess.Popen
Post-mortem of a recent coffee discussion
Problems:
Proposed Solution: The rfc repo would remain a submodule because it is not tagged/versions, so we can freely update the submodule reference everytime a PR is merged into rfc's master. flux-core and flux-sched would be RTD subprojects for flux-docs. So all of the various namespaces and search indexes are shared, but the various projects would still be able to host their own docs that are tagged in the same way that the projects are. flux-docs would be a parent project that ties everything together, but could now be tagged independently of the other projects.
Open Questions: What is the best way to reference flux-core/flux-sched material from flux-docs? RTDs subprojects seems to take care of this with the /projects/<name>
addition to the URL, but does this break sphinx? Is there any easy way to reference the latest tagged version of a subproject in RTD or RST (or is latest
as close as we can get)?
I'm sure there a bunch of other questions and details to work out that we haven't thought of yet.
Tagging @gonsie since our discussion last Friday spawn this whole side discussion.
Summoning the great and wonderful wizard, @gonsie.
Per flux-framework/flux-core#2915, job submission fails without pre-starting munge. We should make that note in the docker quick start instructions as a stop-gap until flux-framework/flux-core#2919 is closed.
Right now we have the user run lstopo | grep -i coproc
, which is an OK test, but it is possible A) for the right binary to be in your PATH but the wrong library to be found by ld
B) the wrong binary to be in your PATH but the right library be found by ld
. The latter is the case on Lassen where the module file only updated the LD_LIBRARY_PATH. We should just test the intended behavior directly with something like flux start flux resource list
.
Try setting the FLUX_PMI_DEBUG=1 variable when bootstrapping Flux:
FLUX_PMI_DEBUG=1 srun flux start
.
If you are running on a Cray system, you may be required to set the path to the Slurm PMI library:
PMI_LIBRARY=/path/to/slurm/libpmi.so srun flux start
As @Larofeticus pointed out, most (all?) of our examples involve working with Flux in an interactive manner. In particular, they all use salloc
to grab a set of nodes and then invoke Flux commands interactively. It would be instructive to have an example where we create a script that bootstraps Flux (and invokes a Flux initial program) and we submit that script with sbatch
to show how the whole workflow would work in batch mode.
Just wondering if the following virtual environment command may be preferable to the one in the README.md? I understand virtualenv is now built into python3 (as venv
). On a Ubuntu18.04 LTS system, python
is actually python2, so running the commands in the README.md resulted in a python 2 environment. I figured that was going to end badly, and in backing up, discovered this alternate method:
$ python3.6 -m venv --system-site-packages env
Also, maybe it would be better to tell people to use _env
or _sphinx_env
, add it to a "clean" or "veryclean" make target, and to .gitignore to underscore that it's not part of the project?
Relevant github action in the marketplace: https://github.com/marketplace/actions/sphinx-build
See bottom of PR #7 for discussion. Could possibly be combined with the "Developer Guidelines" section in contributing.rst (which has content like "commit etiquette"), or it could be a standalone file.
As discussed this at 11AM meeting 7/16, we need a high level text on how to parallel debug a flux job. A good time will be after Perforce Software's support engineer will have a chance to poke at a Flux version on one of LC TOSS clusters. I will coordinate.
In the meanwhile, should this text go into https://flux-framework.readthedocs.io/en/latest/quickstart.html or somewhere else?
Carryover from #32
Did some googling and it looks like RobotPy has a nice readthedocs site with a nice integration of various subprojects. We could probably emulate them. One key technology that they reference is the intersphinx module.
I just added the rfc repo as a readthedocs subproject: flux-framework.readthedocs.io/projects/flux-rfc/en/latest/README.html. I’m not sure about how to correctly link between the projects, but eventually I’d like to do something similar to the robotpy project for generating the sidebar.
There are a few places in the docs where example commands for launching flux under Slurm include a --mpibind=off
flag. This flag refers to an LLNL specific plugin and will lead to errors for users at other sites.
examples:
https://flux-framework.readthedocs.io/en/latest/batch.html#launching-flux-in-slurm-s-batch-mode
https://flux-framework.readthedocs.io/en/latest/batch.html#fluxion-scheduler
cd t && make check TESTS=t0000-testname.t
debug=t FLUX_TESTS_LOGFILE=t make check
cd $REPO_ROOT && ./scripts/pylint && ./scripts/format
$REPO_ROOT/scripts/check-format
as a git pre-commit hookAny other useful tips to include?
I think we'll need a guide for system administrators when flux is deployed as the system resource manager. We were talking about getting some feedback from sys admins on our current set of tools. Perhaps an admin guide would help with that, and could document known shortcomings in our early releases.
A straw man outline to start the discussion:
See discussion at bottom of issue #1. See https://github.com/mfem/mfem/blob/master/CONTRIBUTING.md#code-overview as example.
Caveats on what will work and what won't when using a version of python that isn't the one Flux was configured against
In a ☕ hour call, @garlick suggested that the complexity of the composite futures is a bit much to fully explain in a manpage and that it would be good to have a lengthier writeup on composite futures in ReST form, with inline code examples and the whole shebang.
For Summit (and Lassen, depending on if you use lalloc or not), the default behavior is to allocate compute nodes and a launch node for every job. The batch script is then run on the launch node and any use of the system parallel launcher jsrun
to execute Flux results in all of the Flux processes to land on the compute nodes and none on the launch node. If a user captures the local://
FLUX URI and tries to use that from the launch node, it will fail, since the Unix socket is on the compute nodes not the launch node. This behavior is different from typical clusters where launch nodes are not used.
As noted in #25, example output in the quickstart guide uses real usernames, and thus usernames may not match across all examples.
One idea would be to paste all example output from the docker image, in which case I think the username would end up being flux
(if that's not confusing enough).
With the latest PMI fixes upstreamed to OpenMPI and then Spectrum, our Flux-specific Spectrum module should no longer be required.
This should include details about hwloc needing to be compiled against NVML and OpenCL. The check for if GPU detection in Flux worked (#71). And that CUDA_DEVICE_ORDER
should be set if launching with a system launcher like srun
or jsrun
.
Other project contributing documents to reference:
We have a CLI for poking at job-specific information stored within the KVS: job info
, but @Larofeticus pointed out that having some documentation about where to find job-specific information within the KVS might also be helpful. To start, maybe we can make a table of various workflow/job-relevant information, where it is stored in the KVS and if/where it is stored in the environment, and maybe even where you used to be able to find it in Slurm etc.
Data that comes to mind:
From a recent sched ops syncup meeting, @ryanday36 requested details about multiple queues be added to the sys admin guide.
Topics to include:
Flux will automatically set the CPU affinity and set CUDA_VISIBLE_DEVICES
based on the cores and GPUs allocated to a job. If you are launching multiple tasks in a job, then you may be interested in the shell options “cpu-affinity” and “gpu-affinity”.
If you launch 2 tasks with flux mini run -n2 -N1
or flux mini run -n2 -N1 -o cpu-affinity=on -o gpu-affinity=on
, both tasks/processes will see the same 2 cores and GPUs. If you launch 2 tasks with flux mini run -n2 -o cpu-affinity=per-task -o gpu-affinity=per-task
, then each task will only see its own unique core and GPU. If you launch 2 tasks with flux mini run -n2 -o cpu-affinity=off -o gpu-affinity=off
, then each task/process will see everything on the entire node.
Note: You can easily test and inspect the effects of various affinity policies using lstopo --restrict binding
as the job task (e.g., flux mini run -n2 -N -o cpu-affinity=per-task lstopo --restrict binding
).
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.