GithubHelp home page GithubHelp logo

flux-framework / flux-docs Goto Github PK

View Code? Open in Web Editor NEW
10.0 10.0 21.0 5.25 MB

Documentation for the Flux-Framework

Home Page: https://flux-framework.readthedocs.io/

License: Other

Makefile 0.78% Python 19.07% CSS 64.51% HTML 9.27% Dockerfile 1.34% Jupyter Notebook 5.03%
docs

flux-docs's People

Contributors

chu11 avatar cmoussa1 avatar dongahn avatar garlick avatar garrettbslone avatar gonsie avatar grondo avatar hauten avatar jaeseungyeom avatar jameshcorbett avatar mergify[bot] avatar stevwonder avatar suzannepaterno avatar vsoch avatar wickberg avatar wihobbs avatar xorjane avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

flux-docs's Issues

Background Section

We need the "elevator pitch" for Flux in our docs.

What is Flux? What problems is it trying to solve? Why is it better than the "competition"?

We have lots of research papers that we can probably copy/paste from.

create a top-level reference page for flux jobs (their lifecycle, states, events, etc)

We have several RFCs and command man-pages that document these specifics, but the information isn't all in one place. It would be nice to have a top-level doc page that gives the high-level details of Flux jobs, and then provides pointers to the RFCs/manpages for more nitty-gritty details.

We probably can't do this type of accumulation for everything within Flux, but jobs seem essential enough of a concept as to warrant it.

Create an FAQ page

Good questions to bootstrap the page:

  • Why is Flux not discovering and managing all of the resources on the system/node?
    • Various bind flags passed to the parallel launcher that started Flux
    • CUDA enabled hwloc (for GPUs)
  • How do I efficiently launch a large number of jobs to Flux?
    • Pointer to our bulksubmit workflow example and flux-tree
  • Memory exhaustion on a node when running large ensembles with Flux
    • Set rundir so that the content store is not saved to /tmp but to a filesystem
  • Bring Your Own Python (BYOP)
    • Caveats on what will work and what won't when using a version of python that isn't the one Flux was configured against
  • How do I mimic Slurm's jobstep semantics (could later be moved to a "Flux vs Slurm" or "Porting from Slurm" page)
    • flux mini submitting a script with flux mini runs will not result in Slurm-style jobsteps unless you stick flux start in from of the jobscript in flux mini submit
  • Flux fails to bootstrap a specific MPI implemention (e.g., OpenMPI, MPICH, Spectrum MPI):
    • Use the -o mpi= option when running/submitting jobs
  • Does Flux run on a mac?
    • Not yet. We have an open issue on GitHub tracking the progress towards the goal of natively compiling on a mac. In the meantime, you can use Docker (pointer to our docker install instructions).
  • How do I report a bug or ask a question not answered here?
    • Either a pointer to the GitHub issue tracker, or a pointer to a docs page that outlines how to get help
  • How do I request a resource for scheduling not listed as an argument to flux mini (e.g., memory or storage)?
    • Once flux run is merged, we can point to that documentation

ECP AM tutorial

Are there any materials from the ECP AM tutorial that is not already in the tutorials or examples repos? Slides? We should gather them in this repo, probably on a page listing past events.

FAQ: Flux Job IDs in F58 format print improperly

Depending on the system you are running on and the locale settings, the printing of Flux Job IDs in F58 format can result in weird boxes, underscores, mojibake, or complete corruption of the terminal state. For example:

$ flux mini submit hostname
_Kg1PZgns

or

$ flux mini submit hostname
Kg1PZgns

This issue appears on systems with no locale information set (i.e., LC_ALL, LC_CTYPE, and LANG are all unset and the values printed by locale are all implied). When no locale information is set, python 3.7+ automatically coerces the locale to C.UTF-8

The preferred solution is to set LC_ALL to something (either C, POSIX, or *.UTF-8). Try adding the following to your shell profile script (e.g., ~/.profile): export LC_ALL="C.UTF-8". If your terminal does not support unicode, try just export LC_ALL=C. Ultimately anything that isn't empty should work (C will just prevent the use of unicode characters).

In scenarios where you cannot change your locale (i.e., it must be *.UTF-8), but you want the Flux IDs in F58 to be printed with the ascii f, you should set FLUX_F58_FORCE_ASCII=1.

Flux Archicture Diagrams

It would be good to have some documentation on the main modules and their interactions.

Including:

  • Our "hub and spoke" setup
    • Specifically job-manager is the "hub", and "job-ingest", "depend", "sched", and "exec" are all spokes
  • The difference in modules loaded on rank0 vs other ranks
  • The messages that pass between modules when A) submitting a job B) querying job statuses

I put together some diagrams in PowerPoint that might be a good starting point:
Flux-Architectural-Diagrams.pptx
Flux-Architectural-Diagrams.pdf

quickstart: update `flux module` and `flux job list` usage

After the most recent release (i.e., in current master), the flux module command lost the --rank argument, and the flux job list command moved from being a "porcelain" command to being a "plumbing" command. It has been replaced by "flux jobs".

The quickstart.rst should be updated to reflect these changes.

contributing.rst: explain "subsystem" prefix requirement in git commit subject

Our contributing document requires a subsystem: prefix in the git commit subjects for proposed PRs, but doesn't explain what that means. Furthermore, we've adhered very well to including these prefixes in our work to date, but as @chu11 noted in flux-framework/flux-core#2790, we've not been consistent in the naming used in these prefixes.

If we're going to be rigorous about requiring subsystem prefixes, we should probably document why we want them, as well as what an acceptable subsystem prefix is.

Here's my take, 100% good old fashioned opinion:

Since we're not using the commit subject prefix in any kind of automated way, I think of it as a hint at what parts of the code the commit touches, so that relevant developers can quickly determine if they will have expertise for a code review and/or approval. In that case something like job-info: or python: seems fine, while modules/job-info: or bindings/python: are also ok, as long as those 8-9 characters are not needed to create a terse commit description.

Since concise commit subjects are highly valued, I try to pick the shortest name that fully qualifies the code being modified, unless with the longer prefix the commit message fits within 50 characters. Therefore, instead of cmd/flux-jobs:, the subject could be just flux-jobs: since there is no other 'modules/flux-jobs or libutil/flux-jobs, etc -- it is obvious what "subsystem" I am referencing.

Of course, I may be missing some other purpose for the subsystem prefix. In which case we can discuss that here and come up with some rules to place in our contributing doc to assist future contributors.

[question] Incorporate doxygen-based in-line docs for Fluxion

flux-sched has a ton of doxygen-based in-line documents designed to help developers to extend our Fluxion graph scheduler. For C++ based project (or even high level object based on like Python), this style of documentations would be more common.

As a test, I have a doxygen configuration for our fluxion resource infrastructure piece at https://github.com/flux-framework/flux-sched/blob/master/resource/doxygen/doxy_conf.txt, what would be a feasibility step we should take to see if it is possible to incorporate this into the overall readthedoc?

If I manually build the doxygen document out of it, I get something like the following.

If this test goes well, I can add a top level doxygen configuration spanning both resource and qmanager components.

I remember @trws said this would be possible with some tweak. And as I remember he (and @SteVwonder) would have other in-line docs for Python bindings.

Screen Shot 2020-07-25 at 12 22 29 PM

flux-accounting: add docs for building, installing, and setting up flux-accounting

Once flux-accounting v0.10.0 is released, it would be a good idea for me to add some official documentation on building/installing flux-accounting and setting it up on a machine. This should include instructions on some of the major components of flux-accounting, which include:

  • setting up the DB
  • setting up the cron job for updating job usage values periodically
  • setting up the cron job for updating fairshare values periodically

and any other important notes while setting up the project.

Its location in the Flux docs perhaps fit best in the Admin Guide.

FAQ: MPI init failure when running wrapped MPI process

If you attempt to launch an MPI process using a wrapper, it may fail. For example, if you run flux mini run my_script.py mpi_app.exe, where my_script.py Popens mpi_app.exe, the MPI init will fail, even if it is just a single rank MPI application. Same thing for flux mini run totalview mpi_app.exe. The issue is that Python's Popen and Totalview close file descriptors before launching the child process. MPI uses the PMI_FD file descriptor to communicate with Flux in order to bootstrap.

If you are attempting to debug your MPI application with Totalview, follow these instructions: https://flux-framework.readthedocs.io/en/latest/debugging.html#parallel-debugging-using-totalview

If you are attempting to wrap your MPI application with a Python script and Popen, then make sure to pass close_fds=False to Popen: https://docs.python.org/3/library/subprocess.html#subprocess.Popen

Flux-core & Flux-sched as Subprojects?

Post-mortem of a recent coffee discussion

Problems:

  • How should the submodules for flux-core and flux-sched within flux-docs be updated?
  • How should flux-docs be tagged given the loose coupling of the various projects?

Proposed Solution: The rfc repo would remain a submodule because it is not tagged/versions, so we can freely update the submodule reference everytime a PR is merged into rfc's master. flux-core and flux-sched would be RTD subprojects for flux-docs. So all of the various namespaces and search indexes are shared, but the various projects would still be able to host their own docs that are tagged in the same way that the projects are. flux-docs would be a parent project that ties everything together, but could now be tagged independently of the other projects.

Open Questions: What is the best way to reference flux-core/flux-sched material from flux-docs? RTDs subprojects seems to take care of this with the /projects/<name> addition to the URL, but does this break sphinx? Is there any easy way to reference the latest tagged version of a subproject in RTD or RST (or is latest as close as we can get)?

I'm sure there a bunch of other questions and details to work out that we haven't thought of yet.

Tagging @gonsie since our discussion last Friday spawn this whole side discussion.

Update CORAL GPU section with better test for CUDA-enabled hwloc

Right now we have the user run lstopo | grep -i coproc, which is an OK test, but it is possible A) for the right binary to be in your PATH but the wrong library to be found by ld B) the wrong binary to be in your PATH but the right library be found by ld. The latter is the case on Lassen where the module file only updated the LD_LIBRARY_PATH. We should just test the intended behavior directly with something like flux start flux resource list.

Add examples of bootstrapping under sbatch

As @Larofeticus pointed out, most (all?) of our examples involve working with Flux in an interactive manner. In particular, they all use salloc to grab a set of nodes and then invoke Flux commands interactively. It would be instructive to have an example where we create a script that bootstraps Flux (and invokes a Flux initial program) and we submit that script with sbatch to show how the whole workflow would work in batch mode.

README.md: env questions

Just wondering if the following virtual environment command may be preferable to the one in the README.md? I understand virtualenv is now built into python3 (as venv). On a Ubuntu18.04 LTS system, python is actually python2, so running the commands in the README.md resulted in a python 2 environment. I figured that was going to end badly, and in backing up, discovered this alternate method:

$ python3.6 -m venv --system-site-packages env

Also, maybe it would be better to tell people to use _env or _sphinx_env, add it to a "clean" or "veryclean" make target, and to .gitignore to underscore that it's not part of the project?

Create PR checklist/template

See bottom of PR #7 for discussion. Could possibly be combined with the "Developer Guidelines" section in contributing.rst (which has content like "commit etiquette"), or it could be a standalone file.

Create unified sidebar for RTD subprojects

Carryover from #32

Did some googling and it looks like RobotPy has a nice readthedocs site with a nice integration of various subprojects. We could probably emulate them. One key technology that they reference is the intersphinx module.

I just added the rfc repo as a readthedocs subproject: flux-framework.readthedocs.io/projects/flux-rfc/en/latest/README.html. I’m not sure about how to correctly link between the projects, but eventually I’d like to do something similar to the robotpy project for generating the sidebar.

contributing.rst: add testsuite tips

  • How to run a single test within the test harness: cd t && make check TESTS=t0000-testname.t
  • How to get extra output from the tests: debug=t FLUX_TESTS_LOGFILE=t make check
  • Running style checks and formatters: cd $REPO_ROOT && ./scripts/pylint && ./scripts/format
  • You can use $REPO_ROOT/scripts/check-format as a git pre-commit hook

Any other useful tips to include?

flux system administration guide

I think we'll need a guide for system administrators when flux is deployed as the system resource manager. We were talking about getting some feedback from sys admins on our current set of tools. Perhaps an admin guide would help with that, and could document known shortcomings in our early releases.

A straw man outline to start the discussion:

  • Configuration
    • Installing flux packages
    • Security: setuid IMP, flux user, MUNGE, Curve keys
    • Overlay: selecting a network, TBON topology
    • Resource: hwloc, config via R TBD
    • Storage: content backing store, logs, job archive
  • Day to day administration
    • Changing the Flux configuration
    • Updating Flux software
    • Draining nodes
    • Stopping the flux queue
    • Expediting jobs
    • Canceling jobs
    • Dedicated Application Time
  • Troubleshooting
    • Logs: systemd journal, flux dmesg, KVS eventlogs
    • What happened to a job?
    • What happened to the system?
    • What happened to a node/resource?
    • Queue order?
    • Anticipated/common failure modes TBD

Using `flux-proxy` on a machine with launch nodes

For Summit (and Lassen, depending on if you use lalloc or not), the default behavior is to allocate compute nodes and a launch node for every job. The batch script is then run on the launch node and any use of the system parallel launcher jsrun to execute Flux results in all of the Flux processes to land on the compute nodes and none on the launch node. If a user captures the local:// FLUX URI and tries to use that from the launch node, it will fail, since the Unix socket is on the compute nodes not the launch node. This behavior is different from typical clusters where launch nodes are not used.

unify all example output under a common username

As noted in #25, example output in the quickstart guide uses real usernames, and thus usernames may not match across all examples.

One idea would be to paste all example output from the docker image, in which case I think the username would end up being flux (if that's not confusing enough).

Make "Scheduling GPUs" its own (sub-)section

This should include details about hwloc needing to be compiled against NVML and OpenCL. The check for if GPU detection in Flux worked (#71). And that CUDA_DEVICE_ORDER should be set if launching with a system launcher like srun or jsrun.

create an "atlas" or "cheat sheet" for accessing workflow-relevant information

We have a CLI for poking at job-specific information stored within the KVS: job info, but @Larofeticus pointed out that having some documentation about where to find job-specific information within the KVS might also be helpful. To start, maybe we can make a table of various workflow/job-relevant information, where it is stored in the KVS and if/where it is stored in the environment, and maybe even where you used to be able to find it in Slurm etc.

Data that comes to mind:

  • jobid
  • job size (in nodes)
  • rank id
  • hostnames for all nodes in job
  • job endtime
  • cpu mask (i.e., allocated cores)
  • job account/bank
  • job working directory
  • job stdout/stderr location (either file or KVS stream)

Admin Guide: syntax highlighting in TOML examples

It looks like the admin guide uses TOML code blocks in the ReST code, and the syntax highlighting of those blocks works on GitHub but not on ReadTheDocs. Not sure if this reproduces locally, but it happens on the hosted RTD site.

GitHub Screenshot:
Screen Shot 2020-09-16 at 12 38 44 PM

RTD Screenshot:
Screen Shot 2020-09-16 at 12 38 52 PM

sys admin guide: multiple queues

From a recent sched ops syncup meeting, @ryanday36 requested details about multiple queues be added to the sys admin guide.

Topics to include:

  • How to configure queue parameters
  • How to configure resources under various resources
  • How to set queue limits and QoS

Running jobs: task affinity

Flux will automatically set the CPU affinity and set CUDA_VISIBLE_DEVICES based on the cores and GPUs allocated to a job. If you are launching multiple tasks in a job, then you may be interested in the shell options “cpu-affinity” and “gpu-affinity”.

If you launch 2 tasks with flux mini run -n2 -N1 or flux mini run -n2 -N1 -o cpu-affinity=on -o gpu-affinity=on, both tasks/processes will see the same 2 cores and GPUs. If you launch 2 tasks with flux mini run -n2 -o cpu-affinity=per-task -o gpu-affinity=per-task, then each task will only see its own unique core and GPU. If you launch 2 tasks with flux mini run -n2 -o cpu-affinity=off -o gpu-affinity=off, then each task/process will see everything on the entire node.

Note: You can easily test and inspect the effects of various affinity policies using lstopo --restrict binding as the job task (e.g., flux mini run -n2 -N -o cpu-affinity=per-task lstopo --restrict binding).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.