hpc-carpentry / old-hpc-workflows Goto Github PK

View Code? Open in Web Editor NEW

8.0 8.0 2.0 10.08 MB

Scaling studies on high-performance clusters using Snakemake workflows

Home Page: https://www.hpc-carpentry.org/old-hpc-workflows/

License: Other

Ruby 0.61% Makefile 6.96% R 5.16% Shell 1.04% Python 86.23%

hpc-carpentry parallel-computing snakemake-workflows carpentries-incubator english alpha

old-hpc-workflows's People

Contributors

Stargazers

Watchers

Forkers

vinisalazar lemythe

old-hpc-workflows's Issues

Carpentries Workbench

Try using Carpentries Workbench formatting for this lesson.

Update `setup.md` for the HPC use case

Update the required data files (or do we only generate them instead?)
Decide on approach(es) to make the required software for the tutorial available (snakemake, amdahl and also probably a plotting tool). Possibilities include
- environment modules
- pip
- conda/mamba

Discuss limitations & alternatives to Snakemake

Snakemake may not be the best workflow manager for HPC; we're teaching it because it is broadly accessible, and the lessons can be broadly transferrable.

Include an episode at the end to discuss the limitations of Snakemake, and introduce alternative tools (Parsl, Fireworks) for HPC.

Add `--terse` option to `amdahl` to make it easier to parse the output

PR should go to https://github.com/ocaisa/amdahl

Add versions to requirements.txt

Adding version of libraries supported and tested on to requirements.txt will make setup easier. It would also allow running CI to check that the scripts used work. This is motivated by dropping of dropping of collections.MutableMapping from Python 3.10 and above, see https://stackoverflow.com/questions/70943244/attributeerror-module-collections-has-no-attribute-mutablemapping

Consider Parsl

NERSC's Snakemake docs lists Snakemake's "cluster mode" as a
disadvantage, since it submits each "rule" as a separate job, thereby
spamming the scheduler with dependent tasks. The main Snakemake process also
resides on the login node until all jobs have finished, occupying some
resources.

NERSC specifically documents Parsl as the recommended alternative for multinode jobs. I was aware of Parsl as a Python extension for parallel programming, but had not recognized its ability to dispatch work directly on Slurm (and possibly other schedulers).

This synergy suggests Parsl as a viable alternative to Snakemake, since it (a) would integrate readily with the Python-based Amdahl code and (b) could form the basis of a Programming for HPC lesson with thematic callbacks to this prior lesson in the workshop.

Location of final code products

Currently, Snakefiles and other Snakemake code are in the compressed files in the files directory. Furthermore, they are inside a hidden .solutions directory inside those files. Would it be good to have them "more accessible", e.g. to move them to the code directory?

Review lesson objectives

To guide the development of this lesson, it would be a good idea to have a clear outline of which parts of the current lesson are important to be kept for the HPC Carpentry setting, which are non-essential but not harmful, and which need to be removed.

One way to do this would be to review the current list of objectives for the lesson and discuss them in the context above. Perhaps dividing them into "must be kept", "could be kept", and "should be removed"? And then you will also need a list of new objectives that you want/need to add, which are not in the lesson in its current form.

Rework `index.md`

This can only really be tackled once we have a better idea of the whole lesson but will include

Introducing the data we will create/use
Identifying the prereqs and where they can be found (and I notice now that HPC Carpentry will not fulfil the Python prereq...so how will we tackle this?)

SnakeMake best practices

Snakemake has options to use "profiles", as well as the use of YAML files, to control interaction with clusters, which inform the best practices for running on an HPC cluster. This lesson should examine these best practices with a view to doing this the right way, aligned with the SnakeMake community.

Locations where we may need to use templates

There are cases where we may need to provide different templates for information in the lessons so that information can be easily tweaked for different schedulers (and systems). The use of templating in https://github.com/carpentries-incubator/hpc-intro/blob/gh-pages/_config.yml is probably a good guide here.

At a glance there does not seem to be too many of these cases, but will probably include:

Getting the computing environment needed to run the tutorial (module, pip, conda or whatever)
The scheduler
The cluster config cluster.yaml
The string for --cluster

All of these seem to be really relevant to https://github.com/carpentries-incubator/hpc-workflows/edit/gh-pages/_episodes/09-cluster.md

Inline Python -> Lesson 10

Snakemake allows for Python inside the Snakefile, which is a neat feature. It's not core to workflows, however, and does not map to other workflow tools. We should (as @ocaisa suggested) use gray-box Python code to plot etc., and move the Python-in-Snake material to the currently-sparse Lesson 10.

Generating our data files

In https://carpentries-incubator.github.io/hpc-workflows/09-cluster/index.html#running-our-workflow-on-the-cluster , it is introduced how to send your workflow to the cluster.

In our case, we probably want to introduce this earlier, and also highlight https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#local-rules . For example, we only really want to submit the amdahl jobs to the cluster, everything else is very cheap and we can keep it local.

Plotting Amdahl's Law

Part of the workflow will be to plot Amdahl's Law. It would be nice if we could do this in the terminal, and (with prettier output) to an image file.

There's a tool termplotlib that probably accepts the same options as matplotlib which could be leveraged here.

We could also use gnuplot directly probably but more contributors are likely to be familiar with matplotlib syntax.

Defining the "common workflow" for our lesson

The current example is a set of books that are downloaded. How do we define our raw data? We effectively don't have any, what we are doing is taking measurements with amdahl which will become our raw data.

In 01-introduction.md we start off by creating a bash script describing the manual workflow. We will somehow need to replicate this. This will require:

Generating a set of data (which will require parsing of amdahl output, or perhaps adding a --terse option to amdahl, see #6). Redirecting the amdahl output to a file could work...or indeed using the output files from SLURM itself.
Plotting the result (both graphically...and perhaps in terminal)

spelling & linting actions

create an action to check spelling and lint markdown on PRs

"Perfect" `snakemake` file for Amdahls Law

Ultimately the lesson ends up with the "perfect" Snakefile for the example it uses (which can be found in the .solutions/completing_the_pipeline folder of workflow-engines-lesson.zip). We need to get to a similar perfect Snakefile for own use case (hopefully one that uses all the material in the lesson). Once we have that we can work backwards to replace the existing example with our own.

Provenance of this repository

For the record, this repository is a fork of Getting Started with Snakemake, and since that lesson is already in the Carpentries Incubator we needed to import it since you can't create a new fork in the same organisation. Getting Started with Snakemake is itself a fork of https://github.com/hpc-carpentry/hpc-python and so, the circle is now complete.

Moore's law

Consider adding an exercise with Moore's law using historical processor data.

primer docs for sprinters

Update CONTRIBUTING to include the branching workflow and need for PRs during the CarpentryCon Sprint, along with an overview of where the files to be edited reside (_episodes, mostly)

Flags on Amdahl

We have the --terse option on Amdahl to help make machine-digestible output. We should discuss the fact that some programs have different output modes, to help with human- or machine-parsing of their outputs. Hat tip @ocaisa

Develop Snakefile on a node

We can start developing the Snakefile on a compute node (srun --pty bash) using 1 core, then when a user's Snakefile uses more than 1 core, the workflow manager will kill their session for grabbing more than was allocated. This becomes an HPC & shared resource lesson, and marks the turning point between live development of the script and launching it through the head node. Hat tip @tobyhodges

protect default branch before sprint

Fine to leave gh-pages unprotected while we prepare, but it should be protected against direct commits, and PRs should not be merged without an approving review.

Add description of of `-j` in the compute cluster configuration.

@reid-a suggested (+1) that we can drop all mentions of -j and any discussion pertaining to it until the compute cluster configuration is covered.

Originally posted by @HaoZeke in carpentries-incubator/hpc-workflows#23 (comment)
Converted quote from code block -- @tkphd

Create a profile

Seeing -c 1 repeated in so many places made me wonder whether this parameter can go into a profile early on in the lesson or something, to save some keystrokes?

This would be a good learning objective. Configuring software so that learners become more productive is a common task.

Originally posted by @tobyhodges in carpentries-incubator/hpc-workflows#23 (review)

hpc-carpentry / old-hpc-workflows Goto Github PK

old-hpc-workflows's People

Contributors

Stargazers

Watchers

Forkers

old-hpc-workflows's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs