GithubHelp home page GithubHelp logo

hpc-carpentry / old-hpc-workflows Goto Github PK

View Code? Open in Web Editor NEW
8.0 8.0 2.0 10.08 MB

Scaling studies on high-performance clusters using Snakemake workflows

Home Page: https://www.hpc-carpentry.org/old-hpc-workflows/

License: Other

Ruby 0.61% Makefile 6.96% R 5.16% Shell 1.04% Python 86.23%
hpc-carpentry parallel-computing snakemake-workflows carpentries-incubator english alpha

old-hpc-workflows's People

Contributors

abbycabs avatar andrewspiers avatar bkmgit avatar brandoncurtis avatar ccoulombe avatar dc23 avatar erinbecker avatar evanwill avatar fmichonneau avatar gvwilson avatar ianlee1521 avatar jduckles avatar jpallen avatar jsta avatar jstaf avatar katrinleinweber avatar mawds avatar maxim-belkin avatar mr-c avatar neon-ninja avatar pbanaszkiewicz avatar pipitone avatar reid-a avatar rgaiacs avatar synesthesiam avatar tkphd avatar tobyhodges avatar tracykteal avatar twitwi avatar wclose avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

old-hpc-workflows's Issues

Update `setup.md` for the HPC use case

  • Update the required data files (or do we only generate them instead?)
  • Decide on approach(es) to make the required software for the tutorial available (snakemake, amdahl and also probably a plotting tool). Possibilities include
    • environment modules
    • pip
    • conda/mamba

Discuss limitations & alternatives to Snakemake

Snakemake may not be the best workflow manager for HPC; we're teaching it because it is broadly accessible, and the lessons can be broadly transferrable.

Include an episode at the end to discuss the limitations of Snakemake, and introduce alternative tools (Parsl, Fireworks) for HPC.

Consider Parsl

NERSC's Snakemake docs lists Snakemake's "cluster mode" as a
disadvantage, since it submits each "rule" as a separate job, thereby
spamming the scheduler with dependent tasks. The main Snakemake process also
resides on the login node until all jobs have finished, occupying some
resources.

NERSC specifically documents Parsl as the recommended alternative for multinode jobs. I was aware of Parsl as a Python extension for parallel programming, but had not recognized its ability to dispatch work directly on Slurm (and possibly other schedulers).

This synergy suggests Parsl as a viable alternative to Snakemake, since it (a) would integrate readily with the Python-based Amdahl code and (b) could form the basis of a Programming for HPC lesson with thematic callbacks to this prior lesson in the workshop.

Location of final code products

Currently, Snakefiles and other Snakemake code are in the compressed files in the files directory. Furthermore, they are inside a hidden .solutions directory inside those files. Would it be good to have them "more accessible", e.g. to move them to the code directory?

Review lesson objectives

To guide the development of this lesson, it would be a good idea to have a clear outline of which parts of the current lesson are important to be kept for the HPC Carpentry setting, which are non-essential but not harmful, and which need to be removed.

One way to do this would be to review the current list of objectives for the lesson and discuss them in the context above. Perhaps dividing them into "must be kept", "could be kept", and "should be removed"? And then you will also need a list of new objectives that you want/need to add, which are not in the lesson in its current form.

Rework `index.md`

This can only really be tackled once we have a better idea of the whole lesson but will include

  • Introducing the data we will create/use
  • Identifying the prereqs and where they can be found (and I notice now that HPC Carpentry will not fulfil the Python prereq...so how will we tackle this?)

SnakeMake best practices

Snakemake has options to use "profiles", as well as the use of YAML files, to control interaction with clusters, which inform the best practices for running on an HPC cluster. This lesson should examine these best practices with a view to doing this the right way, aligned with the SnakeMake community.

Locations where we may need to use templates

There are cases where we may need to provide different templates for information in the lessons so that information can be easily tweaked for different schedulers (and systems). The use of templating in https://github.com/carpentries-incubator/hpc-intro/blob/gh-pages/_config.yml is probably a good guide here.

At a glance there does not seem to be too many of these cases, but will probably include:

  • Getting the computing environment needed to run the tutorial (module, pip, conda or whatever)
  • The scheduler
  • The cluster config cluster.yaml
  • The string for --cluster

All of these seem to be really relevant to https://github.com/carpentries-incubator/hpc-workflows/edit/gh-pages/_episodes/09-cluster.md

Inline Python -> Lesson 10

Snakemake allows for Python inside the Snakefile, which is a neat feature. It's not core to workflows, however, and does not map to other workflow tools. We should (as @ocaisa suggested) use gray-box Python code to plot etc., and move the Python-in-Snake material to the currently-sparse Lesson 10.

Plotting Amdahl's Law

Part of the workflow will be to plot Amdahl's Law. It would be nice if we could do this in the terminal, and (with prettier output) to an image file.

There's a tool termplotlib that probably accepts the same options as matplotlib which could be leveraged here.

We could also use gnuplot directly probably but more contributors are likely to be familiar with matplotlib syntax.

Defining the "common workflow" for our lesson

The current example is a set of books that are downloaded. How do we define our raw data? We effectively don't have any, what we are doing is taking measurements with amdahl which will become our raw data.

In 01-introduction.md we start off by creating a bash script describing the manual workflow. We will somehow need to replicate this. This will require:

  • Generating a set of data (which will require parsing of amdahl output, or perhaps adding a --terse option to amdahl, see #6). Redirecting the amdahl output to a file could work...or indeed using the output files from SLURM itself.
  • Plotting the result (both graphically...and perhaps in terminal)

"Perfect" `snakemake` file for Amdahls Law

Ultimately the lesson ends up with the "perfect" Snakefile for the example it uses (which can be found in the .solutions/completing_the_pipeline folder of workflow-engines-lesson.zip). We need to get to a similar perfect Snakefile for own use case (hopefully one that uses all the material in the lesson). Once we have that we can work backwards to replace the existing example with our own.

Moore's law

Consider adding an exercise with Moore's law using historical processor data.

primer docs for sprinters

Update CONTRIBUTING to include the branching workflow and need for PRs during the CarpentryCon Sprint, along with an overview of where the files to be edited reside (_episodes, mostly)

Flags on Amdahl

We have the --terse option on Amdahl to help make machine-digestible output. We should discuss the fact that some programs have different output modes, to help with human- or machine-parsing of their outputs. Hat tip @ocaisa

Develop Snakefile on a node

We can start developing the Snakefile on a compute node (srun --pty bash) using 1 core, then when a user's Snakefile uses more than 1 core, the workflow manager will kill their session for grabbing more than was allocated. This becomes an HPC & shared resource lesson, and marks the turning point between live development of the script and launching it through the head node. Hat tip @tobyhodges

protect default branch before sprint

Fine to leave gh-pages unprotected while we prepare, but it should be protected against direct commits, and PRs should not be merged without an approving review.

Create a profile

Seeing -c 1 repeated in so many places made me wonder whether this parameter can go into a profile early on in the lesson or something, to save some keystrokes?

This would be a good learning objective. Configuring software so that learners become more productive is a common task.

Originally posted by @tobyhodges in carpentries-incubator/hpc-workflows#23 (review)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.