GithubHelp home page GithubHelp logo

possum_workflow's Introduction

POSSUM pipelines

AusSRC contribution to the POSSUM data pre-processing pipelines. The pre-processing of POSSUM data involves

Then, complete HPX tiles are mosaicked together and uploaded to CADC in a final step. The workflow can be applied to MFS images or full spectral cubes. In this repository there are pipelines for:

  • Pre-processing of MFS images (mfs.nf)
  • Pre-processing of spectral cube images (main.nf)
  • Mosaicking to complete tile images (mosaic.nf)

Running Pipelines

To run the pipeline you need to specify a main script, a parameter file (or provide a list of parameters as arguments) and a deployment. Currently we only support setonix as the deployments.

The pipeline needs access to a CASDA credentials file casda.ini:

[CASDA]
username =
password =

Spectral cube images (main.nf)

#!/bin/bash
#SBATCH --account=<Pawsey account>
#SBATCH --ntasks=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=32G
#SBATCH --time=24:00:00

module load singularity/4.1.0-slurm
module load nextflow/23.10.0

export MPICH_OFI_STARTUP_CONNECT=1
export MPICH_OFI_VERBOSE=1

export FI_CXI_DEFAULT_VNI=$(od -vAn -N4 -tu < /dev/urandom)

nextflow run main.nf -profile setonix --CASDA_CREDENTIALS=<path to CASDA credentials> --SBID <SBID>

Deploy

sbatch script.sh

MFS images (mfs.nf)

#!/bin/bash
#SBATCH --account=<Pawsey account>
#SBATCH --ntasks=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=32G
#SBATCH --time=24:00:00

module load singularity/4.1.0-slurm
module load nextflow/23.10.0

export MPICH_OFI_STARTUP_CONNECT=1
export MPICH_OFI_VERBOSE=1

export FI_CXI_DEFAULT_VNI=$(od -vAn -N4 -tu < /dev/urandom)

nextflow run mfs.nf -profile setonix --CASDA_CREDENTIALS=<path to CASDA credentials> --SBID <SBID>

Deploy

sbatch script.sh

File structure

This section describes how the output files are organised. All outputs are stored under the location specified by the WORKDIR parameter. Here is the structure beneath

.
├── ...
└── WORKDIR                             # Parent directory specified in params.WORKDIR
    ├── <SBID_1>
    ├── <SBID_2>
    ├── ...
    ├── <SBID_N>                        # A sub-folder for each SBID containing observation metadata
    │   ├── evaluation_files            # Download evaluation files
    │   └── hpx_tile_map.csv            # Generated map for HPX pixels covered by image cube (map file)
    └── TILE_COMPONENT_OUTPUT_DIR       # HPX tile components for each SBID are stored here
        ├── <OBS_ID_1>
        ├── ...
        └── <OBS_ID_N>                  # All tiled images a separated by observation ID
            ├── i                       # Subdirectory for each stokes parameter
            ├── ...
            └── q

Splitting

We use the CASA imregrid method to do tiling and reprojection onto a HPX grid. CASA has not been written to allow us to parallelise the tiling and reprojection over a number of nodes, and the size of our worker nodes is not sufficient to store entire cubes in memory (160 GB for band 1 images). We therefore need to split the cubes by frequency, run our program, then join at the end.

We do this twice in our full pre-processing pipeline code: for convolution to allow for using the robust method (requires setting nan to zero), and for imregrid to produce tiles as described earlier. The number of splits in frequency are specified by the NAN_TO_ZERO_NSPLIT and NSPLIT parameters respectively. Depending on the size of the cube and the size of the worker nodes, users will have to set these parameters to optimally utilise computing resources.

Download NASA CDDIS data

The FRion predict step of the pipeline (only for main.nf) requires you to download data from NASA CDDIS. To do this you will need to create an EarthData account. Then you will create a .netrc file containing those credentials with the following content:

machine urs.earthdata.nasa.gov login <username> password <password>

Then you will need to change the file access pattern and move it to the home directory on the cluster which you intend to deploy the pipeline

chmod 600 .netrc
mv .netrc ~/

For more info: https://urs.earthdata.nasa.gov/documentation/for_users

possum_workflow's People

Contributors

axshen avatar davepallot avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar  avatar

possum_workflow's Issues

Use `astropy` method instead of `robust` for `beamcon_3D`

We currently have to split the cube in frequency (number of subcubes specified by the NAN_TO_ZERO_NSPLIT parameter) to set NaN values to zero, which is required for robust method of beamcon_3D. Without setting NaN to zero, we get all NaN values in the entire frequency slice if there are any NaN values at all.

Try using the astropy method instead. This may take longer but does not require splitting and re-joining of the split cubes.

Store POSSUM temporary products (`components/`) in Acacia

Tile components are generated for each observation and are intermediate products produced by the POSSUM pipeline. These are mosaicked later in the pipeline when the user (POSSUM team member) decided there are a sufficient number of components to complete a tile.

There is no guarantee that the observations will be adjacent, so these temporary tile components may need to be stored for longer than the scratch 30 day limit. We will need to store these in Acacia and move them back when necessary. Consider updating the mosaicking pipeline code (mosaic.nf) to pull images from Acacia rather than the shared filesystem.

Filename convention

3D cubes require this filename convention

prefix.bandwidth.resolution.RADEC.TileID.Stokes.fits

(e.g. POSSUM.800-1088MHz.18asec.2148-5115.10978.i.fits) while MFS images require

POSSUM.944MHz.18asec.2148-5115.10978.t0.i.fits

where instead of bandwith use the central frequency of the MFS cube, and add t0 or t1 for Taylor 0 or 1. These filename updates will require reading the fits header of the output file, so it may be required that these are added after the mosaicking step of the pipeline.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.