pangeo-forge / staged-recipes Goto Github PK

View Code? Open in Web Editor NEW

38.0 14.0 62.0 459 KB

A place to submit pangeo-forge recipes before they become fully fledged pangeo-forge feedstocks

Home Page: https://pangeo-forge.readthedocs.io/en/latest/

License: Apache License 2.0

staged-recipes's Introduction

staged-recipes

This is the starting place for a new recipe, welcome!

Adding a new Recipe

To add a new recipe, you'll open a pull request containing:

A recipes/<Your feedstock name>/meta.yaml file, containing the metadata for your dataset
A recipes/<Your feedstock name>/recipe.py file, a Python module with recipe (or recipe dict-object) definition.

See below for help on writing those files.

Once your recipe is ready (or if you get stuck and need help), open a pull request from your fork of pangeo-forge/staged-recipes. A team of bots and pangeo-forge maintainers will help get your new recipe ready to go.

Developing a Recipe

New recipes can be developed locally or on Pangeo's binder. During the development of the recipe, you shouldn't need to download or process very large amounts of data, so either should be fine. If you want to work remotely on Pangeo's binder, click on this badge:

If you're working locally, you'll need to clone https://github.com/pangeo-forge/staged-recipes.

staged-recipes's People

Contributors

Stargazers

Watchers

staged-recipes's Issues

Proposed Recipes for Euro-Cordex

The following is transcribed from @larsbuntemeyer in pangeo-data/pangeo#862

Source Dataset

I wanted to drop some of my thoughts here on bringing WCRP EURO-CORDEX datasets to the cloud. That would be cmorized datasets on the European Cordex domain that are currently available only for access by download from, e.g., ESGF or the Coperniucs Climate Data Store.

Link to the website / online documentation for the data - https://www.euro-cordex.net/
The file format (e.g. netCDF, csv) - ???
How are the source files organized? (e.g. one file per day):
i guess i would be able to get support from DKRZ-ESGF where i usually work with the ensemble and where also an intake collection is maintaned. The ensemble contains about:
- up to 150 datasets on the EUR-11 Cordex domain each for a number of frequently requested variables,
- about 75 TB of data volume for the complete ensemble and all variables on the EUR-11 domain.
How are the source files accessed (e.g. FTP)
- a notebook that shows access to the 2m surface temperature EURO-CORDEX ensemble dataset at DKRZ.
Any special steps required to access the data (e.g. password required) - Raw data from ESGF or CDS. Copy at DKRZ.

Transformation / Alignment / Merging

???

Output Dataset

Zarr?

Licensing Question

I am especially wondering what license would be required for the data to be made available publicly and if you think that the Cordex terms of use would be a problem for distributing the data freely? Right now, on ESGF only CMIP5 and CMIP6 data are freely available while for Cordex you still have to register.

I would be interested in your thoughts on whether that data could be successively made available through PANGEO cloud storage. As i said, right now this is just an idea, but the Euro-Cordex General Assembly is coming up in the end of January 2022 and i wanted to bring that up and discuss it in the community. Thanks a lot!

Example pipeline for HRRR

Source Dataset

The High-Resolution Rapid Refresh (HRRR) forecast model is the highest resolution (3 km) met model from NOAA that covers the entire US. The forecast archive from 2014 to the present is available as part of the NOAA Big Data Program on AWS. We want the data for forecast hour 01.

link: https://noaa-hrrr-bdp-pds.s3.amazonaws.com/index.html
format: grib2
access: AWS s3

import fsspec
fs = fsspec.filesystem('s3', anon=True)
url = 'noaa-hrrr-bdp-pds'  # HRRR forecast archive

flist = fs.glob(url+'/hrrr.20190101/conus/hrrr.t*z.wrfnatf01.grib2')
flist

Transformation / Alignment / Merging

It would be great to form a best time series using the data from forecast hour T01.

It turns out that instead of reading the grib2 files with engine=cfgrib, it's faster to download the grib2, convert them to netcdf using wgrib2 and then load the resulting netcdf file into xarray.

wget https://noaa-hrrr-bdp-pds.s3.amazonaws.com/hrrr.20190101/conus/hrrr.t00z.wrfnatf01.grib2

Output Dataset

Zarr output chunked thusly: {'time':72, 'x':600, 'y':600}

Example pipeline for TRMM or GPM level 2 radar data

Source Dataset

Link to the website / online documentation for the data https://docserver.gesdisc.eosdis.nasa.gov/public/project/GPM/README.TRMM.pdf
The file format (e.g. netCDF, csv) HDF5
How are the source files organized? (e.g. one file per day) one file per orbit
How are the source files accessed (e.g. FTP) https
- provide an example link if possible https://disc2.gesdisc.eosdis.nasa.gov/data/TRMM_L2/GPM_2APR.06/2015/091/2A.TRMM.PR.V8-20180516.20150401-S102735-E115838.098987.V06A.HDF5
Any special steps required to access the data (e.g. password required) NASA Earthdata Login

Transformation / Alignment / Merging

Output Dataset

Proposed Recipes for Global HYCOM25

Source Dataset

Global surface dataset of HYCOM25.

The file format : netCDF
How are the source files organized? (to be confirmed)
How are the source files accessed : Globus

Transformation / Alignment / Merging

The files should be concatenated along the time dimension, ideally with a chunk size larger than one in time and a few chucks in the horizontal spatial dimension.

Output Dataset

Ideally as zarr formats.

Improving the contribution workflow + documentation

As proposed by @rabernat in #31 (comment), this issue will track ways to improve the contribution workflow and documentation. Here are a few items from #31 (comment) to begin with:

Change "example" to "proposed recipe" in the PR tracker
The word "pipeline" should be replaced with "recipe" throughout (contibutors no longer interact with the Prefect layer)
Improve contribution documentation and best practices, including recommended directory structure and use of Jupytext for recipe development

Feel free to add!

Example pipeline for GEFSv12

Source Dataset

These are “reforecasts” of the new GEFSv12 system. They are retrospective forecasts spanning the period 2000-2019.
These reforecasts are not as numerous as the real-time data; they were generated only once
per day, from 00 UTC initial conditions, and only 5 members were provided, with the following
exception. Once weekly, an 11-member reforecast was generated, and these extend in lead
time to +35 days.

Link to documentation
Link to data AWS bucket
The file format is grib2 - both the original and the cloud copy.
The files have multiple dimensions because the are forecast data (besides times/x/y/z they have lead time, ensemble members), they are organized as (from the PDF file linked above):
- the directory tree structure under GEFSv12/reforecast/ is by year;
- there are separate subdirectories for each yyyymmddhh, thus 2000010100 to 2000123100 for the year 2000.
- Under each yyyymmddhh subdirectory, there are subdirectories c00, p01, p02, p03, p04 for the five individual member forecasts. Once per week, 11 reforecast members were computed, and the directories for those days extend
  through p10.
Individual grib files have file names such as “variable_yyyymmddhh_member.grib2”
Currently I played with loading some part of the files (CONUS precip/temp) in the AWS pangeo deployment using xarray using the rasterio engine and concatenating them, but it's veeeeeery slow and I loose info from the grib files because I don't use a grib engine:

import s3fs
import xarray as xr
s3 = s3fs.S3FileSystem(anon=True)

def preprocessing_function_ptc(path_loop, im):
    prova = ['/'+ifg for ifg in path_loop.split('/')[1:]]
    ds = xr.open_rasterio('https://noaa-gefs-retrospective.s3.amazonaws.com'+''.join(prova),
                    chunks={'x':'200MB', 'band':-1})
    ds = ds.sel(y = slice(51,20), x = slice(229, 302))
    start_time = pd.to_datetime(str(path_loop.split('/')[4]),format='%Y%m%d%H')
    ds.coords['time'] = start_time
    ds = ds.expand_dims('time')
    ds.coords['member_id'] = im
    ds = ds.expand_dims('member_id')
    return ds

path_b = 's3://noaa-gefs-retrospective/GEFSv12/reforecast/*'
lglob = s3fs.S3FileSystem.glob(s3,path=path_b)
for ilyear in lglob[0:4]:
    print(ilyear)
    lglob2 = s3fs.S3FileSystem.glob(s3,path='s3://'+ilyear+'/*')
    ds_p_all = []
    ds_t_all = []
    
    for ilyear2 in lglob2[0:4]:
        print(ilyear2)
        ds_p_all1 = []
        ds_t_all1 = []
        for member in ['c00','p01','p02','p03','p04']:
            lglob3 = s3fs.S3FileSystem.glob(s3,path='s3://'+ilyear2+'/'+member+'/Days:1-10/*')
            for ilg3 in lglob3:
                if 'apcp' in ilg3:
                    ds_p = preprocessing_function_ptc(ilg3, member)
                    ds_p_all1.append(ds_p)
                elif 'tmp_2m' in ilg3:
                    ds_t = preprocessing_function_ptc(ilg3, member)
                    ds_t_all1.append(ds_t)                    
    temp_p = xr.concat(ds_p_all1, dim='member_id')
    temp_t = xr.concat(ds_t_all1, dim='member_id')
    ds_p_all.append(temp_p)
    ds_t_all.append(temp_t)
precip = xr.concat(ds_p_all, dim='time')
print(precip)
tmp_2m = xr.concat(ds_t_all, dim='time')

There is no password.

Transformation / Alignment / Merging

How to combine forecast data in a zarr format is not entirely clear to me yet - meaning what type of chunking should be used or merging. Often times analysis are carried out both as a function of start times and lead times, so I don't think we can have one structure that makes everyone happy. But I will figure out what is the structure that makes more sense.

Output Dataset

I am interested in transforming them in zarr format. It might be necessary to first download them and them opening them using a grib engine and then push them to the cloud, because as of now I couldn't use a grib engine.

Prefect registration action versioning.

As described in this comment the recipe-prefect-action controls the base image (and all of its corresponding library versioning) used to build and register flows for new recipe PRs in staged-recipes. Currently we are specifying pinned version tags of the action used in the registration workflow.

I'm considering that we should relax this tag and use the HEAD of main so that updating the recipe-prefect-action does not require a corresponding update to staged-recipes. I think we can safely assume that we would always like new PRs in staged-recipes to be registered using the most current image available.

Once we have decided on an approach here I'll write up an ADR detailing the thoughts in my roadmap comment. cc @rabernat @ciaranevans

Consolidate pre-commit tool configuration.

We have currently have several separate tool configs for pre-commit. Can we consolidate these all into the pre-commit config file itself?

Example pipeline for the High Resolution Rapid Refresh (HRRR) model

See: blaylockbk/Herbie#2 (comment)

@rabernat , I hope the info at the above URL suffices to describe the workflow.

@chiaral, note that this workflow starts with GRIB2 files as in #17.

After downloading the GRIB2 files, I convert them to NetCDF using wgrib2 (installed from conda-forge) before using rechunker. (I first tried using cfgrib but it was very slow so I went looking for other faster solutions).

pre-commit is not well documented and interactive feedback is not provided

In #74, the pre-commit workflow failed: https://github.com/pangeo-forge/staged-recipes/pull/74/checks?check_run_id=3558355719

How would a user know that they are supposed to style their code this way

This workflow is left over from @TomAugspurger's early draft (https://github.com/pangeo-forge/staged-recipes/blob/master/.github/workflows/pre-commit.yaml). I propose we just remove it for now until we have a better story for linting of recipes.

I like what Conda Forge does: a bot actually posts on the PR letting you know what needs to be changed.

Example pipeline for CMIP6 full grid files

Source Dataset

The CMIP6 archive does only provide a subset of grid metrics for the ocean output, which inhibits more complex analysis tasks (e.g. calculating spatial gradients). I am currently trying to collect the full grid files for many (hopefully all at some point) CMIP6 models here, but thought it might be helpful to start with one that is already available (big thanks to @adcroft for providing these files for the GFDL models!).

Link to the website / online documentation for the data:
The file format (e.g. netCDF, csv): netcdf
How are the source files organized? (e.g. one file per day): Just one file per model (note this includes two different model setups from GFDL, we might want to make the recipe strictly one per model)
How are the source files accessed (e.g. FTP): FTP
- provide an example link if possibleFiles are available on a public FTP here:
  - ftp://ftp.gfdl.noaa.gov/perm/Alistair.Adcroft/MOM6-testing/OM4_05/ocean_static.nc
  - ftp://ftp.gfdl.noaa.gov/perm/Alistair.Adcroft/MOM6-testing/OM4_025/ocean_static.nc
Any special steps required to access the data (e.g. password required): Nope

Transformation / Alignment / Merging

The only thing that I would consider doing here is renaming these files to directly work with cmip6_preprocessing processed data (e.g. rename xh to x etc). Otherwise nothing needs to be done.

Output Dataset

These are fairly small, so I think we can store each variable as a single chunk

cc @cisaacstern

I made a quick example how to download and open the files here

Pipeline for SODA 3.4.2

SODA 3.4.2 Dataset

SODA version 3.4.2 is an ocean reanalysis that is available on the native grid in netcdf format. The native interlaced horizonal velocity and conserved tracer (e.g. temperature and salinity) grids form a tripolar Arakawa-B grid, varying from 0.1°x0.25° at high latitude to 1/4°x1/4° in the tropics (quasi-isotropic grid spacing increases from ~11.7km at 65 latitude to ~28.0km at the Equator, 1440x1070 grid points). The (1.5Gb sized) topography map (created by Whit Anderson of GFDL) is here.

Link to the website, read me documentation, and download page
The file format is netCDF
The source files come in a complete set of roughly 2600 SODA3.4.2 ocean and sea ice files on the original approximately 1/4°x1/4° displaced pole non-Mercator horizontal grid at 50 z* vertical levels, every 5-days
Can download by
- wget -r -l1 --no-parent --progress=bar -nd -A 'soda3.4.2_5dy_ocean_*.nc' https://dsrs.atmos.umd.edu/DATA/soda3.4.2/ORIGINAL/ocean/
- wget -r -l1 --no-parent --progress=bar -nd -A 'soda3.4.2_5dy_ice_*.nc' https://dsrs.atmos.umd.edu/DATA/soda3.4.2/ORIGINAL/ice/
Password shouldn't be required

Transformation / Alignment / Merging

Output Dataset

Output of pipeline would ideally be stored as zarr files for optimal cloud computing performance.

Recipe for iHESP Global Datasets

Source Dataset

iHESP is focused on high-resolution, coupled climate simulations spanning the entire globe and regionally downscaled simulations of a region of interest (ex: Gulf of Mexico). Our global climate datasets have been generated using a high‐resolution configuration of the Community Earth System Model version 1.3 (CESM1.3), with a nominal horizontal resolution of 0.25° for the atmosphere and land models and 0.1° for the ocean and sea‐ice models. At these resolutions, the model permits tropical cyclones and ocean mesoscale eddies, allowing interactions between these synoptic and mesoscale phenomena with large‐scale circulations.

Link to the website / online documentation for the data:
The file format (e.g. netCDF, csv): netCDF
How are the source files organized? (e.g. one file per day): monthly output
How are the source files accessed (e.g. FTP): THREDDS server with HTTP and OPeNDAP endpoints
- https://datahub.geos.tamu.edu:8880/thredds/catalog/B.E.13.B1850C5.ne120_t12.sehires38.003.sunway_02/catalog.html
- https://datahub.geos.tamu.edu:8880/thredds/fileServer/B.E.13.B1850C5.ne120_t12.sehires38.003.sunway_02/ocn/SSH/B.E.13.B1850C5.ne120_t12.sehires38.003.sunway_02.pop.h.SSH.032001-032912.nc
Any special steps required to access the data (e.g. password required): No

Transformation / Alignment / Merging

The data should merge cleanly.

Output Dataset

Zarr with all time slice files unified into a single timeseries. Possibly merging of different variables as well.

cc @paigem, @abishekg7

Proposed Recipes for NOAA/NSIDC Climate Data Record of Passive Microwave Sea Ice Concentration

Source Dataset

This data set provides a Climate Data Record (CDR) of sea ice concentration from passive microwave data. The CDR algorithm output is a rule-based combination of ice concentration estimates from two well-established algorithms: the NASA Team (NT) algorithm (Cavalieri et al. 1984) and NASA Bootstrap (BT) algorithm (Comiso 1986). The CDR is a consistent, daily and monthly time series of sea ice concentrations from 25 October 1978 through the most recent processing for both the north and south polar regions. All data are on a 25 km x 25 km grid.

Link to the website / online documentation for the data: https://nsidc.org/data/G02202/versions/4
The file format (e.g. netCDF, csv): netCDF
How are the source files organized? (e.g. one file per day): not sure
How are the source files accessed (e.g. FTP): FTP - ftp://sidads.colorado.edu/DATASETS/NOAA/G02202_V4/
Any special steps required to access the data (e.g. password required): Registration required

Transformation / Alignment / Merging

None

Output Dataset

One Zarr each for northern and southern hemisphere

cc @stb2145 - This would be useful for your project

Proposed Recipes for ERA5

There are currently a few subset's of the ERA5 dataset on cloud storage (example but none are complete or updated regularly. It wont be a trivial recipe to implement with Pangeo-Forge but it would be a good stretch goal to support such a dataset.

Source Dataset

Link to the website / online documentation for the data: https://www.ecmwf.int/en/forecasts/datasets/reanalysis-datasets/era5, https://cds.climate.copernicus.eu/cdsapp#!/dataset/reanalysis-era5-pressure-levels?tab=overview
The file format: NetCDF/GRIB
How are the source files organized? One file per api request.
How are the source files accessed: Copernicus Climate Data Store, API Request (https://cds.climate.copernicus.eu/cdsapp#!/dataset/reanalysis-era5-pressure-levels?tab=form)
Any special steps required to access the data: access through the cdsapi or cdstoolbox apis

Transformation / Alignment / Merging

Most likely, the best way to access and arrange the data is in 1-day chunks, concatenating along the time dimension. Given the large user pool for this dataset, I would suggest this recipe does as little data processing as possible.

Output Dataset

One (or more?) Zarr stores. Hourly data for all available variables, all pressure levels, etc.

Example pipeline for CM2.6

Source Dataset

CM2.6 is a high-resolution global climate model run by GFDL. There are two scenarios: a preindustrial control and a 1% CO2 increase.
We already have some CM2.6 data in google cloud: https://catalog.pangeo.io/browse/master/ocean/GFDL_CM2_6/
I created it manually.

https://www.gfdl.noaa.gov/cm2-6/
Format: netCDF4, one file per month, files grouped into different variable classes (e.g. surface, interior, etc.). File names look like 01800101.ocean.nc.
Access: The data are stored in two places:
- On the GFDL supercomputer (accessible to very few, with high security)
- On CyVerse (special permission required)
The files can be downloaded from CyVerse with IRODS. A download command looks like igetwild /iplant/home/shared/iclimate/control field_u.nc e &. Special access tokens must be configured.

Transformation / Alignment / Merging

In general, we want to concatenate the files in time. However, different variables in different files have different time resolutions (monthly, 5-day, daily).

Getting the files to concatenate cleanly required some manual tweaks (dropping variables and overwriting coordinates). There are weird glitches and inconsistencies between different files from the same output set. Some workflows are documented in this repo.

Output Dataset

I think we would like one zarr dataset for all variables with the same grid and temporal resolution. Chunked in time. For 3D data, we also need to chunk in space, probably the vertical dimension makes most sense.

Does the OSN storage support saving analysis data?

I've started to analyze the SWOT-AdAC data (#24 #26 #29 ) on a Google Cloud based Jupyterhub but does the OSN storage also support saving of analysis data?

GNSS data from UNAVCO: RINEX -> NetCDF4 -> Zarr

@timdittmann of @unavco and I had a great chat this morning about bringing UNAVCO's GNSS data archive to the cloud. UNAVCO's 30 year archive of GNSS data currently exists in RINEX format. Tim has already explored converting these files to NetCDF4 using geospace-code/georinex.

UNAVCO is an interesting position insofar as they are able to make choices about how their NetCDF archives will look, specifically to ease eventual conversion to Zarr (or other cloud optimized formats). Tim may chime in with further questions on that subject here, in which case @rabernat may have some informed perspectives to share.

Once Tim has a test NetCDF file or two completed, he indicated that he'd open an separate issue on https://github.com/pangeo-forge/staged-recipes/issues with details, from which I can try my hand at writing an example recipe for him using pangeo-forge/pangeo-forge-recipes.

Starting this issue as a place to discuss this subject further. Looking forward to this exciting new collaboration!

Finalize Github workflow logic for running recipes in feedstocks.

The proposed workflow logic for running recipes in feedstock repos is described here https://github.com/pangeo-forge/roadmap/blob/master/doc/adr/0001-github-workflows.md but we should probably review for a consensus decision and address some open questions around versioning, updating existing zarr archives.

The initial naive Github action will register and run the full recipe (replacing an existing zarr store if it exists) for each recipe in the repository when a new release tag is created. While functional, there are several issues with this approach which should be addressed.

The action should really work in a similar fashion to the existing staged-recipes PR testing workflow which only proceeds when the PR includes changes to a meta.yaml file. I'm unsure how diff should be used with a release tag in order to determine the listing of meta.yaml files that have been updated in a release. There is also the possibility that the underlying recipe code can change with no corresponding change to the meta.yaml so I would suggest the possibility of including a root level version property to the meta.yaml spec to facilitate flagging with diff.
The current approach will naively replace the entire existing zarr archive when the recipe is re-run. (Additionally, I would like clarification on how pangeo-forge-recipes will handle the case of writing to object storage when the target archive already exists.). There has been previous discussion about incrementally updating existing archives but this poses a few questions in relation to workflows. How should underlying changes to source data (notably from reprocessing efforts) be handled? How should intermediate cache files invalidation be handled for source data changes?
Should our zarr target structure patterns include some sort of versioning scheme which is tied to our release tagging? This simplifies many aspects of the workflow process but may be confusing for end users when they encounter multiple versions of the archive.
Another open question around feedstock workflows concerns how we should manage recipes for different chunking strategies for the same data. I believe it makes sense that we maintain single feedstocks for a source dataset and that different chunking strategies are treated as unique recipes within the feedstock. With this approach, users could submit PRs for new chunking strategies and using the diff method discussed above we can run only the recipes which were changed in the release.

As discussed in the coordination meeting today. I'll push forward with the naive approach initially so that we can close #58. Then @rabernat can schedule a sprint where we can address these open questions and update the workflows.

Example pipeline for gridMET

Source Dataset

gridMET is a dataset of 4km daily surface meteorological data covering the CONUS domain from 1979-yesterday.

http://www.climatologylab.org/gridmet.html
NetCDF
One file per year-variable
How are the source files accessed: HTTP download or OPENDaP
- HTTP: https://www.northwestknowledge.net/metdata/data/
- OPENDaP: http://thredds.northwestknowledge.net:8080/thredds/reacch_climate_MET_catalog.html
Any special steps required to access the data: No

Transformation / Alignment / Merging

Files should be concatenated along the time dimension and merged along the variable dimension

Output Dataset

1 Zarr store - chunks oriented for both time series and spatial analysis.

Example pipeline for GFS Archive

Source Dataset

Link to the website: https://rda.ucar.edu/datasets/ds084.1/ (full archive ~2015 - present). Another source is s3://noaa-gfs-bdp-pds/gfs.* although it's ~20210226 onwards
The file format: opendap / grib
How are the source files organized? one file per forecast hour
How are the source files accessed? pydap or download the grib files
provide an example link if possible:

import xarray as xr

url = "https://rda.ucar.edu/thredds/dodsC/files/g/ds084.1/2020/20200201/gfs.0p25.2020020100.f000.grib2"
ds = xr.open_dataset(url)

import requests
import cfgrib

login_url = "https://rda.ucar.edu/cgi-bin/login"
ret = requests.post(
    login_url,
    data={"email": EMAIL, "passwd": PASSWD, "action": "login"},
)
file = "https://rda.ucar.edu/data/ds084.1/2020/20200201/gfs.0p25.2020020100.f000.grib2"
req = requests.get(file, cookies=ret.cookies, allow_redirects=True)
open("gfs.0p25.2020020100.f000.grib2", "wb").write(req.content)
dss = cfgrib.open_datasets("gfs.0p25.2020020100.f000.grib2")

import s3fs
fs = s3fs.S3FileSystem()
fs.get("s3://noaa-gfs-bdp-pds/gfs.20210914/12/atmos/gfs.t12z.pgrb2.0p25.f000", "gfs.0p25.2021091412.f000.grib2")
dss = cfgrib.open_datasets("gfs.0p25.2021091412.f000.grib2")

Any special steps required to access the data (e.g. password required) - Yes have to sign up at https://rda.ucar.edu/. authentification can be done as shown here https://stackoverflow.com/a/66179413/6046019

Transformation / Alignment / Merging

Concat along reftime (init time) and time

Output Dataset

zarr store.

I imagine one giant zarr store would be crazy so could be stored for one init time and all forecast times. Ideally with init time an expanded dim so it can be concatenated later.

Merge our first recipe PR(s)

Note: This began as a conversation on #57, but I'm migrating it to its own issue so it doesn't get buried once that PR is merged. - Charles

@sharkinsspatial, do you know what will happen with these directories when they are merged? I believe, following conda-forge's example, they will become standalone repos within the pangeo-forge org? I feel like this is may be in an ADR somewhere, but not sure which one, and either way wanted to take the opportunity to start a conversation about merging some recipe PR's.

@cisaacstern The proposed Github actions flow is covered in this ADR. Your take is correct that upon merging, each meta.yaml will be used to generate a new feedstock repository. I agree that <Your feedstock name > may be a more appropriate terminology, especially as the meta.yaml may be referencing multiple recipes.

I had been holding on this next block of work as given the issues outlined in pangeo-forge/pangeo-forge-recipes#151, none of our PRs could pass the our recipe test workflow but with the 0.4.0 release we should be able to run tests for pruned recipes. @ciaranevans and I need to undertake the following...

Release a new bakery image using 0.4.0. ✅
Update pangeo-forge-prefect for the recipe class signature changes introduced by 0.4.0. ✅
Include the recipe pruning usage for testing described in pangeo-forge/pangeo-forge-recipes#139 (review) ✅
Build the template repo and Github action for spinning out a new feedstock repository when a PR is merged in `staged-recipes. ✅

I'd like to target having all this in place prior to our call next week (I'll be out traveling on Friday). I believe some of the existing PRs in staged-recipes may need modification to change their use of FilePattern but I'll let @TomAugspurger weigh in on this.

Originally posted by @sharkinsspatial in #57 (comment)

Example pipeline for SWOT-Xover

Source Dataset

SWOT-Xover is a subset of a few basin-scale model outputs with the resolution of ~1/50° surface hourly and interior daily data. The subsets will cover the cross-over regions of the SWOT fast-sampling phase.

Project description is given here
File format: zarr
Organization of file: one file for six-months of surface and interior data each (i.e. two files per model per region).
File access: automating the zarrification of datasets pulled from FTP servers.

Transformation / Alignment / Merging

Files should be concatenated along the time dimension.

Output Dataset

The zarrification of data should be automated via the pangeo-forge pipeline following the pangeo-forge recipe. In order to facilitate the automation, we would ask each modelling group to have the outputs in netcdf4 format and make it available via an ftp server.
A single monthly file of daily-averaged 3D data of u, v, w, T & S in one region is ~30Gb. With the four regions, six months and five models, this would sum up to ~3.6Tb in total on the cloud storage. The chunks of the zarr dataset will be on the order of {'time':30, 'z':5, 'y':100, 'x':100}.
For the surface, a single daily file of hourly averaged data of SST, SSS, SSH, wind stress & buoyancy fluxes in one region is ~380Mb. With the regions, months and models, this sums up to ~45Gb. The chunks of the zarr dataset will be on the order of {'time':100, 'y':100, 'x':100}

Example pipeline for CESM POP low-resolution (1 degree)

Source Dataset

This is the ocean post-processed data of a low-resolution (1 degree ocean and atmosphere) Community Earth System Model (CESM) run: v5_rel04_BC5_ne30_g16. This is the low-resolution counterpart to the CESM run hybrid_v5_rel04_BC5_ne120_t12_pop62 with 0.1 degree ocean/0.25 degree atmosphere that is already available in the Pangeo Cloud Data Catalog here. Data is output as daily averages for a total of 166 model years. The data current sit on the Climate Data Gateway at NCAR.

The website to download the data is here, but requires authentication to access (see below). A publicly accessible website that lists the variables can be found here.
The file format is netCDF.
There is one netCDF file per variable, with 14 variables total. Each of the 14 netCDF files are between 15GB and 20GB, which together sum to about 251GB.
Source files can be accessed via wget or curl
- Scripts to run wget or curl are provided after logging in from this page, selecting all of the listed files, and clicking "Download Options for Selections".
Authentication is required to download the data. I was able to access using my UCAR CIT account, but it appears that there are three authentication options on this page, after clicking "Download Options":
- use a UCAR CIT account
- use an OpenId account
- register for a guest account at Climate Data Gateway

Transformation / Alignment / Merging

The files should be combined into one dataset comprising all 14 variables, so it can be loaded in, e.g., as an Xarray Dataset.

Output Dataset

The files should be stored in the Zarr format.

Proposed Recipes for Global Drifter Program

Source Dataset

I'm trying to port a code that converts the hourly GDP dataset (~20'000 individual netCDF files) into a ragged array to a pangeo-forge recipe.

AOML website link.
The file format is netCDF.
One file per drifter.
Accessible through ftp
- ftp://ftp.aoml.noaa.gov/phod/pub/lumpkin/hourly/v2.00/netcdf/drifter_101143.nc

Transformation / Alignment / Merging

The files contain a few variables and metadata that should in fact be stored as variables. I have a draft recipe that cleans up the file, parses the date, converts metadata to variables. The files have two dimensions ['traj', 'obs'], where each file (one ['traj']) contains a different number of observations ['obs'] (this ranges from ~1000-20000+). More precisely, scalar variables['traj'] are: type of buoy, dimensions, launch date, drogue loss date, etc., and vector variables['obs'] are: lon, lat, ve, vn, time, err, etc.

To create the ragged array, scalar variables should be concatenated together, and the same goes for the various vector variables. My two issues are:

is it possible to concatenate on multiple dimensions?
is it possible to set a non-uniform chunk size? Chunks would be set to the number of observations per trajectory, which is needed to merge the nth chunk and allow for efficient calculations per trajectory.

Cheers,

Output Dataset

Single netCDF archive, or a Zarr folder.

Example pipeline for ERA5

Source Dataset

Link to the website / online documentation for the data
- https://www.ecmwf.int/en/forecasts/dataset/ecmwf-reanalysis-v5
The file format (e.g. netCDF, csv)
- netCDF, Grib2
How are the source files organized? (e.g. one file per day)
- Data is accessible by querying the MARS API from ECMWF (docs here).
- Data is accessible from the MARS API via a selection DSL (examples, syntax)
Any special steps required to access the data (e.g. password required)
- ECMWF keys are required (docs).
- Copernicus (CDS) provides fast access to ERA5.
- If data is not listed in CDS, ECMWF provides slow access to the data, which is stored on tape drives. Care needs to be taken to access data in this archival format. Consider reviewing the retrieval documentation, especially the "retrieval efficiency" section.
- ERA5 files can be quite large (~1GB in size per query). Downloading jobs should be partitioned (select smaller subsets of the overall data).

Transformation / Alignment / Merging

Output Dataset

Ideally, the datasets would be converted into a single Zarr with a 10-100MB chunk size.

Example pipeline for Bio-Formats

Following on from https://twitter.com/rabernat/status/1346849632484257800, feel free to ignore/close/delete as appropriate.

Source Dataset

Any number of bio-imaging filesets which can be parsed by github.com/ome/bioformats, which parses over 150 different file formats. A large number of format-organized files (1-2 TB) from our daily test suite can be found under https://downloads.openmicroscopy.org/images/, downloadable via HTTP.

Public, CC-BY, reference data (0.15 PB and growing) which it would be valuable to store publicly can be found under https://idr.openmicroscopy.org/. Each study organizes its data differently and would likely be captured by a separate pipeline. Download would currently be most efficient using Aspera but we will be migrating the input files to S3. Note, however, that the input for each single image may need to be a list of files, e.g.

def generate_zarrs(filesets: List[List[str]]) -> List[str]

Transformation / Alignment / Merging

bioformats2raw will turn any @bioformats-supported file into an OME-Zarr.
rechunking may be necessary for various interactive scenarios (2D vs 3D)

Output Dataset

Each OME-Zarr output (potentially multiple from a single set of inputs) follows the specification available under https://ngff.openmicroscopy.org/latest/

An initial example from Jan. 6. 2020 can be found in https://gist.github.com/joshmoore/31303ef65820b8af5728d92a4f2d9b51

Incorrect git diff syntax in run-recipe-test workflow.

The intent of run-recipe-test is to determine the location of the meta.yaml in the submitted PR and then use this meta.yaml to register the recipe as a Prefect flow and run it (using a pruned version with a smaller portion of the concat dimension).

To determine the location of the meta.yaml in the PR we have been using the following git command to obtain a list of altered files in the PR.

git diff --name-only origin/master HEAD

While this appeared to work correctly initially, it was only because no conflicting commits had been made to the master branch so the PR branch and master were aligned. Once additional commits have been made to master git diff will list the diffed files from both branches. The behavior we desire is to only list the files from the PR merge branch. In this case I believe the incantation we want is

git diff --name-only origin/master...HEAD

The man page states "git diff A...B" is equivalent to "git diff $(git merge-base A B) B" but that is a bit opaque for me. Can someone more Git knowledgeable verify that this change should provide the desired behavior?

Example pipeline for AMPS output stored at NCAR

Source Dataset

Retrieving output from AMPS archive on NCAR HPSS (tape archive quickly nearing its End-Of-Life) for public storage on Google Cloud Services and access by Pangeo and more generally xarray

The Antarctic Mesoscale Prediction System (AMPS) is a real-time weather forecasting tool primarily in support of the NSF's United States Antarctic Program (USAP). It consists of the assimilation of surface and upper air observations into the Weather Research and Forecasting (WRF) model - forced at its boundaries by the GFS model. There are two outer nested pan-Antarctic domains with several additional higher resolution domains over areas-of-interest.

https://www2.mmm.ucar.edu/rt/amps/information/information.html
NetCDF, GRIB
One WRF time slice per history file. Generally, there are two simulations per day (00z and 12z initialization), with varying model itegration lengths. History file period is generally 3 hours. Since AMPS is an operational product over the history of AMPS project, grid resolution, domain size/location, and model code, parameterizations/physics, and setup, have not been held consistent. This means that model output moved to Google Cloud Storage will be a balanced mix of these, with priority for immediate or planned regions or processes of interest.
Residing on the HPSS tape archive, first task is to pull them from tape to local 'online' GLADE storage, where they reside temporarily while awaiting transfer off-site. Options for tranfer include GLOBUS, bbcp, and scp/sftp.
- https://www2.mmm.ucar.edu/rt/amps/information/amps_archive_hpss_details.html
Data is on NCAR HPC, so an account and appropriate credentials required. Since there will be no computations involved, just data movement, suggest to request "Small" allocation account.

Transformation / Alignment / Merging

The raw WRF output files are NetCDF (and GRIB) format and usable by many software packages, however several common post-processing procedures can both make the data smaller and more usable (e.g. converting to pressure level data, subsetting fields, and de-staggering winds).

Output Dataset

In both cases, these raw or post-processed NetCDF files should be converted to a zarr object for optimization in cloud-based xarray routines. Ideally this conversion (e.g. using xarray method to_zarr) would occur either on the NCAR HPC or within Pangeo or similar cloud computing environment.

Proposed Recipes for Antarctic ice sheet paleo PISM ensemble

Source Dataset

Simulations of the Antarctic ice sheet over the last 20ka performed by @talbrecht using the Parallel Ice Sheet Model (PISM)

Albrecht, Torsten (2019): PISM parameter ensemble analysis of Antarctic Ice Sheet glacial cycle simulations. PANGAEA, https://doi.pangaea.de/10.1594/PANGAEA.909728

The file format: netcdf
How are the source files organized?
one directory per ensemble member, in each directory multiple netcdfs corresponding to different snapshots of the model state (e.g., snapshots_-15000.000.nc) and one containing timeseries of aggregations of the model state (timeseries.nc)
How are the source files accessed (e.g. FTP)
zip file freely downloadable from https://hs.pangaea.de/model/PISM/Albrecht-etal_2019/parameter-ensemble/Part2_pism_paleo_ensemble_v2.zip.

Transformation / Alignment / Merging

All ensemble members and time snapshots should be combined in one xarray with dimensions corresponding to x, y, time, and four model parameters.
Also all 'timeseries.nc' files (each one corresponding to one ensemble member should be collated together into another single xarray.
This involves an unstack step to get the four parameters their own dimensions in the xarray, as discussed here.

Output Dataset

One zarr directory for each of xarrays described above (two in total).

Progress so far

Much of this work has been done using a larger version of the model output (with more timeslices, one every kyr instead of one every 5kyr):
-- all the timeslices and ensemble members were collated and unstacked'd into the correctly shaped xarray, then uploaded to GCS: https://github.com/ldeo-glaciology/pangeo-pismpaleo/blob/main/pism_paleo_nc_to_zarr.ipynb (note that this was done on the University of Potsdam's HPC and did NOT start with the zip file linked to above).
-- then we made an intake catalog, [here] (https://github.com/ldeo-glaciology/pangeo-pismpaleo/blob/48b16dca56d3b736b6f05acdb63ca83744c4f8d4/intake_catalog_setup.ipynb)

As described here, these data are now accessible from a google bucket, e.g.

cat = intake.open_catalog('https://raw.githubusercontent.com/ldeo-glaciology/pangeo-pismpaleo/main/paleopism.yaml')
snapshots1ka  = cat["snapshots1ka"].to_dask()
mask_score_time_series  = cat["mask_score_time_series"].to_dask()

These two zarrs are the result of collating all the timeseries.nc and the snapshots_*.nc, respectively (as described above). Additionally we have

vels5ka  = cat["vels5ka"].to_dask()
present  = cat["present"].to_dask()

which contain just the velocities at 5 kyr resolution and the present day state (t = 0 kyr BP) of the model, respectively.

Here is a notebook showing how to access these data in pangeo.

Question for @talbrecht and @rabernat: should we make this recipe with the smaller dataset contained in the zip, or do we want to use the larger dataset? I like the larger dataset because it is large enough to start really needing clusters and it is more useful for comparing to data when you have that higher time resolution. What do you think?

Make Terraclimate recipe

The docs now contain a full tutorial example for Terraclimate.

This can be used to rewrite the terraclimate feedstock with the new conventions.

Proposed Recipes for CHIRPS 2.0

Note: I became aware of this dataset and the potential recipe via this twitter thread with @alexgleith

Source Dataset

CHIRPS: Rainfall Estimates from Rain Gauge and Satellite Observations

Since 1999, USGS and CHC scientists—supported by funding from USAID, NASA, and NOAA—have developed techniques for producing rainfall maps, especially in areas where surface data is sparse.

Estimating rainfall variations in space and time is a key aspect of drought early warning and environmental monitoring. An evolving drier-than-normal season must be placed in a historical context so that the severity of rainfall deficits can be quickly evaluated. However, estimates derived from satellite data provide areal averages that suffer from biases due to complex terrain, which often underestimate the intensity of extreme precipitation events. Conversely, precipitation grids produced from station data suffer in more rural regions where there are less rain-gauge stations. CHIRPS was created in collaboration with scientists at the USGS Earth Resources Observation and Science (EROS) Center in order to deliver complete, reliable, up-to-date data sets for a number of early warning objectives, like trend analysis and seasonal drought monitoring.

Link to the website / online documentation for the data: https://www.chc.ucsb.edu/data/chirps
The file format (e.g. netCDF, csv): gziped TIF
How are the source files organized? (e.g. one file per day): daily
How are the source files accessed (e.g. FTP): http

See script at https://github.com/digitalearthafrica/deafrica-scripts/blob/main/deafrica/data/chirps.py for a great starting point

Transformation / Alignment / Merging

None

Output Dataset

The DE-Africa folks are converting from regular GeoTIFF to COG + STAC catalog. So this could be a useful test scenario for that sort of pipeline.

NCAR Climate Data Guide

I just learned about the NCAR climate data guide: https://climatedataguide.ucar.edu/

This is a great list of datasets we would potentially want to ingest in to pangeo forge!

Example pipeline for SMAP Seasurface Salinity

Source Dataset

Temperature and salinity are two fundamental variables in the ocean for many applications. While Sea Surface Temperatures (SST) has been observed by satellites for many decades (and is hopefully added to the cloud soon in #20 ), Sea Surface Salinity (SSS) has only 'recently' been added to the remote observations (starting with the European SMOS mission in 2009). Here I propose to create a pipeline for the currently active NASA-SMAP platform into ARCO storage.

Link to the website / online documentation for the data: SMAP data is processed in two time frequencies (8-day running mean, monthly mean) by two different data centers with different algorithms (JPL and RSS). All datasets are available via POODAC.
The file format (e.g. netCDF, csv): netcdf
How are the source files organized? (e.g. one file per day): The files are organized as one file per day (8-day running mean) and one file per month (monthly mean).
How are the source files accessed (e.g. FTP): The data is available via Opendap/Thredds (example for JPL 8-day running mean)
- provide an example link if possible: https://podaac-opendap.jpl.nasa.gov/opendap/allData/smap/L3/JPL/V5.0/8day_running/2015/121/SMAP_L3_SSS_20150505_8DAYS_V5.0.nc.html
Any special steps required to access the data (e.g. password required): nope

Transformation / Alignment / Merging

Single time steps for the JPL product are about 30MB, so chunking in time (maybe 3 time chunks) might make sense, but should not be strictly necessary.

Output Dataset

I think one zarr store per time frequency and alorithm would be the ideal way to access this data.

cc @cisaacstern @hscannell

Example pipeline for FIA

Source Dataset

The Forest Inventory and Analysis (FIA) dataset from the U.S. Forest Service is a collection of in situ observations of forest parameters.

https://apps.fs.usda.gov/fia/datamart/CSV/datamart_csv.html
CSV
One file per data stream - total about 30 CSVs.
Individual CSV or bulk Zip file
- Bulk zip file: https://apps.fs.usda.gov/fia/datamart/CSV/ENTIRE.zip
- Individual CSV: https://apps.fs.usda.gov/fia/datamart/CSV/PLOT.csv
No registration or authentication required

Transformation / Alignment / Merging

Just some clean up of the CSVs, data type parsing, etc. There isn't a natural way to concatenate this dataset.

Output Dataset

30 parquet datasets

Example pipeline for derived datasets from CMIP6 cloud data

Source Dataset

The source data would be the CMIP6 archive, which is already stored in zarr format on Google Cloud Storage

Transformation / Alignment / Merging

My motivation here would be to come up with a simple derived product from existing cloud data, like mean global surface data, which typically requires a long computation and is of general intersest for the community. Additionally to saving repeated computation, the resulting data are very small.

Ultimately I want to use the existing (and future) functionality of cmip6_preprocessing, to derive more complex transformations, but for now I would like to test and understand how to use prefect to automate these processes.

For the mean surface temperature I require two different datasets (Surface temperature + surface grid cell area), which can be combined using cmip6_preprocessing. I can then weight each dataset using the xarrayr.weighted functionality, average and save the results out into timeseries (one per member/model/grid_label).

Output Dataset

These outputs should be stored as zarr files. The easiest would probably be one file per model, with all of the members aggregated into one array.

Further Considerations

Id be happy to learn more about Pangeo-Forge and (if I can) help develop it. For now I will follow @rabernat 's suggestion and try out some simple examples and get accustomed to prefect.

Create pangeo-forge org bot account.

Currently many of the repo actions are running under a token created for my account. I would suggest creating pangeo-forge bot account and corresponding PAT for this account. These actions all reference the pangeo-forge organization level secret with the key ACTIONS_BOT_TOKEN which should be update with the newly created PAT.

Proposed Recipes for CESM2 Superparameterization Emulator

Source Dataset

Several years of high-frequency (15 min) GCM-timestep level output from the Superparameterized CESM2 isolating state variables "before" and "after" a key code region containing computationally intensive superparameterization calculations. For use in making a superparameterization emulator of explicitly resolved clouds + their radiation influence + turbulence, that can be used in a real-geography framework CESM2 framework, to side step the usual computational cost of SP. Similar in spirit to proof of concept in Rasp, Pritchard & Gentine (2018) and Mooers et al. (2021) but with new refinements by Mike Pritchard and Tom Beucler towards compatibility with operational, real-gegoraphy CESM2 (critically, isolating only tendencies up to surface coupling and including outputs relevant to CLM land model's expectations; see Mooers et al. concluding discussion)

Link to the website / online documentation for the data
N/A
The file format (e.g. netCDF, csv)
Raw model output is CESM2-formatted NetCDF history files
How are the source files organized? (e.g. one file per day)
One file per day across 8-10 sim-years each forced with the same annual cycle of SSTs.
How are the source files accessed (e.g. FTP)
Not publicly posted yet. Staged on a few XSEDE or NERSC clusters. Mike Pritchard's group at UC Irvine can help with access to these.
- provide an example link if possible
Any special steps required to access the data (e.g. password required)

Transformation / Alignment / Merging

**Apologies in advance if this is TMI. A starter core-dump from Mike Pritchard on a busy afternoon:

There are multiple pre-processing steps. The raw model output contains many more variables than one would want to analyze so there is trimming. But users may want to experiment with different inputs and outputs. So this trimming may be user-specific. Can provide guidance on specific variable names for inputs/outputs of published emulators worth competing with on request.

Important: A key subset of variables (surface fluxes) that probably everyone would want in their input vector will need to be time-shifted backward relative by one time step to avoid information leaks, having to due to which phase of the integration cycle these fluxes were saved at on history file vs. the emulated regions. Some users may want to make emulators that include memory of state variables from previous time steps in the input vector (e.g. as in Han et al., JAMES, 2020) in which case there is the same preprocessing issue of backwards time shifting made flexible to additional variables (physical caveat: likely no more than a few hours i.e. <= 10 temporal samples at most so never any reason to include contiguous temporal adjacency beyond that limit).

Many users may want to subsample lon,lat,time first to reduce data volume and promote independence of samples due to spatial and temporal autocorrelations riddled throughout the data. Other users may prefer to include all of these samples as fuel for ambitious architectures that require very data-rich limits to find good fits in. This sub-sampling is user-specific.

Many users wanting to make "one-size-fits-all" emulators (i.e. same NN for all grid cells) will want to flatten lon,lat,time into a generic "sample" dimension (retaining level variability) and shuffle that for ML and split into training/validation/test splits. Such users would also want to pre-normalize by means and range/stds defined independently for separate levs, but which lump together the flattened lon/lat/time statistics. Advanced users may want to train regional- or regime-specific emulators, which might then use regionally-aware normalizations, such that flexibility here would help.

Some users may want to convert the specific humidity and temperature input state variables into an equivalent relative humidity as an alternate input that is less prone to out of sample extrapolation when the emulator is tested prognostically. The RH conversion should use a fixed set of assumptions that are consistent with a f90 module for use in identical testing online; can provide python and f90 code when the time comes.

The surface pressure is vital information to make the vertical discretization physically relevant per CESM2's hybrid vertical eta coordinate. So should always be made available. The pressure mid points and pressure thickness of each vertical level can also be derived from this field but vary with lon,lat,time. Mass-weighting the outputs of vertically resolved variables like diabatic heating using the derived pressure thickness could be helpful to users wishing to prioritize the column influence of different samples as in Beucler et al. (2021, PRL).

Output Dataset

I am not qualified to assess the trade-offs of the options listed here but interested in learning.

Example folder out-of-date

The example directory has just 1 file in it - pipeline.py. My understanding is that we want a nice commented up template version of recipe.py/meta.yml instead.

@cisaacstern is the highly commented copy of those files from the sandbox something we could get in here?

Example use-case for Landsat data on AWS

Here is an example notebook showing a version of problem we are working to solve:

https://github.com/ldeo-glaciology/REMAWaterRouting/blob/b473a510bc57901119a85f3ba15051016984a943/landsat8_load.ipynb

Includes command-line magic for wget, instructions to manually download data from another external website, downloading zipped shape files and calculating intersections with image bounding boxes, downloading a list of images on S3 and the use of Beautiful Soup, some image processing with Numpy, and some other things as well.

This is a great example of toil, because although it appears to be working for the author, I was not able to get it to work. I think if Pangeo-forge can solve this type of use case scenario, we would be doing fantastic!

Example pipeline for CESM2-LE

CESM2-LE Dataset

The CESM2 Large Ensemble was generated in partnership with the IBS Center for Climate Physics in South Korea. When completed, the CESM2 Large Ensemble will consist of 100 members at 1 degree spatial resolution covering the period 1850-2100 under CMIP6 historical and SSP370 future radiative forcing scenarios. Data sets from this ensemble will be made available via the Climate Data Gateway on June 14th, 2021, with the data stored on the GLADE file system at NCAR

We are still waiting for the files to CMOR-ized, before uploading, but we can test this workflow on the non-CMORized datasets currently available.

Official CESM2-LENS Documentation
File format: netcdf
File organization: seperated by variable and frequency (ex. temperature, 6 hourly)
File access: stored on the GLADE file system at NCAR
Requires GLOBUS file transfer

Transformation / Alignment / Merging

These files should be combined into similar ensemble members (Ex. 1-10, 11-80, etc.) and by their respective variables and frequency, chunked in time. An example would be the CESM-LENS dataset

Here is an example of the potential experiment groupings

Output Dataset

The output dataset should be zarr format

Considerations for test flow run notifications in CI.

Currently we don't have a mechanism for receiving notifications that the flows triggered by a slash command dispatch on a PR have run successfully. There are some options for Prefect/Github Action integrations that may provide good solutions but there are some potential consequences that need to be considered.

Prefect is transitioning from their previous event model using cloud_hooks to the new automations model which seems relatively untested. After some discussions with the Prefect team I've decided to go with the cloud hooks approach for 2 reasons.

The documentation and prior art is better (marginally, the documentation around these features is lacking).
Automations are only available with Prefect Cloud. Though we are using Prefect Cloud for our initial prototyping there will likely be pangeo-forge implementations in the future using Prefect Server. To prepare for that, consistently using cloud-hooks seems like the safest route.

Notionally,

A flow registration and run is triggered by a PR comment slash command and a cloud hook is created for that flow's SUCCESS and FAILURE events.
When the Flow run completes, Prefect Cloud's triggers the cloud hook which invokes a Github webhook.
staged-recipes uses an action triggered by the webhook to update the original slash command comment with ✅ or ❌ reaction.

This is unfortunately convoluted but this asynchronous approach seems best to avoid long idle usage of CI minutes waiting for Prefect test flows to complete.

Example pipeline for [coawst_4]

Source Dataset

@rsignell-usgs will do the description later ;-p

Link to the website / online documentation for the data

https://geoport.usgs.esipfed.org/thredds/catalog/coawst_4/use/fmrc/catalog.html?dataset=coawst_4/use/fmrc/coawst_4_use_best.ncd

The file format (e.g. netCDF, csv)

netCDF

How are the source files organized? (e.g. one file per day)

Forecast model run collection

How are the source files accessed (e.g. FTP)
- provide an example link if possible
Any special steps required to access the data (e.g. password required)

netCDFSubet Service

https://geoport.usgs.esipfed.org/thredds/ncss/coawst_4/use/fmrc/coawst_4_use_best.ncd?var=Hwave&disableLLSubset=on&disableProjSubset=on&horizStride=1&time_start={{yyyy-mm-HHTMM}}%3A00%3A00Z&time_end={{yyyy-mm-HHTMM}}%3A00%3A00Z&timeStride=1&vertCoord=&accept=netcdf

Transformation / Alignment / Merging

Re-chunk time 1 to x

Output Dataset

zarr

Assistance on manually running pangeo-forge recipes

Hi there pangeo-forge team,

I’m interested in manually executing some of the staged-recipes, specifically the EOBS,
AGDC and Terraclimate recipes. I was wondering if someone, maybe @cisaacstern, had a bit of time to show me the ropes on how to manually execute these recipes or could point me in the right direction.

Cheers,

-Raphael

Example pipeline for IMERG

Source Dataset

IMERG is a dataset of 0.1° half-hourly precipitation estimates over the majority of the Earth's surface from 2000-present.

https://gpm.nasa.gov/data/imerg
HDF5
One file per half-hour
How are the source files accessed: HTTP download
- https://jsimpsonhttps.pps.eosdis.nasa.gov/imerg/late/200006/3B-HHR-L.MS.MRG.3IMERG.20000601-S000000-E002959.0000.V06B.RT-H5
Any special steps required to access the data: register for PPS data access

Transformation / Alignment / Merging

Files should be concatenated along the time dimension.

Output Dataset

1 Zarr store - chunks oriented for both time series and spatial analysis.

/run-recipe-test is hard coded to certain package versions

I think that the /run-recipe-test workflows hardcodes the docker image

staged-recipes/.github/workflows/run-recipe-test.yaml

Line 42 in 99989f2

uses: pangeo-forge/[email protected]_prefect-0.14.22_pangeoforgerecipes-0.5.0

However, ADR 2 specifies that the pangeo-forge-recipes version is given in meta.yaml.

Example pipeline for NOAA OISST

This is the dataset I have been using for all of my testing. It is also the dataset used for the tutorial

Source Dataset

NOAA Optimum Interpolation Sea Surface Temperature

Link to the website / online documentation for the data: https://www.ncdc.noaa.gov/oisst/optimum-interpolation-sea-surface-temperature-oisst-v21
The file format (e.g. netCDF, csv): netCDF
How are the source files organized? (e.g. one file per day): one file per day
How are the source files accessed (e.g. FTP): HTTP
- provide an example link if possible: https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/198109/oisst-avhrr-v02r01.19810901.nc
Any special steps required to access the data (e.g. password required)

Transformation / Alignment / Merging

The files should be concatenated along the time dimension.

Output Dataset

A single Zarr group with chunk size ~100 MB.

Example pipeline for [Dataset Name]

Source Dataset

Link to the website / online documentation for the data
The file format (e.g. netCDF, csv)
How are the source files organized? (e.g. one file per day)
Any special steps required to access the data (e.g. password required)

Transformation / Alignment / Merging

Output Dataset

recipe test action does not give feedback on the PR

On #74 I used the /run-recipe-test command. It ran and turned up an error: https://github.com/pangeo-forge/staged-recipes/runs/3558398362?check_suite_focus=true

However, there was no feedback posted on #74. So the user would have no way to know that the test failed unless they somehow knew to go look at the actions tab.

We need to provide some interactive feedback, like Conda Forge.

test out CMIP6 recipe with surface variables for SSP585

Following up to our CMIP6-in-the-cloud collaboration meeting last week, wanted to include some specs that it would useful to test out the CMIP6 recipe with:

member_id: r1i1p1f1 if available, otherwise analogous ensemble member (e.g. r2i1p1f1)
experiment_id: ssp585
variable_id: tasmax, tasmin, pr
table_id: day
activity_id: ScenarioMIP

Models: all available with the above specs

cc @cisaacstern @naomi-henderson

pangeo-forge / staged-recipes Goto Github PK

staged-recipes's Introduction

staged-recipes

Adding a new Recipe

Developing a Recipe

staged-recipes's People

Contributors

Stargazers

Watchers

Forkers

staged-recipes's Issues

Source Dataset

Transformation / Alignment / Merging

Output Dataset

Licensing Question

Source Dataset

Transformation / Alignment / Merging

Output Dataset

Source Dataset

Transformation / Alignment / Merging

Output Dataset

Source Dataset

Transformation / Alignment / Merging

Output Dataset

Source Dataset

Transformation / Alignment / Merging

Output Dataset

Source Dataset

Transformation / Alignment / Merging

Output Dataset

SODA 3.4.2 Dataset

Transformation / Alignment / Merging

Output Dataset

Source Dataset

Transformation / Alignment / Merging

Output Dataset

Source Dataset

Transformation / Alignment / Merging

Output Dataset

Source Dataset

Transformation / Alignment / Merging

Output Dataset

Source Dataset

Transformation / Alignment / Merging

Output Dataset

Source Dataset

Transformation / Alignment / Merging

Output Dataset

Source Dataset

Transformation / Alignment / Merging

Output Dataset

Source Dataset

Transformation / Alignment / Merging

Output Dataset

Source Dataset

Transformation / Alignment / Merging

Output Dataset

Source Dataset

Transformation / Alignment / Merging

Output Dataset

Source Dataset

Transformation / Alignment / Merging

Output Dataset

Source Dataset

Transformation / Alignment / Merging

Output Dataset

Source Dataset

Transformation / Alignment / Merging

Output Dataset

Source Dataset

Transformation / Alignment / Merging

Output Dataset

Progress so far

Source Dataset

Transformation / Alignment / Merging

Output Dataset

Source Dataset

Transformation / Alignment / Merging

Output Dataset

Source Dataset