desihub / desibackup Goto Github PK

View Code? Open in Web Editor NEW

0.0 0.0 0.0 190 KB

Contains wrapper scripts for backing up DESI data using the hpsspy package.

License: BSD 3-Clause "New" or "Revised" License

Shell 61.32% HTML 38.68%

desibackup's People

Contributors

Watchers

desibackup's Issues

New data in spectro directory

Examine files in the spectro directory that are newer than existing backups.

spectro/redux/sjb/dogwood has new data, but even the new data are from 2015! Maybe just re-backup.
Are other directories in spectro/redux/sjb archival?
redux/oak1/bricks/3127p090/blat.h5: Just one file. Not sure who originally owned it.

Configure cmx/ directory for backups after end of CI run

@sbailey, is the desi/cmx/ci data set now static and ready for backup?

backup datachallenge/quicklook/review-19.1

Please add /project/projectdirs/desi/datachallenge/quicklook/review-19.1 to the backups, while excluding review-19.1/redux/preproc/ (114G of easily reproduced preprocessed data).

New data in mocks directory

The LGal_spectra directory has new data. @weaverba137 will contact rtojeiro and chahah.

add etc/desi.json validation tools

Add tools for validating the backup specifications in etc/desi.json, in particular:

do any regexes not match any files?
are any files in the directory tree not covered by a regex and not in the exclude list?
are any files covered by more than one regex?
would any tar specification result in a tarball > 1TB?
a --manifest option (or some equivalent name) that would generate a manifest of what files would go into what tarballs. e.g. perhaps this would point to a directory and it would create fake *.tar files in the same hierarchy and naming that would end up on HPSS, but the contents of those files would be the listing of file paths rather than the contents of the files themselves (like tar -tf blat.tar would report). This would allow someone to do a dry run and check if the results are intended before actually trying to send TB of data to HPSS.

Note that this validation would need to be run at NERSC since it needs access to the actual files; this isn't just the json syntax check of the travis test.

desi/spectro/data/QL

I can't recall now whether desi/spectro/data/QL is archival. If it is, it would not only need backup configuration, but a data model as well.

add more desi.json documentation

Motivated by #2, add more documentation about how to add a new entry to etc/desi.json, e.g.

naming constraints on the tar files, and whether those are best-practices guidelines or actually required for desiBackup and restore to work properly.
worked examples for using regex groups to map N>>1 directories into N>>1 tar files, and how much one should worry about the resulting size of the tar file (and how to tell).
worked examples for how to map the files in a directory into a tarball without accidentally mapping files in subdirectories.

Since json files don't allow comments, the existing etc/desi.json is only partially useful for deriving these patterns.

Create backup monitor

Create a system for easily monitoring what has already been backed up, what needs to be backed up, what backups are stale, etc.

backup daily/tiles/archive directories

Please add tape backup configurations for $DESI_ROOT/spectro/redux/daily/tiles/archive/TILEID/ARCHIVEDATE . These are the basis of the MTL ledger update decisions and should be archived as part of the history of operations.

Unlike the rest of the daily prod, the tiles/archive/TILEID/ARCHIVEDATE directories are guaranteed to be frozen once written and thus are safe to backup. New TILEID/ARCHIVEDATE directories may appear (including for the same TILEID that had previously been archived under and earlier ARCHIVEDATE), but the contents of the existing ones won't change once written.

These are ~5 GB each, which is a bit on the small side for htar files. If we need to go to a larger bundling, all tiles on a given ARCHIVEDATE could be put together, though that's a little "unnatural" given the TILEID/ARCHIVEDATE organization instead of ARCHIVEDATE/TILEID, but I don't think that's a blocking factor if we need to go for bigger bundles.

add fuji and guadalupe to desi.json

Please add the fuji and guadalupe reductions to desi.json and back them up to HPSS. They likely could follow the same structure as everest for how directories are split into individual htar files.

Configure backups for mocks directory

In case I don't get this done today...

Configure the mocks directory for tape backups. Note that there is a working area in the lya_forest directory that probably shouldn't be backed up.

And make sure the backups actually take place once they are configured.

Update backup configuration for guadalupe, dr1/vacs

Update the backup configuration for guadalupe and check on other public VACs.

Older backups lack precise timestamps

hsi ls -l shows older files with only day-level precision.

-rw-r-----    1 desi      desi      3332536320 Oct 18  2019 desi_spectro_data_20191017.tar

hpsspy deals with this by assigning an arbitrary time of day to the backup, but desiBackup assumes second-level precision when comparing files on disk to files on HPSS, and this leads to spurious warnings about files on disk being newer than files on HPSS.

Find out if higher-precision timestamps are available from HPSS.
If not, assume a time buffer when comparing older files on disk to older files on HPSS.

How should we handle configuration of backups?

While preparing this code to back up the oak1 reduction, I was thinking about how to handle the configuration of the backups. The low-level backup system, in the hpsspy package, needs a JSON file which specifies how files on disk map to files on HPSS. desiBackup has the necessary file in etc/desi.json, and the shell script desiBackup.sh passes this file to the low-level backup system.

The question is, do we:

update this file in GitHub every time there is something new to back up, and cut a new tag?
update this file in GitHub, but don't bother tagging, i.e. install a git clone instead of a tagged version?
tag a stable version, but create copies of etc/desi.json as needed to backup stuff?
do something else?

PS, while dusting off & testing desiBackup, I've already made a backup of oak1, but this could be deleted at will for further testing.

New data in datachallenge directory

Examine files in the datachallenge directory that are newer than existing backups. In some cases there is just one file that changed.

surveysim2018/weather/README; owner sjbailey
dc17a-twopct/spectro/redux/dc17a2/dc17a2_qa.json; owner sjbailey
dc17a-twopct/spectro/redux/dc17a2/exposures/NIGHT/EXPID/qa-SPECTROGRAPH-EXPID.yaml; owner sjbailey
dc17a-twopct/spectro/redux/dc17a2/exposures/NIGHT/EXPID/qa-(sky|flux)-SPECTROGRAPH-EXPID.png; owner sjbailey
dc17a-twopct/spectro/redux/dc17a2/calib2d/NIGHT/qa-z5-EXPID.yaml; owner sjbailey
reference_runs/18.2/survey/test-tiles.fits; owner mmagana

This is a nice one because once these issues are resolved, we should be able to go directly from red to green/Complete.

Assigning to @sbailey since he is the owner of most of these files.

desihub / desibackup Goto Github PK

desibackup's People

Contributors

Watchers

desibackup's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs