GithubHelp home page GithubHelp logo

desihub / desibackup Goto Github PK

View Code? Open in Web Editor NEW
0.0 0.0 0.0 190 KB

Contains wrapper scripts for backing up DESI data using the hpsspy package.

License: BSD 3-Clause "New" or "Revised" License

Shell 61.32% HTML 38.68%

desibackup's People

Contributors

sbailey avatar weaverba137 avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

desibackup's Issues

New data in spectro directory

Examine files in the spectro directory that are newer than existing backups.

  • spectro/redux/sjb/dogwood has new data, but even the new data are from 2015! Maybe just re-backup.
  • Are other directories in spectro/redux/sjb archival?
  • redux/oak1/bricks/3127p090/blat.h5: Just one file. Not sure who originally owned it.

backup datachallenge/quicklook/review-19.1

Please add /project/projectdirs/desi/datachallenge/quicklook/review-19.1 to the backups, while excluding review-19.1/redux/preproc/ (114G of easily reproduced preprocessed data).

add etc/desi.json validation tools

Add tools for validating the backup specifications in etc/desi.json, in particular:

  1. do any regexes not match any files?
  2. are any files in the directory tree not covered by a regex and not in the exclude list?
  3. are any files covered by more than one regex?
  4. would any tar specification result in a tarball > 1TB?
  5. a --manifest option (or some equivalent name) that would generate a manifest of what files would go into what tarballs. e.g. perhaps this would point to a directory and it would create fake *.tar files in the same hierarchy and naming that would end up on HPSS, but the contents of those files would be the listing of file paths rather than the contents of the files themselves (like tar -tf blat.tar would report). This would allow someone to do a dry run and check if the results are intended before actually trying to send TB of data to HPSS.

Note that this validation would need to be run at NERSC since it needs access to the actual files; this isn't just the json syntax check of the travis test.

desi/spectro/data/QL

I can't recall now whether desi/spectro/data/QL is archival. If it is, it would not only need backup configuration, but a data model as well.

add more desi.json documentation

Motivated by #2, add more documentation about how to add a new entry to etc/desi.json, e.g.

  1. naming constraints on the tar files, and whether those are best-practices guidelines or actually required for desiBackup and restore to work properly.
  2. worked examples for using regex groups to map N>>1 directories into N>>1 tar files, and how much one should worry about the resulting size of the tar file (and how to tell).
  3. worked examples for how to map the files in a directory into a tarball without accidentally mapping files in subdirectories.

Since json files don't allow comments, the existing etc/desi.json is only partially useful for deriving these patterns.

Create backup monitor

Create a system for easily monitoring what has already been backed up, what needs to be backed up, what backups are stale, etc.

backup daily/tiles/archive directories

Please add tape backup configurations for $DESI_ROOT/spectro/redux/daily/tiles/archive/TILEID/ARCHIVEDATE . These are the basis of the MTL ledger update decisions and should be archived as part of the history of operations.

Unlike the rest of the daily prod, the tiles/archive/TILEID/ARCHIVEDATE directories are guaranteed to be frozen once written and thus are safe to backup. New TILEID/ARCHIVEDATE directories may appear (including for the same TILEID that had previously been archived under and earlier ARCHIVEDATE), but the contents of the existing ones won't change once written.

These are ~5 GB each, which is a bit on the small side for htar files. If we need to go to a larger bundling, all tiles on a given ARCHIVEDATE could be put together, though that's a little "unnatural" given the TILEID/ARCHIVEDATE organization instead of ARCHIVEDATE/TILEID, but I don't think that's a blocking factor if we need to go for bigger bundles.

add fuji and guadalupe to desi.json

Please add the fuji and guadalupe reductions to desi.json and back them up to HPSS. They likely could follow the same structure as everest for how directories are split into individual htar files.

Configure backups for mocks directory

In case I don't get this done today...

Configure the mocks directory for tape backups. Note that there is a working area in the lya_forest directory that probably shouldn't be backed up.

And make sure the backups actually take place once they are configured.

Older backups lack precise timestamps

hsi ls -l shows older files with only day-level precision.

-rw-r-----    1 desi      desi      3332536320 Oct 18  2019 desi_spectro_data_20191017.tar

hpsspy deals with this by assigning an arbitrary time of day to the backup, but desiBackup assumes second-level precision when comparing files on disk to files on HPSS, and this leads to spurious warnings about files on disk being newer than files on HPSS.

  1. Find out if higher-precision timestamps are available from HPSS.
  2. If not, assume a time buffer when comparing older files on disk to older files on HPSS.

How should we handle configuration of backups?

While preparing this code to back up the oak1 reduction, I was thinking about how to handle the configuration of the backups. The low-level backup system, in the hpsspy package, needs a JSON file which specifies how files on disk map to files on HPSS. desiBackup has the necessary file in etc/desi.json, and the shell script desiBackup.sh passes this file to the low-level backup system.

The question is, do we:

  • update this file in GitHub every time there is something new to back up, and cut a new tag?
  • update this file in GitHub, but don't bother tagging, i.e. install a git clone instead of a tagged version?
  • tag a stable version, but create copies of etc/desi.json as needed to backup stuff?
  • do something else?

PS, while dusting off & testing desiBackup, I've already made a backup of oak1, but this could be deleted at will for further testing.

New data in datachallenge directory

Examine files in the datachallenge directory that are newer than existing backups. In some cases there is just one file that changed.

  • surveysim2018/weather/README; owner sjbailey
  • dc17a-twopct/spectro/redux/dc17a2/dc17a2_qa.json; owner sjbailey
  • dc17a-twopct/spectro/redux/dc17a2/exposures/NIGHT/EXPID/qa-SPECTROGRAPH-EXPID.yaml; owner sjbailey
  • dc17a-twopct/spectro/redux/dc17a2/exposures/NIGHT/EXPID/qa-(sky|flux)-SPECTROGRAPH-EXPID.png; owner sjbailey
  • dc17a-twopct/spectro/redux/dc17a2/calib2d/NIGHT/qa-z5-EXPID.yaml; owner sjbailey
  • reference_runs/18.2/survey/test-tiles.fits; owner mmagana

This is a nice one because once these issues are resolved, we should be able to go directly from red to green/Complete.

Assigning to @sbailey since he is the owner of most of these files.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.