desihub / desidatamodel Goto Github PK

View Code? Open in Web Editor NEW

4.0 46.0 10.0 3.87 MB

The DESI data model.

License: BSD 3-Clause "New" or "Revised" License

Python 98.92% Shell 1.08%

desidatamodel's Introduction

desidatamodel

Introduction

This product defines the Data Model for DESI in the doc/ subdirectory.

Adding a new file

When you add a new file, you also need to add it to a "toctree" (Table of Contents Tree) of one of the index.rst files at that level or higher.

Full Documentation

Please visit desidatamodel on Read the Docs

License

desidatamodel is free software licensed under a 3-clause BSD-style license. For details see the LICENSE.rst file.

desidatamodel's People

Contributors

Stargazers

Watchers

Forkers

rkehoe duanyutong joelbrownstein schlafly segasai aureliocarnero ameisner echaussidon jsuarez314

desidatamodel's Issues

review cross file datamodel consistency

Review the end-to-end dataflow for data model consistency. The same concept should have the same name if it appears in multiple files, e.g. EXTNAME=WAVE vs. WAVELENGTH, or SN vs SNR vs S_N. Identify mismatches and propose necessary changes.

Milky Way Survey outputs datamodel

Hi,

We have currently the working version of the datamodel for some parts of the MWS output https://github.com/segasai/desidatamodel/tree/mwbranch
Currently I put our output files in DESI_SPECTRO_REDUX/SPECPROD/spectra-NSIDE/PIXGROUP/PIXNUM/
But presumably we want to place them in some other folder like
DESI_MWS_REDUX .
What would be the steps necessary to start merging our branch with the main desidatamodel ?
At least all the travis tests are passing. (I also assume that we do want our datamodel merged).

Sergey

Also CC to @callendeprieto

collapsable sections for header keyword tables

It would be helpful if the datamodel could include optional collapsable sections. e.g. we inherit hundreds of keywords from raw data + fiberassign and it would be nice to only show those upon request so that the webpage can focus on the data portions of the files without scrolling through pages and pages of keywords first.

Add surveysim files to datamodel

The files owned by surveysim that probably need documented data models are:

Survey per-tile and per-night statistics written by surveysim.stats.SurveyStatistics.save
Simulated exposure metadata written bysurveysim.exposures.ExposureList.save

(copied from desihub/surveysim#60)

`DESI_TARGET` has evolved away from using a separate file for standard stars

Next time we review the DESI_TARGET data model, remove deprecated files such as stdstar.rst

Can we remove "brick" models?

There are still references to brick-based reduction files, especially in the DESI_SPECTRO_REDUX section. Can these be removed? In general, what brick-based models need to be retained, and what can be removed?

Review datamodel for 17.12 consistency

Update datamodel as needed to match the results of software release 17.12 in /project/projectdirs/desi/datachallenge/reference_runs/17.12 .

In some cases where they differ, the datamodel may reflect what we really want. Make sure that the datamodel is clear that this isn't yet implemented, and make sure that there is a ticket for getting the code to implement the desired datamodel.

update datamodel for software release 18.11

Update the data model for software release 18.11 based on files in /project/projectdirs//desi/datachallenge/reference_runs/18.11/

Lots of formats changed, in particular

target catalogs
fiber assign output
FIBERMAP HDU of fibermap, frame, cframe, and spectra files

Add desisurvey files to datamodel

The files owned by desisurvey that probably need documented data models are:

Ephemerides written by desisurvey.ephem.Ephemerides
Design HAs written by desisurvey.scripts.surveyinit.calculate_initial_plan
Afternoon planning state written by desisurvey.plan.Planner.save
Tile scheduler state written by desisurvey.scheduler.Scheduler.save

(copied from desihub/desisurvey#91)

Update data model for MTL

A specific reminder to update the data model for the survey operations MTL files.

Update Travis configuration and other infrastructure.

Merge #23 and clean up old branches.
Update the Travis configuration to roughly match that of desiutil.
Make sure that documentation builds with --warning-is-error.
Update desiutil version (in requirements.txt).
Check dependencies needed to build on ReadTheDocs.

How do data models in DESI_TARGET map to files in reference runs?

The data model files in the DESI_SPECTRO_(DATA|REDUX|SIM) sections map cleanly onto directories in the recent reference runs. However, data model files in the DESI_TARGET section do not. This includes fiber assignment files.

Data model for qso afterburner files

Please review the data model for the qso_mgii and qso_qn files in both healpix and tiles directories. The structural description of these files should be complete, but descriptions of the HDUs, header keywords and columns need work. If necessary it is possible for files to refer to other files, or for HDUs to refer to other HDUs, to reduce duplication of effort.

Quicklook to Nightwatch

How similar are Nightwatch outputs to Quicklook? That is, could we get away with renaming the current Quicklook directory and tweaking some of the files? Also, does this transition make #49 moot?

Data model for top-level exposures and tiles files

Please review the data model for the top-level exposures-SPECPROD and tiles-SPECPROD files. The structural description of these files should be complete, but descriptions of the HDUs, header keywords and columns need work.

Support "see other" functionality at HDU level

Currently there is a way to say "this data model is the same as this other data model", but it would be nice to have this functionality at the HDU level, not just the file level.

Compare by key, not by list.

Compare HDUs by EXTNAME, not by order in the FITS or model file.
Compare HDU keywords by keyword, not by order.
Compare Column names by Column name? Can we at least specify the order of columns in a table?
This may also make it easier to support required and optional keywords.

Optional HDUs and table columns

Some files may have entire optional HDUs, and, in some situations, data tables may have missing/optional columns. We can add human-readable documentation describing situations and files where HDUs or columns may be absent. However, there also needs to be a machine-readable mechanism for marking optional items.

One suggestion: create an RST comment that is not rendered into the final document, but would nevertheless be machine readable. For example:

.. Optional Columns - COLUMNA, COLUMNB, COLUMNC

Handle data model for psfnight and related files

The data model for psfnight and related files is moderately complex, involving variable-sized image HDUs, as well as data tables that contain two-dimensional, array-valued columns that are also variable-sized. See also desihub/desispec#1778.

tool to check if data model is current

We originally dreamed of having the desidatamodel formats be both human and machine readable such that code could verify if the datamodel files were correct. We found it hard to make something that was human-friendly while also being strict and complete enough for computer parsing. And in the meantime our datamodel has gotten out of sync with our data files. I suggest a pragmatic middle ground:

Write a script to run on a production directory and verify that:

every fits file has a corresponding datamodel file
- in the future this could be expanded to hdf5, yaml, etc. too
every HDU in the fits file has a description in the datamodel file (based on EXTNAME)
for binary tables, confirm that the columns in the file match the columns in the datamodel

Print a brief report for discrepancies, e.g.

blat/foo/bar/quat.fits missing from datamodel
blat/foo/bar/quiz.fits datamodel missing description of HDU EXTNAME=BIZBAT

Once a file of type X has been checked, all other files of that type could be skipped — i.e. we don't need to verify 300000 frame files all have the same format. I think all of our files have the form directory/prefix-*.* — i.e. you can parse and key off of "prefix" since we don't reuse the same prefix for different kinds of files in different locations.

Even keeping the files and datamodel in sync at that level would be very useful. Checking header keywords, float32 vs. float64, etc. could come later.

The interface might look something like

checkmodel /project/projectdirs/desi/spectro/redux/dc3 --root DESI_SPECTRO_REDUX/PRODNAME
checkmodel /project/projectdirs/desi/target --root DESI_TARGET

I hope this is viable. Actually trying to implement it could reveal any gotchas.

PROTODESI_DATA files

I noticed a few problems with the PROTODESI_DATA files that should be addressed before that branch is merged.

The naming conventions for the various files has a three-digit exposure ID and a 5-digit MJD. Are we absolutely certain that protoDESI will never produce more than 999 exposures, or is the exposure ID reset every day (which seems dangerous)? Also, the rest of the pipeline is using eight-digit YYYYMMDD instead of MJD. Is it possible to use YYYYMMDD for consistency?
The main DESI pipeline has largely abandoned mixed-case filenames in favor of all lowercase.
The files describe FITS files that have binary tables in HDU0, which is not allowed. It appears there is some confusion about the difference between Header Keywords and Column Names.
The index.rst file has incorrect links in the toctree section. The links should simply be e.g. pdFVC not pdFVC/index.

Human-readable directories and filenames from everest

We will likely need to expand the set of human-readable directory and file names. For reference, these are things like NIGHT and EXPID, which stand in for harder-to-read regular expressions. The existing set is here.

It looks like we no longer use NSIDE as of everest, but I'm more interested in what to call:

(sv1|sv2|sv3|main) = SURVEY?
(backup|bright|dark|other) = CONDITIONS?
(cumulative|perexp|pernight) = TILETYPE?
[14]x_depth = TILETYPE or does that have such a different data model that it requires a separate section?

To think about this another way, do we need this structure in the data model:

DESI_SPECTRO_REDUX
- SPECPROD
  - tiles
    - 1x_depth
      - TILEID
        
        SPECTROGRAPH
        
        various files...
    - 4x_depth
      - TILEID
        
        SPECTROGRAPH...
    - cumulative...
    - perexp...
    - pernight....

or can we get away with this:

DESI_SPECTRO_REDUX
- SPECPROD
  - tiles
    - TILETYPE
      - TILEID
        
        NIGHT
        
        various files....

And similar question for the healpix directory.

fitsverify errors and warnings on pipeline output

Targeting everest first, make sure that all file types give no warnings from fitsverify.

different visual style for header keyword tables vs. data tables

Currently both header keyword tables and data tables use the same CSS styling. If sphinx can support it, it would be UX helpful to have them visually distinguishable, e.g. with different background colors so that you can quickly tell when scrolling if you are looking at a data table or a header keyword table.

Use astropy.io.fits

Just like desihub/specter#14, desidatamodel uses fitsio, but doesn't actually need to.

HDU 0 EXTNAME policy

DESI FITS files EXTNAME requirements:

FITS extensions shall have an EXTNAME set
Pipeline code shall access FITS extensions by name not by number
User code should access FITS extensions by name not by number

i.e. we reserve the right to re-order or insert HDUs; accessing them by name won't break but accessing them by number might.

For discussion: a potential special case is blank HDU 0, where the "real" data are in HDU 1 (with an EXTNAME!) because they are a binary table that the FITS standard doesn't allow in HDU 0, e.g. the targets catalog. In formal FITS nomenclature I don't even think that HDU 0 is an "extension" but the standard doesn't forbid giving it an EXTNAME anyway (thankfully). I suggest our HDU 0 EXTNAME policy should be:

if HDU 0 has blank data and no non-standard header keywords, it should not have an EXTNAME
if HDU 0 has blank data but meaningful header keywords, HDU 0 should have EXTNAME=PRIMARY (Note: should not shall)
if HDU 0 has meaningful data, it shall have an EXTNAME set that should be something more meaningful than "PRIMARY"

I suggest that check_model and unit tests check (3) (the only "shall") and do a best-effort basis on (1) and (2), but if that gets too irritating or there are messy corner cases, don't worry about it.

Comments in PR #68 have a list of known files with HDU 0 that have no EXTNAME.

targets.dat should really be a YAML file.

The data file data/targets/targets.dat should really be a YAML file. As far as I can tell it is already perfectly valid YAML.

This is especially important because the format of targets.dat is significantly different from all other *.dat files in that directory.

Update fiberassign tile file data model

The data model for the fiberassign tile files has changed and needs to be updated for release 18.6. See also desihub/desitest#12.

Update data model for tiles product

The data model path DESI_TARGET/fiberassign is outdated and should be updated:

The ultimate source for fiberassign files is DESI_SPECTRO_DATA/NIGHT/EXPID/fiberassign-EXPID.rst. After those files appear in the raw data, they are checked into the tiles product, so the fiberassign data model in that area can refer to the raw data version.
The equivalent path to the actual data tree is DESI_TARGET/tiles/TILES_VERSION/TILEXX/fiberassign-TILEID.rst. DESI_TARGET/tiles is simply a svn checkout of the full tiles product.
There may be other files in the tiles product to document.

Stale branch: average_flux_calib

Is anything happening with the average_flux_calib branch?

Unit tests should ensure that all data model files can be parsed for validation.

Data model (.rst) files need to follow a certain format to ensure that they can be parsed by data model validation software. One or more unit tests should be written to guarantee this.

Ensure EXTNAME is always set

It is always a good thing to set EXTNAME in FITS file HDUs, so that they can referred to by name. However, we haven't been very careful about enforcing this. Therefore I propose:

All data models should have an EXTNAME recorded for every HDU. Stern warning if not.
When generating data models from FITS files, complain if EXTNAME is not set.
When validating FITS files against the data model, complain if EXTNAME is not set in one or the other or both.

'electron' is not a valid unit in the FITS standard

I'm trying to make sure that all units defined in FITS files (BUNIT and TUNIT keywords) conform to the FITS standard. Unfortunately 'electron' is not valid.

See doc/DESI_SPECTRO_REDUX/SPECPROD/exposures/NIGHT/EXPID/calib-CAMERA-EXPID.rst for an example. Would adu be an acceptable substitute?

Data model for redshift catalog roll-up files

Please review the data model for the zall-(pix|tilecumulative) files.

Remove imaging directory

The top level index.rst file explicitly says that the imaging data model is not included in this package, so there is no point in keeping the doc/imaging directory. Sphinx is complaining about not being able to find a link to this directory.

Allow data model to express sets of very similar HDUs

The DESI raw data files will have a set of 30 HDUs that are more-or-less identical except for the EXTNAME, B0, B1, B2, ..., R1, R2, ..., Z8, Z9. Allow all of the data model metadata to be expressed by a single HDU description, rather than repeat the same HDU description 30 times.

Note that this has nothing to do with the INHERIT keyword in raw data HDUs. Treat that keyword as an ordinary, but not-trivial, required HDU keyword.

Deduplicate fiberassign/fibermap tables

Many data files contain fiberassign/fibermap table HDUs. To the extent possible, and leveraging the optional table and HDU cross-reference capability implemented in #102, reduce the number of separate descriptions of these tables.

Complete description of "depth" tiles

As of PR #90, the description of (1x|4x)_depth tiles was incomplete. These have a different directory structure from the rest of the tiles.

update data model for 18.3

Update data model based upon 18.3 reference run at NERSC in:

/global/projecta/projectdirs/desi/datachallenge/reference_runs/18.3

A number of files have the same format but a new location as itemized in

list of scalars being tested

In desidatamodel/doc/QUICKLOOK/*.rst, we need the information of which scalar is being tested in *_NORMAL_RANGE and *_WARN_RANGE when the *_STATUS is created. This info was available in older versions of datamodel.

Complete documentation of fuji+guadalupe header keywords

After merging #102, a few files still needed updates to descriptions of header keywords in fuji and guadalupe.

Update model for DESI_SPECTRO_DATA

The DESI_SPECTRO_DATA section needs updating, in particular, the coordinates files urgently need description.

data model issues to check

This is a meta ticket to capture a list of questions leftover from PR #78.

DESISURVEY_OUTPUT

No model file for exposures.fits, test-tiles.fits.

DESI_SPECTRO_DATA

The datachallenge files are not a useful comparison to the model.
See also comments on fibermap files below.

DESI_SPECTRO_REDUX

Many files types are not generated by the datachallenge.
I'm very worried that passing around huge fibermap/target tables is going to
result in lots of inconsistencies between files as columns are added or
removed.

DESI_SPECTRO_SIM

No comparison possible for pix and simpix files.
Only routine header keyword and column name cleanup are needed.
Fibermap files no longer have HDU2, TARGETS.
Simspec TRUTH table no longer has CONTAM_TARGET.
Why are NUMOBS_INIT and PRIORITY_INIT int64, when NUMOBS_MORE is int32.
And in any case, why are we planning for 2**32 or 2**64 exposures?
There are a number of type changes in simspec, and some columns
that used to have units no longer do. See the OBSCONDITIONS table especially.

DESI_TARGET

Is there any way to more accurately represent the "official" pipeline layout?
For example, is the position of fiberassign under DESI_TARGET reflective of
actual operations?
The fiberassign data model file is named tile-TILEID-FIELDNUM.rst but the files
are named tile-TILEID.fits.
What happened to the standards files?
Why did skies.rst get named that? It's called sky.fits.
Are these really meant to be the same? There are a lot of changes.
mtl files have no units on any column. Some of those columns have to have units.
In fact, units are missing from basically every DESI_TARGET file.
mtl files have NUMOBS_INIT as int64, but it is int32 in other files.
targets file do not contain PRIORITY?
There are several minor differences among truth, sky, mtl that should be
checked in detail.
Why are ALLMASK_[GRZ] columns float32? The description is pretty clear
that it should be an integer.
SEED keyword in truth table looks dubious. Why not an integer with a comment?
Layout of tile file FIBERASSIGN table is radically different from other
files.
Does every HDU in tile have to have the same keywords?

How to describe MTL files

@weaverba137 @sbailey I want to add a description of the MTL files generated by this code in desitarget.

Is it enough to include a new mtl.rst file under doc/DESI_TARGET or is there anything else that should be done?

String types not being written out with size.

Recent runs of generate_model are printing char stream instead of e.g. char[8] when describing string-valued columns in binary tables. This may be an artifact of the Numpy version used.

don't rely on docDB to document fiberassign data model

In order to properly interpret the fiberassign HDUs --
https://desidatamodel.readthedocs.io/en/latest/DESI_TARGET/fiberassign/tile.html
one has to read DESI-0530 and DESI-1049 on DocDB.

It would be nice to have the "essential" information self-contained in the desidatamodel page, especially for the POTENTIAL HDU, which is a fairly short and straightforward.

Also: note that the extension name has changed: POTENTIAL_ASSIGNMENTS --> POTENTIAL

tilemeasure.rst file needs to be fixed

The file doc/DESI_TARGET/fiberassign/tilemeasure.rst was added on June 13 to master by @forero (I think...) but it causes Travis tests to fail with a couple warnings / errors (see below). Please fix or move to a branch while in development!

% sphinx-build doc junk
[snip]
building [html]: targets for 109 source files that are out of date
updating environment: 109 added, 0 changed, 0 removed
reading sources... [100%] index                                                                               
/Users/ioannis/repos/desihub/desidatamodel/doc/DESI_TARGET/fiberassign/tilemeasure.rst:27: ERROR: Unknown target name: "hdu2".
looking for now-outdated files... none found
pickling environment... done
checking consistency... /Users/ioannis/repos/desihub/desidatamodel/doc/DESI_TARGET/fiberassign/tilemeasure.rst:: WARNING: document isn't included in any toctree
done
preparing documents... done
writing output... [100%] index                                                                                
generating indices... genindex py-modindex
highlighting module code... [100%] desidatamodel.check                                                        
writing additional pages... search
copying static files... done
copying extra files... done
dumping search index in English (code: en) ... done
dumping object inventory... done
build succeeded, 2 warnings.

Data model for calibration files in exposures and calibnight directories

Please review the descriptions in the data model for these files:

In DESI_SPECTRO_REDUX/SPECPROD/exposures/NIGHT/EXPID
- cframe
- exposure-qa
- fiberflat
- fiberflatexp
- fit-psf
- fit-psf-before-listed-fix
- fit-psf-fixed-listed
- fluxcalib
- frame
- psf
- sframe
- shifted-input-psf
- sky
- stdstars
In DESI_SPECTRO_REDUX/SPECPROD/preproc/NIGHT/EXPID
- preproc
In DESI_SPECTRO_REDUX/SPECPROD/calibnight/NIGHT
- biasnight
- biasnighttest
- fiberflatnight
- psfnight

Note that Issue #106 concerns how to support the somewhat complex structure of the psfnight files in the desidatamodel automation, but is not related to the descriptions of HDUs, header keywords or columns within those files.

add desimodel files to datamodel

The desimodel files are lightly documented in the desimodel/doc directory and in DESI-0847, but we should add more detailed data model documentation to desidatamodel. This would be a new top-level DESIMODEL directory, at the same level as DESI_SPECTRO_REDUX, etc.

Is mockobserving going to be merged?

Is the mockobserving branch still required?