As raised at pydata/xarray#1092, a potential way to resolve the issue where different Case
s have common dimensions but different coordinates (e.g., comparing a model with a 2 deg grid vs one with a 1 deg grid) would be to lean on the "Group" idiom implemented as part of the NetCDF CDM. The data model defines a Group such that:
A Group is a container for Attributes, Dimensions, EnumTypedefs, Variables, and nested Groups. The Groups in a Dataset form a hierarchical tree, like directories on a disk. There is always at least one Group in a Dataset, the root Group, whose name is the empty string.
Ignoring the "meta"-ness of this idiom (and its similarity to the on-disk organization that experiment encourages), it enables something very powerful, because variables organized in different groups can have different dimensions. A major issue raised in pydata/xarray#1092 is how this sort of functionality could be extended on top of existing xarray data structures. But selection semantics aren't necessarily the tough part of this; that's just an issue of building a sensible interface.
The broader idea is to encapsulate an entire collection of numerical model output in the same common framework, to help automate the generation of complex analyses. Using the sample data in experiment and the idiom I use now, that might look something like this:
<xarray.Dataset>
Dimensions: (time: 10, x: 5, y: 5, param1: 3, param2: 3, param3: 2)
Coordinates:
* param3 (param3) <U5 'alpha' 'beta'
* param2 (param2) int64 1 2 3
* param1 (param1) <U1 'a' 'b' 'c'
* time (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 ...
* x (x) float64 0.0 2.5 5.0 7.5 10.0
* y (y) float64 0.0 2.5 5.0 7.5 10.0
numbers (time) int64 0 1 2 3 4 5 6 7 8 9
Data variables:
precip (time, x, y) float64 -1.072 -0.124 -0.05287 -0.9504 -1.527 ...
So this corresponds to a dense, 6D tensor or array which is constructed from (332) NetCDF files on disk. But, it would be equivalent to a NetCDF file with hierarchical groups, which could potentially look something like this:
<experiment.Experiment>
Dimensions: (time: 10, x: *, y: *)
Groups: param1: 3 // param: 3 // param3: 2
Cases:
+ param3 "Parameter 3" <U5 'alpha' 'beta'
+ param2 "Parameter 2" int64 1 2 3
+ param1 "Parameter 1" <U1 'a' 'b' 'c'
Coordinates:
* time (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 ...
* x (x) float64 [param1, ]
* y (y) float64 [param1, ]
numbers (time) int64 0 1 2 3 4 5 6 7 8 9
Data variables:
precip (time, x, y) float64 {min: -10.0, max: 10.0}
What's going on here:
- The
Case
s which comprise this Experiment
are recorded, similar to how they are currently
- We've set a hierarchy for the
Case
s to match a Group hierarchy. The only invariant here is Cases or Groups which contain datasets where Coordinate values may differ (e.g. the 2 deg vs 1 deg grid) should appear first in the hierarchy... although it could also be last, I need to think about the implications of either/or.
- Every single dataset has the same set of Dimensions, but their coordinate values might differ. We indicate that in the readout of the Experiment in two places: under "Dimensions" and then under the "Coordinates" values.
- To accommodate the fact that underlying arrays may have inconsistent shapes, we do not printout preview values under "Data variables" but maybe a summary statistic or just some key attributes.
Immediately, this gives a way to serialize to disk - this entire data structure directly translates to a NetCDF4 file with Groups. Groups can of course have metadata, which we would specify as part of their Case instantiation. Furthermore, the order/layout of Cases would become important when you initialize an Experiment, since that dictates the structure of the translated netCDF dataset.
Underneath the hood, I don't think anything in Experiment has to change. We still use a multi-keyed dictionary to access the different Datasets that get loaded. But the idea of a "master dataset" is now irrelevant - as long as all the different Datasets are loaded lazily via dask, then selecting out of the underlying dictionary and performing arithmetic is just deferring to xarray. The only question is how to directly apply calculations... it seems like we don't gain anything by looping over that dictionary, but we can always apply the calculations in parallel asynchronously via Joblib and reserve placeholders using the futures we get back. I mean, xarray itself is looping over the Variables in a Dataset... right?
The other key issue is that this interface implicitly encodes when dimension mis-matching will occur, since we are recording Groups which feature data with dimensions that do not agree. This means that we can build in functionality to implicitly resolve these inconsistencies. Similar to the semantics for "aligning" in pandas, we could incorporate crude functionality to automatically re-grid data so that normal array operations work. For the case of bilinear resampling on simple rectangular coordinate systems, that's super easy. The framework would then strongly motivate the broader pangeo-data goal of universal re-gridding tools.
These are just some early thoughts to consider.