darothen / experiment Goto Github PK

View Code? Open in Web Editor NEW

7.0 7.0 3.0 145 KB

Organizing numerical model experiment output

License: MIT License

Python 100.00%

climate data model science

experiment's Introduction

Hello!

I'm Daniel Rothenberg. In my professional life I wear a lot of different hats:

🌍 I'm an atmospheric/climate scientist with specialization in aerosol-cloud-climate processes and machine learning / AI applied to a variety weather/climate problems.
⛈ I'm a meteorologist who buils AI/ML weather forecasting systems
🏛 work adjacent to science-for-policy and policy-for-science applications and am interested in building a robust weather research enterprise in the United States.
👨🏻‍💻 I write a lot of code and develop software for many different problem domains.
🚗 I help(ed) develop the capability for autonomous vehicles to handle nearly all non-winter weather conditions.

Please feel free to get in touch @danrothenberg and at daniel at danielrothenberg dot com!

experiment's People

Contributors

Stargazers

Watchers

Forkers

grandey francbartoli duncanwp

experiment's Issues

Implement YAML round-trip serialization

Reading/writing from and to YAML files is a convenient way to pass along configurations to other people, to quickly access data archives.

Re-name package to 'xperiment'

Following the discussion in pydata/xarray#1447, it might be worthwhile to re-name this package to something like xperiment to highlight its provenance from the original xarray package.

Implement testing fixtures for Experiment

YAML round-trip
case getters

Serialize Experiment to disk via pickle

Would be nice to be able to serialize this way, so that function-based naming elements can be saved.

Compatibility/crossover/potential synergy with experi?

Apologies if you're already aware of this, but there is a package called experi which aims to provide some related functionality to this package.

Experi uses a yaml file to specify different combinations of options for and experiment which scans over various parameters, then helps you submit those as separate simulation jobs in different directories. What would be awesome is if experiment was able to read those same yaml files in to specify the different cases. Then you could use experi to automatically submit the jobs, and experiment to automatically analyse them, while keeping a global record of the options chosen for each case. It's almost like the two packages have tried to implement two halves of the same workflow...

A lot of the internal logic for parsing the yaml files is similar, so you might be interested in that too. I'm not sure how hard this might be to implement, but maybe it's possible to have an alternative yaml parser in experiment which can read the type of files which experi produces? I don't think merging the two packages would make sense because experi is more general in that it doesn't use xarray at all, and experiment is possibly more flexible with its file structures than experi, but explicit support between the two could be great.

I think experi is in an earlier stage of development than experiment, but I just thought I would bring this to your attention.

Let me (and experi's developer @malramsay64) know what you think!

Suggestion: flexibility regarding variable name in path

For example, it could be useful to be able to handle paths like this:
{variable}/some_prefix_{param1}_{param2}_{param3}_{variable}.nc

Note: what I call {variable} should probably be called {field}.

Generalize time post-processing

Trying to use this package wth Fernando's large ensemble fails because he didn't include traditional time indices along his post-processed output - he simply has a "date" dimension, which is a YYYYMMDD time as an integer.

All of the date processing logic can actually probably occur as a pre-processing function passed by the user. But a more general abstraction for this functionality - metadata post-processing in general - would be useful.

Compare with GODAD technique

Charlie Zender documents a Group-Oriented Data Analysis and Distribution (GODAD) technique in the NCO documentation. This is actually really similar to what I'm trying to accomplish with this package, so there should be some opportunity for overlap and reinforcement.

Build documentation

(self-explanatory)

Load subset of cases

Currently you can load specific, single cases via the .load() function. It would be useful to, say, subset one specific value for a given case but load all others. E.g., with the cases

emis = Case("emis", "Emissions Scenario", ['low', 'high'])
param = Case("param", "Tuning Parameter", ['x', 'y', 'z'])
other = Case("abc", "other factor", ['a', 'b', 'c'])

calling .load({var}, abc='a') would load the low/high and x/y/z cases but select along this last dimension. This would presumably speed up load times.

Pass case bits / kwargs to "preprocess" function on load

Future Implementation based on NetCDF Groups

As raised at pydata/xarray#1092, a potential way to resolve the issue where different Cases have common dimensions but different coordinates (e.g., comparing a model with a 2 deg grid vs one with a 1 deg grid) would be to lean on the "Group" idiom implemented as part of the NetCDF CDM. The data model defines a Group such that:

A Group is a container for Attributes, Dimensions, EnumTypedefs, Variables, and nested Groups. The Groups in a Dataset form a hierarchical tree, like directories on a disk. There is always at least one Group in a Dataset, the root Group, whose name is the empty string.

Ignoring the "meta"-ness of this idiom (and its similarity to the on-disk organization that experiment encourages), it enables something very powerful, because variables organized in different groups can have different dimensions. A major issue raised in pydata/xarray#1092 is how this sort of functionality could be extended on top of existing xarray data structures. But selection semantics aren't necessarily the tough part of this; that's just an issue of building a sensible interface.

The broader idea is to encapsulate an entire collection of numerical model output in the same common framework, to help automate the generation of complex analyses. Using the sample data in experiment and the idiom I use now, that might look something like this:

<xarray.Dataset>
Dimensions:  (time: 10, x: 5, y: 5, param1: 3, param2: 3, param3: 2)
Coordinates:
  * param3   (param3) <U5 'alpha' 'beta'
  * param2   (param2) int64 1 2 3
  * param1   (param1) <U1 'a' 'b' 'c'
  * time     (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 ...
  * x        (x) float64 0.0 2.5 5.0 7.5 10.0
  * y        (y) float64 0.0 2.5 5.0 7.5 10.0
    numbers  (time) int64 0 1 2 3 4 5 6 7 8 9
Data variables:
    precip   (time, x, y) float64 -1.072 -0.124 -0.05287 -0.9504 -1.527 ...

So this corresponds to a dense, 6D tensor or array which is constructed from (332) NetCDF files on disk. But, it would be equivalent to a NetCDF file with hierarchical groups, which could potentially look something like this:

<experiment.Experiment>
Dimensions:  (time: 10, x: *, y: *)
Groups: param1: 3 // param: 3 // param3: 2
Cases:
  + param3   "Parameter 3" <U5 'alpha' 'beta'
  + param2   "Parameter 2" int64 1 2 3
  + param1   "Parameter 1" <U1 'a' 'b' 'c'
Coordinates:
  * time   (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 ...
  * x        (x) float64 [param1, ]
  * y        (y) float64 [param1, ]
    numbers  (time) int64 0 1 2 3 4 5 6 7 8 9
Data variables:
    precip   (time, x, y) float64 {min: -10.0, max: 10.0}

What's going on here:

The Cases which comprise this Experiment are recorded, similar to how they are currently
We've set a hierarchy for the Cases to match a Group hierarchy. The only invariant here is Cases or Groups which contain datasets where Coordinate values may differ (e.g. the 2 deg vs 1 deg grid) should appear first in the hierarchy... although it could also be last, I need to think about the implications of either/or.
Every single dataset has the same set of Dimensions, but their coordinate values might differ. We indicate that in the readout of the Experiment in two places: under "Dimensions" and then under the "Coordinates" values.
To accommodate the fact that underlying arrays may have inconsistent shapes, we do not printout preview values under "Data variables" but maybe a summary statistic or just some key attributes.

Immediately, this gives a way to serialize to disk - this entire data structure directly translates to a NetCDF4 file with Groups. Groups can of course have metadata, which we would specify as part of their Case instantiation. Furthermore, the order/layout of Cases would become important when you initialize an Experiment, since that dictates the structure of the translated netCDF dataset.

Underneath the hood, I don't think anything in Experiment has to change. We still use a multi-keyed dictionary to access the different Datasets that get loaded. But the idea of a "master dataset" is now irrelevant - as long as all the different Datasets are loaded lazily via dask, then selecting out of the underlying dictionary and performing arithmetic is just deferring to xarray. The only question is how to directly apply calculations... it seems like we don't gain anything by looping over that dictionary, but we can always apply the calculations in parallel asynchronously via Joblib and reserve placeholders using the futures we get back. I mean, xarray itself is looping over the Variables in a Dataset... right?

The other key issue is that this interface implicitly encodes when dimension mis-matching will occur, since we are recording Groups which feature data with dimensions that do not agree. This means that we can build in functionality to implicitly resolve these inconsistencies. Similar to the semantics for "aligning" in pandas, we could incorporate crude functionality to automatically re-grid data so that normal array operations work. For the case of bilinear resampling on simple rectangular coordinate systems, that's super easy. The framework would then strongly motivate the broader pangeo-data goal of universal re-gridding tools.

These are just some early thoughts to consider.

Query available fields/outputs for a given experiment

Finish integration of travis CI

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.

Jobs

Jooble