GithubHelp home page GithubHelp logo

experiment's Introduction

Hello!

I'm Daniel Rothenberg. In my professional life I wear a lot of different hats:

  • ๐ŸŒ I'm an atmospheric/climate scientist with specialization in aerosol-cloud-climate processes and machine learning / AI applied to a variety weather/climate problems.
  • โ›ˆ I'm a meteorologist who buils AI/ML weather forecasting systems
  • ๐Ÿ› work adjacent to science-for-policy and policy-for-science applications and am interested in building a robust weather research enterprise in the United States.
  • ๐Ÿ‘จ๐Ÿปโ€๐Ÿ’ป I write a lot of code and develop software for many different problem domains.
  • ๐Ÿš— I help(ed) develop the capability for autonomous vehicles to handle nearly all non-winter weather conditions.

Please feel free to get in touch @danrothenberg and at daniel at danielrothenberg dot com!

experiment's People

Contributors

francbartoli avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

experiment's Issues

Compatibility/crossover/potential synergy with experi?

Apologies if you're already aware of this, but there is a package called experi which aims to provide some related functionality to this package.

Experi uses a yaml file to specify different combinations of options for and experiment which scans over various parameters, then helps you submit those as separate simulation jobs in different directories. What would be awesome is if experiment was able to read those same yaml files in to specify the different cases. Then you could use experi to automatically submit the jobs, and experiment to automatically analyse them, while keeping a global record of the options chosen for each case. It's almost like the two packages have tried to implement two halves of the same workflow...

A lot of the internal logic for parsing the yaml files is similar, so you might be interested in that too. I'm not sure how hard this might be to implement, but maybe it's possible to have an alternative yaml parser in experiment which can read the type of files which experi produces? I don't think merging the two packages would make sense because experi is more general in that it doesn't use xarray at all, and experiment is possibly more flexible with its file structures than experi, but explicit support between the two could be great.

I think experi is in an earlier stage of development than experiment, but I just thought I would bring this to your attention.

Let me (and experi's developer @malramsay64) know what you think!

Generalize time post-processing

Trying to use this package wth Fernando's large ensemble fails because he didn't include traditional time indices along his post-processed output - he simply has a "date" dimension, which is a YYYYMMDD time as an integer.

All of the date processing logic can actually probably occur as a pre-processing function passed by the user. But a more general abstraction for this functionality - metadata post-processing in general - would be useful.

Load subset of cases

Currently you can load specific, single cases via the .load() function. It would be useful to, say, subset one specific value for a given case but load all others. E.g., with the cases

emis = Case("emis", "Emissions Scenario", ['low', 'high'])
param = Case("param", "Tuning Parameter", ['x', 'y', 'z'])
other = Case("abc", "other factor", ['a', 'b', 'c'])

calling .load({var}, abc='a') would load the low/high and x/y/z cases but select along this last dimension. This would presumably speed up load times.

Future Implementation based on NetCDF Groups

As raised at pydata/xarray#1092, a potential way to resolve the issue where different Cases have common dimensions but different coordinates (e.g., comparing a model with a 2 deg grid vs one with a 1 deg grid) would be to lean on the "Group" idiom implemented as part of the NetCDF CDM. The data model defines a Group such that:

A Group is a container for Attributes, Dimensions, EnumTypedefs, Variables, and nested Groups. The Groups in a Dataset form a hierarchical tree, like directories on a disk. There is always at least one Group in a Dataset, the root Group, whose name is the empty string.

Ignoring the "meta"-ness of this idiom (and its similarity to the on-disk organization that experiment encourages), it enables something very powerful, because variables organized in different groups can have different dimensions. A major issue raised in pydata/xarray#1092 is how this sort of functionality could be extended on top of existing xarray data structures. But selection semantics aren't necessarily the tough part of this; that's just an issue of building a sensible interface.

The broader idea is to encapsulate an entire collection of numerical model output in the same common framework, to help automate the generation of complex analyses. Using the sample data in experiment and the idiom I use now, that might look something like this:

<xarray.Dataset>
Dimensions:  (time: 10, x: 5, y: 5, param1: 3, param2: 3, param3: 2)
Coordinates:
  * param3   (param3) <U5 'alpha' 'beta'
  * param2   (param2) int64 1 2 3
  * param1   (param1) <U1 'a' 'b' 'c'
  * time     (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 ...
  * x        (x) float64 0.0 2.5 5.0 7.5 10.0
  * y        (y) float64 0.0 2.5 5.0 7.5 10.0
    numbers  (time) int64 0 1 2 3 4 5 6 7 8 9
Data variables:
    precip   (time, x, y) float64 -1.072 -0.124 -0.05287 -0.9504 -1.527 ...

So this corresponds to a dense, 6D tensor or array which is constructed from (332) NetCDF files on disk. But, it would be equivalent to a NetCDF file with hierarchical groups, which could potentially look something like this:

<experiment.Experiment>
Dimensions:  (time: 10, x: *, y: *)
Groups: param1: 3 // param: 3 // param3: 2
Cases:
  + param3   "Parameter 3" <U5 'alpha' 'beta'
  + param2   "Parameter 2" int64 1 2 3
  + param1   "Parameter 1" <U1 'a' 'b' 'c'
Coordinates:
  * time   (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 ...
  * x        (x) float64 [param1, ]
  * y        (y) float64 [param1, ]
    numbers  (time) int64 0 1 2 3 4 5 6 7 8 9
Data variables:
    precip   (time, x, y) float64 {min: -10.0, max: 10.0}

What's going on here:

  1. The Cases which comprise this Experiment are recorded, similar to how they are currently
  2. We've set a hierarchy for the Cases to match a Group hierarchy. The only invariant here is Cases or Groups which contain datasets where Coordinate values may differ (e.g. the 2 deg vs 1 deg grid) should appear first in the hierarchy... although it could also be last, I need to think about the implications of either/or.
  3. Every single dataset has the same set of Dimensions, but their coordinate values might differ. We indicate that in the readout of the Experiment in two places: under "Dimensions" and then under the "Coordinates" values.
  4. To accommodate the fact that underlying arrays may have inconsistent shapes, we do not printout preview values under "Data variables" but maybe a summary statistic or just some key attributes.

Immediately, this gives a way to serialize to disk - this entire data structure directly translates to a NetCDF4 file with Groups. Groups can of course have metadata, which we would specify as part of their Case instantiation. Furthermore, the order/layout of Cases would become important when you initialize an Experiment, since that dictates the structure of the translated netCDF dataset.

Underneath the hood, I don't think anything in Experiment has to change. We still use a multi-keyed dictionary to access the different Datasets that get loaded. But the idea of a "master dataset" is now irrelevant - as long as all the different Datasets are loaded lazily via dask, then selecting out of the underlying dictionary and performing arithmetic is just deferring to xarray. The only question is how to directly apply calculations... it seems like we don't gain anything by looping over that dictionary, but we can always apply the calculations in parallel asynchronously via Joblib and reserve placeholders using the futures we get back. I mean, xarray itself is looping over the Variables in a Dataset... right?

The other key issue is that this interface implicitly encodes when dimension mis-matching will occur, since we are recording Groups which feature data with dimensions that do not agree. This means that we can build in functionality to implicitly resolve these inconsistencies. Similar to the semantics for "aligning" in pandas, we could incorporate crude functionality to automatically re-grid data so that normal array operations work. For the case of bilinear resampling on simple rectangular coordinate systems, that's super easy. The framework would then strongly motivate the broader pangeo-data goal of universal re-gridding tools.

These are just some early thoughts to consider.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.