GithubHelp home page GithubHelp logo

fastjmd95's Introduction

xgcm: General Circulation Model Postprocessing with xarray

pypi package conda forge conda-forge GitHub Workflow CI Status code coverage documentation status DOI license Code style pre-commit.ci status

Binder Examples

Link Provider Description
Binder mybinder.org Basic self-contained example
PBinder Pangeo Binder More complex examples integrated with other Pangeo tools (dask, zarr, etc.)

Description

xgcm is a python package for working with the datasets produced by numerical General Circulation Models (GCMs) and similar gridded datasets that are amenable to finite volume analysis. In these datasets, different variables are located at different positions with respect to a volume or area element (e.g. cell center, cell face, etc.) xgcm solves the problem of how to interpolate and difference these variables from one position to another.

xgcm consumes and produces xarray data structures, which are coordinate and metadata-rich representations of multidimensional array data. xarray is ideal for analyzing GCM data in many ways, providing convenient indexing and grouping, coordinate-aware data transformations, and (via dask) parallel, out-of-core array computation. On top of this, xgcm adds an understanding of the finite volume Arakawa Grids commonly used in ocean and atmospheric models and differential and integral operators suited to these grids.

xgcm was motivated by the rapid growth in the numerical resolution of ocean, atmosphere, and climate models. While highly parallel supercomputers can now easily generate tera- and petascale datasets, common post-processing workflows struggle with these volumes. Furthermore, we believe that a flexible, evolving, open-source, python-based framework for GCM analysis will enhance the productivity of the field as a whole, accelerating the rate of discovery in climate science. xgcm is part of the Pangeo initiative.

Getting Started

To learn how to install and use xgcm for your dataset, visit the xgcm documentation.

fastjmd95's People

Contributors

jbusecke avatar rabernat avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

fastjmd95's Issues

Performance issues with `xr.apply_ufunc()` and `jmd95`

I was trying to compute the density equation (for sigma_2) using SOSE model data and fastjmd95, but ran into some performance issues, particularly with this line:

drhodt = xr.apply_ufunc(jmd95numba.drhodt, ds.SALT, ds.THETA, pref,
                                         output_dtypes=[ds.THETA.dtype],
                                         dask='parallelized').reset_coords(drop=True))

Workers (using 30 max) would die off for some reason. This lead to running a matrix of computations to try and isolate the underlying problem - is it just xr.apply_ufunc() having difficulty, or a combination of ufunc() and jmd95?

Please see my nb for full view of issue at hand and run times of the following options: https://nbviewer.jupyter.org/github/ocean-transport/WMT-project/blob/master/SOSE-budgets/optimization-computing-issue.ipynb

  1. xr.apply_ufunc()
  2. dsa.map_blocks()
  3. xr.map_blocks()
  4. fastjmd95
  5. dummy_function (choose simple function (.sum()) to check if fastjmd95 is also having issues)
  6. SOSE model data
  7. randomized data (to check if problem is also rooted from model data)

Run times:

  1. 4min 4s: xr.apply_ufunc()-fastjmd95-model data
  2. all tasks go to one worker and it never executes: xr.apply_ufunc()-fastjmd95-randomized data
  3. 29.7 s: xr.apply_ufunc()-sum()-model data
  4. 15.6 s: xr.apply_ufunc()-sum()-randomized data
  5. 51.2 s: dsa.map_blocks()-fastjmd95-model data
  6. 1min 53s: dsa.map_blocks()-fastjmd95-randomized data
  7. 27.7 s: dsa.map_blocks()-sum()-model data
  8. 13.3 s: dsa.map_blocks()-sum()-randomized data

Please help in trying to figure out what's going on.

No 'DUFunc' object has no attribute 'signature' issue

I am running option 2 from the tutorial on LEAP hub:

import xarray as xr
from fastjmd95 import jmd95numba
import fastjmd95

print("xarry:"  + xr.__version__)
print("fastjmd95:" + fastjmd95.__version__)

from intake import open_catalog
cat = open_catalog("https://raw.githubusercontent.com/pangeo-data/pangeo-datastore/master/intake-catalogs/ocean.yaml")
ds  = cat["SOSE"].to_dask()

# load data into memory
th0 = ds.THETA[0].compute()
slt0 = ds.SALT[0].compute()

rho_xr = jmd95numba.rho(slt0, th0, 0)

and getting the error:

/srv/conda/envs/notebook/lib/python3.11/site-packages/intake_xarray/base.py:21: FutureWarning: The return type of `Dataset.dims` will be changed to return a set of dimension names in future, in order to be more consistent with `DataArray.dims`. To access a mapping from dimension names to lengths, please use `Dataset.sizes`.
  'dims': dict(self._ds.dims),
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[41], line 16
     13 th0 = ds.THETA[0].compute()
     14 slt0 = ds.SALT[0].compute()
---> 16 rho_xr = jmd95numba.rho(slt0, th0, 0)

File /srv/conda/envs/notebook/lib/python3.11/site-packages/numba/np/ufunc/dufunc.py:186, in DUFunc.__call__(self, *args, **kws)
    184 for arg in args + tuple(kws.values()):
    185     if getattr(type(arg), "__array_ufunc__", default) is not default:
--> 186         output = arg.__array_ufunc__(self, "__call__", *args, **kws)
    187         if output is not NotImplemented:
    188             return output

File /srv/conda/envs/notebook/lib/python3.11/site-packages/xarray/core/arithmetic.py:56, in SupportsArithmetic.__array_ufunc__(self, ufunc, method, *inputs, **kwargs)
     51     if not is_duck_array(x) and not isinstance(
     52         x, self._HANDLED_TYPES + (SupportsArithmetic,)
     53     ):
     54         return NotImplemented
---> 56 if ufunc.signature is not None:
     57     raise NotImplementedError(
     58         f"{ufunc} not supported: xarray objects do not directly implement "
     59         "generalized ufuncs. Instead, use xarray.apply_ufunc or "
     60         "explicitly convert to xarray objects to NumPy arrays "
     61         "(e.g., with `.values`)."
     62     )
     64 if method != "__call__":
     65     # TODO: support other methods, e.g., reduce and accumulate.

AttributeError: 'DUFunc' object has no attribute 'signature'

My relevant package version numbers are:
xarry:2024.1.0
fastjmd95:0.2.1

DOI for referencing this work.

I am writing up a manuscript and will likely cite this package (unless the stupid citation limit comes into play). Does this already have a DOI?

Performance issues on google cloud (and beyond)

I am using fastjmd95 to infer potential density from CMIP6 models. I have recently experienced performance issues in a complicated workflow, but I think I can trace some of it back to the step involving fastjmd95.

Here is a small example that reproduces the issue:

# Load a single model from the CMIP archive
import xarray as xr
import gcsfs
from fastjmd95 import jmd95numba

gcs = gcsfs.GCSFileSystem(token='anon')
so = xr.open_zarr(gcs.get_mapper('gs://cmip6/CMIP/NCAR/CESM2/historical/r1i1p1f1/Omon/so/gn/'), consolidated=True).so
thetao = xr.open_zarr(gcs.get_mapper('gs://cmip6/CMIP/NCAR/CESM2/historical/r1i1p1f1/Omon/thetao/gn/'), consolidated=True).thetao

# calculate sigma0 based on the instruction notebook (https://nbviewer.jupyter.org/github/xgcm/fastjmd95/blob/master/doc/fastjmd95_tutorial.ipynb)
sigma_0 = xr.apply_ufunc(
    jmd95numba.rho, so, thetao, 0, dask='parallelized', output_dtypes=[so.dtype]
) - 1000

I then performed some tests on the Goodle Cloud deployment (dask cluster with 5 workers)
When I trigger a computation on the variables that are simply read from storage (so.mean().load(), everything works fine, the memory load is low and the task stream dense)

But when I try the same with the derived variable (sigma_0.mean().load()), things look really ugly: The memory fills up almost immediately and spilling to disk starts. From the Progress Pane it seems like dask is trying to load a large chunk of the dataset into memory before the rho calculation is applied.
image

To me it seems like the scheduler is going wide on the task graph rather than deep, which could free up some memory?
I am really not good enough to diagnose what is going on with dask, but any tips would be much appreciated.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.