xgcm / fastjmd95 Goto Github PK

Numba implementation of Jackett & McDougall (1995) ocean equation of state

License: MIT License

Python 100.00%

fastjmd95's Introduction

xgcm: General Circulation Model Postprocessing with xarray

Binder Examples

Link	Provider	Description
	mybinder.org	Basic self-contained example
	Pangeo Binder	More complex examples integrated with other Pangeo tools (dask, zarr, etc.)

Description

xgcm is a python package for working with the datasets produced by numerical General Circulation Models (GCMs) and similar gridded datasets that are amenable to finite volume analysis. In these datasets, different variables are located at different positions with respect to a volume or area element (e.g. cell center, cell face, etc.) xgcm solves the problem of how to interpolate and difference these variables from one position to another.

xgcm consumes and produces xarray data structures, which are coordinate and metadata-rich representations of multidimensional array data. xarray is ideal for analyzing GCM data in many ways, providing convenient indexing and grouping, coordinate-aware data transformations, and (via dask) parallel, out-of-core array computation. On top of this, xgcm adds an understanding of the finite volume Arakawa Grids commonly used in ocean and atmospheric models and differential and integral operators suited to these grids.

xgcm was motivated by the rapid growth in the numerical resolution of ocean, atmosphere, and climate models. While highly parallel supercomputers can now easily generate tera- and petascale datasets, common post-processing workflows struggle with these volumes. Furthermore, we believe that a flexible, evolving, open-source, python-based framework for GCM analysis will enhance the productivity of the field as a whole, accelerating the rate of discovery in climate science. xgcm is part of the Pangeo initiative.

Getting Started

To learn how to install and use xgcm for your dataset, visit the xgcm documentation.

fastjmd95's People

Contributors

Stargazers

Watchers

Forkers

roxyboy cspencerjones wydh yadidya-b emg110

fastjmd95's Issues

Issue with dask distributed

Based on the discussion in dask/distributed#3450, it seems like we will not get automatic dispatching with dask-distributed working any time soon.

So we need a workaround, probably involving map blocks.

Performance issues with `xr.apply_ufunc()` and `jmd95`

I was trying to compute the density equation (for sigma_2) using SOSE model data and fastjmd95, but ran into some performance issues, particularly with this line:

drhodt = xr.apply_ufunc(jmd95numba.drhodt, ds.SALT, ds.THETA, pref,
                                         output_dtypes=[ds.THETA.dtype],
                                         dask='parallelized').reset_coords(drop=True))

Workers (using 30 max) would die off for some reason. This lead to running a matrix of computations to try and isolate the underlying problem - is it just xr.apply_ufunc() having difficulty, or a combination of ufunc() and jmd95?

Please see my nb for full view of issue at hand and run times of the following options: https://nbviewer.jupyter.org/github/ocean-transport/WMT-project/blob/master/SOSE-budgets/optimization-computing-issue.ipynb

xr.apply_ufunc()
dsa.map_blocks()
xr.map_blocks()
fastjmd95
dummy_function (choose simple function (.sum()) to check if fastjmd95 is also having issues)
SOSE model data
randomized data (to check if problem is also rooted from model data)

Run times:

4min 4s: xr.apply_ufunc()-fastjmd95-model data
all tasks go to one worker and it never executes: xr.apply_ufunc()-fastjmd95-randomized data
29.7 s: xr.apply_ufunc()-sum()-model data
15.6 s: xr.apply_ufunc()-sum()-randomized data
51.2 s: dsa.map_blocks()-fastjmd95-model data
1min 53s: dsa.map_blocks()-fastjmd95-randomized data
27.7 s: dsa.map_blocks()-sum()-model data
13.3 s: dsa.map_blocks()-sum()-randomized data

Please help in trying to figure out what's going on.

No 'DUFunc' object has no attribute 'signature' issue

I am running option 2 from the tutorial on LEAP hub:

import xarray as xr
from fastjmd95 import jmd95numba
import fastjmd95

print("xarry:"  + xr.__version__)
print("fastjmd95:" + fastjmd95.__version__)

from intake import open_catalog
cat = open_catalog("https://raw.githubusercontent.com/pangeo-data/pangeo-datastore/master/intake-catalogs/ocean.yaml")
ds  = cat["SOSE"].to_dask()

# load data into memory
th0 = ds.THETA[0].compute()
slt0 = ds.SALT[0].compute()

rho_xr = jmd95numba.rho(slt0, th0, 0)

and getting the error:

/srv/conda/envs/notebook/lib/python3.11/site-packages/intake_xarray/base.py:21: FutureWarning: The return type of `Dataset.dims` will be changed to return a set of dimension names in future, in order to be more consistent with `DataArray.dims`. To access a mapping from dimension names to lengths, please use `Dataset.sizes`.
  'dims': dict(self._ds.dims),
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[41], line 16
     13 th0 = ds.THETA[0].compute()
     14 slt0 = ds.SALT[0].compute()
---> 16 rho_xr = jmd95numba.rho(slt0, th0, 0)

File /srv/conda/envs/notebook/lib/python3.11/site-packages/numba/np/ufunc/dufunc.py:186, in DUFunc.__call__(self, *args, **kws)
    184 for arg in args + tuple(kws.values()):
    185     if getattr(type(arg), "__array_ufunc__", default) is not default:
--> 186         output = arg.__array_ufunc__(self, "__call__", *args, **kws)
    187         if output is not NotImplemented:
    188             return output

File /srv/conda/envs/notebook/lib/python3.11/site-packages/xarray/core/arithmetic.py:56, in SupportsArithmetic.__array_ufunc__(self, ufunc, method, *inputs, **kwargs)
     51     if not is_duck_array(x) and not isinstance(
     52         x, self._HANDLED_TYPES + (SupportsArithmetic,)
     53     ):
     54         return NotImplemented
---> 56 if ufunc.signature is not None:
     57     raise NotImplementedError(
     58         f"{ufunc} not supported: xarray objects do not directly implement "
     59         "generalized ufuncs. Instead, use xarray.apply_ufunc or "
     60         "explicitly convert to xarray objects to NumPy arrays "
     61         "(e.g., with `.values`)."
     62     )
     64 if method != "__call__":
     65     # TODO: support other methods, e.g., reduce and accumulate.

AttributeError: 'DUFunc' object has no attribute 'signature'

My relevant package version numbers are:
xarry:2024.1.0
fastjmd95:0.2.1

DOI for referencing this work.

I am writing up a manuscript and will likely cite this package (unless the stupid citation limit comes into play). Does this already have a DOI?

Performance issues on google cloud (and beyond)

I am using fastjmd95 to infer potential density from CMIP6 models. I have recently experienced performance issues in a complicated workflow, but I think I can trace some of it back to the step involving fastjmd95.

Here is a small example that reproduces the issue:

# Load a single model from the CMIP archive
import xarray as xr
import gcsfs
from fastjmd95 import jmd95numba

gcs = gcsfs.GCSFileSystem(token='anon')
so = xr.open_zarr(gcs.get_mapper('gs://cmip6/CMIP/NCAR/CESM2/historical/r1i1p1f1/Omon/so/gn/'), consolidated=True).so
thetao = xr.open_zarr(gcs.get_mapper('gs://cmip6/CMIP/NCAR/CESM2/historical/r1i1p1f1/Omon/thetao/gn/'), consolidated=True).thetao

# calculate sigma0 based on the instruction notebook (https://nbviewer.jupyter.org/github/xgcm/fastjmd95/blob/master/doc/fastjmd95_tutorial.ipynb)
sigma_0 = xr.apply_ufunc(
    jmd95numba.rho, so, thetao, 0, dask='parallelized', output_dtypes=[so.dtype]
) - 1000

I then performed some tests on the Goodle Cloud deployment (dask cluster with 5 workers)
When I trigger a computation on the variables that are simply read from storage (so.mean().load(), everything works fine, the memory load is low and the task stream dense)

But when I try the same with the derived variable (sigma_0.mean().load()), things look really ugly: The memory fills up almost immediately and spilling to disk starts. From the Progress Pane it seems like dask is trying to load a large chunk of the dataset into memory before the rho calculation is applied.

To me it seems like the scheduler is going wide on the task graph rather than deep, which could free up some memory?
I am really not good enough to diagnose what is going on with dask, but any tips would be much appreciated.

xgcm / fastjmd95 Goto Github PK

fastjmd95's Introduction

xgcm: General Circulation Model Postprocessing with xarray

Binder Examples

Description

Getting Started

fastjmd95's People

Contributors

Stargazers

Watchers

Forkers

fastjmd95's Issues

Issue with dask distributed

Performance issues with `xr.apply_ufunc()` and `jmd95`

No 'DUFunc' object has no attribute 'signature' issue

DOI for referencing this work.

Performance issues on google cloud (and beyond)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs