roocs / clisops Goto Github PK
View Code? Open in Web Editor NEWClimate Simulation Operations
Home Page: https://clisops.readthedocs.io/en/latest/
License: Other
Climate Simulation Operations
Home Page: https://clisops.readthedocs.io/en/latest/
License: Other
I noticed that the C-library requirements now include libudunits2-dev
or udunits
(for macOS). This is a breaking change that needs to be communicated upstream and to users. Part of the issue is that documentation is not updated alongside changes to code to ensure that "surprises" don't occur in production.
I can see a few steps needed to address this:
I'm sorry if I seem frustrated, but clisops
will be going into Ouranos's production instance of xclim on our servers next month, so we need to follow better protocols.
The dataset c3s-cmip5.output1.NCC.NorESM1-ME.rcp60.mon.seaIce.OImon.r1i1p1.tsice.v20120614
has an irregular grid with dims i
and j
.
The test
clisops/tests/ops/test_subset.py
Lines 391 to 423 in d9a7ef0
shows that longitude is not subsetted correctly for this example but latitude is.
The core.subset
module is missing the implementation for "crossing 0 degree meridian".
clisops/clisops/core/subset.py
Line 114 in c493bc5
[This issue is a migration of Ouranosinc/xclim#422]
The goal would be to compute spatial averages over a polygon. It would need to account for non-uniform grid area and partial overlap, holes in polygons.
The best would be to compute a weight mask for the array representing the area of each gridcell covered by the polygon (of the fractional area).
As mentioned on the original thread, a first way to do this would be to generate a grid of higher resolution and use the existing create_mask
.
Or, we could iterate over all gridcells and generate Polygons for them, either using provided lat_bnds
and lon_bnds
or inferring them. Then, shapely's methods could be used to compute the intersection of each gridpolygon and the target polygon.
From some tests I made of both methods, the second can be quite fast and relatively easy to implement.
Issue in ESMValTool - some coordinate points vary for different files of this dataset (for different time range). This fix removes these inaccuracies by rounding the coordinates.
can be found here: https://github.com/ESMValGroup/ESMValCore/blob/master/esmvalcore/cmor/_fixes/cmip5/noresm1_me.py
This seems to have been corrected - can't find this problem in any of the NCC/NorESM1-ME Amon tas datasets
@ellesmith88: I was just comparing the documentation on the dask pages with our implementation at:
http://xarray.pydata.org/en/stable/dask.html
Their example is:
delayed_obj = ds.to_netcdf("manipulated-example-data.nc", compute=False)
results = delayed_obj.compute()
Our code is:
with dask.config.set(scheduler="synchronous"):
chunked_ds.to_netcdf(output_path, compute=False)
chunked_ds.compute()
I think our code should be:
with dask.config.set(scheduler="synchronous"):
delayed_obj = chunked_ds.to_netcdf(output_path, compute=False)
delayed_obj.compute()
Please check which is the correct implementation. Remembering that compute=False
was writing files but filling them with empty arrays. Thanks
The windows build is presently set as an allowed_fail
build but this isn't really the case, as it is generally stable and presently used by Windows users. Since this build uses Anaconda explicitly to get around the nightmare of installing/compiling C-libraries, it's important to note that changes to the requirements (in setup.py or requirements.txt) need to be mirrored in the environment.yml
, otherwise WIndows-installed pip will try to install dependencies, and we'll get failure stack traces that make no sense at all.
If no one disagrees, I'm going to re-align the dependencies and set the Windows build to be a required passing build.
Discuss with Ouranos how we should implement this. Maybe inside clisops.core.subset
.
DKRZ are loading CMIP6 into Zarr. Here are some of their experiences with xarray.open_mfdataset
:
One problem with the following line:
ds = xarray.open_mfdataset(catvar.df["path"].to_list(), use_cftime=True, combine="by_coords")
Xarray does not interpret the bounds keyword so that the corresponding lat and lon bounds are listed as data variables. That might not cause any problem, but on top of that, xarray adds a time dimension to that variables:
lat_bnds (time, lat, bnds) float64 dask.array<chunksize=(1826, 192, 2), meta=np.ndarray>
lon_bnds (time, lon, bnds) float64 dask.array<chunksize=(1826, 384, 2), meta=np.ndarray>
DKRZ used:
xarray.open_mfdataset(catvar.df["path"].to_list(),
decode_cf=True,
concat_dim="time",
data_vars='minimal',
coords='minimal',
compat='override')
From the xarray
tutorial so that there is no time dimension anymore for the bnds
. They had not included use_cftime
, which might cause other problems as I saw now when reconverting it to netCDF.
Noticed this error occurring on the ReadTheDocs build. While __init__.py
isn't needed for packages in Python3, it's terribly convenient when it comes to ensuring that the version of the library is available when imported. My suggestion would be to simply add a single __init__.py
at the top-level of the library or to add a __version__
object to the main module to set the version.
This would ultimately be managed by bumpversion.
@sol1105: noted this one, might be relevant when we look at masks:
Create a couple of unit tests that check that auxiliary variables and coordinate variables are provided in the output Dataset from clisops
.
coordinate
variables, such as latitude
and longitude
when the original data is on an irregular grid.auxiliary
variables, such a related variable that happens to exist in the file but is not a coordinate.What we want to understand is:
The checks in subset_time
can raise an exception in certain cases:
I have tested:
However, if start_date
or end_date
are in the time range BUT are not exactly aligned then we get an exception, e.g.:
Traceback (most recent call last):
File "test_start_datetime.py", line 7, in <module>
res = subset_time(ds, start_date=ts[0], end_date=ts[1])
File "/home/users/astephen/roocs/clisops/clisops/core/subset.py", line 78, in func_checker
> da.time.sel(time=kwargs["end_date"]).max()
File "/home/users/astephen/roocs/venv/lib/python3.7/site-packages/xarray/core/common.py", line 46, in wrapped_func
return self.reduce(func, dim, axis, skipna=skipna, **kwargs)
File "/home/users/astephen/roocs/venv/lib/python3.7/site-packages/xarray/core/dataarray.py", line 2338, in reduce
var = self.variable.reduce(func, dim, axis, keep_attrs, keepdims, **kwargs)
File "/home/users/astephen/roocs/venv/lib/python3.7/site-packages/xarray/core/variable.py", line 1591, in reduce
data = func(input_data, **kwargs)
File "/home/users/astephen/roocs/venv/lib/python3.7/site-packages/xarray/core/duck_array_ops.py", line 324, in f
return func(values, axis=axis, **kwargs)
File "/home/users/astephen/roocs/venv/lib/python3.7/site-packages/xarray/core/nanops.py", line 86, in nanmax
return _nan_minmax_object("max", dtypes.get_neg_infinity(a.dtype), a, axis)
File "/home/users/astephen/roocs/venv/lib/python3.7/site-packages/xarray/core/nanops.py", line 67, in _nan_minmax_object
data = getattr(np, func)(filled_value, axis=axis, **kwargs)
File "<__array_function__ internals>", line 6, in amax
File "/home/users/astephen/roocs/venv/lib/python3.7/site-packages/numpy/core/fromnumeric.py", line 2706, in amax
keepdims=keepdims, initial=initial, where=where)
File "/home/users/astephen/roocs/venv/lib/python3.7/site-packages/numpy/core/fromnumeric.py", line 87, in _wrapreduction
return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
ValueError: zero-size array to reduction operation maximum which has no identity
Here is a suggested rewrite of the check_start_end_dates
decorator, use slicing to nudge the start_date
and end_date
, e.g.:
def check_start_end_dates(func):
@wraps(func)
def func_checker(*args, **kwargs):
da = args[0]
kwargs.setdefault("start_date", None)
kwargs.setdefault("end_date", None)
nudged_start = da.time.sel(time=slice(kwargs["start_date"], None)).values[0].isoformat()
if nudged_start != kwargs["start_date"]:
warnings.warn(
'"start_date" not found within input date time range. Value has been nudged to nearest'
' valid time step in xarray object.',
UserWarning,
stacklevel=2,
)
kwargs["start_date"] = nudged_start
nudged_end = da.time.sel(time=slice(None, kwargs["end_date"])).values[-1].isoformat()
if nudged_end != kwargs["start_date"]:
warnings.warn(
'"end_date" not found within input date time range. Value has been nudged to nearest'
' valid time step in xarray object.',
UserWarning,
stacklevel=2,
)
kwargs["end_date"] = nudged_end
return func(*args, **kwargs)
return func_checker
Check that:
... test on windows and macos are currently disabled in .travis.yml
(#8).
Current clisops implementation calculates time chunks without taking Area and Level selections into account.
See:
https://github.com/roocs/clisops/blob/master/clisops/ops/subset.py#L71
However, due to lazy evaluation, the call to clisops.core.subset.py
will not touch the actual data. Hence we can do the subset on the entire object, then calculate the time slices on that. This will simplify the whole calculation of size and time...probably means we can delete code!
we need something here that finds nbytes
and n_time_steps
and uses our config memory limit in "clisops:read" to decide on a sensible chunk_length
var_id = get_main_variable(ds)
da = ds[var_id].chunk({'time': chunk_length})
da.unify_chunks()
ds.to_netcdf(tmp_file)
Also set environment variables in roocs_utils/etc/roocs.ini
and set in clisops/__init__.py
print('[WARNING] Heeding warning about Dask environment variables: https://docs.dask.org/en/latest/array-best-practices.html#avoid-oversub
scribing-threads')
print("""export OMP_NUM_THREADS=1
export MKL_NUM_THREADS=1
export OPENBLAS_NUM_THREADS=1
""")
for key in 'OMP_NUM_THREADS MKL_NUM_THREADS OPENBLAS_NUM_THREADS'.split():
os.environ[key] = '1'
The Docstring examples still refer to calls using xclim
. These should be revised to reflect the accepted call behaviour for users (and eventually tested in Travis CI)
Tag as version 3 and make a pull request to Xclim.
We can see from clisops.ops.subset::subset
that most of the workflow will be independent of the actual operation:
Line 53 in ffeb599
The differences are:
clisops.core.*
Would it make sense to create a class that worked through the stages?
The latter, get_outputs
could, in turn, be a class that brings together:
The clisops.ops.subset
could then be a class or function with dectorators:
@validate_ds
def subset(ds, .....):
mapped_args = utils.map_params(...)
ds = clisops.core.subset(ds, ...)
output_handler = OutputHandler(ds, ...)
return output_handler.get_outputs()
NOTE: we also need to simplify, or find a way to remove: clisops.ops.subset::_subset
The average
function will allow averaging across multiple dimensions. But what do we call them?
Here is a suggestion to be discussed:
time
- should match our standard parameter name for submitting over timelevel
- should match our standard parameter name for submitting over levellatitude
- allowed (but might not always exist)longitude
- allowed (but might not always exist)y
- if y-axis is not really latitudex
- if x-axis is not really longitudeMain issue: the user doesn't know what the options are - so we need to map them.
The images in the Subsetting Utilities notebooks in the documentation are not showing. This might be something the do with the path/ symlink between the two notebook directories.
The Xarray mean
method is:
http://xarray.pydata.org/en/stable/generated/xarray.DataArray.mean.html
It includes two optional arguments:
skipna
- skip missing values (or not)keep_attrs
- keep variable attributes (or not)Create some unit tests to help us understand the behaviour of each argument:
def test_xarray_da_mean_skipna_true():
- create a simple 1D xarray.DataArray with 10 values of [10., 10., 10., 10., 10., nan, nan, nan, nan, nan]
- test that the average is 2 if you use `skipna=True`
def test_xarray_da_mean_skipna_false():
- create a simple 1D xarray.DataArray with 10 values of [10., 10., 10., 10., 10., nan, nan, nan, nan, nan]
- test that the average is 1 if you use `skipna=False`
If the results are not as above, we need to investigate more.
def test_xarray_da_mean_keep_attrs_true():
- read a variable from our mini-esgf-cache
- average it with `mean` method across the time axis, with `keep_attrs=True`
- assert the original attributes match the new attributes
def test_xarray_da_mean_keep_attrs_false():
- read a variable from our mini-esgf-cache
- average it with `mean` method across the time axis, with `keep_attrs=False`
- examine the attributes of the resulting average DataArray
- assert those values when you know them
Discuss with team whether we want to:
1. Keep attrs
2. Lose attrs
3. Modify attrs (which might be: keep some then remove/edit/add others).
Keep these unit tests in our codebase anyway.
We need to provide an average
function in clisops with an associated daops
and clisops
method. This is a high-level GitHub issue regarding that function. It breaks down into more issues.
Overall plan:
Our average
uses xarrray
directly, as documented here:
http://xarray.pydata.org/en/stable/generated/xarray.DataArray.mean.html
We are not going to support weighted means or complex averages.
This function only supports a simple and complete average across one or more dimensions of a hypercube.
Issues to explore:
Can we make an initial release 0.1.0 of the current clisops? This can be referenced by daops. After that we can merge the initial xclim subset module ... and make it a 0.2.0 release.
I ran a test on daops
that used subset but didn't pass any arguments and got the error UnboundLocalError: local variable 'result' referenced before assignment
This is from clisops/ops/subset.py
: https://github.com/roocs/clisops/blob/master/clisops/ops/subset.py#L20-L46
Do we want to be able to use subset with no arguments? If not then then we should update this so that the error message is more meaningful.
@agstephens What do you think?
PRs could benefit from a template of questions that the contributor could answer. Some identifiers like "What does this PR change?" and "Are there breaking changes?" are things I've seen across GitHub and xclim has a few as well. I can add that at some point soon.
with Ag and Elle
travis ci is not active yet. We can adapt to the xclim configuration.
Should we move root dirs, basic xarray utills etc into a new repo (roocs-utils)?
FOR NOW: move those dependencies into clisops - but later we might move out.
black style checks tests are run by travis. Fix complains.
Sooner or later a user makes a "larger than memory" request.
We need to implement an appropriate level of Dask chunking in open dataset operations so that we avoid memory errors.
This needs some thought, but this example may be of use:
import xarray as xr
import os
def _setup_env():
"""
export OMP_NUM_THREADS=1
export MKL_NUM_THREADS=1
export OPENBLAS_NUM_THREADS=1
"""
env = os.environ
env['OMP_NUM_THREADS'] = '1'
env['MKL_NUM_THREADS'] = '1'
env['OPENBLAS_NUM_THREADS'] = '1'
_setup_env()
def main():
dr = '/badc/cmip6/data/CMIP6/HighResMIP/MOHC/HadGEM3-GC31-HH/control-1950/r1i1p1f1/day/ta/gn/v20180927'
print(f'[INFO] Working on: {dr}')
ds = xr.open_mfdataset(f'{dr}/*.nc') # , parallel=False)
chunk_rule = {'time': 4}
chunked_ds = ds.chunk(chunk_rule)
ds['ta'].unify_chunks()
print(f'[INFO] Chunk rule: {chunk_rule}')
OUTPUT_DIR = '/gws/nopw/j04/cedaproc/astephen/ag-zarr-test'
output_path = f'{OUTPUT_DIR}/test.zarr'
chunked_ds.to_zarr(output_path) # Although we won't use Zarr in clisops! - NC should work fine.
print(f'[INFO] Wrote: {output_path}')
if __name__ == '__main__':
main()
Things to do:
When simple filenamer is used and split_method = 'time:auto'
the output files are all named output_001.nc
see tests/test_file_namers.py::test_SimpleFileNamer_with_chunking
on implement-split-outputs
branch
clisops/tests/test_file_namers.py
Line 33 in 8d8eea3
Also available in PyPI and conda-forge.
clisops has "heavy" dependencies ... for installation a conda package is more convenient.
Improve checking inside clisops?
This is an issue to identify a few problems:
pre-commit
, a library that has made standardized development of xclim a breeze is not implemented here. This library automatically catches PEP8 and black errors before they are committed to the shared code base with almost no effort or extra steps needed from the contributor. For more information: https://pre-commit.com/.I would like to contribute a fix for these issues.
Hello!
I just wanted to open an issue about some things that would make compatibility much easier to maintain for xclim. As of now, we have clisops installed in our Travis CI load-out for the following versions and builds:
Clisops is built into Xclim such that when importing xclim.subset
with clisops in the installation environment, xclim exposes all the processes listed in __all__
of clisops.core.subset. It would be good to ensure that new functions are listed in __all__
and tested. Presently, these are:
__all__ = [
"create_mask",
"create_mask_vectorize",
"distance",
"subset_bbox",
"subset_gridpoint",
"subset_shape",
"subset_time",
]
I understand that there may eventually be decisions that change the way clisops is run, so in order to ensure that these changes don't take us by surprise, it would be good to see some of the following:
DeprecationWarnings
and FutureDeprecationWarnings
on major functions.I am always available to help clarify any of the practices we use and can help with implementing standards as needed. Just let me know where I can help out.
Use bumpversion for releases. Fix config and update docs for bumpversion. Check usage of version
and compare with xclim.
Enable ReadTheDocs build for documentation. Docs can include notebook examples using nbsphinx. See xclim/finch.
Just curious, but I was wondering if there was interest in installing slack integration to clisops (and possibly other repositories)? I'm not sure if the roocs team uses slack, but it's useful in that the slack integrations allows you to watch issues and PRs on slack with build checks updated in real-time.
We presently use it for some Bird-house/PAVICS repositories (xclim included). Seems like a good fit for coordinating efforts/conversing. For more info: https://github.com/integrations/slack
Instead of looking for "lat" or "latitude", clisops should use:
is_longitude(dim)
is_latitude(dim)
etc etc
Read standardized model output into xarray DataSets (from disk, also via intake-esm?) -> Not required, assume data has been read into an xarray dataset
Hi @huard @Zeitsperre as part of getting ready for our first formal release of clisops
we would to:
<repo>/docs/notebooks
to <repo>/notebooks
<repo>/docs/notebooks
to <repo>/notebooks
clisops.ops.subset
module and connect it to the sphinx documentation.Are you happy for us to do this?
I have marked in PR #8 some subset tests with pytest.xfail
which are failing in a tox environment but are working with a conda environment.
Packages like rioxarray
or hvplot
, provide an xarray extension so their methods can be called directly on the dataset. Would that be wanted with clisops
?
Example: instead of
from clisops import subset
subset.subset_bbox(ds, lat_bnds=[45, 50], lon_bnds=[-60, -55])
one could use:
import clisops.xarray
ds.cso.subset_bbox(lat_bnds=[45, 50], lon_bnds=[-60, -55])
Where "cso" is the xarray extension added by clisops
.
Personally, I like this approach as it looks more elegant and xarray-esque. Moreover, it could allow for dataset-related lookups like crs info in metadata or using something like rioxarray's ds.rio.set_spatial_dims
to solve the problem of #32.
Implementation-wise, it shouldn't be complicated and wouldn't change the rest of the api, simply add another access mechanism.
And, I believe it would make clisops more attractive to xarray users!
As a heavy user of almost-extinct xclim.subset
, I can offer some time on this implementation, it it is wanted.
This test demonstrates that at present, clisops does NOT reverse lat limits - but DOES NOT RAISE AN EXCEPTION:
from clisops.core import subset
import xarray as xr
def test_lat_lon_reversal_empty_ds():
data = 'CMIP6.ScenarioMIP.DKRZ.MPI-ESM1-2-HR.ssp126.Amon.gn'
coll = '/badc/cmip6/data/CMIP6/ScenarioMIP/DKRZ/MPI-ESM1-2-HR/ssp126/r1i1p1f1/Amon/tas/gn/v20190710/*.nc'
ds = xr.open_mfdataset(coll, decode_times=False, combine='by_coords', use_cftime=True)
ds1 = subset.subset_bbox(ds, lat_bnds=[70, 35])
ds2 = subset.subset_bbox(ds, lat_bnds=[35, 70])
assert(ds1.tas.shape == (1032, 0, 384))
assert(ds1.dims['lat'] == 0)
assert(ds2.tas.shape == (1032, 38, 384))
assert(ds2.dims['lat'] == 38)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.