GithubHelp home page GithubHelp logo

resample fails on HadCRUT dataset about esmlab HOT 8 OPEN

ncar avatar ncar commented on July 18, 2024
resample fails on HadCRUT dataset

from esmlab.

Comments (8)

bradyrx avatar bradyrx commented on July 18, 2024 1

Note that native xarray just drops variables that can't be sampled in the requested way. It would be a nice fix to avoid that behavior. Although you might want to throw a warning that certain variables weren't resampled as requested due to data type.

import numpy as np
import pandas as pd
import xarray as xr
A = xr.DataArray(np.random.rand(100,), dims='time')
A.name = 'ints'
B = xr.DataArray(['str']*100, dims='time')
B.name = 'strs'
ds = xr.merge([A, B])
ds['time'] = pd.date_range('1900', '2000', freq='A')
print(ds)

<xarray.Dataset>
Dimensions:  (time: 100)
Coordinates:
  * time     (time) datetime64[ns] 1900-12-31 1901-12-31 ... 1999-12-31
Data variables:
    ints     (time) float64 0.8798 0.4835 0.6354 0.354 ... 0.1584 0.7635 0.8525
    strs     (time) <U3 'str' 'str' 'str' 'str' ... 'str' 'str' 'str' 'str'

ds.resample(time='A').mean()

<xarray.Dataset>
Dimensions:  (time: 100)
Coordinates:
  * time     (time) datetime64[ns] 1900-12-31 1901-12-31 ... 1999-12-31
Data variables:
    ints     (time) float64 0.8798 0.4835 0.6354 0.354 ... 0.1584 0.7635 0.8525

from esmlab.

andersy005 avatar andersy005 commented on July 18, 2024

@brianpm, I am going to look into this and will get back to you

from esmlab.

andersy005 avatar andersy005 commented on July 18, 2024
  • Open the dataset with decode_times=True

The issue is caused by the field_status variable datatype (which is a string). Esmlab is trying to apply resample to all data variables including field_status and it is failing due its field_status data type:

In [14]: tanomfil
Out[14]:
<xarray.Dataset>
Dimensions:              (latitude: 36, longitude: 72, time: 2030)
Coordinates:
  * latitude             (latitude) float32 -87.5 -82.5 -77.5 ... 77.5 82.5 87.5
  * longitude            (longitude) float32 -177.5 -172.5 ... 172.5 177.5
  * time                 (time) datetime64[ns] 1850-01-16T12:00:00 ... 2019-02-15
Data variables:
    temperature_anomaly  (time, latitude, longitude) float32 ...
    field_status         (time) |S1 ...

One solution is to drop this variable:

In [16]: ds = tanomfil.drop('field_status')

In [17]: esmlab.resample(ds, 'ann',  time_coord_name='time')
/glade/work/abanihi/softwares/miniconda3/envs/analysis/lib/python3.7/site-packages/xarray/core/nanops.py:159: RuntimeWarning: Mean of empty slice
  return np.nanmean(a, axis=axis, dtype=dtype)
Out[17]:
<xarray.Dataset>
Dimensions:              (latitude: 36, longitude: 72, time: 170)
Coordinates:
  * latitude             (latitude) float64 -87.5 -82.5 -77.5 ... 77.5 82.5 87.5
  * longitude            (longitude) float64 -177.5 -172.5 ... 172.5 177.5
  * time                 (time) datetime64[ns] 1850-07-01T17:00:00 ... 2019-01-31T06:00:00
Data variables:
    temperature_anomaly  (time, latitude, longitude) float64 nan nan ... nan nan
Attributes:
    history:  \n2019-05-01 13:14:37.171125 esmlab.resample(<DATASET>, freq="a...

Another solution is to set field_status as a coordinate variable and then call resample:

In [27]: tanomfil = tanomfil.set_coords(['field_status'])

In [28]: tanomfil
Out[28]:
<xarray.Dataset>
Dimensions:              (latitude: 36, longitude: 72, time: 2030)
Coordinates:
  * latitude             (latitude) float32 -87.5 -82.5 -77.5 ... 77.5 82.5 87.5
  * longitude            (longitude) float32 -177.5 -172.5 ... 172.5 177.5
  * time                 (time) datetime64[ns] 1850-01-16T12:00:00 ... 2019-02-15
    field_status         (time) |S1 b'f' b'f' b'f' b'f' ... b'p' b'p' b'p' b'p'
Data variables:
    temperature_anomaly  (time, latitude, longitude) float32 ...
Attributes:
    title:                  HadCRUT4 near-surface temperature ensemble data -...
    institution:            Met Office Hadley Centre / Climatic Research Unit...
    history:                Updated at 25/03/2019 16:03:56
    source:                 CRUTEM.4.6.0.0, HadSST.3.1.1.0
    comment:
    reference:              Morice, C. P., J. J. Kennedy, N. A. Rayner, and P...
    version:                HadCRUT.4.6.0.0
    Conventions:            CF-1.0
    ensemble_members:       100
    ensemble_member_index:  0

In [29]: esmlab.resample(tanomfil, 'ann',  time_coord_name='time')
/glade/work/abanihi/softwares/miniconda3/envs/analysis/lib/python3.7/site-packages/xarray/core/nanops.py:159: RuntimeWarning: Mean of empty slice
  return np.nanmean(a, axis=axis, dtype=dtype)
Out[29]:
<xarray.Dataset>
Dimensions:              (latitude: 36, longitude: 72, time: 170)
Coordinates:
  * latitude             (latitude) float64 -87.5 -82.5 -77.5 ... 77.5 82.5 87.5
  * longitude            (longitude) float64 -177.5 -172.5 ... 172.5 177.5
  * time                 (time) datetime64[ns] 1850-07-01T17:00:00 ... 2019-01-31T06:00:00
Data variables:
    temperature_anomaly  (time, latitude, longitude) float64 nan nan ... nan nan
Attributes:
    history:  \n2019-05-01 13:20:25.645865 esmlab.resample(<DATASET>, freq="a...

from esmlab.

brianpm avatar brianpm commented on July 18, 2024

That makes sense.

A nice alternative would be to be able to apply resample to just a DataArray instead of the whole dataset. Is that possible, or is then time_bnds then super problematic?

from esmlab.

andersy005 avatar andersy005 commented on July 18, 2024

A nice alternative would be to be able to apply resample to just a DataArray instead of the whole dataset.

One easy solution for this would be to allow a user to pass a list of variables they would like to apply the resample on. For example:

esmlab.resample(tanomfil, 'ann',  time_coord_name='time', data_vars=['temperature_anomaly'])

@matt-long, any thoughts?

from esmlab.

bradyrx avatar bradyrx commented on July 18, 2024

I imagine you could also do a quick check for variable type when you loop through data_vars. If you encounter a str, datetime, etc. you don't apply the resample there.

from esmlab.

andersy005 avatar andersy005 commented on July 18, 2024

@bradyrx, we are using dset.apply() internally, and I believe this is where the problem is coming from:

esmlab/esmlab/core.py

Lines 488 to 505 in 8ced4d6

def weighted_mean_arr(darr, wgts=None):
# if NaN are present, we need to use individual weights
cond = darr.isnull()
ones = xr.where(cond, 0.0, 1.0)
mask = (
darr.resample({self.time_coord_name: 'A'}).mean(dim=self.time_coord_name).notnull()
)
da_sum = (
(darr * wgts).resample({self.time_coord_name: 'A'}).sum(dim=self.time_coord_name)
)
ones_out = (
(ones * wgts).resample({self.time_coord_name: 'A'}).sum(dim=self.time_coord_name)
)
ones_out = ones_out.where(ones_out > 0.0)
da_weighted_mean = da_sum / ones_out
return da_weighted_mean.where(mask)
ds_resample_mean = dset.apply(weighted_mean_arr, wgts=wgts)

I am curious to know how one could add a check before calling .apply()

from esmlab.

bradyrx avatar bradyrx commented on July 18, 2024

Here's a naive solution off the top of my head (using slightly modified weighted mean function here):

def weighted_mean_arr(darr, wgts=1, time_coord_name='time'): 
    # if NaN are present, we need to use individual weights 
    cond = darr.isnull() 
    ones = xr.where(cond, 0.0, 1.0) 
    mask = ( 
        darr.resample({time_coord_name: 'A'}).mean(dim=time_coord_name).notnull() 
    ) 
    da_sum = ( 
        (darr * wgts).resample({time_coord_name: 'A'}).sum(dim=time_coord_name) 
    ) 
    ones_out = ( 
        (ones * wgts).resample({time_coord_name: 'A'}).sum(dim=time_coord_name) 
    ) 
    ones_out = ones_out.where(ones_out > 0.0) 
    da_weighted_mean = da_sum / ones_out 
    return da_weighted_mean.where(mask) 
import numpy as np
import pandas as pd
import xarray as xr

# Generate monthly dummy data with two strings variables
A = xr.DataArray(np.random.rand(120,), dims='time')
A.name = 'ints'
B = xr.DataArray(['str']*120, dims='time')
B.name = 'strs'
C = xr.DataArray(['foo']*120, dims='time')
C.name = 'strs2'
ds = xr.merge([A, B, C])
ds['time'] = pd.date_range('1900-01', '1910-01', freq='M')
print(ds)

<xarray.Dataset>
Dimensions:  (time: 120)
Coordinates:
  * time     (time) datetime64[ns] 1900-01-31 1900-02-28 ... 1909-12-31
Data variables:
    ints     (time) float64 0.3267 0.4233 0.4777 0.6476 ... 0.3189 0.9488 0.9251
    strs     (time) <U3 'str' 'str' 'str' 'str' ... 'str' 'str' 'str' 'str'
    strs2    (time) <U3 'foo' 'foo' 'foo' 'foo' ... 'foo' 'foo' 'foo' 'foo'

# this list probably isn't exhaustive
acceptable_types = [np.dtype('float'), np.dtype('int'), np.dtype('double')]
drop_vars = []
for var in ds.data_vars:
    if np.asarray(ds[var]).dtype not in acceptable_types:
        drop_vars.append(var)

# extract variables that can't be extracted into separate dataset
no_apply = ds[drop_vars]
# apply function only to non-string types
resampled = ds.drop(drop_vars).apply(weighted_mean_arr)

print(resampled)
<xarray.Dataset>
Dimensions:  (time: 10)
Coordinates:
  * time     (time) datetime64[ns] 1900-12-31 1901-12-31 ... 1909-12-31
Data variables:
    ints     (time) float64 0.4236 0.5849 0.5722 0.4475 ... 0.4421 0.5669 0.5151

print(no_apply)
<xarray.Dataset>
Dimensions:  (time: 120)
Coordinates:
  * time     (time) datetime64[ns] 1900-01-31 1900-02-28 ... 1909-12-31
Data variables:
    strs     (time) <U3 'str' 'str' 'str' 'str' ... 'str' 'str' 'str' 'str'
    strs2    (time) <U3 'foo' 'foo' 'foo' 'foo' ... 'foo' 'foo' 'foo' 'foo'

As in, you split the Dataset into two separate Datasets, apply the function, and then merge them in the end. The issue is now you have two time dimensions with different coordinates. At this point, you'd have to rename one of them something like TIME (which LANL does for some of their netCDF files with differing temporal resolutions). Although I don't really like this solution, as it gets confusing.

In the end, should you really be maintaining a Dataset with different resolutions on the time axis? Maybe xarray has it right to just drop any non-quantitative variables. (I.e., just ditch no_apply in this case).

from esmlab.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.