I'm trying to use resample to go from monthly to annual means in HadCRUT data. It seem

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Open the dataset with decode_times=True </

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

resample fails on HadCRUT dataset about esmlab HOT 8 OPEN

ncar commented on July 18, 2024

resample fails on HadCRUT dataset

from esmlab.

Comments (8)

bradyrx commented on July 18, 2024 1

Note that native xarray just drops variables that can't be sampled in the requested way. It would be a nice fix to avoid that behavior. Although you might want to throw a warning that certain variables weren't resampled as requested due to data type.

import numpy as np
import pandas as pd
import xarray as xr
A = xr.DataArray(np.random.rand(100,), dims='time')
A.name = 'ints'
B = xr.DataArray(['str']*100, dims='time')
B.name = 'strs'
ds = xr.merge([A, B])
ds['time'] = pd.date_range('1900', '2000', freq='A')
print(ds)

<xarray.Dataset>
Dimensions:  (time: 100)
Coordinates:
  * time     (time) datetime64[ns] 1900-12-31 1901-12-31 ... 1999-12-31
Data variables:
    ints     (time) float64 0.8798 0.4835 0.6354 0.354 ... 0.1584 0.7635 0.8525
    strs     (time) <U3 'str' 'str' 'str' 'str' ... 'str' 'str' 'str' 'str'

ds.resample(time='A').mean()

<xarray.Dataset>
Dimensions:  (time: 100)
Coordinates:
  * time     (time) datetime64[ns] 1900-12-31 1901-12-31 ... 1999-12-31
Data variables:
    ints     (time) float64 0.8798 0.4835 0.6354 0.354 ... 0.1584 0.7635 0.8525

from esmlab.

andersy005 commented on July 18, 2024

@brianpm, I am going to look into this and will get back to you

from esmlab.

andersy005 commented on July 18, 2024

Open the dataset with decode_times=True

The issue is caused by the field_status variable datatype (which is a string). Esmlab is trying to apply resample to all data variables including field_status and it is failing due its field_status data type:

In [14]: tanomfil
Out[14]:
<xarray.Dataset>
Dimensions:              (latitude: 36, longitude: 72, time: 2030)
Coordinates:
  * latitude             (latitude) float32 -87.5 -82.5 -77.5 ... 77.5 82.5 87.5
  * longitude            (longitude) float32 -177.5 -172.5 ... 172.5 177.5
  * time                 (time) datetime64[ns] 1850-01-16T12:00:00 ... 2019-02-15
Data variables:
    temperature_anomaly  (time, latitude, longitude) float32 ...
    field_status         (time) |S1 ...

One solution is to drop this variable:

In [16]: ds = tanomfil.drop('field_status')

In [17]: esmlab.resample(ds, 'ann',  time_coord_name='time')
/glade/work/abanihi/softwares/miniconda3/envs/analysis/lib/python3.7/site-packages/xarray/core/nanops.py:159: RuntimeWarning: Mean of empty slice
  return np.nanmean(a, axis=axis, dtype=dtype)
Out[17]:
<xarray.Dataset>
Dimensions:              (latitude: 36, longitude: 72, time: 170)
Coordinates:
  * latitude             (latitude) float64 -87.5 -82.5 -77.5 ... 77.5 82.5 87.5
  * longitude            (longitude) float64 -177.5 -172.5 ... 172.5 177.5
  * time                 (time) datetime64[ns] 1850-07-01T17:00:00 ... 2019-01-31T06:00:00
Data variables:
    temperature_anomaly  (time, latitude, longitude) float64 nan nan ... nan nan
Attributes:
    history:  \n2019-05-01 13:14:37.171125 esmlab.resample(<DATASET>, freq="a...

Another solution is to set field_status as a coordinate variable and then call resample:

In [27]: tanomfil = tanomfil.set_coords(['field_status'])

In [28]: tanomfil
Out[28]:
<xarray.Dataset>
Dimensions:              (latitude: 36, longitude: 72, time: 2030)
Coordinates:
  * latitude             (latitude) float32 -87.5 -82.5 -77.5 ... 77.5 82.5 87.5
  * longitude            (longitude) float32 -177.5 -172.5 ... 172.5 177.5
  * time                 (time) datetime64[ns] 1850-01-16T12:00:00 ... 2019-02-15
    field_status         (time) |S1 b'f' b'f' b'f' b'f' ... b'p' b'p' b'p' b'p'
Data variables:
    temperature_anomaly  (time, latitude, longitude) float32 ...
Attributes:
    title:                  HadCRUT4 near-surface temperature ensemble data -...
    institution:            Met Office Hadley Centre / Climatic Research Unit...
    history:                Updated at 25/03/2019 16:03:56
    source:                 CRUTEM.4.6.0.0, HadSST.3.1.1.0
    comment:
    reference:              Morice, C. P., J. J. Kennedy, N. A. Rayner, and P...
    version:                HadCRUT.4.6.0.0
    Conventions:            CF-1.0
    ensemble_members:       100
    ensemble_member_index:  0

In [29]: esmlab.resample(tanomfil, 'ann',  time_coord_name='time')
/glade/work/abanihi/softwares/miniconda3/envs/analysis/lib/python3.7/site-packages/xarray/core/nanops.py:159: RuntimeWarning: Mean of empty slice
  return np.nanmean(a, axis=axis, dtype=dtype)
Out[29]:
<xarray.Dataset>
Dimensions:              (latitude: 36, longitude: 72, time: 170)
Coordinates:
  * latitude             (latitude) float64 -87.5 -82.5 -77.5 ... 77.5 82.5 87.5
  * longitude            (longitude) float64 -177.5 -172.5 ... 172.5 177.5
  * time                 (time) datetime64[ns] 1850-07-01T17:00:00 ... 2019-01-31T06:00:00
Data variables:
    temperature_anomaly  (time, latitude, longitude) float64 nan nan ... nan nan
Attributes:
    history:  \n2019-05-01 13:20:25.645865 esmlab.resample(<DATASET>, freq="a...

from esmlab.

brianpm commented on July 18, 2024

That makes sense.

A nice alternative would be to be able to apply resample to just a DataArray instead of the whole dataset. Is that possible, or is then time_bnds then super problematic?

from esmlab.

andersy005 commented on July 18, 2024

A nice alternative would be to be able to apply resample to just a DataArray instead of the whole dataset.

One easy solution for this would be to allow a user to pass a list of variables they would like to apply the resample on. For example:

esmlab.resample(tanomfil, 'ann',  time_coord_name='time', data_vars=['temperature_anomaly'])

@matt-long, any thoughts?

from esmlab.

bradyrx commented on July 18, 2024

I imagine you could also do a quick check for variable type when you loop through data_vars. If you encounter a str, datetime, etc. you don't apply the resample there.

from esmlab.

andersy005 commented on July 18, 2024

@bradyrx, we are using dset.apply() internally, and I believe this is where the problem is coming from:

esmlab/esmlab/core.py

Lines 488 to 505 in 8ced4d6

 def weighted_mean_arr(darr, wgts=None): 

 # if NaN are present, we need to use individual weights 

 cond = darr.isnull() 

 ones = xr.where(cond, 0.0, 1.0) 

 mask = ( 

 darr.resample({self.time_coord_name: 'A'}).mean(dim=self.time_coord_name).notnull() 

 ) 

 da_sum = ( 

 (darr * wgts).resample({self.time_coord_name: 'A'}).sum(dim=self.time_coord_name) 

 ) 

 ones_out = ( 

 (ones * wgts).resample({self.time_coord_name: 'A'}).sum(dim=self.time_coord_name) 

 ) 

 ones_out = ones_out.where(ones_out > 0.0) 

 da_weighted_mean = da_sum / ones_out 

 return da_weighted_mean.where(mask) 

 ds_resample_mean = dset.apply(weighted_mean_arr, wgts=wgts)

I am curious to know how one could add a check before calling .apply()

from esmlab.

bradyrx commented on July 18, 2024

Here's a naive solution off the top of my head (using slightly modified weighted mean function here):

def weighted_mean_arr(darr, wgts=1, time_coord_name='time'): 
    # if NaN are present, we need to use individual weights 
    cond = darr.isnull() 
    ones = xr.where(cond, 0.0, 1.0) 
    mask = ( 
        darr.resample({time_coord_name: 'A'}).mean(dim=time_coord_name).notnull() 
    ) 
    da_sum = ( 
        (darr * wgts).resample({time_coord_name: 'A'}).sum(dim=time_coord_name) 
    ) 
    ones_out = ( 
        (ones * wgts).resample({time_coord_name: 'A'}).sum(dim=time_coord_name) 
    ) 
    ones_out = ones_out.where(ones_out > 0.0) 
    da_weighted_mean = da_sum / ones_out 
    return da_weighted_mean.where(mask)

import numpy as np
import pandas as pd
import xarray as xr

# Generate monthly dummy data with two strings variables
A = xr.DataArray(np.random.rand(120,), dims='time')
A.name = 'ints'
B = xr.DataArray(['str']*120, dims='time')
B.name = 'strs'
C = xr.DataArray(['foo']*120, dims='time')
C.name = 'strs2'
ds = xr.merge([A, B, C])
ds['time'] = pd.date_range('1900-01', '1910-01', freq='M')
print(ds)

<xarray.Dataset>
Dimensions:  (time: 120)
Coordinates:
  * time     (time) datetime64[ns] 1900-01-31 1900-02-28 ... 1909-12-31
Data variables:
    ints     (time) float64 0.3267 0.4233 0.4777 0.6476 ... 0.3189 0.9488 0.9251
    strs     (time) <U3 'str' 'str' 'str' 'str' ... 'str' 'str' 'str' 'str'
    strs2    (time) <U3 'foo' 'foo' 'foo' 'foo' ... 'foo' 'foo' 'foo' 'foo'

# this list probably isn't exhaustive
acceptable_types = [np.dtype('float'), np.dtype('int'), np.dtype('double')]
drop_vars = []
for var in ds.data_vars:
    if np.asarray(ds[var]).dtype not in acceptable_types:
        drop_vars.append(var)

# extract variables that can't be extracted into separate dataset
no_apply = ds[drop_vars]
# apply function only to non-string types
resampled = ds.drop(drop_vars).apply(weighted_mean_arr)

print(resampled)
<xarray.Dataset>
Dimensions:  (time: 10)
Coordinates:
  * time     (time) datetime64[ns] 1900-12-31 1901-12-31 ... 1909-12-31
Data variables:
    ints     (time) float64 0.4236 0.5849 0.5722 0.4475 ... 0.4421 0.5669 0.5151

print(no_apply)
<xarray.Dataset>
Dimensions:  (time: 120)
Coordinates:
  * time     (time) datetime64[ns] 1900-01-31 1900-02-28 ... 1909-12-31
Data variables:
    strs     (time) <U3 'str' 'str' 'str' 'str' ... 'str' 'str' 'str' 'str'
    strs2    (time) <U3 'foo' 'foo' 'foo' 'foo' ... 'foo' 'foo' 'foo' 'foo'

As in, you split the Dataset into two separate Datasets, apply the function, and then merge them in the end. The issue is now you have two time dimensions with different coordinates. At this point, you'd have to rename one of them something like TIME (which LANL does for some of their netCDF files with differing temporal resolutions). Although I don't really like this solution, as it gets confusing.

In the end, should you really be maintaining a Dataset with different resolutions on the time axis? Maybe xarray has it right to just drop any non-quantitative variables. (I.e., just ditch no_apply in this case).

from esmlab.

resample fails on HadCRUT dataset about esmlab HOT 8 OPEN

Comments (8)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs

	def weighted_mean_arr(darr, wgts=None):
	# if NaN are present, we need to use individual weights
	cond = darr.isnull()
	ones = xr.where(cond, 0.0, 1.0)
	mask = (
	darr.resample({self.time_coord_name: 'A'}).mean(dim=self.time_coord_name).notnull()
	)
	da_sum = (
	(darr * wgts).resample({self.time_coord_name: 'A'}).sum(dim=self.time_coord_name)
	)
	ones_out = (
	(ones * wgts).resample({self.time_coord_name: 'A'}).sum(dim=self.time_coord_name)
	)
	ones_out = ones_out.where(ones_out > 0.0)
	da_weighted_mean = da_sum / ones_out
	return da_weighted_mean.where(mask)

	ds_resample_mean = dset.apply(weighted_mean_arr, wgts=wgts)