Comments (12)
okay, turns out I seriously misunderstood what *by
does... I was assuming I would need to pass a DataArray
, but it seems that's only true for isbin=False
. For isbin=True
it instead takes a variable name. This is really confusing, though.
I guess something like this might be more intuitive?
# by as a mapping, to avoid shadowing variable names
flox.xarray.xarray_reduce(arr, by={"time": flox.Bins(...), "depth": flox.Bins(...)}, func="mean", ...)
# by as a *args
flox.xarray.xarray_reduce(
arr,
flox.Bins(variable="time", ...),
flox.Bins(variable="depth", ...),
func="mean",
...
)
from flox.
Thanks for trying it out Justus!
One reason for all this confusion is that I always expected xarray_reduce
to be temporary while I ported it to Xarray but that is taking a lot of effort :( So now flox is stuck with some bad API choices.
I frequently hit the Needs better message error from
oops yeah. The issue is that I wanted to enable the really simple line:
xarray_reduce(ds, "x", expected_groups=["a", "b", "c"])
So for multiple groupers expected_groups
must be a tuple. The error message should be improved ;). This would be a nice PR.
Another option might be to just use an interval index.
So flox.core.groupby_reduce
does actually handle IntervalIndex (or any pandas Index).And looking at the code, xarray_reduce
should also handle it. Here we convert everything to an Index:
Line 318 in 51fb6e9
And that function does handle pd.Index objects. I think we should update the typing and docstring. This would be a helpful PR!
edit ah now i see, we'll need to remove the isbin
argument for this to work properly.
turns out I seriously misunderstood what *by does... I was assuming I would need to pass a DataArray, but it seems that's only true for isbin=False.
This sounds like a bug but I'm surprised. by
can be Hashable
or DataArray
.
Line 232 in 51fb6e9
and the tests:
Lines 111 to 127 in 51fb6e9
A reproducible example would help.
by={"time": flox.Bins(...), "depth": flox.Bins(...)}
Yeah I've actually been considering the dict
approach. It seems a lot nicer since there's potentially a lot of information (for example cut_kwargs
in xarray which gets passed to pd.cut
which can contain whether the closed edge is left or right), and its hard to keep track of order in 3-4 different kwargs. One point of ugliness would be grouping by a categorical variable where you want flox
to autodetect group members . That would then look like
by = {"labels": None}
It also means you can't do
xarray_reduce(ds, "x", func="mean")
An alternative is to use pd.IntervalIndex
instead of a new flox.Bins
object. It can wrap a lot of info and will avoid allow us to remove isbin
(see flox.core.groupby_reduce
).
from flox.
here's the example I tested this with (although I just realized I could have used .cf.bounds_to_vertices
instead of writing a new function):
In [1]: import xarray as xr
...: import cf_xarray
...: import numpy as np
...: import flox.xarray
...:
...:
...: def add_vertices(ds, bounds_dim="bounds"):
...: new_names = {
...: name: f"{name.removesuffix(bounds_dim).rstrip('_')}_vertices"
...: for name, coord in ds.variables.items()
...: if bounds_dim in coord.dims
...: }
...: new_coords = {
...: new_name: cf_xarray.bounds_to_vertices(ds[name], bounds_dim=bounds_dim)
...: for name, new_name in new_names.items()
...: }
...: return ds.assign_coords(new_coords)
...:
...:
...: categories = list("abcefghi")
...:
...: coords = (
...: xr.Dataset(coords={"x": np.arange(10), "y": ("y", categories)})
...: .cf.add_bounds(["x"])
...: .pipe(add_vertices)
...: )
...: coords
Out[1]:
<xarray.Dataset>
Dimensions: (x: 10, y: 8, bounds: 2, x_vertices: 11)
Coordinates:
* x (x) int64 0 1 2 3 4 5 6 7 8 9
* y (y) <U1 'a' 'b' 'c' 'e' 'f' 'g' 'h' 'i'
x_bounds (x, bounds) float64 -0.5 0.5 0.5 1.5 1.5 ... 7.5 7.5 8.5 8.5 9.5
* x_vertices (x_vertices) float64 -0.5 0.5 1.5 2.5 3.5 ... 6.5 7.5 8.5 9.5
Dimensions without coordinates: bounds
Data variables:
*empty*
In [2]: data = xr.Dataset(
...: {"a": ("x", np.arange(200))},
...: coords={
...: "x": np.linspace(-0.5, 9.5, 200),
...: "y": ("x", np.random.choice(categories, size=200)),
...: },
...: )
...: data
Out[2]:
<xarray.Dataset>
Dimensions: (x: 200)
Coordinates:
* x (x) float64 -0.5 -0.4497 -0.3995 -0.3492 ... 9.349 9.399 9.45 9.5
y (x) <U1 'e' 'g' 'b' 'b' 'h' 'a' 'g' ... 'a' 'c' 'c' 'h' 'c' 'b' 'h'
Data variables:
a (x) int64 0 1 2 3 4 5 6 7 8 ... 191 192 193 194 195 196 197 198 199
In [3]: flox.xarray.xarray_reduce(
...: data["a"],
...: coords["x"],
...: expected_groups=(coords["x_vertices"],),
...: isbin=[True] * 1,
...: func="mean",
...: )
Out[3]:
<xarray.DataArray 'a' (x_bins: 10)>
array([nan, nan, nan, nan, nan, nan, nan, nan, nan, nan])
Coordinates:
* x_bins (x_bins) object (-0.5, 0.5] (0.5, 1.5] ... (7.5, 8.5] (8.5, 9.5]
In [4]: flox.xarray.xarray_reduce(
...: data["a"],
...: "x",
...: expected_groups=(coords["x_vertices"],),
...: isbin=[True] * 1,
...: func="mean",
...: )
Out[4]:
<xarray.DataArray 'a' (x_bins: 10)>
array([ 10. , 29.5, 49.5, 69.5, 89.5, 109.5, 129.5, 149.5, 169.5,
189.5])
Coordinates:
* x_bins (x_bins) object (-0.5, 0.5] (0.5, 1.5] ... (7.5, 8.5] (8.5, 9.5]
the first call passes the dimension coordinate and does something weird, while the second call succeeds.
from flox.
So for multiple groupers
expected_groups
must be a tuple. The error message should be improved ;).
Something else I noticed is that when passing a list with a single-element list containing a DataArray
([x]
) to expected_groups
, it would somehow convert that to [[x]]
and would fail later.
It also means you can't do
We could get around that restriction by renaming *by
to *variable_names
(or something) and having a separate keyword-only argument named by
. The only issue with that would be to decide what to do when it receives both the positional args and the dict
(raise an error, probably). I'd actually also restrict the positional arguments to just the variable names.
pd.IntervalIndex
works, although we need a good way to convert between the different bounds conventions: cf-xarray
does provide conversion functions for bounds and vertices, but converting to / from intervals not as easy (pd.IntervalIndex.from_breaks
only helps in one direction). And for IntervalIndex
we definitely need the dict
interface.
Edit: at the moment, the IntervalIndex
coordinates flox
adds to the output are not really useful: they can't really be used for selecting / aligning (though that might change with a custom index), and they also can't be written to disk, so I usually replace them with the interval centers and bounds
from flox.
We could get around that restriction by renaming *by to *variable_names
This is interesting but also messy. Presumably with the dict, everything else (expected_groups
, isbin
) should be None
?
pd.IntervalIndex.from_breaks only helps in one direction
the IntervalIndex coordinates flox adds to the output are not really useful:
It does what Xarray does at the moment (see output of groupby_bins
). But yes sticking IntervalIndex
in Xarray is something we should fix upstream :). Then we could add an accessor that provides .left
.right
and .mid
(these already exist on the Index itself).
And for IntervalIndex we definitely need the dict interface.
Can you clarify? In core.groupby_reduce
you provide IntervalIndex
in expected_groups
.
from flox.
the first call passes the dimension coordinate and does something weird
Ah I don't think this is a bug.data["x"]
and coords["x"]
are not the same so there's a reindexing step that happens as part of alignment.
I think I'll check for exact alignment.
from flox.
Ah I don't think this is a bug.
Yeah, that was a user error. After reading the example from #189, I finally understand that *by
should contain the variables / variable names to group over (e.g. coordinates of the data) and not the group labels. Not sure how I got that idea.
So with that, I now think the dict
suggestion does not make too much sense, because you can pass non-hashable objects (DataArray
).
Which means we're left with the custom object suggestion:
flox.xarray.xarray_reduce(
data,
flox.Grouper("x", bins=coords.x_vertices),
flox.Grouper(data.y, values=["a", "b", "c"]),
)
which basically does the same thing as the combination of *by
, expected_groups
, and isbin
but the information stays in a single place. I guess with that, expected_group
and isbin
could stay as they are, and if we mix both, the expected_groups
and isbin
should be None
for the Grouper
value.
But yes, the example helps quite a bit, so the custom object would not improve the API as much as I had thought.
Edit: but I still would like to have a convenient way to convert between different bounds conventions (bounds, vertices, interval index)
from flox.
but I still would like to have a convenient way to convert between different bounds conventions (bounds, vertices, interval index)
I think this is up to xarray/cf-xarray since it is a generally useful thing. Once xarray can stick intervalindex in Xarray objects, I think flox should just do that.
pd.cut
does support a labels
argument though, that might partially handle what you want.
from flox.
So with that, I now think the dict suggestion does not make too much sense, because you can pass non-hashable objects (DataArray).
Oh great point!
Which means we're left with the custom object suggestion:
flox.Grouper
sounds good to me actually.
from flox.
I did some experiments with the grouper object this afternoon. I'd imagine something like this (I'm not particularly fond of the bounds-type detection, but of course we can just require pd.IntervalIndex
objects):
@attrs.define
class Grouper:
"""grouper for use with `flox.xarray.xarray_reduce`
Parameter
---------
over : hashable or DataArray
The coordinate to group over. If a hashable, has to be a variable on the
data. If a `DataArray`, its dimensions have to be a subset of the data's.
values : list-like, array-like, or DataArray, optional
The expected group labels. Mutually exclusive with `bins`.
bins : DataArray or IntervalIndex, optional
The bins used to group the data. Can either be a `IntervalIndex`, a `DataArray`
of `n + 1` vertices, or a `DataArray` of `(n, 2)` bounds.
Mutually exclusive with `values`.
bounds_dim : hashable, optional
The bounds dimension if the bins were passed as bounds.
"""
over = attrs.field()
bins = attrs.field(kw_only=True, default=None)
values = attrs.field(kw_only=True, default=None)
labels = attrs.field(init=False, default=None)
bounds_dim = attrs.field(kw_only=True, default=None, repr=False)
def __attrs_post_init__(self):
if self.bins is not None and self.values is not None:
raise TypeError("cannot specify both bins and group labels")
if self.bins is not None:
self.labels = to_intervals(self.bins, self.bounds_dim)
elif self.values is not None:
self.labels = self.values
@property
def is_bin(self):
if self.labels is None:
return None
return self.bins is not None
def merge_lists(*lists):
def merge_elements(elements):
filtered = [element for element in elements if element is not None]
return more_itertools.first(filtered, default=None)
return [merge_elements(elements) for elements in zip(*lists)]
def groupby(obj, *by, **flox_kwargs):
orig_expected_groups = flox_kwargs.get("expected_groups", [None] * len(by))
orig_isbin = flox_kwargs.get("isbin", [None] * len(by))
extracted = ((grouper.over, grouper.labels, grouper.is_bin) for grouper in by)
by_, expected_groups, isbin = (list(_) for _ in more_itertools.unzip(extracted))
flox_kwargs["expected_groups"] = tuple(
merge_lists(orig_expected_groups, expected_groups)
)
flox_kwargs["isbin"] = merge_lists(orig_isbin, isbin)
return flox.xarray.xarray_reduce(obj, *by_, **flox_kwargs)
flox.xarray.xarray_reduce(
data,
Grouper(over="x", bins=pd.IntervalIndex.from_breaks(coords["x_vertices"])),
Grouper(over=data.y, values=["a", "b", "c"]),
func="mean",
)
from flox.
pd.cut
uses labels
to specify output labels for binning. So we could have values
and bins
at the same time (though I prefer labels
over values
and expected_groups
).
The nice thing is that we could also support pd.Grouper
to do resampling, and other grouping at the same time.
I really like this Grouper
idea. On a minor note, I think I prefer by
(or key
from pd.Grouper
) instead of over
(also see pydata/xarray#324 )
from flox.
well, I'm not really attached to the names, so that sounds good to me?
from flox.
Related Issues (20)
- Using Flox with cubed HOT 19
- Aggregation 'first' is only implemented for dask arrays when method='blockwise' but I can't change the method HOT 2
- Reporting a vulnerability HOT 1
- always factorize early
- optimize groupby for resample
- Add engine="numbagg"
- AttributeError: 'DataArrayResample' object has no attribute '_unique_coord'
- Test failure on i386: ValueError: bins must be monotonically increasing or decreasing HOT 4
- More Groupers / user stories / strategies HOT 2
- Support xarray grouper objects in xarray interface
- How to create two groups from two lists of dimension labels, and apply "sum" to each group HOT 2
- Using Xarray and Flox for custom non-aggregation functions HOT 3
- use engine flox if array is ordered? HOT 2
- "most common" Aggregator with Dask HOT 7
- Address repo-review comments
- more cohorts optimization when chunksize == 1
- add cftime benchmarks
- Optimize `split_every` HOT 4
- Error when data variables have different dimensions HOT 2
- Flox seems much slower in some cases? HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from flox.