GithubHelp home page GithubHelp logo

improving the API for binned groupby about flox HOT 12 OPEN

keewis avatar keewis commented on June 12, 2024
improving the API for binned groupby

from flox.

Comments (12)

keewis avatar keewis commented on June 12, 2024

okay, turns out I seriously misunderstood what *by does... I was assuming I would need to pass a DataArray, but it seems that's only true for isbin=False. For isbin=True it instead takes a variable name. This is really confusing, though.

I guess something like this might be more intuitive?

# by as a mapping, to avoid shadowing variable names
flox.xarray.xarray_reduce(arr, by={"time": flox.Bins(...), "depth": flox.Bins(...)}, func="mean", ...)
# by as a *args
flox.xarray.xarray_reduce(
    arr,
    flox.Bins(variable="time", ...),
    flox.Bins(variable="depth", ...),
    func="mean",
    ...
)

from flox.

dcherian avatar dcherian commented on June 12, 2024

Thanks for trying it out Justus!

One reason for all this confusion is that I always expected xarray_reduce to be temporary while I ported it to Xarray but that is taking a lot of effort :( So now flox is stuck with some bad API choices.

I frequently hit the Needs better message error from

oops yeah. The issue is that I wanted to enable the really simple line:

xarray_reduce(ds, "x", expected_groups=["a", "b", "c"])

So for multiple groupers expected_groups must be a tuple. The error message should be improved ;). This would be a nice PR.

Another option might be to just use an interval index.

So flox.core.groupby_reduce does actually handle IntervalIndex (or any pandas Index).And looking at the code, xarray_reduce should also handle it. Here we convert everything to an Index:

expect_index = _convert_expected_groups_to_index((expect_,), (isbin_,), sort=sort)[0]

And that function does handle pd.Index objects. I think we should update the typing and docstring. This would be a helpful PR!

edit ah now i see, we'll need to remove the isbin argument for this to work properly.

turns out I seriously misunderstood what *by does... I was assuming I would need to pass a DataArray, but it seems that's only true for isbin=False.

This sounds like a bug but I'm surprised. by can be Hashable or DataArray.

by_da = tuple(obj[g] if isinstance(g, Hashable) else g for g in by)

and the tests:

flox/tests/test_xarray.py

Lines 111 to 127 in 51fb6e9

with raise_if_dask_computes():
actual = xarray_reduce(da, da.labels, da.labels2, **kwargs)
xr.testing.assert_identical(expected, actual)
with raise_if_dask_computes():
actual = xarray_reduce(da, "labels", da.labels2, **kwargs)
xr.testing.assert_identical(expected, actual)
with raise_if_dask_computes():
actual = xarray_reduce(da, "labels", "labels2", **kwargs)
xr.testing.assert_identical(expected, actual)
if pass_expected_groups:
kwargs["expected_groups"] = (expected.labels2.data, expected.labels.data)
with raise_if_dask_computes():
actual = xarray_reduce(da, "labels2", "labels", **kwargs)
xr.testing.assert_identical(expected.transpose("z", "labels2", "labels"), actual)

A reproducible example would help.

by={"time": flox.Bins(...), "depth": flox.Bins(...)}

Yeah I've actually been considering the dict approach. It seems a lot nicer since there's potentially a lot of information (for example cut_kwargs in xarray which gets passed to pd.cut which can contain whether the closed edge is left or right), and its hard to keep track of order in 3-4 different kwargs. One point of ugliness would be grouping by a categorical variable where you want flox to autodetect group members . That would then look like

by = {"labels": None}

It also means you can't do

xarray_reduce(ds, "x", func="mean")

An alternative is to use pd.IntervalIndex instead of a new flox.Bins object. It can wrap a lot of info and will avoid allow us to remove isbin (see flox.core.groupby_reduce).

from flox.

keewis avatar keewis commented on June 12, 2024

here's the example I tested this with (although I just realized I could have used .cf.bounds_to_vertices instead of writing a new function):

In [1]: import xarray as xr
   ...: import cf_xarray
   ...: import numpy as np
   ...: import flox.xarray
   ...: 
   ...: 
   ...: def add_vertices(ds, bounds_dim="bounds"):
   ...:     new_names = {
   ...:         name: f"{name.removesuffix(bounds_dim).rstrip('_')}_vertices"
   ...:         for name, coord in ds.variables.items()
   ...:         if bounds_dim in coord.dims
   ...:     }
   ...:     new_coords = {
   ...:         new_name: cf_xarray.bounds_to_vertices(ds[name], bounds_dim=bounds_dim)
   ...:         for name, new_name in new_names.items()
   ...:     }
   ...:     return ds.assign_coords(new_coords)
   ...: 
   ...: 
   ...: categories = list("abcefghi")
   ...: 
   ...: coords = (
   ...:     xr.Dataset(coords={"x": np.arange(10), "y": ("y", categories)})
   ...:     .cf.add_bounds(["x"])
   ...:     .pipe(add_vertices)
   ...: )
   ...: coords
Out[1]: 
<xarray.Dataset>
Dimensions:     (x: 10, y: 8, bounds: 2, x_vertices: 11)
Coordinates:
  * x           (x) int64 0 1 2 3 4 5 6 7 8 9
  * y           (y) <U1 'a' 'b' 'c' 'e' 'f' 'g' 'h' 'i'
    x_bounds    (x, bounds) float64 -0.5 0.5 0.5 1.5 1.5 ... 7.5 7.5 8.5 8.5 9.5
  * x_vertices  (x_vertices) float64 -0.5 0.5 1.5 2.5 3.5 ... 6.5 7.5 8.5 9.5
Dimensions without coordinates: bounds
Data variables:
    *empty*

In [2]: data = xr.Dataset(
   ...:     {"a": ("x", np.arange(200))},
   ...:     coords={
   ...:         "x": np.linspace(-0.5, 9.5, 200),
   ...:         "y": ("x", np.random.choice(categories, size=200)),
   ...:     },
   ...: )
   ...: data
Out[2]: 
<xarray.Dataset>
Dimensions:  (x: 200)
Coordinates:
  * x        (x) float64 -0.5 -0.4497 -0.3995 -0.3492 ... 9.349 9.399 9.45 9.5
    y        (x) <U1 'e' 'g' 'b' 'b' 'h' 'a' 'g' ... 'a' 'c' 'c' 'h' 'c' 'b' 'h'
Data variables:
    a        (x) int64 0 1 2 3 4 5 6 7 8 ... 191 192 193 194 195 196 197 198 199

In [3]: flox.xarray.xarray_reduce(
   ...:     data["a"],
   ...:     coords["x"],
   ...:     expected_groups=(coords["x_vertices"],),
   ...:     isbin=[True] * 1,
   ...:     func="mean",
   ...: )
Out[3]: 
<xarray.DataArray 'a' (x_bins: 10)>
array([nan, nan, nan, nan, nan, nan, nan, nan, nan, nan])
Coordinates:
  * x_bins   (x_bins) object (-0.5, 0.5] (0.5, 1.5] ... (7.5, 8.5] (8.5, 9.5]

In [4]: flox.xarray.xarray_reduce(
   ...:     data["a"],
   ...:     "x",
   ...:     expected_groups=(coords["x_vertices"],),
   ...:     isbin=[True] * 1,
   ...:     func="mean",
   ...: )
Out[4]: 
<xarray.DataArray 'a' (x_bins: 10)>
array([ 10. ,  29.5,  49.5,  69.5,  89.5, 109.5, 129.5, 149.5, 169.5,
       189.5])
Coordinates:
  * x_bins   (x_bins) object (-0.5, 0.5] (0.5, 1.5] ... (7.5, 8.5] (8.5, 9.5]

the first call passes the dimension coordinate and does something weird, while the second call succeeds.

from flox.

keewis avatar keewis commented on June 12, 2024

So for multiple groupers expected_groups must be a tuple. The error message should be improved ;).

Something else I noticed is that when passing a list with a single-element list containing a DataArray ([x]) to expected_groups, it would somehow convert that to [[x]] and would fail later.

It also means you can't do

We could get around that restriction by renaming *by to *variable_names (or something) and having a separate keyword-only argument named by. The only issue with that would be to decide what to do when it receives both the positional args and the dict (raise an error, probably). I'd actually also restrict the positional arguments to just the variable names.

pd.IntervalIndex works, although we need a good way to convert between the different bounds conventions: cf-xarray does provide conversion functions for bounds and vertices, but converting to / from intervals not as easy (pd.IntervalIndex.from_breaks only helps in one direction). And for IntervalIndex we definitely need the dict interface.

Edit: at the moment, the IntervalIndex coordinates flox adds to the output are not really useful: they can't really be used for selecting / aligning (though that might change with a custom index), and they also can't be written to disk, so I usually replace them with the interval centers and bounds

from flox.

dcherian avatar dcherian commented on June 12, 2024

We could get around that restriction by renaming *by to *variable_names

This is interesting but also messy. Presumably with the dict, everything else (expected_groups, isbin) should be None?

pd.IntervalIndex.from_breaks only helps in one direction
the IntervalIndex coordinates flox adds to the output are not really useful:

It does what Xarray does at the moment (see output of groupby_bins). But yes sticking IntervalIndex in Xarray is something we should fix upstream :). Then we could add an accessor that provides .left .right and .mid (these already exist on the Index itself).

And for IntervalIndex we definitely need the dict interface.

Can you clarify? In core.groupby_reduce you provide IntervalIndex in expected_groups.

from flox.

dcherian avatar dcherian commented on June 12, 2024

the first call passes the dimension coordinate and does something weird

Ah I don't think this is a bug.data["x"] and coords["x"] are not the same so there's a reindexing step that happens as part of alignment.

I think I'll check for exact alignment.

from flox.

keewis avatar keewis commented on June 12, 2024

Ah I don't think this is a bug.

Yeah, that was a user error. After reading the example from #189, I finally understand that *by should contain the variables / variable names to group over (e.g. coordinates of the data) and not the group labels. Not sure how I got that idea.

So with that, I now think the dict suggestion does not make too much sense, because you can pass non-hashable objects (DataArray).

Which means we're left with the custom object suggestion:

flox.xarray.xarray_reduce(
    data,
    flox.Grouper("x", bins=coords.x_vertices),
    flox.Grouper(data.y, values=["a", "b", "c"]),
)

which basically does the same thing as the combination of *by, expected_groups, and isbin but the information stays in a single place. I guess with that, expected_group and isbin could stay as they are, and if we mix both, the expected_groups and isbin should be None for the Grouper value.

But yes, the example helps quite a bit, so the custom object would not improve the API as much as I had thought.

Edit: but I still would like to have a convenient way to convert between different bounds conventions (bounds, vertices, interval index)

from flox.

dcherian avatar dcherian commented on June 12, 2024

but I still would like to have a convenient way to convert between different bounds conventions (bounds, vertices, interval index)

I think this is up to xarray/cf-xarray since it is a generally useful thing. Once xarray can stick intervalindex in Xarray objects, I think flox should just do that.

pd.cut does support a labels argument though, that might partially handle what you want.

from flox.

dcherian avatar dcherian commented on June 12, 2024

So with that, I now think the dict suggestion does not make too much sense, because you can pass non-hashable objects (DataArray).

Oh great point!

Which means we're left with the custom object suggestion:

flox.Grouper sounds good to me actually.

from flox.

keewis avatar keewis commented on June 12, 2024

I did some experiments with the grouper object this afternoon. I'd imagine something like this (I'm not particularly fond of the bounds-type detection, but of course we can just require pd.IntervalIndex objects):

@attrs.define
class Grouper:
    """grouper for use with `flox.xarray.xarray_reduce`

    Parameter
    ---------
    over : hashable or DataArray
        The coordinate to group over. If a hashable, has to be a variable on the
        data. If a `DataArray`, its dimensions have to be a subset of the data's.
    values : list-like, array-like, or DataArray, optional
        The expected group labels. Mutually exclusive with `bins`.
    bins : DataArray or IntervalIndex, optional
        The bins used to group the data. Can either be a `IntervalIndex`, a `DataArray`
        of `n + 1` vertices, or a `DataArray` of `(n, 2)` bounds.

        Mutually exclusive with `values`.
    bounds_dim : hashable, optional
        The bounds dimension if the bins were passed as bounds.
    """

    over = attrs.field()

    bins = attrs.field(kw_only=True, default=None)
    values = attrs.field(kw_only=True, default=None)
    labels = attrs.field(init=False, default=None)

    bounds_dim = attrs.field(kw_only=True, default=None, repr=False)

    def __attrs_post_init__(self):
        if self.bins is not None and self.values is not None:
            raise TypeError("cannot specify both bins and group labels")

        if self.bins is not None:
            self.labels = to_intervals(self.bins, self.bounds_dim)
        elif self.values is not None:
            self.labels = self.values

    @property
    def is_bin(self):
        if self.labels is None:
            return None

        return self.bins is not None

def merge_lists(*lists):
    def merge_elements(elements):
        filtered = [element for element in elements if element is not None]

        return more_itertools.first(filtered, default=None)

    return [merge_elements(elements) for elements in zip(*lists)]


def groupby(obj, *by, **flox_kwargs):
    orig_expected_groups = flox_kwargs.get("expected_groups", [None] * len(by))
    orig_isbin = flox_kwargs.get("isbin", [None] * len(by))

    extracted = ((grouper.over, grouper.labels, grouper.is_bin) for grouper in by)
    by_, expected_groups, isbin = (list(_) for _ in more_itertools.unzip(extracted))
    flox_kwargs["expected_groups"] = tuple(
        merge_lists(orig_expected_groups, expected_groups)
    )
    flox_kwargs["isbin"] = merge_lists(orig_isbin, isbin)

    return flox.xarray.xarray_reduce(obj, *by_, **flox_kwargs)

flox.xarray.xarray_reduce(
    data,
    Grouper(over="x", bins=pd.IntervalIndex.from_breaks(coords["x_vertices"])),
    Grouper(over=data.y, values=["a", "b", "c"]),
    func="mean",
)

from flox.

dcherian avatar dcherian commented on June 12, 2024

pd.cut uses labels to specify output labels for binning. So we could have values and bins at the same time (though I prefer labels over values and expected_groups).

The nice thing is that we could also support pd.Grouper to do resampling, and other grouping at the same time.

I really like this Grouper idea. On a minor note, I think I prefer by (or key from pd.Grouper) instead of over (also see pydata/xarray#324 )

from flox.

keewis avatar keewis commented on June 12, 2024

well, I'm not really attached to the names, so that sounds good to me?

from flox.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.