xarray-contrib / flox Goto Github PK

View Code? Open in Web Editor NEW

117.0 5.0 15.0 1.54 MB

Fast & furious GroupBy operations for dask.array

Home Page: https://flox.readthedocs.io

License: Apache License 2.0

Python 100.00%

dask xarray map-reduce

flox's Introduction

flox

This project explores strategies for fast GroupBy reductions with dask.array. It used to be called dask_groupby It was motivated by

Dask Dataframe GroupBy blogpost
numpy_groupies in Xarray issue

(See a presentation about this package, from the Pangeo Showcase).

Acknowledgements

This work was funded in part by

NASA-ACCESS 80NSSC18M0156 "Community tools for analysis of NASA Earth Observing System Data in the Cloud" (PI J. Hamman, NCAR),
NASA-OSTFL 80NSSC22K0345 "Enhancing analysis of NASA data with the open-source Python Xarray Library" (PIs Scott Henderson, University of Washington; Deepak Cherian, NCAR; Jessica Scheick, University of New Hampshire), and
NCAR's Earth System Data Science Initiative.

It was motivated by very very many discussions in the Pangeo community.

API

There are two main functions

flox.groupby_reduce(dask_array, by_dask_array, "mean") "pure" dask array interface
flox.xarray.xarray_reduce(xarray_object, by_dataarray, "mean") "pure" xarray interface; though work is ongoing to integrate this package in xarray.

Implementation

See the documentation for details on the implementation.

Custom reductions

flox implements all common reductions provided by numpy_groupies in aggregations.py. It also allows you to specify a custom Aggregation (again inspired by dask.dataframe), though this might not be fully functional at the moment. See aggregations.py for examples.

mean = Aggregation(
    # name used for dask tasks
    name="mean",
    # operation to use for pure-numpy inputs
    numpy="mean",
    # blockwise reduction
    chunk=("sum", "count"),
    # combine intermediate results: sum the sums, sum the counts
    combine=("sum", "sum"),
    # generate final result as sum / count
    finalize=lambda sum_, count: sum_ / count,
    # Used when "reindexing" at combine-time
    fill_value=0,
    # Used when any member of `expected_groups` is not found
    final_fill_value=np.nan,
)

flox's People

Contributors

Stargazers

Watchers

Forkers

thomas-moore-creative andersy005 illviljan lunarlanding avalentino aulemahal ouranosinc tomnicholas sebastic arpitjain799 negin513 keewis mathause eendebakpt tomwhite

flox's Issues

fix cf-xarray tests

https://github.com/xarray-contrib/cf-xarray/runs/6729986876?check_suite_focus=true

Grouping by a datetime variable seems to not be working

support min_count in groupby_reduce

cache cohorts somehow

Sometimes users want to compute max, mean for the same groupby. In this case; we'll end up recomputing cohorts twice; for large numbers of groups, this could get slow.

I wonder if lru_cache makes sense here

expect grouped/reduced dimensions at the end

instead of the beginning as it is now.

Using `resample_reduce` on a `Dataset` with more than one data variable produces incorrect results

Forgive me if I'm too eager to use this package -- I'm excited about the concepts behind it! I noticed some unusual behavior when using resample_reduce. For some reason it seems like it uses the values of the last data variable for all data variables in the resultant Dataset.

See the following minimal example:

>>> import pandas as pd; import xarray as xr; from dask_groupby.xarray import resample_reduce
>>>
>>> times = pd.date_range("2000", periods=5)
>>> foo = xr.DataArray(range(5), dims=["time"], coords=[times], name="foo")
>>> bar = xr.DataArray(range(1, 6), dims=["time"], coords=[times], name="bar")
>>> ds = xr.merge([foo, bar]).chunk({"time": 4})
>>> resampler = ds.resample(time="4D")
>>> expected = resampler.mean().compute()
>>> result = resample_reduce(resampler, "mean").compute()
>>>
>>>
>>> expected
<xarray.Dataset>
Dimensions:  (time: 2)
Coordinates:
  * time     (time) datetime64[ns] 2000-01-01 2000-01-05
Data variables:
    foo      (time) float64 1.5 4.0
    bar      (time) float64 2.5 5.0
>>> result
<xarray.Dataset>
Dimensions:  (time: 2)
Coordinates:
  * time     (time) datetime64[ns] 2000-01-01 2000-01-05
Data variables:
    foo      (time) float64 2.5 5.0
    bar      (time) float64 2.5 5.0

return tuples instead of dict

I think we should return (tuple_of_group_labels, result). I think we should do this after figuring out the multiple variable groupby.

support expected_groups=int

We could support unknown groups in the xarray interface if the user provides a size for "expected groups". The returned xarray object will have an unindexed "groups" coordinate and a dask array for "group_labels" non-dimensional coordinate.

As long as the provided size > total number of discovered groups, this should work OK.

rethink reindexing to expected_groups

Right now we reindex to expected_groups in chunk_reduce. This helps in cases like the following where each block has a different number of groups.

| 1 group  | 2 groups |
| 3 groups | 4 groups |

Right now we always reindex to

| 12 groups | 12 groups | 
| 12 groups | 12 groups |

in chunk_reduce (assuming there are 12 expected_groups). We could instead reindex to 4 groups in _npg_combine here: https://github.com/dcherian/dask_groupby/blob/86a1edda60b0daefebbe18304a506493fa598ec4/dask_groupby/core.py#L314-L321

| 4 groups | 4 groups | 
| 4 groups | 4 groups |

This would be 3x smaller blocks and a major memory improvement.

Fix typing

There are a bunch of errors if I run mypy in flox/

A lot seem easy and some seem like bugs =)

flox/core.py:69: error: Value of type "Optional[Any]" is not indexable  [index]
flox/core.py:69: error: Argument "isbin" to "_convert_expected_groups_to_index" has incompatible type "Tuple[bool]"; expected "bool"  [arg-type]
flox/core.py:447: error: Value of type "Optional[ndarray[Any, Any]]" is not indexable  [index]
flox/core.py:550: error: Incompatible default for argument "axis" (default has type "None", argument has type "Union[int, Sequence[int]]")  [assignment]
flox/core.py:605: error: Incompatible types in assignment (expression has type "Tuple[None]", variable has type "Optional[Mapping[Union[str, Callable[..., Any]], Any]]")  [assignment]
flox/core.py:661: error: No overload variant of "zip" matches argument types "Union[Sequence[str], Sequence[Callable[..., Any]]]", "None", "Any", "Any"  [call-overload]
flox/core.py:661: note: Possible overload variants:
flox/core.py:661: note:     def [_T_co, _T1] zip(cls, Iterable[_T1]) -> zip[Tuple[_T1]]
flox/core.py:661: note:     def [_T_co, _T1, _T2] zip(cls, Iterable[_T1], Iterable[_T2]) -> zip[Tuple[_T1, _T2]]
flox/core.py:661: note:     def [_T_co, _T1, _T2, _T3] zip(cls, Iterable[_T1], Iterable[_T2], Iterable[_T3]) -> zip[Tuple[_T1, _T2, _T3]]
flox/core.py:661: note:     def [_T_co, _T1, _T2, _T3, _T4] zip(cls, Iterable[_T1], Iterable[_T2], Iterable[_T3], Iterable[_T4]) -> zip[Tuple[_T1, _T2, _T3, _T4]]
flox/core.py:661: note:     def [_T_co, _T1, _T2, _T3, _T4, _T5] zip(cls, Iterable[_T1], Iterable[_T2], Iterable[_T3], Iterable[_T4], Iterable[_T5]) -> zip[Tuple[_T1, _T2, _T3, _T4, _T5]]
flox/core.py:661: note:     def [_T_co] zip(cls, Iterable[Any], Iterable[Any], Iterable[Any], Iterable[Any], Iterable[Any], Iterable[Any], *iterables: Iterable[Any]) -> zip[Tuple[Any, ...]]
flox/core.py:665: error: Argument 1 to "is_nanlen" has incompatible type "None"; expected "Union[str, Callable[..., Any]]"  [arg-type]
flox/core.py:802: error: Unsupported left operand type for + ("Sequence[Any]")  [operator]
flox/core.py:804: error: Unsupported left operand type for + ("Sequence[Any]")  [operator]
flox/core.py:809: error: Incompatible return value type (got "Dict[str, Any]", expected "Dict[Union[str, Callable[..., Any]], Any]")  [return-value]
flox/core.py:809: note: Perhaps you need a type annotation for "results"? Suggestion: "Dict[Union[str, Callable[..., Any]], Any]"
flox/core.py:895: error: Incompatible types in assignment (expression has type "Dict[Union[str, Callable[..., Any]], Any]", variable has type "Dict[str, object]")  [assignment]
flox/core.py:911: error: "object" has no attribute "append"  [attr-defined]
flox/core.py:914: error: "object" has no attribute "append"  [attr-defined]
flox/core.py:921: error: Argument "fill_value" to "chunk_reduce" has incompatible type "Tuple[int]"; expected "Optional[Mapping[Union[str, Callable[..., Any]], Any]]"  [arg-type]
flox/core.py:937: error: "object" has no attribute "append"  [attr-defined]
flox/core.py:950: error: Argument "fill_value" to "chunk_reduce" has incompatible type "Tuple[Any]"; expected "Optional[Mapping[Union[str, Callable[..., Any]], Any]]"  [arg-type]
flox/core.py:955: error: "object" has no attribute "append"  [attr-defined]
flox/core.py:957: error: Incompatible return value type (got "Dict[str, object]", expected "Dict[Union[str, Callable[..., Any]], Any]")  [return-value]
flox/core.py:957: note: Perhaps you need a type annotation for "results"? Suggestion: "Dict[Union[str, Callable[..., Any]], Any]"
flox/core.py:1175: error: Argument 1 to "partial" has incompatible type "object"; expected "Callable[..., <nothing>]"  [arg-type]
flox/core.py:1228: error: Item "None" of "Optional[Any]" has no attribute "values"  [union-attr]
flox/core.py:1290: error: No overload variant of "zip" matches argument types "Tuple[Any, ...]", "bool"  [call-overload]
flox/core.py:1290: note: Possible overload variants:
flox/core.py:1290: note:     def [_T_co, _T1] zip(cls, Iterable[_T1]) -> zip[Tuple[_T1]]
flox/core.py:1290: note:     def [_T_co, _T1, _T2] zip(cls, Iterable[_T1], Iterable[_T2]) -> zip[Tuple[_T1, _T2]]
flox/core.py:1290: note:     def [_T_co, _T1, _T2, _T3] zip(cls, Iterable[_T1], Iterable[_T2], Iterable[_T3]) -> zip[Tuple[_T1, _T2, _T3]]
flox/core.py:1290: note:     def [_T_co, _T1, _T2, _T3, _T4] zip(cls, Iterable[_T1], Iterable[_T2], Iterable[_T3], Iterable[_T4]) -> zip[Tuple[_T1, _T2, _T3, _T4]]
flox/core.py:1290: note:     def [_T_co, _T1, _T2, _T3, _T4, _T5] zip(cls, Iterable[_T1], Iterable[_T2], Iterable[_T3], Iterable[_T4], Iterable[_T5]) -> zip[Tuple[_T1, _T2, _T3, _T4, _T5]]
flox/core.py:1290: note:     def [_T_co] zip(cls, Iterable[Any], Iterable[Any], Iterable[Any], Iterable[Any], Iterable[Any], Iterable[Any], *iterables: Iterable[Any]) -> zip[Tuple[Any, ...]]
flox/core.py:1457: error: Argument 1 to "_validate_reindex" has incompatible type "Optional[bool]"; expected "bool"  [arg-type]
flox/core.py:1465: error: Incompatible types in assignment (expression has type "Tuple[bool, ...]", variable has type "bool")  [assignment]
flox/core.py:1478: error: Argument 1 to "_convert_expected_groups_to_index" has incompatible type "Union[Tuple[None, ...], Sequence[Any], ndarray[Any, Any]]"; expected "Tuple[Any, ...]"  [arg-type]
flox/core.py:1490: error: Value of type "Optional[Any]" is not indexable  [index]
flox/core.py:1543: error: Argument 4 to "_initialize_aggregation" has incompatible type "Optional[int]"; expected "int"  [arg-type]
flox/core.py:1567: error: Item "ndarray[Any, Any]" of "Union[ndarray[Any, Any], Any]" has no attribute "chunks"  [union-attr]
flox/core.py:1580: error: Item "ndarray[Any, Any]" of "Union[ndarray[Any, Any], Any]" has no attribute "chunks"  [union-attr]
flox/core.py:1608: error: Incompatible types in assignment (expression has type "List[Union[ndarray[Any, Any], Any]]", variable has type "Tuple[Any]")  [assignment]
flox/xarray.py:19: error: Module "xarray" has no attribute "Resample"  [attr-defined]
flox/xarray.py:231: error: Unsupported right operand type for in ("Optional[Hashable]")  [operator]
flox/xarray.py:237: error: "Hashable" has no attribute "__iter__" (not iterable)  [attr-defined]
flox/xarray.py:240: error: "Hashable" has no attribute "__iter__" (not iterable)  [attr-defined]
flox/xarray.py:248: error: Incompatible types in assignment (expression has type "Union[str, Aggregation]", variable has type "str")  [assignment]
flox/xarray.py:256: error: Argument 1 to "len" has incompatible type "Hashable"; expected "Sized"  [arg-type]
flox/xarray.py:277: error: Argument 1 to "_convert_expected_groups_to_index" has incompatible type "List[Any]"; expected "Tuple[Any, ...]"  [arg-type]
flox/xarray.py:277: error: Argument 2 to "_convert_expected_groups_to_index" has incompatible type "Sequence[bool]"; expected "bool"  [arg-type]
flox/xarray.py:278: error: Incompatible types in assignment (expression has type "Tuple[int, ...]", variable has type "List[None]")  [assignment]
flox/xarray.py:278: error: Item "None" of "Optional[Any]" has no attribute "__iter__" (not iterable)  [union-attr]
flox/xarray.py:301: error: "Hashable" has no attribute "__iter__" (not iterable)  [attr-defined]
flox/xarray.py:314: error: Argument 1 to "set" has incompatible type "Hashable"; expected "Iterable[Any]"  [arg-type]
flox/xarray.py:330: error: Argument 1 to "tuple" has incompatible type "Optional[Any]"; expected "Iterable[Any]"  [arg-type]
flox/xarray.py:339: error: "Hashable" has no attribute "__iter__" (not iterable)  [attr-defined]
flox/xarray.py:342: error: Argument 2 to "zip" has incompatible type "Optional[Any]"; expected "Iterable[Any]"  [arg-type]
flox/visualize.py:4: error: Skipping analyzing "matplotlib": module is installed, but missing library stubs or py.typed marker  [import]
flox/visualize.py:5: error: Skipping analyzing "matplotlib.pyplot": module is installed, but missing library stubs or py.typed marker  [import]

Originally posted by @dcherian in #92 (comment)

Allow chunk=None with method='blockwise'

We check for chunk=None and raise an error quite early if passed a dask array

But we could easily make this work for method="blockwise" where we just apply a function designed for numpy arrarys blockwise.

This would be particularly convenient for custom reductions.

Add docs

Improvement suggestion for sum_of_squares

Hi Deepak,

thanks for using numpy_groupies in flox. Out of curiosity I was reading through the codebase and noted that line:

https://github.com/dcherian/flox/blob/227ce041e9b78f8432c62370c9b2c46dd91fa657/flox/aggregate_npg.py#L17

It creates an intermediate array for the squares, which then gets summed up. In this context, this is likely to be a relevant time waster for allocating that intermediate array for the squares, while it could all be done in one go with simply a derived class based on

https://github.com/ml31415/numpy-groupies/blob/c11987005cccea1b3e990aba02ece482f3a00b1d/numpy_groupies/aggregate_numba.py#L240

class SumOfSquares(Sum):
    @staticmethod
    def _inner(ri, val, ret, counter, mean):
        counter[ri] = 0
        ret[ri] += val * val

Should do the trick and should be free of extra array and memory allocation and run in roughly the same time as the plain grouped sum.

use "hash split" with split_out

We should copy the "hash splitting" strategy from dask.dataframe to avoid the expected_groups is not None restriction and memory complications associated with #10

support skipna in xarray_reduce

reindexing to `output_labels` when provided

make dask optional

I think the core code is now dask-optional but changing the tests looks painful :/

Minimize dependencies

numpy_groupies should be the only requirement.

It should be possible to use this package without dask or xarray.

We need environments that test all possible combinations.

The real painful bit is dealing with dask's absence in the tests (#31)

support groupby_bins

There might be a clever solution where we provide pd.cut instead of pd.factorize through a kwarg

Faster var, std, count

This test slowed down a lot after switching to count_

understand where xarray decides to put the new dimension

Weird behaviour of `split-reduce` with weird dask chunking

I don't understand flox yet enough to see what's going on here, but I think there is a bug when using "split-reduce" on data that has a strange chunking.

See:

from itertools import product
import xarray as xr
from flox.xarray import xarray_reduce
 
# The datetime are not important here (I think) but it gives something to group on.
t = xr.DataArray(xr.date_range('2000-01-01', '2000-12-31', use_cftime=False), dims=('time',), name='time')
data = t.time.dt.dayofyear
 
for chunking, method in product([None, 366, 365, 15], ['split-reduce', 'cohorts']):
    if chunking is not None:
        dat = data.chunk({'time': chunking})
     else:
        dat = data
    out = xarray_reduce(dat, dat.time.dt.month, func='mean', method=method).load()
    out_dec = out.sel(month=12).item()
    if out_dec != 351:
         print(f'What is this! Got {out_dec=} for {chunking=} and {method=}')

You'll see that the last element of the output (the average of the doy over december), is 351 for all cases, except when the chunking cuts the group in 2 and that the first chunk is spans multiple groups. In this case, with method "split-reduce", is 10881. (which is exactly 31 * 351, coincidence?).

I'm not sure of my diagnostic, but after trying some combinations, it seems it happens when a group is cut in two by chunks, of which at least one covers many groups. Example, with chunking=190, the split happens on July 8th. And the result is 6138 instead of 198. And 6138 is 31*198.

Haha, this is all a mystery to me... It doesn't seem related to the dtype : here we got integers as input, casting to float doesn't change the behaviour.

For the info, the real use case where this happened to me (in xclim) was when performing a groupby after a rolling. The latter leaves a small chunk at the end of the array and this appeared.

support grouping by multiple variables

https://stackoverflow.com/questions/43167413/using-numpy-unique-on-multiple-columns-of-a-pandas-dataframe

Support numba for custom aggregations?

How does pandas do it?

Or I guess we can just expect the user to pass in numba-decorated functions.

construct a cost function for the dask graph when evaluating strategies

@martindurant's (and @jbednar) idea is that we can examine the dask graph to judge which strategy works best...

support NaNs in ``to_group``

NOTICE: renaming package to flox

Just for any rare person that has this installed :)

optimize factorize?

See ml31415/numpy-groupies#37

In particular we can skip ints, and cast single ascii chars to int.

support arg reductions

Add tests and documentation for custom aggregations

avoid required xarray dependency

refactor out xarray functions
copy over xarray.core.dtypes and xarray.core.utils

support rechunking to match group edges

When to_group is a 1D numpy array, this matches the resample case.

The nD version seems hard...

Potential bug in `xarray_reduce`

When I try using the xarray_reduce interface with the following

xarray_reduce(ds.Tair, ds.time.dt.month, func='mean')

I get an UnboundLocalError.:

In [19]: xarray_reduce(ds.Tair, ds.time.dt.month, func='mean')
<xarray.DataArray 'month' (time: 36)>
array([ 9, 10, 11, 12,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12,  1,
        2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12,  1,  2,  3,  4,  5,  6,
        7,  8])
Coordinates:
  * time     (time) object 1980-09-16 12:00:00 ... 1983-08-17 00:00:00
---------------------------------------------------------------------------
UnboundLocalError                         Traceback (most recent call last)
<ipython-input-19-b8f0e0a131da> in <module>
----> 1 xarray_reduce(ds.Tair, ds.time.dt.month, func='mean')

~/devel/pydata/dask_groupby/dask_groupby/xarray.py in xarray_reduce(obj, func, expected_groups, bins, dim, split_out, fill_value, *by)
     48 
     49     group_names = tuple(g.name for g in by)
---> 50     group_sizes = dict(zip(group_names, group_shape))
     51     indims = tuple(obj.dims)
     52     otherdims = tuple(d for d in indims if d not in dim)

UnboundLocalError: local variable 'group_shape' referenced before assignment

It appears that group_shape is only defined when using multiple groupers

https://github.com/dcherian/dask_groupby/blob/cab3b5bc3355e7934fe60029fc1063b6ed2f0a2f/dask_groupby/xarray.py#L39-L41

@dcherian, Am I missing something?

fix Ellipsis reductions

This means solving reducing along dimensions not present in by which doesn't seem to work in some cases.

Support cohorts for nD `by` variables

check dtype promotion

On pangeo cloud, I'm seeing float32 being upcast to float64. This should not be needed.

skip reindexing finalize_results if reindex == True

Support cumsum, cumprod

Supporting just numpy should be relatively easy. This will also work for method="blockwise" by default.

We may want to rename groupby_reduce to groupby_agg?

For dask proper, we'll need to use dask.array.cumreduction instead of dask.array.blockwise + dask.array.reductions._tree_reduce

expose kwarg to choose numpy or numba backend for numpy_groupies

set expected_groups is None when len(axis) == 1

Basically the simplest version of #10 where we avoid reindexing when reducing along 1 axis.

performance benchmarking

bincount-ed sum is 6x slower than sum, and bincount-ed count is 2x slower than count so there are definitely cases where it makes sense to split the dataset early instead of using flox. I'm seeing this on Pangeo Cloud using the GODAS dataset.

array = np.zeros((10 ** 5,), dtype=int)
by = array

%timeit np.bincount(by)  # count
%timeit np.bincount(by, weights=array)  # sum
%timeit np.sum(~np.isnan(by)) # count
%timeit array.sum() # sum

264 µs ± 2.76 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
441 µs ± 1.33 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
117 µs ± 2.68 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
65.9 µs ± 1.17 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

implement shuffle-then-reduce strategy

Proof-of-concept here. This will help with time grouping in particular where the group's elements occur at repeated intervals but at some distance from each other. When the repeats are "far" relative to chunk size; it can take a while before we actually reduce anything leading to memory spike in the beginning of the computation

https://gist.github.com/dcherian/6ccd76d2a6eaadb7844d61d197a8b3db

If we can generalize this, we can solve the biggest fail case of the map-reduce approach used in this package.

Allow grouping noisy arrays using tolerances

I find grouping data can be a little more intuitive when using tolerances.
This can be done for example like this:

def groupby_isclose_argsort(arr, atol=0, rtol=0.1):
    """
    Return a group idx of a noisy array.

    TODO: argsort not available in dask.

    Examples
    --------
    reps = 2
    y = np.array([72, 72, 100, 100, 300, 300, 500, 500])
    y = np.stack(reps*[y], 0)
    noise = lambda y : 1 + 0.1 * (np.random.rand(*y.shape) - 0.5)
    y = y * noise(y)
    groupby_isclose_argsort(y)
    array([[0, 0, 1, 1, 2, 2, 3, 3],
           [0, 0, 1, 1, 2, 2, 3, 3]], dtype=int32)
    """
    # Sort values to make sure values are monotonically increasing:
    a = arr.ravel()
    i = np.argsort(a)
    i_rev = np.empty(i.shape, dtype=np.intp, like=a)
    i_rev[i] = np.arange(len(a), like=a)
    a = a[i]

    # Calculate a monotonically increasing index that increase when the
    # difference between current and previous value changes:
    b = np.roll(a, 1)
    b[0] = a[0]
    by = np.cumsum(~np.isclose(a, b, atol=atol, rtol=rtol))
    # tolerance = atol + rtol * b
    # by = np.cumsum(np.abs(a - b) > tolerance)

    return by[i_rev].reshape(arr.shape)

So on a dataset the api would look something like this ds.groupby("arr", atol=0, rtol=0.1).

reduction on cftime object arrays, datetime arrays

I'm not sure reductions on datetime arrays work, at least we should test it.

Optimize cohorts by ignoring NaNs when factorize_early is True

By default find_group_cohorts ignores NaNs.

When we factorize_early, this optimization is lost, because NaN groups are now relabeled to -1 (and other group labels are now positive integers)

I think the only fix is a little messy: optionally tell find_group_cohorts to drop -1 labels.

benchmark mean: current solution vs `np.bincount` solution

Conslidate factorize to use searchsorted only

This could work and be a lot faster. Using pd.cut with an IntervalIndex seems really slow.

We might also want to create our own minimal "Index" structure to represent bin-edges

investigate sorting of groups

Does xarray sort all groups?

Does it not sort MultiIndexes?

Add tests for first, last

add more tests where array has NaNs

This will exercise _count more.

Summation on booleans returns OR instead of sum

As explained in pydata/xarray#6615, it seems the sum aggregations implemented by flox actually return an OR reduction when given boolean inputs. While this could be seen as mathematically/technically correct, it is a different behaviour than numpy.

However, testing with flox directly, I see that the numpy engine actually raises an error with boolean types. I'm guessing the flox one should too? But then, I'm not sure how xarray should handle it... A special case for sum here (xarray_reduce) or there?

Culprit line:
https://github.com/dcherian/flox/blob/19543801fb28f2557591aecd559e52b66c8051b3/flox/aggregate_flox.py#L51

MWE:

import flox
import numpy as np

groups = np.array([1, 1, 1])
data = np.array([True, True, False])

np.sum(data) # 2, what I expect

flox.groupby_reduce(data, groups, func='sum', engine='flox')  # (True, 1) , the result of an OR agg.

flox.groupby_reduce(data, groups, func='sum', engine='numpy')   # Fails within numpy_groupies, same with numba
#  TypeError: function sum requires a more complex datatype than bool