GithubHelp home page GithubHelp logo

xarray-contrib / flox Goto Github PK

View Code? Open in Web Editor NEW
117.0 5.0 15.0 1.54 MB

Fast & furious GroupBy operations for dask.array

Home Page: https://flox.readthedocs.io

License: Apache License 2.0

Python 100.00%
dask xarray map-reduce

flox's Introduction

GitHub Workflow CI Status pre-commit.ci status image Documentation Status

PyPI Conda-forge

NASA-80NSSC18M0156 NASA-80NSSC22K0345

flox

This project explores strategies for fast GroupBy reductions with dask.array. It used to be called dask_groupby It was motivated by

  1. Dask Dataframe GroupBy blogpost
  2. numpy_groupies in Xarray issue

(See a presentation about this package, from the Pangeo Showcase).

Acknowledgements

This work was funded in part by

  1. NASA-ACCESS 80NSSC18M0156 "Community tools for analysis of NASA Earth Observing System Data in the Cloud" (PI J. Hamman, NCAR),
  2. NASA-OSTFL 80NSSC22K0345 "Enhancing analysis of NASA data with the open-source Python Xarray Library" (PIs Scott Henderson, University of Washington; Deepak Cherian, NCAR; Jessica Scheick, University of New Hampshire), and
  3. NCAR's Earth System Data Science Initiative.

It was motivated by very very many discussions in the Pangeo community.

API

There are two main functions

  1. flox.groupby_reduce(dask_array, by_dask_array, "mean") "pure" dask array interface
  2. flox.xarray.xarray_reduce(xarray_object, by_dataarray, "mean") "pure" xarray interface; though work is ongoing to integrate this package in xarray.

Implementation

See the documentation for details on the implementation.

Custom reductions

flox implements all common reductions provided by numpy_groupies in aggregations.py. It also allows you to specify a custom Aggregation (again inspired by dask.dataframe), though this might not be fully functional at the moment. See aggregations.py for examples.

mean = Aggregation(
    # name used for dask tasks
    name="mean",
    # operation to use for pure-numpy inputs
    numpy="mean",
    # blockwise reduction
    chunk=("sum", "count"),
    # combine intermediate results: sum the sums, sum the counts
    combine=("sum", "sum"),
    # generate final result as sum / count
    finalize=lambda sum_, count: sum_ / count,
    # Used when "reindexing" at combine-time
    fill_value=0,
    # Used when any member of `expected_groups` is not found
    final_fill_value=np.nan,
)

flox's People

Contributors

andersy005 avatar aulemahal avatar avalentino avatar dcherian avatar dependabot[bot] avatar eendebakpt avatar illviljan avatar keewis avatar mathause avatar pre-commit-ci[bot] avatar sebastic avatar tomnicholas avatar tomwhite avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

flox's Issues

cache cohorts somehow

Sometimes users want to compute max, mean for the same groupby. In this case; we'll end up recomputing cohorts twice; for large numbers of groups, this could get slow.

I wonder if lru_cache makes sense here

Using `resample_reduce` on a `Dataset` with more than one data variable produces incorrect results

Forgive me if I'm too eager to use this package -- I'm excited about the concepts behind it! I noticed some unusual behavior when using resample_reduce. For some reason it seems like it uses the values of the last data variable for all data variables in the resultant Dataset.

See the following minimal example:

>>> import pandas as pd; import xarray as xr; from dask_groupby.xarray import resample_reduce
>>>
>>> times = pd.date_range("2000", periods=5)
>>> foo = xr.DataArray(range(5), dims=["time"], coords=[times], name="foo")
>>> bar = xr.DataArray(range(1, 6), dims=["time"], coords=[times], name="bar")
>>> ds = xr.merge([foo, bar]).chunk({"time": 4})
>>> resampler = ds.resample(time="4D")
>>> expected = resampler.mean().compute()
>>> result = resample_reduce(resampler, "mean").compute()
>>>
>>>
>>> expected
<xarray.Dataset>
Dimensions:  (time: 2)
Coordinates:
  * time     (time) datetime64[ns] 2000-01-01 2000-01-05
Data variables:
    foo      (time) float64 1.5 4.0
    bar      (time) float64 2.5 5.0
>>> result
<xarray.Dataset>
Dimensions:  (time: 2)
Coordinates:
  * time     (time) datetime64[ns] 2000-01-01 2000-01-05
Data variables:
    foo      (time) float64 2.5 5.0
    bar      (time) float64 2.5 5.0

return tuples instead of dict

I think we should return (tuple_of_group_labels, result). I think we should do this after figuring out the multiple variable groupby.

support expected_groups=int

We could support unknown groups in the xarray interface if the user provides a size for "expected groups". The returned xarray object will have an unindexed "groups" coordinate and a dask array for "group_labels" non-dimensional coordinate.

As long as the provided size > total number of discovered groups, this should work OK.

rethink reindexing to expected_groups

Right now we reindex to expected_groups in chunk_reduce. This helps in cases like the following where each block has a different number of groups.

| 1 group  | 2 groups |
| 3 groups | 4 groups |

Right now we always reindex to

| 12 groups | 12 groups | 
| 12 groups | 12 groups |

in chunk_reduce (assuming there are 12 expected_groups). We could instead reindex to 4 groups in _npg_combine here: https://github.com/dcherian/dask_groupby/blob/86a1edda60b0daefebbe18304a506493fa598ec4/dask_groupby/core.py#L314-L321

| 4 groups | 4 groups | 
| 4 groups | 4 groups |

This would be 3x smaller blocks and a major memory improvement.

Fix typing

There are a bunch of errors if I run mypy in flox/

A lot seem easy and some seem like bugs =)

flox/core.py:69: error: Value of type "Optional[Any]" is not indexable  [index]
flox/core.py:69: error: Argument "isbin" to "_convert_expected_groups_to_index" has incompatible type "Tuple[bool]"; expected "bool"  [arg-type]
flox/core.py:447: error: Value of type "Optional[ndarray[Any, Any]]" is not indexable  [index]
flox/core.py:550: error: Incompatible default for argument "axis" (default has type "None", argument has type "Union[int, Sequence[int]]")  [assignment]
flox/core.py:605: error: Incompatible types in assignment (expression has type "Tuple[None]", variable has type "Optional[Mapping[Union[str, Callable[..., Any]], Any]]")  [assignment]
flox/core.py:661: error: No overload variant of "zip" matches argument types "Union[Sequence[str], Sequence[Callable[..., Any]]]", "None", "Any", "Any"  [call-overload]
flox/core.py:661: note: Possible overload variants:
flox/core.py:661: note:     def [_T_co, _T1] zip(cls, Iterable[_T1]) -> zip[Tuple[_T1]]
flox/core.py:661: note:     def [_T_co, _T1, _T2] zip(cls, Iterable[_T1], Iterable[_T2]) -> zip[Tuple[_T1, _T2]]
flox/core.py:661: note:     def [_T_co, _T1, _T2, _T3] zip(cls, Iterable[_T1], Iterable[_T2], Iterable[_T3]) -> zip[Tuple[_T1, _T2, _T3]]
flox/core.py:661: note:     def [_T_co, _T1, _T2, _T3, _T4] zip(cls, Iterable[_T1], Iterable[_T2], Iterable[_T3], Iterable[_T4]) -> zip[Tuple[_T1, _T2, _T3, _T4]]
flox/core.py:661: note:     def [_T_co, _T1, _T2, _T3, _T4, _T5] zip(cls, Iterable[_T1], Iterable[_T2], Iterable[_T3], Iterable[_T4], Iterable[_T5]) -> zip[Tuple[_T1, _T2, _T3, _T4, _T5]]
flox/core.py:661: note:     def [_T_co] zip(cls, Iterable[Any], Iterable[Any], Iterable[Any], Iterable[Any], Iterable[Any], Iterable[Any], *iterables: Iterable[Any]) -> zip[Tuple[Any, ...]]
flox/core.py:665: error: Argument 1 to "is_nanlen" has incompatible type "None"; expected "Union[str, Callable[..., Any]]"  [arg-type]
flox/core.py:802: error: Unsupported left operand type for + ("Sequence[Any]")  [operator]
flox/core.py:804: error: Unsupported left operand type for + ("Sequence[Any]")  [operator]
flox/core.py:809: error: Incompatible return value type (got "Dict[str, Any]", expected "Dict[Union[str, Callable[..., Any]], Any]")  [return-value]
flox/core.py:809: note: Perhaps you need a type annotation for "results"? Suggestion: "Dict[Union[str, Callable[..., Any]], Any]"
flox/core.py:895: error: Incompatible types in assignment (expression has type "Dict[Union[str, Callable[..., Any]], Any]", variable has type "Dict[str, object]")  [assignment]
flox/core.py:911: error: "object" has no attribute "append"  [attr-defined]
flox/core.py:914: error: "object" has no attribute "append"  [attr-defined]
flox/core.py:921: error: Argument "fill_value" to "chunk_reduce" has incompatible type "Tuple[int]"; expected "Optional[Mapping[Union[str, Callable[..., Any]], Any]]"  [arg-type]
flox/core.py:937: error: "object" has no attribute "append"  [attr-defined]
flox/core.py:950: error: Argument "fill_value" to "chunk_reduce" has incompatible type "Tuple[Any]"; expected "Optional[Mapping[Union[str, Callable[..., Any]], Any]]"  [arg-type]
flox/core.py:955: error: "object" has no attribute "append"  [attr-defined]
flox/core.py:957: error: Incompatible return value type (got "Dict[str, object]", expected "Dict[Union[str, Callable[..., Any]], Any]")  [return-value]
flox/core.py:957: note: Perhaps you need a type annotation for "results"? Suggestion: "Dict[Union[str, Callable[..., Any]], Any]"
flox/core.py:1175: error: Argument 1 to "partial" has incompatible type "object"; expected "Callable[..., <nothing>]"  [arg-type]
flox/core.py:1228: error: Item "None" of "Optional[Any]" has no attribute "values"  [union-attr]
flox/core.py:1290: error: No overload variant of "zip" matches argument types "Tuple[Any, ...]", "bool"  [call-overload]
flox/core.py:1290: note: Possible overload variants:
flox/core.py:1290: note:     def [_T_co, _T1] zip(cls, Iterable[_T1]) -> zip[Tuple[_T1]]
flox/core.py:1290: note:     def [_T_co, _T1, _T2] zip(cls, Iterable[_T1], Iterable[_T2]) -> zip[Tuple[_T1, _T2]]
flox/core.py:1290: note:     def [_T_co, _T1, _T2, _T3] zip(cls, Iterable[_T1], Iterable[_T2], Iterable[_T3]) -> zip[Tuple[_T1, _T2, _T3]]
flox/core.py:1290: note:     def [_T_co, _T1, _T2, _T3, _T4] zip(cls, Iterable[_T1], Iterable[_T2], Iterable[_T3], Iterable[_T4]) -> zip[Tuple[_T1, _T2, _T3, _T4]]
flox/core.py:1290: note:     def [_T_co, _T1, _T2, _T3, _T4, _T5] zip(cls, Iterable[_T1], Iterable[_T2], Iterable[_T3], Iterable[_T4], Iterable[_T5]) -> zip[Tuple[_T1, _T2, _T3, _T4, _T5]]
flox/core.py:1290: note:     def [_T_co] zip(cls, Iterable[Any], Iterable[Any], Iterable[Any], Iterable[Any], Iterable[Any], Iterable[Any], *iterables: Iterable[Any]) -> zip[Tuple[Any, ...]]
flox/core.py:1457: error: Argument 1 to "_validate_reindex" has incompatible type "Optional[bool]"; expected "bool"  [arg-type]
flox/core.py:1465: error: Incompatible types in assignment (expression has type "Tuple[bool, ...]", variable has type "bool")  [assignment]
flox/core.py:1478: error: Argument 1 to "_convert_expected_groups_to_index" has incompatible type "Union[Tuple[None, ...], Sequence[Any], ndarray[Any, Any]]"; expected "Tuple[Any, ...]"  [arg-type]
flox/core.py:1490: error: Value of type "Optional[Any]" is not indexable  [index]
flox/core.py:1543: error: Argument 4 to "_initialize_aggregation" has incompatible type "Optional[int]"; expected "int"  [arg-type]
flox/core.py:1567: error: Item "ndarray[Any, Any]" of "Union[ndarray[Any, Any], Any]" has no attribute "chunks"  [union-attr]
flox/core.py:1580: error: Item "ndarray[Any, Any]" of "Union[ndarray[Any, Any], Any]" has no attribute "chunks"  [union-attr]
flox/core.py:1608: error: Incompatible types in assignment (expression has type "List[Union[ndarray[Any, Any], Any]]", variable has type "Tuple[Any]")  [assignment]
flox/xarray.py:19: error: Module "xarray" has no attribute "Resample"  [attr-defined]
flox/xarray.py:231: error: Unsupported right operand type for in ("Optional[Hashable]")  [operator]
flox/xarray.py:237: error: "Hashable" has no attribute "__iter__" (not iterable)  [attr-defined]
flox/xarray.py:240: error: "Hashable" has no attribute "__iter__" (not iterable)  [attr-defined]
flox/xarray.py:248: error: Incompatible types in assignment (expression has type "Union[str, Aggregation]", variable has type "str")  [assignment]
flox/xarray.py:256: error: Argument 1 to "len" has incompatible type "Hashable"; expected "Sized"  [arg-type]
flox/xarray.py:277: error: Argument 1 to "_convert_expected_groups_to_index" has incompatible type "List[Any]"; expected "Tuple[Any, ...]"  [arg-type]
flox/xarray.py:277: error: Argument 2 to "_convert_expected_groups_to_index" has incompatible type "Sequence[bool]"; expected "bool"  [arg-type]
flox/xarray.py:278: error: Incompatible types in assignment (expression has type "Tuple[int, ...]", variable has type "List[None]")  [assignment]
flox/xarray.py:278: error: Item "None" of "Optional[Any]" has no attribute "__iter__" (not iterable)  [union-attr]
flox/xarray.py:301: error: "Hashable" has no attribute "__iter__" (not iterable)  [attr-defined]
flox/xarray.py:314: error: Argument 1 to "set" has incompatible type "Hashable"; expected "Iterable[Any]"  [arg-type]
flox/xarray.py:330: error: Argument 1 to "tuple" has incompatible type "Optional[Any]"; expected "Iterable[Any]"  [arg-type]
flox/xarray.py:339: error: "Hashable" has no attribute "__iter__" (not iterable)  [attr-defined]
flox/xarray.py:342: error: Argument 2 to "zip" has incompatible type "Optional[Any]"; expected "Iterable[Any]"  [arg-type]
flox/visualize.py:4: error: Skipping analyzing "matplotlib": module is installed, but missing library stubs or py.typed marker  [import]
flox/visualize.py:5: error: Skipping analyzing "matplotlib.pyplot": module is installed, but missing library stubs or py.typed marker  [import]

Originally posted by @dcherian in #92 (comment)

Allow chunk=None with method='blockwise'

We check for chunk=None and raise an error quite early if passed a dask array

But we could easily make this work for method="blockwise" where we just apply a function designed for numpy arrarys blockwise.

This would be particularly convenient for custom reductions.

Improvement suggestion for sum_of_squares

Hi Deepak,

thanks for using numpy_groupies in flox. Out of curiosity I was reading through the codebase and noted that line:

https://github.com/dcherian/flox/blob/227ce041e9b78f8432c62370c9b2c46dd91fa657/flox/aggregate_npg.py#L17

It creates an intermediate array for the squares, which then gets summed up. In this context, this is likely to be a relevant time waster for allocating that intermediate array for the squares, while it could all be done in one go with simply a derived class based on

https://github.com/ml31415/numpy-groupies/blob/c11987005cccea1b3e990aba02ece482f3a00b1d/numpy_groupies/aggregate_numba.py#L240

class SumOfSquares(Sum):
    @staticmethod
    def _inner(ri, val, ret, counter, mean):
        counter[ri] = 0
        ret[ri] += val * val

Should do the trick and should be free of extra array and memory allocation and run in roughly the same time as the plain grouped sum.

use "hash split" with split_out

We should copy the "hash splitting" strategy from dask.dataframe to avoid the expected_groups is not None restriction and memory complications associated with #10

make dask optional

I think the core code is now dask-optional but changing the tests looks painful :/

Minimize dependencies

numpy_groupies should be the only requirement.

It should be possible to use this package without dask or xarray.

We need environments that test all possible combinations.

The real painful bit is dealing with dask's absence in the tests (#31)

support groupby_bins

There might be a clever solution where we provide pd.cut instead of pd.factorize through a kwarg

Weird behaviour of `split-reduce` with weird dask chunking

I don't understand flox yet enough to see what's going on here, but I think there is a bug when using "split-reduce" on data that has a strange chunking.

See:

from itertools import product
import xarray as xr
from flox.xarray import xarray_reduce
 
# The datetime are not important here (I think) but it gives something to group on.
t = xr.DataArray(xr.date_range('2000-01-01', '2000-12-31', use_cftime=False), dims=('time',), name='time')
data = t.time.dt.dayofyear
 
for chunking, method in product([None, 366, 365, 15], ['split-reduce', 'cohorts']):
    if chunking is not None:
        dat = data.chunk({'time': chunking})
     else:
        dat = data
    out = xarray_reduce(dat, dat.time.dt.month, func='mean', method=method).load()
    out_dec = out.sel(month=12).item()
    if out_dec != 351:
         print(f'What is this! Got {out_dec=} for {chunking=} and {method=}')

You'll see that the last element of the output (the average of the doy over december), is 351 for all cases, except when the chunking cuts the group in 2 and that the first chunk is spans multiple groups. In this case, with method "split-reduce", is 10881. (which is exactly 31 * 351, coincidence?).

I'm not sure of my diagnostic, but after trying some combinations, it seems it happens when a group is cut in two by chunks, of which at least one covers many groups. Example, with chunking=190, the split happens on July 8th. And the result is 6138 instead of 198. And 6138 is 31*198.

Haha, this is all a mystery to me... It doesn't seem related to the dtype : here we got integers as input, casting to float doesn't change the behaviour.

For the info, the real use case where this happened to me (in xclim) was when performing a groupby after a rolling. The latter leaves a small chunk at the end of the array and this appeared.

Potential bug in `xarray_reduce`

When I try using the xarray_reduce interface with the following

xarray_reduce(ds.Tair, ds.time.dt.month, func='mean')

I get an UnboundLocalError.:

In [19]: xarray_reduce(ds.Tair, ds.time.dt.month, func='mean')
<xarray.DataArray 'month' (time: 36)>
array([ 9, 10, 11, 12,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12,  1,
        2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12,  1,  2,  3,  4,  5,  6,
        7,  8])
Coordinates:
  * time     (time) object 1980-09-16 12:00:00 ... 1983-08-17 00:00:00
---------------------------------------------------------------------------
UnboundLocalError                         Traceback (most recent call last)
<ipython-input-19-b8f0e0a131da> in <module>
----> 1 xarray_reduce(ds.Tair, ds.time.dt.month, func='mean')

~/devel/pydata/dask_groupby/dask_groupby/xarray.py in xarray_reduce(obj, func, expected_groups, bins, dim, split_out, fill_value, *by)
     48 
     49     group_names = tuple(g.name for g in by)
---> 50     group_sizes = dict(zip(group_names, group_shape))
     51     indims = tuple(obj.dims)
     52     otherdims = tuple(d for d in indims if d not in dim)

UnboundLocalError: local variable 'group_shape' referenced before assignment

It appears that group_shape is only defined when using multiple groupers

https://github.com/dcherian/dask_groupby/blob/cab3b5bc3355e7934fe60029fc1063b6ed2f0a2f/dask_groupby/xarray.py#L39-L41

@dcherian, Am I missing something?

fix Ellipsis reductions

This means solving reducing along dimensions not present in by which doesn't seem to work in some cases.

check dtype promotion

On pangeo cloud, I'm seeing float32 being upcast to float64. This should not be needed.

Support cumsum, cumprod

Supporting just numpy should be relatively easy. This will also work for method="blockwise" by default.

We may want to rename groupby_reduce to groupby_agg?

For dask proper, we'll need to use dask.array.cumreduction instead of dask.array.blockwise + dask.array.reductions._tree_reduce

performance benchmarking

bincount-ed sum is 6x slower than sum, and bincount-ed count is 2x slower than count so there are definitely cases where it makes sense to split the dataset early instead of using flox. I'm seeing this on Pangeo Cloud using the GODAS dataset.

array = np.zeros((10 ** 5,), dtype=int)
by = array

%timeit np.bincount(by)  # count
%timeit np.bincount(by, weights=array)  # sum
%timeit np.sum(~np.isnan(by)) # count
%timeit array.sum() # sum
264 µs ± 2.76 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
441 µs ± 1.33 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
117 µs ± 2.68 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
65.9 µs ± 1.17 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

implement shuffle-then-reduce strategy

Proof-of-concept here. This will help with time grouping in particular where the group's elements occur at repeated intervals but at some distance from each other. When the repeats are "far" relative to chunk size; it can take a while before we actually reduce anything leading to memory spike in the beginning of the computation

https://gist.github.com/dcherian/6ccd76d2a6eaadb7844d61d197a8b3db

If we can generalize this, we can solve the biggest fail case of the map-reduce approach used in this package.

Allow grouping noisy arrays using tolerances

I find grouping data can be a little more intuitive when using tolerances.
This can be done for example like this:

def groupby_isclose_argsort(arr, atol=0, rtol=0.1):
    """
    Return a group idx of a noisy array.

    TODO: argsort not available in dask.

    Examples
    --------
    reps = 2
    y = np.array([72, 72, 100, 100, 300, 300, 500, 500])
    y = np.stack(reps*[y], 0)
    noise = lambda y : 1 + 0.1 * (np.random.rand(*y.shape) - 0.5)
    y = y * noise(y)
    groupby_isclose_argsort(y)
    array([[0, 0, 1, 1, 2, 2, 3, 3],
           [0, 0, 1, 1, 2, 2, 3, 3]], dtype=int32)
    """
    # Sort values to make sure values are monotonically increasing:
    a = arr.ravel()
    i = np.argsort(a)
    i_rev = np.empty(i.shape, dtype=np.intp, like=a)
    i_rev[i] = np.arange(len(a), like=a)
    a = a[i]

    # Calculate a monotonically increasing index that increase when the
    # difference between current and previous value changes:
    b = np.roll(a, 1)
    b[0] = a[0]
    by = np.cumsum(~np.isclose(a, b, atol=atol, rtol=rtol))
    # tolerance = atol + rtol * b
    # by = np.cumsum(np.abs(a - b) > tolerance)

    return by[i_rev].reshape(arr.shape)

So on a dataset the api would look something like this ds.groupby("arr", atol=0, rtol=0.1).

Optimize cohorts by ignoring NaNs when factorize_early is True

By default find_group_cohorts ignores NaNs.

When we factorize_early, this optimization is lost, because NaN groups are now relabeled to -1 (and other group labels are now positive integers)

I think the only fix is a little messy: optionally tell find_group_cohorts to drop -1 labels.

Summation on booleans returns OR instead of sum

As explained in pydata/xarray#6615, it seems the sum aggregations implemented by flox actually return an OR reduction when given boolean inputs. While this could be seen as mathematically/technically correct, it is a different behaviour than numpy.

However, testing with flox directly, I see that the numpy engine actually raises an error with boolean types. I'm guessing the flox one should too? But then, I'm not sure how xarray should handle it... A special case for sum here (xarray_reduce) or there?

Culprit line:
https://github.com/dcherian/flox/blob/19543801fb28f2557591aecd559e52b66c8051b3/flox/aggregate_flox.py#L51

MWE:

import flox
import numpy as np

groups = np.array([1, 1, 1])
data = np.array([True, True, False])

np.sum(data) # 2, what I expect

flox.groupby_reduce(data, groups, func='sum', engine='flox')  # (True, 1) , the result of an OR agg.

flox.groupby_reduce(data, groups, func='sum', engine='numpy')   # Fails within numpy_groupies, same with numba
#  TypeError: function sum requires a more complex datatype than bool

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.