I'm frequently trying to apply multiple reductions to the same groups (e.g. <code clas

applying multiple reductions to the same groups about flox HOT 2 CLOSED

keewis commented on June 12, 2024

applying multiple reductions to the same groups

from flox.

Comments (2)

keewis commented on June 12, 2024 1

(1) sounds good to me, actually, so it's fine to not do this here (and I can just create a PR for GroupBy.agg or something similar).

Edit: basically, let's make use of pydata/xarray#7206

from flox.

dcherian commented on June 12, 2024

It would be a decent bit of complexity to add, and I'm not inclined to add it.

There would be two advantages:

The data are only factorized once, and the integer codes are reused.
We could drastically reduce the number of tasks in the dask graph at the cost of more complicated code. Number of tasks is reduced because we can maker a single task calculate all the necessary intermediates for all reductions.

I'm not sure (1) is worth it, at least for xarray, because after pydata/xarray#7206, we will get this for free by just calling each individual method on a saved GroupBy object (for xarray).

I'm not sure (2) is worth it for a couple of cases:

It will also mean that to calculate max only you will calculate every other reduction and then discard it.
If you're writing the output to zarr for example, you lose parallelism again.
It could be an advantage to only compute count once and reuse it for count, mean but not sure its worth it. We could get this advantage by instead breaking up the current algo to. compute count and sum separately for mean. Then the dask optimizer will handle the shared count computation for us.