Comments (8)
Nice example.
I believe this is #222 .
I ran the reduction and took a mental note of the timings for the "blocks" tasks while specifying the "engine"
kwarg, this approximately reflects the cost of running the first groupby-reduction on every block of data (usually the most computationally intensive piece)
|---------+-------------|
| engine | approx time |
|---------+-------------|
| numpy | 1s |
| flox | 500ms |
| numbagg | 300ms |
|---------+-------------|
So installing "numbagg"
or specifying engine="flox"
might bring it back to parity.
I'll note that flox's real innovation is making more things possible (e.g. this post) that would just straight up fail otherwise.
The default Xarray strategy does work well for a few chunking schemes (indeed this observation inspired "cohorts"), but it's hard to predict if you haven't deeply thought about groupby.
EDIT: I love that "cohorts" (the automatic choice) is 2x faster than "map-reduce".
from flox.
@dcherian - thanks for these comments ( and for all the helpful tools! )
I'll note that flox's real innovation is making more things possible (e.g. this post) that would just straight up fail otherwise.
I do really appreciate this important point - even if I possibly currently lack the understanding to write a simple example that shows this for climatological aggregations over one dimension. I did try to push the size of the array farther to reach a point where "not-flox" failed and "flox" completed. But in this simple case I couldn't seem to do that with the array size changes I was making? Given my real world problem is trying to apply climatological aggregations to 11TB arrays "making more things possible" is the gold star for a cluster of given size and why flox
is so welcome.
re: numbagg
- my very ignorant understanding was that this only helps with NaN
excluding calculations? .... ahh, but this highlights that the default for xr.mean()
is skipna = True
. Even though we don't have any NaN
's here I suppose we are actually running nanmean()
not mean()
?
I'll try to apply some of your comments here . . .
from flox.
..... specifying
engine="flox"
might bring it back to parity.
Something else I'm clearly not understanding - I thought that current xarray
will automatically use flox
if:
- - it's installed
- -
import flox
- -
xr.options
showsOption: use_flox, Value: True
in this case how does adding engine="flox"
change things?
from flox.
OK - yes . . .
engine="flox"
significantly speeds up.mean()
- I didn't try
numbagg
butskipna = False
speeds up all flavours of the calculation ( regardless of the lack ofNaN
's in the array = my ignorance ) engine="flox"
does bring it closer to parity 20.9 s (no-flox & skipna=False
) vs 36.4 s (flox, cohorts, & skipna=False
) ... but as above comment I'm a bit unclear on the syntax inxr.groupby.mean
thinking thatflox
would have been automatic?
from flox.
but as above comment I'm a bit unclear on the syntax in xr.groupby.mean thinking that flox would have been automatic?
Yes unclear syntax. See https://flox.readthedocs.io/en/latest/engines.html. Basically there's two levels to flox
(1) vectorized groupby algos for numpy arrays
(2) optimized graphs for dask arrays.
engine
controls strategy for (1).
method
controls strategy for (2).
By setting engine="flox"
you're opting in to flox's internal vectorized algo. This is a super great idea when your groups are sorted. However I have now realized that my current heuristic for choosing this assumes numpy arrays, we can do better for the dask arrays (like the one you're working with).
- This is something to fix. so thanks for taking the time to write this up :)
Installing numbagg
should then get you faster than default, though there is a (small) cost to compiling.
Setting skipna=False
if you don't have NaNs is always a good idea. It avoid some extra memory copies.
from flox.
I did try to push the size of the array farther to reach a point where "not-flox" failed and "flox" completed. But in this simple case I couldn't seem to do that with the array size changes I was making?
Nice. one of the "challenges" is that dask tends to improve with time, so this envelope keeps shifting (and sometimes regresses hehe).
from flox.
I'll note that my major goal is here is to get decent perf with 0 thinking :)
Hence my excitement that we are automatically choosing method="cohorts"
, so you needn't set that.
Clearly, I need to think more about how to set engine
so that this pain here goes away.
from flox.
You might try a daily climatology or an hourly climatology to see how things shape up
from flox.
Related Issues (20)
- Optimize `split_every` HOT 4
- Error when data variables have different dimensions HOT 2
- Flox seems much slower in some cases? HOT 2
- possible support for sparse arrays HOT 2
- Examples in docs can be hard to read in dark mode HOT 1
- Support scipy < 1.11 HOT 3
- Add docs for `method` guessing
- optimize nanquantile
- ⚠️ Nightly upstream-dev CI failed ⚠️
- Unable to assign flox method and quantile method in xarray_reduce HOT 2
- ⚠️ Nightly upstream-dev CI failed ⚠️
- `xarray_reduce` is incompatible with `DataArray.pipe` due to mandatory `func` kwarg. HOT 1
- Handling sparse matrices HOT 11
- TypeError: no implementation found for 'numpy.asarray' on types that implement __array_function__: [<class 'pint.Quantity'>] HOT 2
- Support grouping by multiple variables with Cubed map-reduce
- Implement `method='blockwise'` for Cubed HOT 3
- Consider `preferred_method="blockwise"` if `by` is sorted HOT 1
- First execution of groupby on Xarray with Flox takes a lot of time HOT 4
- make cubed support more prominent in docs
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from flox.