Comments (6)
Ah, now this is interesting!
The general solution is hard and approximate (I think). You'll need to implement something like count min sketch (I'm sure there are others)
The easier way: You'll have to decompose the problem and compute each unique item and associated count in chunk
(like you already do), merge those intermediates together in combine
and then pick the top-most in finalize
.
Are there useful properties for the field you're collapsing or the field you're grouping by (example plots of both these fields would be useful)? For examlke
- are there only a small number of unique labels?
method="blockwise"
only works
from flox.
Hi Deepak, thank you for your reply.
The easier way: You'll have to decompose the problem and compute each unique item and associated count in
chunk
(like you already do), merge those intermediates together incombine
and then pick the top-most infinalize
.
I attempted to do this, however I am unsure of how to exactly make this work. Below is the code, but I am not sure of how to wrap the individual functions in such a way that they're compatible with Aggregation.
I could not make much sense of the documentation/docstrings beyond the very simple examples which are available. I also received some errors relating to setting a element with a sequence (as each chunk will return an array with unique values, instead of a single value).
Code
def unique_labels(a: np.ndarray) -> np.ndarray:
labels = np.unique(a)
return labels
def unique_counts(a: np.ndarray) -> np.ndarray:
_, counts = np.unique(a, return_counts=True)
return counts
def most_common_chunked(multi_values: np.ndarray, multi_counts: np.ndarray, **kwargs):
all_values, index = np.unique(multi_values, return_inverse=True)
all_counts = np.zeros(all_values.size, np.int64)
np.add.at(all_counts, index, multi_counts.ravel()) # inplace
return all_values[all_counts.argmax()]
most_common = Aggregation(
name="most_common",
numpy=_custom_grouped_reduction,
chunk=(unique_labels, unique_counts), # first compute blockwise
combine=(wrap_stack, wrap_stack), # stack these intermediate results
finalize=most_common_chunked, # get most common value from the combined result
fill_value=0,
)
Are there useful properties for the field you're collapsing or the field you're grouping by (example plots of both these fields would be useful)? For example are there only a small number of unique labels?
Our specific use case is a high resolution land cover dataset, which we want to be able to regrid easily. Of course once the "most common" strategy works, knowing the 2nd (or n-th) most common is also interesting.
Due to the nature of our dataset it will have a limited number of unique labels.
from flox.
Unfortunately, this will need some major thinking. You'll have to handle the unique and count intermediates together, similar to how the argreduction
is run. This is not trivial.
Is it possible to rechunk so that a blockwise solution works? Can you describe your problem precisely? A reproducible example would be best...
Due to the nature of our dataset it will have a limited number of unique labels.
OK that's good for the exact nature of this solution.
from flox.
Also np.unique
flattens the array. Is that OK for your purposes?
from flox.
#269 should fix your blockwise problems. I'm thinking I can just add support for mode
applied blockwise. Would that solve your use case?
from flox.
Hi Deepak, thank you for your replies and the modifications to the code. I am currently too busy with other projects to focus on this, but I will get back to it and try it out as soon as I can make some time!
Also np.unique flattens the array. Is that OK for your purposes?
Yes that is fine.
A reproducible example would be best...
I do have a demo notebook on our repository where I show a basic use case. It is not easily reproducible as it requires a very large dataset at the moment, but it might give you a better idea of what we are trying to achieve.
from flox.
Related Issues (20)
- Reporting a vulnerability HOT 1
- always factorize early
- optimize groupby for resample
- Add engine="numbagg"
- AttributeError: 'DataArrayResample' object has no attribute '_unique_coord'
- Test failure on i386: ValueError: bins must be monotonically increasing or decreasing HOT 4
- More Groupers / user stories / strategies HOT 2
- Support xarray grouper objects in xarray interface
- How to create two groups from two lists of dimension labels, and apply "sum" to each group HOT 2
- Using Xarray and Flox for custom non-aggregation functions HOT 3
- use engine flox if array is ordered? HOT 2
- Address repo-review comments
- more cohorts optimization when chunksize == 1
- add cftime benchmarks
- Optimize `split_every` HOT 4
- Error when data variables have different dimensions HOT 2
- Flox seems much slower in some cases? HOT 2
- possible support for sparse arrays HOT 2
- Examples in docs can be hard to read in dark mode HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from flox.