Hi! Thanks for developing Flox, it has been quite useful in our workflows. <p dir=

<a class="issue-link js-issue-link" data-error-text="Failed to load title" data-id="19

"most common" Aggregator with Dask about flox HOT 6 OPEN

BSchilperoort commented on June 12, 2024

"most common" Aggregator with Dask

from flox.

Comments (6)

dcherian commented on June 12, 2024

Ah, now this is interesting!

The general solution is hard and approximate (I think). You'll need to implement something like count min sketch (I'm sure there are others)

The easier way: You'll have to decompose the problem and compute each unique item and associated count in chunk (like you already do), merge those intermediates together in combine and then pick the top-most in finalize.

Are there useful properties for the field you're collapsing or the field you're grouping by (example plots of both these fields would be useful)? For examlke

are there only a small number of unique labels?
method="blockwise" only works

from flox.

BSchilperoort commented on June 12, 2024

Hi Deepak, thank you for your reply.

The easier way: You'll have to decompose the problem and compute each unique item and associated count in chunk (like you already do), merge those intermediates together in combine and then pick the top-most in finalize.

I attempted to do this, however I am unsure of how to exactly make this work. Below is the code, but I am not sure of how to wrap the individual functions in such a way that they're compatible with Aggregation.

I could not make much sense of the documentation/docstrings beyond the very simple examples which are available. I also received some errors relating to setting a element with a sequence (as each chunk will return an array with unique values, instead of a single value).

Code

def unique_labels(a: np.ndarray) -> np.ndarray:
    labels = np.unique(a)
    return labels


def unique_counts(a: np.ndarray) -> np.ndarray:
    _, counts = np.unique(a, return_counts=True)
    return counts


def most_common_chunked(multi_values: np.ndarray, multi_counts: np.ndarray, **kwargs):
    all_values, index = np.unique(multi_values, return_inverse=True)
    all_counts = np.zeros(all_values.size, np.int64)
    np.add.at(all_counts, index, multi_counts.ravel())  # inplace
    return all_values[all_counts.argmax()]


most_common = Aggregation(
    name="most_common",
    numpy=_custom_grouped_reduction,
    chunk=(unique_labels, unique_counts),  # first compute blockwise
    combine=(wrap_stack, wrap_stack),  # stack these intermediate results
    finalize=most_common_chunked,  # get most common value from the combined result
    fill_value=0,
)

Are there useful properties for the field you're collapsing or the field you're grouping by (example plots of both these fields would be useful)? For example are there only a small number of unique labels?

Our specific use case is a high resolution land cover dataset, which we want to be able to regrid easily. Of course once the "most common" strategy works, knowing the 2nd (or n-th) most common is also interesting.

Due to the nature of our dataset it will have a limited number of unique labels.

from flox.

dcherian commented on June 12, 2024

Unfortunately, this will need some major thinking. You'll have to handle the unique and count intermediates together, similar to how the argreduction is run. This is not trivial.

Is it possible to rechunk so that a blockwise solution works? Can you describe your problem precisely? A reproducible example would be best...

Due to the nature of our dataset it will have a limited number of unique labels.

OK that's good for the exact nature of this solution.

from flox.

dcherian commented on June 12, 2024

Also np.unique flattens the array. Is that OK for your purposes?

from flox.

dcherian commented on June 12, 2024

#269 should fix your blockwise problems. I'm thinking I can just add support for mode applied blockwise. Would that solve your use case?

from flox.

BSchilperoort commented on June 12, 2024

Hi Deepak, thank you for your replies and the modifications to the code. I am currently too busy with other projects to focus on this, but I will get back to it and try it out as soon as I can make some time!

Also np.unique flattens the array. Is that OK for your purposes?

Yes that is fine.

A reproducible example would be best...

I do have a demo notebook on our repository where I show a basic use case. It is not easily reproducible as it requires a very large dataset at the moment, but it might give you a better idea of what we are trying to achieve.

from flox.

"most common" Aggregator with Dask about flox HOT 6 OPEN

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs