For some reason I have files (model output) where the hdf5</

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

inconsistent fill value on hdf5 files about kerchunk HOT 6 CLOSED

fsspec commented on July 30, 2024

inconsistent fill value on hdf5 files

from kerchunk.

Comments (6)

martindurant commented on July 30, 2024

At the combine stage, we allow a preprocessor argument to modify datasets. We could have something similar in the individual files; having said that, loading the produces JSON and removing the offending tags should not be too difficult.

from kerchunk.

keewis commented on July 30, 2024

sure, postprocessing the individual files is not too difficult, and I'll do that for now.

However, since the files I investigated are from multiple sources (I think?) it might be good to also catch this here. Am I correct in assuming that these values should not be out-of-sync?

from kerchunk.

lsterzinger commented on July 30, 2024

@cgentemann this seems similar to the problem you were having a while back with the fill values, did you ever figure that out?

from kerchunk.

martindurant commented on July 30, 2024

I have a feeling that, in general, there can be many things that are wrong/inconsistent in original data files! I don't know that we can cover them all, but perhaps we can auto-correct common issues. The ability to add custom processing is more powerful, however, and I suspect that many datasets will require some form of custom processing. In fact, I think that kerchunk workflows (in pangeo-forge or not) is an opportunity to apply those things so that users don't have to.

from kerchunk.

keewis commented on July 30, 2024

right, that makes sense. I was thinking that rather than postprocessing the output of .translate() it would be much easier to postprocess with the variable's zarr object, but of course the details of hooks like that are tricky to get right.

for reference, here's my hacky postprocessing function

def correct_fill_values(data):
    def fix_variable(values):
        zattrs = values[".zattrs"]

        if "_FillValue" not in zattrs:
            return values

        _FillValue = zattrs["_FillValue"]
        if values[".zarray"]["fill_value"] != _FillValue:
            values[".zarray"]["fill_value"] = _FillValue

        return values

    refs = data["refs"]
    prepared = (
        (tuple(key.split("/")), value) for key, value in refs.items() if "/" in key
    )
    filtered = (
        (key, ujson.loads(value))
        for key, value in prepared
        if key[1] in (".zattrs", ".zarray")
    )
    key = lambda i: i[0][0]
    grouped = (
        (name, {n[1]: v for n, v in group})
        for name, group in itertools.groupby(sorted(filtered, key=key), key=key)
    )
    fixed = ((name, fix_variable(var)) for name, var in grouped)
    flattened = {
        f"{name}/{item}": ujson.dumps(data, indent=4)
        for name, var in fixed
        for item, data in var.items()
    }
    data["refs"] = dict(sorted((refs | flattened).items()))
    return data

from kerchunk.

martindurant commented on July 30, 2024

it would be much easier to postprocess with the variable's zarr object

Yes, totally agree with this too. We could have a bunch of optional processing functions. As you say, it's tricky, because (for instance) processing the zarr object might not be quite the same as processing the xarray view of the same thing.

from kerchunk.

inconsistent fill value on hdf5 files about kerchunk HOT 6 CLOSED

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs