Comments (6)
At the combine stage, we allow a preprocessor argument to modify datasets. We could have something similar in the individual files; having said that, loading the produces JSON and removing the offending tags should not be too difficult.
from kerchunk.
sure, postprocessing the individual files is not too difficult, and I'll do that for now.
However, since the files I investigated are from multiple sources (I think?) it might be good to also catch this here. Am I correct in assuming that these values should not be out-of-sync?
from kerchunk.
@cgentemann this seems similar to the problem you were having a while back with the fill values, did you ever figure that out?
from kerchunk.
I have a feeling that, in general, there can be many things that are wrong/inconsistent in original data files! I don't know that we can cover them all, but perhaps we can auto-correct common issues. The ability to add custom processing is more powerful, however, and I suspect that many datasets will require some form of custom processing. In fact, I think that kerchunk workflows (in pangeo-forge or not) is an opportunity to apply those things so that users don't have to.
from kerchunk.
right, that makes sense. I was thinking that rather than postprocessing the output of .translate()
it would be much easier to postprocess with the variable's zarr
object, but of course the details of hooks like that are tricky to get right.
for reference, here's my hacky postprocessing function
def correct_fill_values(data):
def fix_variable(values):
zattrs = values[".zattrs"]
if "_FillValue" not in zattrs:
return values
_FillValue = zattrs["_FillValue"]
if values[".zarray"]["fill_value"] != _FillValue:
values[".zarray"]["fill_value"] = _FillValue
return values
refs = data["refs"]
prepared = (
(tuple(key.split("/")), value) for key, value in refs.items() if "/" in key
)
filtered = (
(key, ujson.loads(value))
for key, value in prepared
if key[1] in (".zattrs", ".zarray")
)
key = lambda i: i[0][0]
grouped = (
(name, {n[1]: v for n, v in group})
for name, group in itertools.groupby(sorted(filtered, key=key), key=key)
)
fixed = ((name, fix_variable(var)) for name, var in grouped)
flattened = {
f"{name}/{item}": ujson.dumps(data, indent=4)
for name, var in fixed
for item, data in var.items()
}
data["refs"] = dict(sorted((refs | flattened).items()))
return data
from kerchunk.
it would be much easier to postprocess with the variable's zarr object
Yes, totally agree with this too. We could have a bunch of optional processing functions. As you say, it's tricky, because (for instance) processing the zarr object might not be quite the same as processing the xarray view of the same thing.
from kerchunk.
Related Issues (20)
- What is the best way to concatenate over time where files have different "time:units" string? HOT 3
- `concatenate_arrays` has unexpected result size HOT 1
- `concatenate_arrays(..., check_arrays=True)` argument not behaving as expected HOT 1
- `concatenate_arrays` with (slightly) different array shapes HOT 2
- `combine.auto_dask` doesn't loop over all batches HOT 1
- Is it possible to open remote ensemble datasets? HOT 2
- `combine.MultiZarrToZarr` producing broken reference file HOT 3
- Issue using `tiff_to_zarr` HOT 10
- Read json dictionary of references instead of list of json references HOT 3
- Error when trying to open dataset with xarray HOT 9
- tiff_to_zarr for geotiff with compression: zarr reads strange values HOT 14
- Link SECM MOM6 workflow as a case study? HOT 5
- write parquet in MultiZarrToZarr HOT 2
- TIFF: internal codec (small chunks) vs. entire file as single chunk (`imagecodecs_tiff` codec) HOT 10
- Should there be a `kerchunk[parquet]` optional install for `fastparquet`? HOT 3
- Is it possible to create a kerchunk mapping that has different chunk sizes than the underlying file. HOT 4
- Regression in 0.0.8-0.0.9 release causes race condition & segfault in eccodes grib_string_length HOT 10
- Kerchunk tutorial for July ESIP Meeting HOT 14
- MultiZarrToZarr for non-spatial HDF5 files HOT 7
- Allow file scanners to write straight to parquet
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from kerchunk.