Comments (27)
OK, so the task is to find out why ._get_array_dims
failed in this case. Perhaps this is because the file isn't one netCDF, but several netCDFs stored in the hierarchy - I think this is the first such example.
I would breakpoint in ._get_array_dims
to figure out why ["phony_dim_0", "phony_dim_1"]
are not being found.
from kerchunk.
I think you are checking for the case when there are dimensions (i.e., a non-empty shape), but _get_array_dims
doesn't populate any names at all.
from kerchunk.
To investigate, you might want to start with fs.ls(...)
to figure out what files you can see, and for metadata ones (".zarray", ...) look at their contents. Maybe check you can access some of the data paths.
from kerchunk.
I'm able to see all the files stored for a variable (FireMask
) in the HDFEOS/GRIDS/VNP14A1_Grid/Data Fields
group.
fs1.ls("HDFEOS/GRIDS/VNP14A1_Grid/Data Fields/FireMask")
[{'name': 'HDFEOS/GRIDS/VNP14A1_Grid/Data Fields/FireMask/.zarray',
'type': 'file',
'size': 269},
{'name': 'HDFEOS/GRIDS/VNP14A1_Grid/Data Fields/FireMask/.zattrs',
'type': 'file',
'size': 345},
{'name': 'HDFEOS/GRIDS/VNP14A1_Grid/Data Fields/FireMask/0.0',
'type': 'file',
'size': 5206},
{'name': 'HDFEOS/GRIDS/VNP14A1_Grid/Data Fields/FireMask/1.0',
'type': 'file',
'size': 4805},
{'name': 'HDFEOS/GRIDS/VNP14A1_Grid/Data Fields/FireMask/2.0',
'type': 'file',
'size': 3385},
{'name': 'HDFEOS/GRIDS/VNP14A1_Grid/Data Fields/FireMask/3.0',
'type': 'file',
'size': 2918},
{'name': 'HDFEOS/GRIDS/VNP14A1_Grid/Data Fields/FireMask/4.0',
'type': 'file',
'size': 3302},
{'name': 'HDFEOS/GRIDS/VNP14A1_Grid/Data Fields/FireMask/5.0',
'type': 'file',
'size': 2643},
{'name': 'HDFEOS/GRIDS/VNP14A1_Grid/Data Fields/FireMask/6.0',
'type': 'file',
'size': 2558},
{'name': 'HDFEOS/GRIDS/VNP14A1_Grid/Data Fields/FireMask/7.0',
'type': 'file',
'size': 4857},
{'name': 'HDFEOS/GRIDS/VNP14A1_Grid/Data Fields/FireMask/8.0',
'type': 'file',
'size': 5784},
{'name': 'HDFEOS/GRIDS/VNP14A1_Grid/Data Fields/FireMask/9.0',
'type': 'file',
'size': 5196},
{'name': 'HDFEOS/GRIDS/VNP14A1_Grid/Data Fields/FireMask/10.0',
'type': 'file',
'size': 7293},
{'name': 'HDFEOS/GRIDS/VNP14A1_Grid/Data Fields/FireMask/11.0',
'type': 'file',
'size': 6792},
{'name': 'HDFEOS/GRIDS/VNP14A1_Grid/Data Fields/FireMask/12.0',
'type': 'file',
'size': 5104},
{'name': 'HDFEOS/GRIDS/VNP14A1_Grid/Data Fields/FireMask/13.0',
'type': 'file',
'size': 3153},
{'name': 'HDFEOS/GRIDS/VNP14A1_Grid/Data Fields/FireMask/14.0',
'type': 'file',
'size': 5435}]
Looking at the contents of .zarray
, I get:
{'chunks': [80, 1200],
'compressor': {'id': 'zlib', 'level': 4},
'dtype': '|u1',
'fill_value': 0,
'filters': None,
'order': 'C',
'shape': [1200, 1200],
'zarr_format': 2}
So I don't see any dimension info, but I'm also not sure what's supposed to be in this metadata and what isn't.
from kerchunk.
This is an HDF-EOS HDF5 file (https://earthdata.nasa.gov/esdis/eso/standards-and-references/hdf-eos5) and it's content is not necessarily compatible with xarray's model. Have you already used such files with xarray before?
from kerchunk.
Not extensively, but I was able to open the file locally with xarray and as long as I specified the group the data variables were in, it seemed to work fine
from kerchunk.
from kerchunk.
(btw: the dimensions info is in the .zattrs file)
from kerchunk.
I notice you are giving a remote protocol (az) but the URLs are to a local file.
What URL are you referring to? The JSON specifies az://modis-006/VNP14A1/08/04/2020001/VNP14A1.A2020001.h08v04.001.2020003132203.h5
.
I included a local file to show that xarray can open the file fine locally. I think I might be misunderstanding something?
Does it work with protocol "file".
If I give remote_protocol="file"
I get the same dimension error.
(btw: the dimensions info is in the .zattrs file)
fs1.cat("HDFEOS/GRIDS/VNP14A1_Grid/Data Fields/FireMask/.zattrs")
yields:
{'_ARRAY_DIMENSIONS': [],
'legend': 'Classes:\n0 missing input data\n1 not processed (trim)\n2 not processed (obsolete)\n3 non-fire water\n4 cloud\n5 non-fire land\n6 unknown\n7 fire (low confidence)\n8 fire (nominal confidence)\n9 fire (high confidence)',
'long_name': 'fire mask',
'valid_range': [0, 9]}
from kerchunk.
'_ARRAY_DIMENSIONS': []
- this is clearly wrong! Can you find the same variable in the original and see what the dimensions ought to be?
from kerchunk.
Haha yeah that's the problem.
The dimensions of the variable are (1200, 1200)
Opening the file directly with xarray yields
print(xr.open_dataset("./VNP14A1.A2020001.h08v04.001.2020003132203.h5", group="HDFEOS/GRIDS/VNP14A1_Grid/Data Fields"))
<xarray.Dataset>
Dimensions: (phony_dim_0: 1200, phony_dim_1: 1200)
Dimensions without coordinates: phony_dim_0, phony_dim_1
Data variables:
FireMask (phony_dim_0, phony_dim_1) uint8 ...
MaxFRP (phony_dim_0, phony_dim_1) float64 ...
QA (phony_dim_0, phony_dim_1) uint8 ...
sample (phony_dim_0, phony_dim_1) float32 ...
from kerchunk.
The culprit is this line
https://github.com/intake/fsspec-reference-maker/blob/67ccf7111709707d2643458ddc47872ab6d768c4/fsspec_reference_maker/hdf.py#L223
While earlier it does correctly get rank=2
since dset.shape
returns (1200,1200)
, len(dset.dims)
returns 0 since it seems like each dim is not iterable.
Part of the problem I think is that HDF5 file do not have named dimensions. Some engines like netCDF4 will name the dimensions phony_dim_x
but it doesn't appear as if that's happening here.
from kerchunk.
Well, I should say the culprit is not really just that line. Even if it did return num_scales=2
, it would still fail on
https://github.com/intake/fsspec-reference-maker/blob/67ccf7111709707d2643458ddc47872ab6d768c4/fsspec_reference_maker/hdf.py#L225
since dset.dims[0][0]
returns
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
File "/Users/lucass/anaconda3/lib/python3.8/site-packages/h5py/_hl/dims.py", line 74, in __getitem__
h5ds.iterate(self._id, self._dimension, scales.append, 0)
File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
File "h5py/h5ds.pyx", line 167, in h5py.h5ds.iterate
File "h5py/defs.pyx", line 4300, in h5py.defs.H5DSiterate_scales
RuntimeError: Unspecified error in H5DSiterate_scales (return value <0)
from kerchunk.
@ajelenak , perhaps you have an opinion on this too?
from kerchunk.
Can you please share the original data file on anywhere-but-azure?
from kerchunk.
The HDF5 file does not have dimension scales which is the mechanism to generate the array dimension info in the reference JSON. That's why the h5py error, somehow to code was forced to execute on an HDF5 dataset without dimension scales.
The quick fix is to add ability to generate phony dims so xarray can handle this file. However, information about coordinates is available in HDF-EOS5 files, like this one, but don't know if that is currently supported by xarray.
HDF created a command-line tool that makes HDF-EOS5 files netCDF-friendly but this may not be a workable option here.
from kerchunk.
Thank, @ajelenak .
@lsterzinger , when xarray opens the original data, it was able to infer the coordinates here, right? It has a lot of hidden magic inside, of course, but maybe we can find out how that happens.
from kerchunk.
Like @ajelenak said, I believe the netCDF4 engine (which I think xarray uses by default) adds phony dims to an HDF5 file. If you do a ncdump -h
on a file, those phony dims will also be there.
from kerchunk.
So can we just put '_ARRAY_DIMENSIONS': ["phony_dim_x", ...]
? Perhaps you could edit the JSON to see if this allows xarray to proceed.
from kerchunk.
the netCDF4 engine (which I think xarray uses by default) adds phony dims to an HDF5 file. If you do a
ncdump -h
on a file, those phony dims will also be there.
Just to clarify: What ncdump
shows in case of netCDF-4 (HDF5) files is a view of the file content, interpreted according to the netCDF data model. So those phony dims are not added to the file.
from kerchunk.
@ajelenak , I think that's what @lsterzinger means :)
from kerchunk.
Yeah bad wording on my part. Nothing is added to the file of course, but ncdump/netcdf engine interpret the file according to the netcdf spec and returns phony dims accordingly
from kerchunk.
So can we just put
'_ARRAY_DIMENSIONS': ["phony_dim_x", ...]
? Perhaps you could edit the JSON to see if this allows xarray to proceed.
@martindurant Yes, this works. I replaced all instances of '_ARRAY_DIMENSIONS'
with the added phony dims (one for each variable) and I was able to open the remote file with xarray and engine='zarr'
.
from kerchunk.
Well OK then!
So we need to pick these from a predefined list of, what, up to five labels (x, y, z, t, ...), in the case that that labels don't get generated directly by h5py.
from kerchunk.
I can attempt to have _get_array_dims
add these dimensions manually, but I'm not sure what the best way for it to determine whether it's needed or not is. The dset.dims
are different between a netcdf and hdf5 file so I think checking that first might be the best way forward
from kerchunk.
So we need to pick these from a predefined list of, what, up to five labels (x, y, z, t, ...), in the case that that labels don't get generated directly by h5py.
This sounds like it would work well, but it would make assumptions about what is x,y,z etc. Maybe just phony_dim_0
/phone_dim_1
like what netcdf
/ncdump
currently do?
from kerchunk.
Maybe just phony_dim_0/phone_dim_1
Sure.
from kerchunk.
Related Issues (20)
- Can kerchunk return references where the final chunk is smaller? HOT 3
- Missing Values are unexpectedly filled in Kerchunked version of dataset HOT 2
- Have ZarrToZarr write to json in the function? HOT 1
- add autodask example
- example of auto_dask writing to a parquet for for the multi references HOT 1
- inline_threshold kwarg not changing result of SingleHdf5ToZarr call HOT 8
- Invalid try/except block in `scan_grib`? HOT 8
- Create a MultiZarr json file from netcdf files of unequal time length. HOT 2
- Support for small files in `_split_file` HOT 5
- Support for failed chunk requests HOT 13
- kerchunking zarr from OSN, bucket not found HOT 2
- Parquet reference files from git with simplecache HOT 1
- Release? HOT 4
- Kerchunk doesn't translate HDF5 hard links HOT 3
- add to xarray backends docs HOT 22
- grib_tree's unexpected behaviour HOT 5
- Support the Open Meteo custom data format HOT 3
- NetCDF file has one time step, kerchunk-generated reference has nine time steps? HOT 3
- How to use subchunking HOT 10
- inline_threshold not encoding time value? HOT 9
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from kerchunk.