Comments (6)
In [14]: fs = fsspec.filesystem("s3", anon=True)
In [17]: fs.head("s3://noaa-nwm-retrospective-2-1-pds/forcing/2007/2007010100.LDASIN_DOMAIN1", 8)
Out[17]: b'CDF\x01\x00\x00\x00\x00'
It appears to be "classic netCDF CDF-1 format" (see here). That would need a separate conversion class; the file format looks simpler, but I don't know if the old CDF libraries will be as convenient. If the chunking remains the same, there would be, in principle, no problem combining the different file formats into a global kerchunked dataset.
@rsignell-usgs , any idea why it looks like the data format became older in 2007? Or was this some sort of HDF5 (not CDF) -> CDF (not HDF) evolution?
from kerchunk.
Just add some extra observations: All post-2007 NWM2.1 files are not only bigger in size (540mb each), but also have slightly different naming conventions (e.g., 10-digit 2007010100 vs. 12-digit 199602182000).
In any case, I'd appreciate if the people in this forum can help me find a temporary solution. I've spent several days converting pre-2007 files.
from kerchunk.
I don't anticipate having time to implement a netCDF<4 scanner in the near term, but perhaps someone else has? At a guess, the files are much larger because there is no compression; but maybe the chunking is still the same.
from kerchunk.
@dialuser, also note that the NWM2.1 data is already available in Zarr format from
https://registry.opendata.aws/nwm-archive/
Specifically:
aws s3 ls s3://noaa-nwm-retrospective-2-1-zarr-pds/ --no-sign-request
The rechunking-and-conversion-to-zarr was done by @jmccreight who would likely be able to answer these questions if necessary.
from kerchunk.
@rsignell-usgs Thanks for pinging me here.
@dialuser I changed jobs and had covid, your email fell through the cracks. i was looking for your email recently but could not find it. I had several inquiries on this exact topic, which also confused me.
Thanks for these questions. The answer is that no single conversion process or person produced all the LDASIN files here. I'm not fully up on what was done, but I had some similar (but different) myself when processing the data on NCAR systems.
https://github.com/NCAR/rechunk_retro_nwm_v21/blob/da170bf2af462a4a117ceebc39f751d3ba91ea74/precip/symlink_aorc.py#L18
You can see there are essentially 3 different periods of data with different conventions. (at least it's finite, right?)
I can anecdotally confirm what @martindurant uncovered above
jamesmcc@casper-login2[1017]:/glade/p/cisl/nwc/nwm_forcings/AORC> for ff in $f1 $f2 $f3; do echo $ff: `ncdump -k $ff`; done
/glade/campaign/ral/hap/zhangyx/AORC.Forcing/2007/200702010000.LDASIN_DOMAIN1: netCDF-4
/glade/p/cisl/nwc/nwm_forcings/AORC/2007020101.LDASIN_DOMAIN1: classic
/glade/p/cisl/nwc/nwm_forcings/AORC/202002010100.LDASIN_DOMAIN1: netCDF-4
I believe that the file size difference is because no compression ("deflate level") is available for classic (against @martindurant pointed out), while _DeflateLevel = 2
is applied for the other, netCDF-4 files (that I looked at). I was surprised to see that there is chunking in the netCDF-4 files: for (time, y, x): _ChunkSizes = 1, 768, 922. It appears there is no chunking in the classic (as far as I can tell).
I dont expect much can be done on the NCAR/NOAA end at this point except to take note that this is a problem. I will connect you with at least one other user who is interested in this data. Perhaps you can collaborate on a solution (I may point them here). It would be nice to see. I honestly did not know that all this forcing data was part of the release, i thought that only the Zarr precip field that I processed was what was released.
from kerchunk.
Note on this point:
It appears there is no chunking in the classic (as far as I can tell)
if the blocks are not compressed, then, from a kerchunk point of view, we can pick any chunking we like on the biggest dimension (and second-biggest, if we choose a chunksize of 1 for the biggest), so it may still be possible to get consistency across the different file species.
from kerchunk.
Related Issues (20)
- `concatenate_arrays(..., check_arrays=True)` argument not behaving as expected HOT 1
- `concatenate_arrays` with (slightly) different array shapes HOT 2
- `combine.auto_dask` doesn't loop over all batches HOT 1
- Is it possible to open remote ensemble datasets? HOT 2
- `combine.MultiZarrToZarr` producing broken reference file HOT 3
- Issue using `tiff_to_zarr` HOT 10
- Read json dictionary of references instead of list of json references HOT 3
- Error when trying to open dataset with xarray HOT 9
- tiff_to_zarr for geotiff with compression: zarr reads strange values HOT 14
- Link SECM MOM6 workflow as a case study? HOT 5
- write parquet in MultiZarrToZarr HOT 2
- TIFF: internal codec (small chunks) vs. entire file as single chunk (`imagecodecs_tiff` codec) HOT 10
- Should there be a `kerchunk[parquet]` optional install for `fastparquet`? HOT 3
- Is it possible to create a kerchunk mapping that has different chunk sizes than the underlying file. HOT 4
- Regression in 0.0.8-0.0.9 release causes race condition & segfault in eccodes grib_string_length HOT 10
- Kerchunk tutorial for July ESIP Meeting HOT 14
- MultiZarrToZarr for non-spatial HDF5 files HOT 7
- Allow file scanners to write straight to parquet
- UserWarning / NotImplementedError HOT 4
- `tiff_to_zarr` ValueError: incomplete chunks are not supported by the fsspec ReferenceFileSystem HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from kerchunk.