Comments (6)
I see there was some discussion about a library that could facilitate this kind of data access in #281.
If I can get some insight into how and where this should be implemented, I'd be happy to take a crack at it.
from kerchunk.
Perhaps it would be possible to generate the gzip index sidecar during scan_grib, and save it as a metadata field as a base64 string that can be decoded? And then, somehow incorporate that info into dereference_archives
?
from kerchunk.
Yes, you are completely on the right lines for the kind of work it would take to be able to reference byte ranges within a compressed file. Tying up the pieces would probably not be that simple... You should be aware that the gzip version of indexing (as opposed to bzip2 or zstd) requires you storing a rather large amount of data, 32kB per checkpoint. The current state of indexed_gzip doesn't allow you to pick your checkpoints, but we could generate many and pick only the ones we need in a two-step process.
from kerchunk.
Yeah, I don't imagine this will be simple in the slightest, but it would certainly be cool if it worked!
Right now I'm just trying to wrap my head around the indexed_gzip
library and how kerchunk actually needs to interface with it. While I have some experience working with things like grib2/hdf5/netcdf at a low-ish level, I've never really worked with archives or had to think about how they're stored.
This might be naive or dumb, but I did notice this in the indexed_gzip
documentation for IndexedGzipFile
:
:arg auto_build: If ``True`` (the default), the index is
automatically built on calls to :meth:`seek`.
Does this imply that an index can be built based off of the calls to seek? If so, maybe it would be possible to build the index for the seek points as scan_grib is decoding the grib2 message with eccodes/cfgrib, which in turn could be used to provide the bare minimum number of checkpoints to read arrays from their start bytes? My thought is that scan_grib is already appropriately reading the decompressed grib2 metadata, in which the uncompressed byte ranges can be used to generate the index... but perhaps I'm misunderstanding some terminology here.
If I'm not totally out to lunch here, then the size of the side-car file would scale with the number of grib2 messages/arrays in the file. Probably not ideal, but neither is storing gzip compressed grib2 data in a cloud storage bucket. I'd prefer it if people would just make their data usable to begin with, but that's a dream that'll never come true :).
from kerchunk.
Does this imply that an index can be built based off of the calls to seek?
No; when you seek forward in the file, indexed_gzip will write all of the checkpoints up to that point. This is because gzip must be streamed through in order to know where you are up to at the bit level. It should be possible to only save checkpoints of interest, but that would require editing the code in indexed_gzip. From the outside, probably the best we can do is to generate all the checkpoints to a local file, at a reasonably small spacing. Next, take a second pass through and keep only the ones immediately before grib message offsets - maybe that is small enough to inline into a references file, or maybe we store this as a separate sidecar (I am leaning to the latter).
from kerchunk.
the size of the side-car file would scale with the number of grib2 messages/arrays in the file
Yes, 32kB per offset. Maybe a bit big to store in a JSON file (where they would need to be base64 encoded) for several messages and then combinations of potentially many files. Naturally, these 32kB blocks will not compress well.
from kerchunk.
Related Issues (20)
- Can kerchunk return references where the final chunk is smaller? HOT 3
- Missing Values are unexpectedly filled in Kerchunked version of dataset HOT 2
- Have ZarrToZarr write to json in the function? HOT 1
- add autodask example
- example of auto_dask writing to a parquet for for the multi references HOT 1
- inline_threshold kwarg not changing result of SingleHdf5ToZarr call HOT 8
- Invalid try/except block in `scan_grib`? HOT 8
- Create a MultiZarr json file from netcdf files of unequal time length. HOT 2
- Support for small files in `_split_file` HOT 5
- Support for failed chunk requests HOT 13
- kerchunking zarr from OSN, bucket not found HOT 2
- Parquet reference files from git with simplecache HOT 1
- Release? HOT 4
- Kerchunk doesn't translate HDF5 hard links HOT 3
- add to xarray backends docs HOT 22
- grib_tree's unexpected behaviour HOT 5
- Support the Open Meteo custom data format HOT 3
- NetCDF file has one time step, kerchunk-generated reference has nine time steps? HOT 3
- How to use subchunking HOT 10
- inline_threshold not encoding time value? HOT 9
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from kerchunk.