This feels like its probably an edge case, but I wanted to bring it up in case there's

Issue reading GZIP compressed GRIB2 remote file about kerchunk HOT 6 OPEN

keltonhalbert commented on July 30, 2024

Issue reading GZIP compressed GRIB2 remote file

from kerchunk.

Comments (6)

keltonhalbert commented on July 30, 2024

I see there was some discussion about a library that could facilitate this kind of data access in #281.

If I can get some insight into how and where this should be implemented, I'd be happy to take a crack at it.

from kerchunk.

keltonhalbert commented on July 30, 2024

Perhaps it would be possible to generate the gzip index sidecar during scan_grib, and save it as a metadata field as a base64 string that can be decoded? And then, somehow incorporate that info into dereference_archives?

from kerchunk.

martindurant commented on July 30, 2024

Yes, you are completely on the right lines for the kind of work it would take to be able to reference byte ranges within a compressed file. Tying up the pieces would probably not be that simple... You should be aware that the gzip version of indexing (as opposed to bzip2 or zstd) requires you storing a rather large amount of data, 32kB per checkpoint. The current state of indexed_gzip doesn't allow you to pick your checkpoints, but we could generate many and pick only the ones we need in a two-step process.

from kerchunk.

keltonhalbert commented on July 30, 2024

Yeah, I don't imagine this will be simple in the slightest, but it would certainly be cool if it worked!

Right now I'm just trying to wrap my head around the indexed_gzip library and how kerchunk actually needs to interface with it. While I have some experience working with things like grib2/hdf5/netcdf at a low-ish level, I've never really worked with archives or had to think about how they're stored.

This might be naive or dumb, but I did notice this in the indexed_gzip documentation for IndexedGzipFile:

:arg auto_build:       If ``True`` (the default), the index is
                               automatically built on calls to :meth:`seek`.

Does this imply that an index can be built based off of the calls to seek? If so, maybe it would be possible to build the index for the seek points as scan_grib is decoding the grib2 message with eccodes/cfgrib, which in turn could be used to provide the bare minimum number of checkpoints to read arrays from their start bytes? My thought is that scan_grib is already appropriately reading the decompressed grib2 metadata, in which the uncompressed byte ranges can be used to generate the index... but perhaps I'm misunderstanding some terminology here.

If I'm not totally out to lunch here, then the size of the side-car file would scale with the number of grib2 messages/arrays in the file. Probably not ideal, but neither is storing gzip compressed grib2 data in a cloud storage bucket. I'd prefer it if people would just make their data usable to begin with, but that's a dream that'll never come true :).

from kerchunk.

martindurant commented on July 30, 2024

Does this imply that an index can be built based off of the calls to seek?

No; when you seek forward in the file, indexed_gzip will write all of the checkpoints up to that point. This is because gzip must be streamed through in order to know where you are up to at the bit level. It should be possible to only save checkpoints of interest, but that would require editing the code in indexed_gzip. From the outside, probably the best we can do is to generate all the checkpoints to a local file, at a reasonably small spacing. Next, take a second pass through and keep only the ones immediately before grib message offsets - maybe that is small enough to inline into a references file, or maybe we store this as a separate sidecar (I am leaning to the latter).

from kerchunk.

martindurant commented on July 30, 2024

the size of the side-car file would scale with the number of grib2 messages/arrays in the file

Yes, 32kB per offset. Maybe a bit big to store in a JSON file (where they would need to be base64 encoded) for several messages and then combinations of potentially many files. Naturally, these 32kB blocks will not compress well.

from kerchunk.

Issue reading GZIP compressed GRIB2 remote file about kerchunk HOT 6 OPEN

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs