GithubHelp home page GithubHelp logo

Comments (6)

keltonhalbert avatar keltonhalbert commented on July 30, 2024

I see there was some discussion about a library that could facilitate this kind of data access in #281.

If I can get some insight into how and where this should be implemented, I'd be happy to take a crack at it.

from kerchunk.

keltonhalbert avatar keltonhalbert commented on July 30, 2024

Perhaps it would be possible to generate the gzip index sidecar during scan_grib, and save it as a metadata field as a base64 string that can be decoded? And then, somehow incorporate that info into dereference_archives?

from kerchunk.

martindurant avatar martindurant commented on July 30, 2024

Yes, you are completely on the right lines for the kind of work it would take to be able to reference byte ranges within a compressed file. Tying up the pieces would probably not be that simple... You should be aware that the gzip version of indexing (as opposed to bzip2 or zstd) requires you storing a rather large amount of data, 32kB per checkpoint. The current state of indexed_gzip doesn't allow you to pick your checkpoints, but we could generate many and pick only the ones we need in a two-step process.

from kerchunk.

keltonhalbert avatar keltonhalbert commented on July 30, 2024

Yeah, I don't imagine this will be simple in the slightest, but it would certainly be cool if it worked!

Right now I'm just trying to wrap my head around the indexed_gzip library and how kerchunk actually needs to interface with it. While I have some experience working with things like grib2/hdf5/netcdf at a low-ish level, I've never really worked with archives or had to think about how they're stored.

This might be naive or dumb, but I did notice this in the indexed_gzip documentation for IndexedGzipFile:

:arg auto_build:       If ``True`` (the default), the index is
                               automatically built on calls to :meth:`seek`.

Does this imply that an index can be built based off of the calls to seek? If so, maybe it would be possible to build the index for the seek points as scan_grib is decoding the grib2 message with eccodes/cfgrib, which in turn could be used to provide the bare minimum number of checkpoints to read arrays from their start bytes? My thought is that scan_grib is already appropriately reading the decompressed grib2 metadata, in which the uncompressed byte ranges can be used to generate the index... but perhaps I'm misunderstanding some terminology here.

If I'm not totally out to lunch here, then the size of the side-car file would scale with the number of grib2 messages/arrays in the file. Probably not ideal, but neither is storing gzip compressed grib2 data in a cloud storage bucket. I'd prefer it if people would just make their data usable to begin with, but that's a dream that'll never come true :).

from kerchunk.

martindurant avatar martindurant commented on July 30, 2024

Does this imply that an index can be built based off of the calls to seek?

No; when you seek forward in the file, indexed_gzip will write all of the checkpoints up to that point. This is because gzip must be streamed through in order to know where you are up to at the bit level. It should be possible to only save checkpoints of interest, but that would require editing the code in indexed_gzip. From the outside, probably the best we can do is to generate all the checkpoints to a local file, at a reasonably small spacing. Next, take a second pass through and keep only the ones immediately before grib message offsets - maybe that is small enough to inline into a references file, or maybe we store this as a separate sidecar (I am leaning to the latter).

from kerchunk.

martindurant avatar martindurant commented on July 30, 2024

the size of the side-car file would scale with the number of grib2 messages/arrays in the file

Yes, 32kB per offset. Maybe a bit big to store in a JSON file (where they would need to be base64 encoded) for several messages and then combinations of potentially many files. Naturally, these 32kB blocks will not compress well.

from kerchunk.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.