GithubHelp home page GithubHelp logo

zip file format details about zipzap HOT 2 CLOSED

mikeabdullah avatar mikeabdullah commented on August 19, 2024
zip file format details

from zipzap.

Comments (2)

pixelglow avatar pixelglow commented on August 19, 2024

The apparent duplication of LocalFileHeader and DataDescriptor strikes me as weird. From poking around zip archives I have here, it looks like LocalFileHeader generally reports file sizes to be 0. So I'm guessing unarchivers take that as a signal to consult the DataDescriptor for the true size of the file? Thus allowing a file to be rewritten into the archive without yet knowing the size it will be when compressed.

Either:

  • the archiver knows before writing the entry what its size will be. Then LocalFileHeader reports the true size, DataDescriptor is not written, or
  • the archiver doesn't know. Then LocalFileHeader has 0 size, and a DataDescriptor is written with the file size.

The latter is often used in streaming scenarios. zipzap has some streaming and non-streaming scenarios, but for cleanness of implementation, I always treat them as streaming.

The bit I don't understand though, how do unarchivers know how to find the individual files and CentralFileHeader if each LocalFileHeader claims its data is of zero length?

This is tricky and is really up to the quality of the unarchiver. I've based zipzap off the algorithm expressed in minizip, which has a fair amount of history to it.

  • Look for the EndOfCentralDirectory signature. Because the EndOfCentralDirectory is constrained to be at most n bytes (given the variable length comment at the end), we look for it in the last n bytes in the file, searching backwards.
  • Having got a candidate EndOfCentralDirectory signature, we sanity check as many of the EndOfCentralDirectory fields as possible. In particular the comment length field has to be consistent with the actual discovered length of the EndOfCentralDirectory.

This search is complicated by the possibility that a perverse archiver could write out a second etc. EndOfCentralDirectory header in the variable length comment. Because of this (and performance reasons), zipzap does a backward search and only uses the last header it finds. It could be argued that this is incorrect, since this is not a "real" header but a variable length comment. But I chose an algorithm that will work even with a perverse archiver, not necessarily one that will work correctly in such a situation.

Some refinements I did think of

  • Searching forward but skipping inconsistent candidate EndOfCentralDirectory structures. This is probably more correct but performance intensive, since you would definitely scan 64K bytes of any large zip file.
  • Searching backward but continuing if the structure is inconsistent. This won't avoid the perverse archive above but might be slightly more robust e.g. if the variable length comment say contained the signature.

Some other unarchivers do a search forward from the start of the zip. This has several issues:

  • it is intense on large zip files since you would have to sample a couple of bytes, skip, rinse lather and repeat to get all the information you need.
  • it will work 100% correctly in the non-streaming scenario, but you cannot guarantee a non-streaming zip file.
  • in the streaming scenario, you will need to somehow limit the entry reading to the correct size without being told what it is in the LocalFileHeader! It is possible for some formats e.g. deflate compression, but not in general. This forces you to do expensive decompression just to discover metadata, or stall with formats you cannot interpret.

from zipzap.

mikeabdullah avatar mikeabdullah commented on August 19, 2024

Wow, it is quite tricky then!

It occurs to me that unarchivers can take different strategies depending on the task at hand. For example, if the end goal is to decompress an archive, then there's no harm in searching through the data, decompressing as you go, to find the data descriptors.

Also for situations where compressed data sizes are already known, zipzap could potentially be kind to unarchivers and generate local file headers that contain the file size.

Thank you for filling me in.

from zipzap.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.