I originally posted this as part of another issue, as I was working my way through the

The apparent duplication of LocalFileHeader and DataDeor strikes me

zip file format details about zipzap HOT 2 CLOSED

mikeabdullah commented on August 19, 2024

zip file format details

from zipzap.

Comments (2)

pixelglow commented on August 19, 2024

The apparent duplication of LocalFileHeader and DataDescriptor strikes me as weird. From poking around zip archives I have here, it looks like LocalFileHeader generally reports file sizes to be 0. So I'm guessing unarchivers take that as a signal to consult the DataDescriptor for the true size of the file? Thus allowing a file to be rewritten into the archive without yet knowing the size it will be when compressed.

Either:

the archiver knows before writing the entry what its size will be. Then LocalFileHeader reports the true size, DataDescriptor is not written, or
the archiver doesn't know. Then LocalFileHeader has 0 size, and a DataDescriptor is written with the file size.

The latter is often used in streaming scenarios. zipzap has some streaming and non-streaming scenarios, but for cleanness of implementation, I always treat them as streaming.

The bit I don't understand though, how do unarchivers know how to find the individual files and CentralFileHeader if each LocalFileHeader claims its data is of zero length?

This is tricky and is really up to the quality of the unarchiver. I've based zipzap off the algorithm expressed in minizip, which has a fair amount of history to it.

Look for the EndOfCentralDirectory signature. Because the EndOfCentralDirectory is constrained to be at most n bytes (given the variable length comment at the end), we look for it in the last n bytes in the file, searching backwards.
Having got a candidate EndOfCentralDirectory signature, we sanity check as many of the EndOfCentralDirectory fields as possible. In particular the comment length field has to be consistent with the actual discovered length of the EndOfCentralDirectory.

This search is complicated by the possibility that a perverse archiver could write out a second etc. EndOfCentralDirectory header in the variable length comment. Because of this (and performance reasons), zipzap does a backward search and only uses the last header it finds. It could be argued that this is incorrect, since this is not a "real" header but a variable length comment. But I chose an algorithm that will work even with a perverse archiver, not necessarily one that will work correctly in such a situation.

Some refinements I did think of

Searching forward but skipping inconsistent candidate EndOfCentralDirectory structures. This is probably more correct but performance intensive, since you would definitely scan 64K bytes of any large zip file.
Searching backward but continuing if the structure is inconsistent. This won't avoid the perverse archive above but might be slightly more robust e.g. if the variable length comment say contained the signature.

Some other unarchivers do a search forward from the start of the zip. This has several issues:

it is intense on large zip files since you would have to sample a couple of bytes, skip, rinse lather and repeat to get all the information you need.
it will work 100% correctly in the non-streaming scenario, but you cannot guarantee a non-streaming zip file.
in the streaming scenario, you will need to somehow limit the entry reading to the correct size without being told what it is in the LocalFileHeader! It is possible for some formats e.g. deflate compression, but not in general. This forces you to do expensive decompression just to discover metadata, or stall with formats you cannot interpret.

from zipzap.

mikeabdullah commented on August 19, 2024

Wow, it is quite tricky then!

It occurs to me that unarchivers can take different strategies depending on the task at hand. For example, if the end goal is to decompress an archive, then there's no harm in searching through the data, decompressing as you go, to find the data descriptors.

Also for situations where compressed data sizes are already known, zipzap could potentially be kind to unarchivers and generate local file headers that contain the file size.

Thank you for filling me in.

from zipzap.

zip file format details about zipzap HOT 2 CLOSED

Comments (2)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs