What version are you using ( fq -v )? <pre class

Have a look at <a class="issue-link js-issue-link" data-error-text="Failed to load tit

gzip files can contain multiple concatenated gzips about fq HOT 8 CLOSED

TomiBelan commented on June 6, 2024

gzip files can contain multiple concatenated gzips

from fq.

Comments (8)

wader commented on June 6, 2024

Huh did not know, thats interesting. I wonder if this is the same or similar to inflate/deflate flush to encode boundaries, i ran into this for TLS compression, but in that case there is no header for the trailing inflates.

Can come up with three ways to model this:

Root is alway an array. Maybe inconvenient?
Root can optionally be an array. Currently not possible API-wise.
Add a trailing field array etc with trailing gzip:s
Something else?

from fq.

TomiBelan commented on June 6, 2024

I think "root is always an array" most precisely models the underlying format.

from fq.

wader commented on June 6, 2024

Have a look at #794 and i think i agree, always an array is probably best

from fq.

wader commented on June 6, 2024

Yeap some of text test were wrong, fixed, thanks.

I wonder if it's bad that we won't provide the full concatenated uncompressed stream somehow? also the nested decoding should happen on the concatenation and not the members uncompressed data. So maybe the root should instead be a struct with a members array and a uncompressed raw bytes?

from fq.

TomiBelan commented on June 6, 2024

I didn't realize fq performs nested decoding. I'm not sure what to do. In most cases it might be better to have "a struct with a members array and a uncompressed raw bytes". But today I was analyzing a corrupted gz file where zcat said CRC and size is wrong, and fq helped me to discover only the last member is corrupted and find out why. It was useful to see uncompressed of each member and check they're fine. But I know this is an unusual situation.

I don't have a strong preference. I feel multi-member gz files are rare in practice, so either way is a decent choice.

Just for fun: This is how I used fq to analyze it. That was before I filed this issue, so I had to use gap0.

rm -f part* after*; cp original_input.gz after0.gz; i=0; while true; do o=$(./fq '.gap0|tobytesrange.start' after$i.gz) || break; [[ -z $o ]] && break; head -c$o after$i.gz > part$((i+1)).gz; tail -c+$((o+1)) after$i.gz > after$((i+1)).gz; ((i++)); done

from fq.

wader commented on June 6, 2024

I didn't realize fq performs nested decoding. I'm not sure what to do. In most cases it might be better to have "a struct with a members array and a uncompressed raw bytes". But today I was analyzing a corrupted gz file where zcat said CRC and size is wrong, and fq helped me to discover only the last member is corrupted and find out why. It was useful to see uncompressed of each member and check they're fine. But I know this is an unusual situation.

Yes it does nested decode by default, with sometimes options to disable it. This was added early for fq as it's roots is in debugging media containers and codecs where it's common with lots of nested subformat and muxers that slice up packets in various ways.

About each member's uncompress: in the PR i now modelled so that you have access to both each members uncompressed data and a concat of them all.

I don't have a strong preference. I feel multi-member gz files are rare in practice, so either way is a decent choice.

I think it makes sense, kind of the point of fq is to not hide details :)

Now i actually remember that alpine packages uses concatted gzip:s.

Just for fun: This is how I used fq to analyze it. That was before I filed this issue, so I had to use gap0.

rm -f part* after*; cp original_input.gz after0.gz; i=0; while true; do o=$(./fq '.gap0|tobytesrange.start' after$i.gz) || break; [[ -z $o ]] && break; head -c$o after$i.gz > part$((i+1)).gz; tail -c+$((o+1)) after$i.gz > after$((i+1)).gz; ((i++)); done

Nice! you wanted to output each uncompressed to a file? what was the o+1 thing, skip one byte from gap0 start?

fq is not great for outputting multiple files atm, not sure how it could be done without adding messy IO-function hmm. But i have used some hack using tar. So something like this:

Copy the to_tar snippet from https://github.com/wader/fq/wiki/snippets an put in tar.jq then do:

# -L . adds cwd to include path
# use include "tar" to include tar.jq
# iterate .members as {key: ..., value: ...} objects, as it's an array key will be 0,1,2,... and value the member itself
# to_tar(f) takes a function f as arg that outputs {filename: ..., data: ...} objects
$ fq -L . 'include "tar"; to_tar(.members | to_entries[] | {filename: "part\(.key)", data: .value.uncompressed})' format/gzip/testdata/multi_members.gz | tar tv
-rw-r--r--  0 user   group      11 Jan  1  1970 part0
-rw-r--r--  0 user   group      10 Jan  1  1970 part1

from fq.

TomiBelan commented on June 6, 2024

Nice! you wanted to output each uncompressed to a file? what was the o+1 thing, skip one byte from gap0 start?

Right, I wanted to output each compressed member to a file, so I can look at them with zcat/fq/hexdump. $((o+1)) is just because tail counts from 1, e.g. "tail -c+9" discards first 8 bytes and starts printing from the 9th byte.

Interesting tar snippet. To be honest I don't really like or understand the jq language, but maybe I'll learn one day.

By the way just for fun, this is not related to fq, but I solved the mystery of the corrupted gz file I mentioned: The uncompressed data looks OK and the footer is present, but the footer CRC and isize are wrong. What could've caused that?
It is generated by a Python program which opens it as with gzip.open(filename, "at") as f:. The solution is that it got a KeyboardInterrupt exception just after executing this line. The compressed data was written, but self.crc and self.size weren't updated. The with: statement called the close() method and wrote a gzip footer, but not the correct values.

from fq.

wader commented on June 6, 2024

Interesting tar snippet. To be honest I don't really like or understand the jq language, but maybe I'll learn one day.

I can relate and it took quite a while to get my head around it, now i love it. But i think it really fits very well for what i at least use fq for, to do lots of adhoc queries to digg and poke around in half broken and strange media and binary files. And i hope basic jq is easy enough for ppl to use... i've also notice ppl use fq by more or less just with d and -V etc and then pipe grep/less or whatnot :) whatever works

By the way just for fun, this is not related to fq, but I solved the mystery of the corrupted gz file I mentioned: The uncompressed data looks OK and the footer is present, but the footer CRC and isize are wrong. What could've caused that? It is generated by a Python program which opens it as with gzip.open(filename, "at") as f:. The solution is that it got a KeyboardInterrupt exception just after executing this line. The compressed data was written, but self.crc and self.size weren't updated. The with: statement called the close() method and wrote a gzip footer, but not the correct values.

👍 aha tricky, glad you solved it! so it was just one odd gzip file or something that happened regularly?

from fq.

gzip files can contain multiple concatenated gzips about fq HOT 8 CLOSED

Comments (8)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs