GithubHelp home page GithubHelp logo

Comments (8)

wader avatar wader commented on June 6, 2024

Huh did not know, thats interesting. I wonder if this is the same or similar to inflate/deflate flush to encode boundaries, i ran into this for TLS compression, but in that case there is no header for the trailing inflates.

Can come up with three ways to model this:

  • Root is alway an array. Maybe inconvenient?
  • Root can optionally be an array. Currently not possible API-wise.
  • Add a trailing field array etc with trailing gzip:s
  • Something else?

from fq.

TomiBelan avatar TomiBelan commented on June 6, 2024

I think "root is always an array" most precisely models the underlying format.

from fq.

wader avatar wader commented on June 6, 2024

Have a look at #794 and i think i agree, always an array is probably best

from fq.

wader avatar wader commented on June 6, 2024

Yeap some of text test were wrong, fixed, thanks.

I wonder if it's bad that we won't provide the full concatenated uncompressed stream somehow? also the nested decoding should happen on the concatenation and not the members uncompressed data. So maybe the root should instead be a struct with a members array and a uncompressed raw bytes?

from fq.

TomiBelan avatar TomiBelan commented on June 6, 2024

I didn't realize fq performs nested decoding. I'm not sure what to do. In most cases it might be better to have "a struct with a members array and a uncompressed raw bytes". But today I was analyzing a corrupted gz file where zcat said CRC and size is wrong, and fq helped me to discover only the last member is corrupted and find out why. It was useful to see uncompressed of each member and check they're fine. But I know this is an unusual situation.

I don't have a strong preference. I feel multi-member gz files are rare in practice, so either way is a decent choice.

Just for fun: This is how I used fq to analyze it. That was before I filed this issue, so I had to use gap0.

rm -f part* after*; cp original_input.gz after0.gz; i=0; while true; do o=$(./fq '.gap0|tobytesrange.start' after$i.gz) || break; [[ -z $o ]] && break; head -c$o after$i.gz > part$((i+1)).gz; tail -c+$((o+1)) after$i.gz > after$((i+1)).gz; ((i++)); done

from fq.

wader avatar wader commented on June 6, 2024

I didn't realize fq performs nested decoding. I'm not sure what to do. In most cases it might be better to have "a struct with a members array and a uncompressed raw bytes". But today I was analyzing a corrupted gz file where zcat said CRC and size is wrong, and fq helped me to discover only the last member is corrupted and find out why. It was useful to see uncompressed of each member and check they're fine. But I know this is an unusual situation.

Yes it does nested decode by default, with sometimes options to disable it. This was added early for fq as it's roots is in debugging media containers and codecs where it's common with lots of nested subformat and muxers that slice up packets in various ways.

About each member's uncompress: in the PR i now modelled so that you have access to both each members uncompressed data and a concat of them all.

I don't have a strong preference. I feel multi-member gz files are rare in practice, so either way is a decent choice.

I think it makes sense, kind of the point of fq is to not hide details :)

Now i actually remember that alpine packages uses concatted gzip:s.

Just for fun: This is how I used fq to analyze it. That was before I filed this issue, so I had to use gap0.

rm -f part* after*; cp original_input.gz after0.gz; i=0; while true; do o=$(./fq '.gap0|tobytesrange.start' after$i.gz) || break; [[ -z $o ]] && break; head -c$o after$i.gz > part$((i+1)).gz; tail -c+$((o+1)) after$i.gz > after$((i+1)).gz; ((i++)); done

Nice! you wanted to output each uncompressed to a file? what was the o+1 thing, skip one byte from gap0 start?

fq is not great for outputting multiple files atm, not sure how it could be done without adding messy IO-function hmm. But i have used some hack using tar. So something like this:

Copy the to_tar snippet from https://github.com/wader/fq/wiki/snippets an put in tar.jq then do:

# -L . adds cwd to include path
# use include "tar" to include tar.jq
# iterate .members as {key: ..., value: ...} objects, as it's an array key will be 0,1,2,... and value the member itself
# to_tar(f) takes a function f as arg that outputs {filename: ..., data: ...} objects
$ fq -L . 'include "tar"; to_tar(.members | to_entries[] | {filename: "part\(.key)", data: .value.uncompressed})' format/gzip/testdata/multi_members.gz | tar tv
-rw-r--r--  0 user   group      11 Jan  1  1970 part0
-rw-r--r--  0 user   group      10 Jan  1  1970 part1

from fq.

TomiBelan avatar TomiBelan commented on June 6, 2024

Nice! you wanted to output each uncompressed to a file? what was the o+1 thing, skip one byte from gap0 start?

Right, I wanted to output each compressed member to a file, so I can look at them with zcat/fq/hexdump. $((o+1)) is just because tail counts from 1, e.g. "tail -c+9" discards first 8 bytes and starts printing from the 9th byte.

Interesting tar snippet. To be honest I don't really like or understand the jq language, but maybe I'll learn one day.

By the way just for fun, this is not related to fq, but I solved the mystery of the corrupted gz file I mentioned: The uncompressed data looks OK and the footer is present, but the footer CRC and isize are wrong. What could've caused that?
It is generated by a Python program which opens it as with gzip.open(filename, "at") as f:. The solution is that it got a KeyboardInterrupt exception just after executing this line. The compressed data was written, but self.crc and self.size weren't updated. The with: statement called the close() method and wrote a gzip footer, but not the correct values.

from fq.

wader avatar wader commented on June 6, 2024

Interesting tar snippet. To be honest I don't really like or understand the jq language, but maybe I'll learn one day.

I can relate and it took quite a while to get my head around it, now i love it. But i think it really fits very well for what i at least use fq for, to do lots of adhoc queries to digg and poke around in half broken and strange media and binary files. And i hope basic jq is easy enough for ppl to use... i've also notice ppl use fq by more or less just with d and -V etc and then pipe grep/less or whatnot :) whatever works

By the way just for fun, this is not related to fq, but I solved the mystery of the corrupted gz file I mentioned: The uncompressed data looks OK and the footer is present, but the footer CRC and isize are wrong. What could've caused that? It is generated by a Python program which opens it as with gzip.open(filename, "at") as f:. The solution is that it got a KeyboardInterrupt exception just after executing this line. The compressed data was written, but self.crc and self.size weren't updated. The with: statement called the close() method and wrote a gzip footer, but not the correct values.

👍 aha tricky, glad you solved it! so it was just one odd gzip file or something that happened regularly?

from fq.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.