circulosmeos / gztool Goto Github PK

extract random-positioned data from gzip files with no penalty, including gzip tailing like with 'tail -f' !

Home Page: https://circulosmeos.wordpress.com/2019/08/11/continuous-tailing-of-a-gzip-file-efficiently/

C 93.22% Makefile 0.09% M4 0.06% Roff 6.50% Shell 0.13%

bgzf bgzip compressed-files compression concatenate-files decompression gzip gzip-compression gzip-data gzip-decompression gzip-format gzip-stream gzipped-files indexing inflate zlib zlib-decompression-library

gztool's People

Contributors

Stargazers

Watchers

Forkers

jwoolston xytrix alphaneer skitt anti32 mycroft hopeday6688 thoughtsynapse edwardbetts aitorarjona jnorthrup

gztool's Issues

Segmentation fault -z -b 0

Hey I'm reporting something where I want the block index but i am doing my own line index with -b 0 so -z looks like the thing.

then this happens.

$ gztool -z ~/work/fumes/data/galaxy_1day.json.gz -I FOOIdx.gzi -b 0 >/dev/null
ACTION: Extract from byte = 0

Processing '/home/jim/work/fumes/data/galaxy_1day.json.gz' ...
Processing index to 'FOOIdx.gzi'...
Segmentation fault

the OS distro is recent gentoo ~2 months world build on gcc compiler -O2

the gztool binary is from clang-15 -l{z,m} -O3 -flto -ogztool gztool.c

it was also plain old gcc before with same result.

I think I could possibly get a cheap build with zig to test different libc's which would output clang but I'd need to rtfm and follow up for that.

request: byte-aligned index blocks for low-tech zlib inflater

I am able to read the created index files using a jdk client and to prime the window as needed and set up the streams to the needed positions.

the wall i run into is the inflatePrime zlib function being absent from non-c libraries which is true among at least 3 ports including the official Oracle one.

the occurrence of non-zero bits in the index is roughly... 7 in 8

shown below:

in gzindex the indexes are not stored to disk, it's just a minimum unit test of what gztool does. the point struct stores the first 2 offsets in bits, not bytes.

i modified the loop conditionals of gzindex as shown below to change the input window to 1 and keep iterating the loop until arriving at byte aligned block boundary. i'm guessing this makes the block boundary slightly stochastic, up to an average of 4 bytes variance. with gztool this isn't a simple modification.

diff --git a/gzindex.c b/gzindex.c
--- a/gzindex.c	(revision f1b7696c1e4757a7201009a2f3e02ed9e3536a56)
+++ b/gzindex.c	(revision 662eb8434ed5c3d18e4673621aafd9e0feb415bf)
@@ -207,6 +207,7 @@
     unsigned char *out, *out2;
     z_stream strm;
     unsigned char in[16384];
+size_t input_stride = sizeof(in); 

     /* position input file */
     ret = fseeko(gz, offset, SEEK_SET);
@@ -273,7 +274,7 @@
         do {
             /* if needed, get more input data */
             if (strm.avail_in == 0) {
-                strm.avail_in = fread(in, 1, sizeof(in), gz);
+                strm.avail_in = fread(in, 1, input_stride, gz);
                 if (ferror(gz)) {
                     (void)inflateEnd(&strm);
                     free(list);
@@ -304,6 +305,12 @@
 
             /* if at a block boundary, note the location of the header */
             if (strm.data_type & 128) {
+            out_alignment = (pos - strm.avail_in) & 7;
+            if (out_alignment) {
+                input_stride = 1;
+            } else {
+                input_stride = sizeof(in);
+            }
                 head = ((pos - strm.avail_in) << 3) - (strm.data_type & 63);
                 last = strm.data_type & 64; /* true at end of last block */
             }
...
        } while (strm.avail_out != 0 && !last &&out_alignment ); //keeps reading 1 byte at end of block-read until alignment

Action create index for a gzip from STDIN not working

Hello, thanks for maintaining this amazing tool :)

I need to use the reading from stdin feature, but is not working anymore for version 1.5.0.
I tried for version 1.4.3 and is working fine.

Steps to reproduce:

$ git clone https://github.com/circulosmeos/gztool.git
$ git fetch
$ git checkout v1.5.0
$ automake --add-missing && autoreconf && ./configure && make check
$ cat tests/gplv3.txt.gz | ./gztool -I index.bin
ACTION: Create index for a gzip file

Processing STDIN ...
Processing index to 'index.bin'...
ERROR: Compressed data error in STDIN.
$ make clean
$ git checkout v1.4.3
$ automake --add-missing && autoreconf && ./configure && make check
$ cat tests/gplv3.txt.gz | ./gztool -I index.bin
ACTION: Create index for a gzip file

Index file 'index.bin' already exists and will be used.
Processing STDIN ...
Processing index to 'index.bin'...
$ ./gztool -ell index.bin
ACTION: Check & list info in index file

Checking index file 'index.bin' ...
	Size of index file (v1)  : 84.00 Bytes (84 Bytes)
	Number of index points   : 1
	Size of uncompressed file: 31.27 kiB (32024 Bytes)
	Number of lines          : 207
	Compression factor       : Not available
	List of points:
	#: @ compressed/uncompressed byte L#line_number (window data size in Bytes @window's beginning at index file), ...
#1: @ 20 / 0 L1 ( 0 @60 ), 

1 files processed

Thank you!

Feature request: Random access when using stdin

Hi Roberto,

Could you please implement a new argument, which informs gztool which compressed byte it's about to receive in stdin?

Something like:
gztool.exe -ecb 1501239288 -b 3410890441

Where -ecb stands for 'Expect compressed byte'

Background
The -b argument is great for telling telling gztool which uncompressed byte to start extracting from. That works really well if working with concrete files.

However I'm writing a program which doesn't use concrete files, rather uses in-memory content. In order to use the -b argument, I've used the technique you described here, which makes use of sparse files.

For example, if I would like to get uncompressed bytes 3,410,890,441 - 4,000,000,000, I first run gztool.exe -ll to work out which part of the compressed file contains that portion. In this case 1,501,239,288 - 1,630,240,888. I then create a sparse file and populate that section with the compressed data. (The sparse file in this example only takes up 123MB). Then finally I run gztool on the sparse file:
gztool -W -I "file.gzi" -b 3410890441 <sparseGzipFilename> and read from stdout.

That technique works perfectly. The only issue is that there is unnecessary IO (I have to populate a sparse file every time I want to decompress something). So using the -ecb argument would mean I could just send compressed data straight into gztool's stdin, and it would output decompressed data.

Cheers,
Fidel

an attempt at porting an index client

this C code has 2 ints

struct access {
    uint64_t have;      /* number of list entries filled in */
    uint64_t size;      /* number of list entries allocated */
    uint64_t file_size; /* size of uncompressed file (useful for bgzip files) */
    struct point *list; /* allocated list */
    char *file_name;    /* path to index file */
    int index_complete; /* 1: index is complete; 0: index is (still) incomplete */
// index v1:
    int index_version;  /* 0: default; 1: index with line numbers */
    uint32_t line_number_format; /* 0: linux \r | windows \n\r; 1: mac \n */
    uint64_t number_of_lines; /* number of lines (only used with v1 index format) */
};

This code below is an attempt to port the index and eventually apply the inflater lib to the points, but im hitting buffer underruns with the bytebuffers and i cannot rule out that sizeof(int) might be other than 32 bits.

package ed.fumes

import java.io.File
import java.nio.ByteBuffer
import java.nio.ByteOrder
import java.nio.channels.FileChannel


object GzipIndexReader {
    data class Point(
        var out: Long,
        var `in`: Long,
        var bits: Int,
        var windowSize: Int,
        var window: ByteArray?,
        var windowBeginning: Long,
        var lineNumber: Long,
    )

    data class Access(
        var have: Long,
        var size: Long,
        var fileSize: Long,
        var list: MutableList<Point>,
        var fileName: String,
        var indexComplete: Int,
        var indexVersion: Int,
        var lineNumberFormat: Int,
        var numberOfLines: Long,
    )

    val GZIP_INDEX_HEADER_SIZE = 16
    val GZIP_INDEX_IDENTIFIER_STRING = "gzipindx"
    val GZIP_INDEX_IDENTIFIER_STRING_V1 = "gzipindX"

    fun ByteBuffer.readLongBE(): Long {
        order(ByteOrder.BIG_ENDIAN)
        return long
    }

    fun ByteBuffer.readIntBE(): Int {
        order(ByteOrder.BIG_ENDIAN)
        return int
    }

    fun ByteBuffer.readBytes(n: Int): ByteArray {
        val arr = ByteArray(n)
        get(arr)
        return arr
    }

    fun createEmptyIndex(): Access {
        return Access(
            have = 0L,
            size = 0L,
            fileSize = 0L,
            list = mutableListOf(),
            fileName = "",
            indexComplete = 0,
            indexVersion = 0,
            lineNumberFormat = 0,
            numberOfLines = 0L
        )
    }

    fun addPoint(index: Access, point: Point) {
        index.list.add(point)
        index.have++
        index.size++
    }

    fun deserializeIndexFromFile(indexFile: File, loadWindows: Boolean = false, gzFilename: String): Access? {
        indexFile.inputStream().use { inputStream ->
            val channel = inputStream.channel
            val header = ByteArray(GZIP_INDEX_HEADER_SIZE)
            inputStream.read(header)

            if (header.sliceArray(0 until 8)
                    .toString(Charsets.UTF_8) != "\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000" ||
                !(header.sliceArray(8 until 16)
                    .toString(Charsets.UTF_8) == GZIP_INDEX_IDENTIFIER_STRING || header.sliceArray(8 until 16)
                    .toString(Charsets.UTF_8) == GZIP_INDEX_IDENTIFIER_STRING_V1)
            ) {
                println("ERROR: File is not a valid gzip index file.")
                return null
            }

            val indexVersion =
                if (header.sliceArray(8 until 16).toString(Charsets.UTF_8) == GZIP_INDEX_IDENTIFIER_STRING_V1) 1 else 0
            val index = createEmptyIndex()
            index.indexVersion = indexVersion

            if (indexVersion == 1) index.lineNumberFormat = channel.readIntBE()

            val indexHave = channel.readLongBE()
            val indexSize = channel.readLongBE()

            if (indexHave != indexSize) {
                println("Index file is incomplete.")
                index.indexComplete = 0
            } else index.indexComplete = 1

            for (i in 0 until indexSize) {
                val out = channel.readLongBE()
                val `in` = channel.readLongBE()
                val bits = channel.readIntBE()
                val windowSize = channel.readIntBE()

                val window: ByteArray? = if (loadWindows) {
                    val windowBytes = ByteBuffer.allocate(windowSize)
                    channel!!.read(windowBytes)
                    windowBytes.array()
                } else {
                    channel.position(channel.position() + windowSize)
                    null
                }

                val windowBeginning = channel.readLongBE()
                val lineNumber = if (indexVersion == 1) channel.readLongBE() else 0L

                val point = Point(
                    out = out,
                    `in` = `in`,
                    bits = bits,
                    windowSize = windowSize,
                    window = window,
                    windowBeginning = windowBeginning,
                    lineNumber = lineNumber
                )

                addPoint(index, point)
            }

            index.fileName = gzFilename
            index.numberOfLines = if (indexVersion == 1) channel.readLongBE() else 0L
            index.fileSize = indexFile.length()

            return index
        }
    }

    fun FileChannel.readIntBE(): Int {
        val buffer = ByteBuffer.allocate(4).order(ByteOrder.BIG_ENDIAN)
        read(buffer)
        buffer.flip()
        return buffer.int
    }

    fun FileChannel.readLongBE(): Long {
        val buffer = ByteBuffer.allocate(8).order(ByteOrder.BIG_ENDIAN)
        read(buffer)
        buffer.flip()
        return buffer.long
    }

}

fun main(args: Array<String>) {
    val inputFile = args[0]
    val gzfile = args.getOrNull(1) ?: (inputFile.replace(".gzi$".toRegex(), ".gz"))

    val index = GzipIndexReader.deserializeIndexFromFile(File(inputFile), true, gzfile)

    if (index != null) {
        println("Index loaded successfully.")
        println("Points: ${index.list.size}")
    } else {
        println("Failed to load index.")
    }
}

the outcome is that on the second window read, there's a buffer underflow reported however this debugger shows something that looks entirely like misaligned values in the millions and trillions for windowsize. this overrruns the index if its completely in memory or if there's a short read from file. any insight appreciated

Exception in thread "main" java.nio.BufferUnderflowException
at java.base/java.nio.Buffer.nextGetIndex(Buffer.java:710)
at java.base/java.nio.HeapByteBuffer.getLong(HeapByteBuffer.java:494)
at ed.fumes.GzipIndexReader.readLongBE(GzipIndexReader.kt:153)
at ed.fumes.GzipIndexReader.deserializeIndexFromFile(GzipIndexReader.kt:118)
at ed.fumes.GzipIndexReaderKt.main(GzipIndexReader.kt:162)

Dead code when checking input

The code at line 3844:

    if ( span_between_points != SPAN &&
        ( action == ACT_COMPRESS_CHUNK || action == ACT_DECOMPRESS_CHUNK || action == ACT_LIST_INFO )
        ) {
        printToStderr( VERBOSITY_NORMAL, "ERROR: `-s` parameter does not apply to `-[lu]`.\n" );
        return EXIT_INVALID_OPTION;
    }

will never run, as this case is covered by code at 3809-3827:

    if ( ( action == ACT_COMPRESS_CHUNK || action == ACT_DECOMPRESS_CHUNK ) &&
        ( force_action == 1 || force_strict_order == 1 || write_index_to_disk == 0 ||
            span_between_points != SPAN || index_filename_indicated == 1 ||
            end_on_first_proper_gzip_eof == 1 || always_create_a_complete_index == 1 ||
            waiting_time != WAITING_TIME )
        ) {
        printToStderr( VERBOSITY_NORMAL, "ERROR: `-[aCEfFIsW]` does not apply to `-u`\n" );
        return EXIT_INVALID_OPTION;
    }

    if ( ( action == ACT_LIST_INFO ) &&
        ( force_action == 1 || force_strict_order == 1 || write_index_to_disk == 0 ||
            span_between_points != SPAN ||
            end_on_first_proper_gzip_eof == 1 || always_create_a_complete_index == 1 ||
            waiting_time != WAITING_TIME )
        ) {
        printToStderr( VERBOSITY_NORMAL, "ERROR: `-[aCEfFsW]` does not apply to `-l`\n" );
        return EXIT_INVALID_OPTION;
    }

I suggest to remove handling of case span_between_points != SPAN from 3809-3827.

gzip tailing like tail -F

Hi and thank you for this wonderful tool.

I was wondering if it was possible to continuously tail a gzipped file like you would with tail -F (note the capital -F, not -f!).
This allows the seamless restart of the tail process if the file disappears (e.g. if it is rotated).

This would be very useful for gzipped log files that are quickly rotated.
Currently, my only option, as far as I understand it, is to do something along the lines of:

while true; do
  gztool -wWTP -a1 -v0 test-log.log
done

, which is admittedly quite ugly. Do you see any other option?

Feature request: Multiple input files

Hi Roberto,

Clonezilla splits gzip files into 4GB segments. Would it be possible to enable gztool to accept multiple input files and treat them as one big input file?

eg.
gztool -I myindex.gzi "sda4.ntfs-ptcl-img.gz.aa" "sda4.ntfs-ptcl-img.gz.ab" "sda4.ntfs-ptcl-img.gz.ac"

I want to avoid using cat to combine the files
eg.
cat *sda4* | gztool -I myindex.gzi
because I create the index over multiple gztool runs; so I want to avoid providing data which has already been indexed by gztool.

Thank you,
Fidel

Feature request: Resume index creation for stdin

Hi Roberto,

Merry Christmas! I hope you are well.

Could you please enable the -n argument to be used during index creation?

Something like:
gztool.exe -n 56343039741 -I "file.gzi"

Then I pass in compressed data (starting at byte 56343039741) into stdin.

Thank you,
Fidel

Background
I have in-memory compressed data, and I am currently creating an index using:
gztool.exe -I "file.gzi"
During index creation, the user may close my program without warning. I'd like to resume index creation when they run the program again. So I plan to use -ll to see what compressed byte we got up to last time, then use the -n argument to resume where we left off. This would avoid having to send all the data into stdin again.

Can we make it so I can 'write' gzi files to a different directory?

I have a situation where I have given a user RO access to a growing log file, but don't want to give them 'write' access to the logs directory/files. Is there a way to use gztool to reference an index in a DIFFERENT directory than the 'fully readable' growing gzip file? I want them to be able to tail -f like the file w/o r/w access to the gzip file.

Feature request: zsttool

Hi Roberto,

Going out on a limb here, but do you think you can make a tool to index Zstandard files?

automake --add-missing

Thanks for the recent release and the mention

this brought me to clean the old source tree and attempt something proper

 jim@gentoo ~/work/gztool $ autoreconf && ./configure && make check
configure.ac:4: warning: The macro `AC_PROG_CC_C99' is obsolete.
configure.ac:4: You should run autoupdate.
./lib/autoconf/c.m4:1664: AC_PROG_CC_C99 is expanded from...
configure.ac:4: the top level
configure.ac:3: error: required file './compile' not found
configure.ac:3:   'automake --add-missing' can install 'compile'
[...]

some fiddling lead to

jim@gentoo ~/work/gztool $ automake --add-missing
configure.ac:3: installing './compile'
configure.ac:2: installing './install-sh'
configure.ac:2: installing './missing'
Makefile.am: installing './depcomp'
parallel-tests: installing './test-driver'

the results changed after that.

jim@gentoo ~/work/gztool $ autoreconf && ./configure && make check
configure.ac:4: warning: The macro `AC_PROG_CC_C99' is obsolete.
configure.ac:4: You should run autoupdate.
./lib/autoconf/c.m4:1664: AC_PROG_CC_C99 is expanded from...
configure.ac:4: the top level
checking for a BSD-compatible install... /usr/bin/install -c

eventually the makefile was sane.

just passing this along, i think i can retire my fork for a bit until im ready to figure out an in-browser solution via range requests

use libdeflate?

Would implementing with libdeflate improve performance?

Functional question.

I would like to be able to unzip truncated data with a range request without using transfer encoding.
How do I raise the first index point so that I can make larger cuts?

Thanks : )

Question: line-oriented vs byte

Hi,

I work with large text files using commands like 'tail -n +40000 file.txt | head -n 1000' which retrieves lines 40000-41000. This feeds chunks of data to a program. The program needs to know which line numbers are being processed; and it operates on multiple files at once, getting the same line number blocks from each even though they have different data.

Is it possible to retrieve in line-oriented mode instead of byte? I imagine if the index proxied bytes for line numbers, for example determine the byte number at each 1,000 line block of text when creating the index so it knows that line 41000 is at byte #xyz. When creating the index specify a line-oriented block size eg. 1,000

This probably seems like a special application but I think many/most people work with text files in line-oriented mode vs. byte count otherwise lines are truncated and data lost and not able to control which lines are retrieved. Thanks for you consideration and work on this project!

Only extract a subset of lines using gztool

Hi,

I've been searching for a little while for a way to perform seek and tell operations on gzip files, which I need for one of my projects, and finally found your work. Thank you so very much for this!

However, it seems like the extraction functionalities always output the whole file, starting from the given line / offset.
I would need a way to use gztool to, say, for instance, "Extract lines between 12 and 18", or "Extract content between offset 125 and offset 1500", but I don't seem to find any option allowing this?

I'm currently browsing through the source code trying to see if I can slightly alter it to fit my needs, but maybe I'm just overlooking an option that already exists?

Thanks,
Pierre

circulosmeos / gztool Goto Github PK

gztool's People

Contributors

Stargazers

Watchers

Forkers

gztool's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs