GithubHelp home page GithubHelp logo

linkedin / migz Goto Github PK

View Code? Open in Web Editor NEW
74.0 5.0 12.0 4.92 MB

Multithreaded, gzip-compatible compression and decompression, available as a platform-independent Java library and command-line utilities.

License: BSD 2-Clause "Simplified" License

Shell 1.79% Java 98.21%

migz's Introduction

MiGz

Motivation

Compressing and decompressing files with standard compression utilities like gzip is a single-threaded affair. For large files on fast disks, that single thread becomes the bottleneck.

There are several utilities for multithreaded compression, including an extant Java library (https://github.com/shevek/parallelgzip), but no Java library (or GZip utility) also supports multithreaded decompression, which is especially important for large files that are read repeatedly. Hence, MiGz.

Benefits

MiGz uses the GZip format, which has widespread support and offers both reasonably fast speed and a good compression ratio.

MiGz'ed files are also entirely valid GZip files, and can be read (single-threaded) by any GZip utility/library, including GZipInputStream! Better still, MiGz'ed files can be multithreadedly decompressed by the MiGz decompressor.

On multicore machines, MiGz compression is much faster for any reasonably large file (tens of megabytes or more); 6x gains were seen on a MacBook with a large Wikipedia dump vs. the gzip command line utility (see Performance, below), with only ~1% increase in file size vs. gzip at max compression.

Decompression is also sped up for larger files (many tens of megabytes or more); for smaller files, it's about the same as Java's built-in single-threaded GZipInputStream. Decompression of the aforementioned Wikipedia data was over 3x faster.

Performance

Using default settings on a MacBook Pro (with a SSD) with four hyperthreaded physical cores (8 logical cores):

Shakespeare

The time to compress a 25.6MB collection of Shakespeare text was 25% that of GZip at max compression (~1.35s vs. ~6s), with MiGz's output being ~1% larger. However, the time to decompress, measured with the MUnzip command-line tool, is ~0.25s vs. GZip's ~0.09s, mostly attributable Java overhead: the time to decompress in Java with GZip is a slightly faster ~0.23s.

Still, using the Java API, in a tight loop decompressing the same in-memory data 100 times and discarding the result, the decompression time per copy is ~0.019s vs. ~0.073s for GZipStream. We suspect that MiGz requires either some JIT-related warm-up or amortizing the extra class loading cost vs. GZipStream before gains are seen on smaller files.

German Wikipedia

This is an 18GB XML dump of German Wikipedia articles. At maximum compression, MiGz compresses it in 198.2s, vs. 810.2s for GZip. Decompression is 15.6s for MiGz and 65.2s for GZip. Compressed file size is roughly equal: 5.74GB for MiGz and 5.70GB for GZip (a difference of less than 1%).

Using MiGz in Java and other JVM Languages

MiGz is used just like you would use GZipInputStream and GZipOutputStream, with the analogous MiGzInputStream and MiGzOutputStream classes. For example, decompression is as simple as:

InputStream is = ...
MiGzInputStream mis = new MiGzInputStream(is);

Compression is just as simple:

OutputStream os = ...
MiGzOutputStream mos = new MiGzOutputStream(os);

Using MiGz from the Command-line

The MiGz project also comes with modules for two simple command-line tools; you may build these yourself or use our precompiled executables (for *nix platforms) or JARs (other platforms).

mzip

mzip uses MiGz to compresses data from stdin and outputs the compressed data to stdout. For example, to compress data.txt and write the result to data.gz, we can run:

mzip < data.txt > data.gz

munzip

munzip likewise uses MiGz to decompress data from stdin and output the original, uncompressed data to stdout. For example, to decompress data.gz back to data.txt:

muzip < data.gz > data.txt

The default block size is 512KB, which provides good speed (smaller block sizes -> better parallelization) on relatively "small" (10s of MB) files, while still maintaining file sizes very close to standard gzip, though you can reduce block size to ~100KB before the difference is really noticeable.

The default thread count is either the number of logical cores on your machine (decompression) or twice that (compression). Extra threads are use for compression because MiGz uses the threads to effectively buffer the output without using a dedicated writer thread. However, this may change in the future and we recommend sticking with the default thread count as "future proofing".

Building MiGz

To build the MiGz Java library, use the command gradle :migz:build. To build the munzip tool, use the command gradle :munzip:build. To build the mzip tool, use the command gradle :mzip:build.

migz's People

Contributors

jeffpasternack avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

migz's Issues

MiGz未能解压缩GZIP压缩数据

下列两组数据分别是GZIP和MiGz压缩二进制数据:
GZIP1:1f8b08000000000000003334323634320400ee129cde06000000
MiGz1:1f8b080400000000020008004d5a0400080000003334323634320400ee129cde06000000
GZIP2:1f8b08000000000000002b4b2c4b492c2b4b4c010071ec6fe909000000
MiGz2:1f8b080400000000020008004d5a04000b0000002b4b2c4b492c2b4b4c010071ec6fe909000000

MIGZ 压缩的数据会多出20008004d5a040008000或者20008004d5a04000b000,请问这10个字节能去掉吗?

Deadlock seen in MiGzOutputStream#close:329 version 1.0.0

Hi guys, first of all major thanks for open sourcing this library! I've been able to significantly improve the I/O performance of my application with this.

I just came across an instance where calling MiGzOutputStream#close hangs indefinitely. A heap dump reveals the thread in question is waiting for the thread pool to shut down.

Screen Shot 2022-12-06 at 11 05 01 AM

if (!_threadPool.awaitTermination(3650, TimeUnit.DAYS)) { // wait a long, long time

Is this a known issue? Is there any reason the ForkJoinPool introduced in 2.0-beta1 prevents this behavior?

Inputs:

  • numThreads => 96 (48 core machine)
  • outputStream => FileOutputStream("/dev/null")
  • bytesIn => ~12GiB
  • bytesOut => ~3GiB

munzip fails on gzipped files

Hi guys,

Thanks for the great library! It's made some of my Java programs that were I/O bound much faster :)

However, it's currently hard to use MiGz unless you can guarantee that it will never try to unzip a file zipped by vanilla gzip. If you do, it crashes:

% gzip junk.csv
% munzip < junk.csv.gz > junk.csv
Decompressing stdin using 40 threads
Exception in thread "main" java.lang.RuntimeException: java.util.zip.DataFormatException: invalid code lengths set
	at com.linkedin.migz.MiGzInputStream.decompressorThreadWithInflater(MiGzInputStream.java:208)
	at com.linkedin.migz.MiGzInputStream.decompressorThread(MiGzInputStream.java:114)
	at com.concurrentli.Interrupted.lambda$ignored$1(Interrupted.java:48)
	at java.base/java.lang.Thread.run(Thread.java:830)
Caused by: java.util.zip.DataFormatException: invalid code lengths set
	at java.base/java.util.zip.Inflater.inflateBytesBytes(Native Method)
	at java.base/java.util.zip.Inflater.inflate(Inflater.java:378)
	at java.base/java.util.zip.Inflater.inflate(Inflater.java:464)
	at com.linkedin.migz.MiGzInputStream.decompressorThreadWithInflater(MiGzInputStream.java:192)
	... 3 more

Would it be possible for MiGz to just decompress serially in this case? The opposite seems to work fine; I can decompress MiGz files with gunzip with no problem.

This is MiGz 1.0.1 running under Oracle OpenJDK 13.0.2 on RedHat 7.4.

hang with certain block sizes and data sets

I have a case where MiGz hangs when I set a block size below ~120*1024 bytes. This happens for a single data set I found, not always. So it is input-specific. Code hangs inside os.write(), CPU usage is 0, so it seems nothing is being done. I suspect it hangs inside scheduleCurrentBlock(false);

Are there any constraints on the block size that can be used? or is this a bug?

Windows Command-Line Tools

Hi, could you provide binaries for Windows? Wanna include MiGz into my benchmark roster using 512KB and bigger blocks, by the way, how bigger they can be?

Your explanation of the acronym is not clear to me, please explain:
"... also supports multithreaded decompression, which is especially important for large files that are read repeatedly. Hence, MiGz."

What does the little 'i' stand for?

Allow me two more questions:

  • Why not in C?
  • Are you aware that MultIple decompressions are not bottlenecked by CPU in LzTurbo 29? I mean, your testresults using 4 cores are even inferior to the single-threaded decompression of LzTurbo 29/39!
     1447249    12.6       0.50     471.90   brotli 11d29                     ftp.gnu.org_grep-3.3.tar
     1455165    12.6       2.10     137.62   lzma 9d29:fb273:mf=bt4           ftp.gnu.org_grep-3.3.tar
     1489213    12.9       0.24    1063.22   oodle 139 ‘Leviathan’            ftp.gnu.org_grep-3.3.tar
     1496718    13.0       0.18    1322.62   oodle 129 ‘Hydra’                ftp.gnu.org_grep-3.3.tar
     1496718    13.0       0.45    1322.01   oodle 89 ‘Kraken’                ftp.gnu.org_grep-3.3.tar
     1513749    13.1       2.24    1070.13   zstd 22d29                       ftp.gnu.org_grep-3.3.tar
     1517944    13.2       0.16     346.06   lzham 4fb258:x4:d29              ftp.gnu.org_grep-3.3.tar
     1521395    13.2       1.49    1552.77   lzturbo 39                       ftp.gnu.org_grep-3.3.tar
     1756302    15.2      39.10    1542.17   lzturbo 32                       ftp.gnu.org_grep-3.3.tar
     1774686    15.4      21.85    1145.93   zstd 12                          ftp.gnu.org_grep-3.3.tar
     1782164    15.5       1.52    1953.87   lzturbo 29                       ftp.gnu.org_grep-3.3.tar
     1875468    16.3       1.47    1581.11   lizard 49                        ftp.gnu.org_grep-3.3.tar
     2114046    18.4      54.64     886.77   oodle 132 ‘Leviathan’            ftp.gnu.org_grep-3.3.tar
     2163309                       1548      Nakamichi 'Ryuugan-ditto-1TB'    ! Outside TurboBench, Intel-v15.0-64bit-archSSE41 compile !
     2172516    18.9       2.42    3501.52   oodle 118 ‘Selkie’               ftp.gnu.org_grep-3.3.tar
     2172516    18.9       2.42    3495.15   oodle 116 ‘Selkie’               ftp.gnu.org_grep-3.3.tar
     2359093    20.5     306.55    1233.67   lzturbo 30                       ftp.gnu.org_grep-3.3.tar
     2404889    20.9      15.35     333.83   zlib 9                           ftp.gnu.org_grep-3.3.tar
     2406525    20.9      46.42    3756.11   oodle 114 ‘Selkie’               ftp.gnu.org_grep-3.3.tar

As you can see from the above (random) test, LzTurbo 30 decompresses 1233.67/333.83=3.69x and compresses 306.55/15.35=19.97x faster than zlib 9!

     6998037    17.4       0.90     849.88   lzturbo 39                       Complete_works_of_Fyodor_Dostoyevsky_in_15_volumes_(Russian).tar
     7022279    17.4       0.37     321.86   brotli 11d29                     Complete_works_of_Fyodor_Dostoyevsky_in_15_volumes_(Russian).tar
     7049563    17.5       1.38     684.89   zstd 22d29                       Complete_works_of_Fyodor_Dostoyevsky_in_15_volumes_(Russian).tar
     7071491    17.5       0.29     644.84   oodle 139 ‘Leviathan’            Complete_works_of_Fyodor_Dostoyevsky_in_15_volumes_(Russian).tar
     7103502    17.6       0.40     724.20   oodle 89 ‘Kraken’                Complete_works_of_Fyodor_Dostoyevsky_in_15_volumes_(Russian).tar
     7105986    17.6       0.23     723.98   oodle 129 ‘Hydra’                Complete_works_of_Fyodor_Dostoyevsky_in_15_volumes_(Russian).tar
     7125187    17.7      10.61      25.82   bzip2                            Complete_works_of_Fyodor_Dostoyevsky_in_15_volumes_(Russian).tar
     7960854    19.8       0.91    1302.16   lzturbo 29                       Complete_works_of_Fyodor_Dostoyevsky_in_15_volumes_(Russian).tar
     8061825    20.0       1.38     924.79   lizard 49                        Complete_works_of_Fyodor_Dostoyevsky_in_15_volumes_(Russian).tar
     9041702                       1077      Nakamichi 'Ryuugan-ditto-1TB'    ! Outside TurboBench, Intel-v15.0-64bit-archSSE41 compile !
     9314676    23.1      37.27    1085.78   lzturbo 32                       Complete_works_of_Fyodor_Dostoyevsky_in_15_volumes_(Russian).tar
     9759547    24.2       0.57    1812.03   oodle 116 ‘Selkie’               Complete_works_of_Fyodor_Dostoyevsky_in_15_volumes_(Russian).tar
     9759547    24.2       0.57    1810.97   oodle 118 ‘Selkie’               Complete_works_of_Fyodor_Dostoyevsky_in_15_volumes_(Russian).tar
    10771358    26.7       4.30     320.47   zlib 9                           Complete_works_of_Fyodor_Dostoyevsky_in_15_volumes_(Russian).tar

For the second random example, LzTurbo 29 is 4x faster than zlib 9 in decompression, no need of 4 cores.

IndexOutOfBoundsException

Hi!

root@wendigo:/data/snc_backups/full/temp# pv stream.gz | /home/ayengar/munzip | mbstream -x -v
Decompressing stdin using 32 threads
Exception in thread "main" java.lang.IndexOutOfBoundsException
        at java.base/java.io.BufferedInputStream.read(BufferedInputStream.java:344)
        at com.linkedin.migz.MiGzInputStream.readFromInputStream(MiGzInputStream.java:222)
        at com.linkedin.migz.MiGzInputStream.decompressorThreadWithInflater(MiGzInputStream.java:172)
        at com.linkedin.migz.MiGzInputStream.decompressorThread(MiGzInputStream.java:114)
        at com.concurrentli.Interrupted.lambda$ignored$1(Interrupted.java:48)
        at java.base/java.lang.Thread.run(Thread.java:829)
72.0KiB 0:00:00 [ 725KiB/s] [>                                                                                                                             ]  0%            
root@wendigo:/data/snc_backups/full/temp#

'Nuff said?

failed with using MiGzInputStream

hello,developers,
when I use MiGzInputStream with availableProcessors(), it happened some exception. For example: OOM or IndexOutBounds. My code is:

int threadCount = Runtime.getRuntime().availableProcessors();
ByteArrayInputStream in = new ByteArrayInputStream(compressedBytes);
MiGzInputStream miGzInputStream = new MiGzInputStream(inputStream, threadCount);
ByteArrayOutputStream out = new ByteArrayOutputStream();
MiGzBuffer buffer;
while ((buffer = miGzInputStream.readBuffer()) != null) {
      out.write(buffer.getData(), 0, buffer.getLength());
}
out.close();

And how I can resolve it? Thanks

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.