GithubHelp home page GithubHelp logo

pigzbench's Introduction

pigz-bench for Python

Introduction

These simple Python scripts benchmark different zlib compression libraries for the pigz parallel compressor. Parallel compression can use multiple cores available with modern computers to rapidly compress data. This technique can be combined with the CloudFlare zlib or zlib-ng which accelerates compression using other features of modern hardware.

By default, these scripts examine compression and decompression of the Silesia compression corpus. Some compression methods work particularly well for some datasets (for example, the Italian alphabet only has 21 letters, whereas some other languages have a larger set). The scripts will also install an optional corpus of brain imaging data in the NIfTI format. It is common for tools like AFNI and FSL to save NIfTI images using gzip compression (.nii.gz files). Modern MRI methods such as multi-band yield huge datasets, so considerable time spent compressing these images. You can also choose to compress the files in any folder you wish, allowing you to create a custom corpus. You can also download and test the Canterbury corpus or the Calgary corpus.

The graph below shows the performance of pigz variants compressing the Silesia corpus with increasing numbers of threads (CPU cores) devoted to the compression. The test system was a 12-core (24 thread) AMD Ryzen 3900X:

alt tag

The next graph shows each tool using its preferred number of threads to compress the Silesia corpus. All versions of pigz outperform the system's single threaded gzip. One can see that the modern zstd format dominates the older and simpler gzip. gzip has been widely adopted in many fields (for example in brain imaging it is used for NIfTI, NRRD and AFNI). The simplicity of the gzip format means it is easy for developers to include support in their tools. Therefore, gzip plays an important niche in the community. However, this graph demonstrates that modern compression formats (like zstd) that were designed for modern hardware and leveraging new techniques have inherent benefits.

alt tag

The script c_decompress.py allows us to compare speed of decompression. Decompression is faster than compression. However, gzip decompression can not leverage multiple threads, and is generally slower than modern compression formats. However, the modern zstd is not tuned for the datatypes common in science. The tests below illustrate that gz decompression remains competitive in this niche. In this test, all gz tools are decompressing the same data (addressing a concern by Sebastian Pop that different gzip compressors create different file sizes, and smaller files might be more complicated and therefore slower to extract). In contrast, bzip2 and ztd are decompressing data that was compressed to a smaller size. It is typical for more compact compression to use more complicated algorithms, so comparing between formats is challenging. Regardless, among gz tools, zlib-ng shows superior decompression performance:

Speed (mb/s) pigz-CF pigz-ng pigz-Sys gzip pbzip2 zstd
Decompression 307 361 306 189 448 528

Running the benchmark

Run the benchmark with a command like the following (you may need to use python instead of python3):

python3 a_compile.py
python3 b_speed_threads.py
python3 c_decompress.py
python3 d_speed_size.py
python3 f_speed_size_decompress.py

Dependencies

This script required Python 3.3 or later (for functions like shutil.which, os.cpu_count).

The compile script will require your system has a C compiler, CMake, and git installed. Installation depends on operating system, but for a Debian-based Linux system the install might be sudo apt install build-essential cmake git.

You may have to install some Python packages. You can install these with your favorite package manager, if you use pip3 the commands will look like this:

pip3 install seaborn
pip3 install psutil

This should be sufficient for most modern systems (since installing seaborn should install pandas, scipy, numpy). However, you may need to install additional dependencies (e.g. pip3 install Cython; pip3 install numpy).

The a_compile.py will build variants of pigz. However, it will also test the gzip, zstd and pbzip2 compressors if they are installed. Installation varies for different operating systems. For example, on Debian-based Linux distributions you could run sudo apt install pbzip2 to install pbzip2.

Running data on a server

These scripts will attempt to generate a line plot to show the performance of different versions of pigz. These plots require access to a graphical display. Some servers only provide test-based command line access, so in these cases the scripts will report Plot the results on a machine with a graphical display. In this case, you can copy the result files generated and view them on a computer with a graphical display. This Python script shows how to view plots for results generated on a different computer:

import b_speed_threads
b_speed_threads.plot('speed_threadsAmpere.pkl')
import d_speed_size
d_speed_size.plot('speed_size.pkl')

The scripts

  1. a_compile.py will download and build copies of pigz using different zlib variants (system, CloudFlare, ng). It also downloads sample images to test compression, specifically the sample MRI scans which are copied to the folder corpus. You must run this script once first, before the other scripts. All the other scripts can be run independently of each other.
  2. b_speed_threads.py compares the speed of the different versions of pigz as the number of threads is increased. Each variant is timed compressing the files in the folder corpus. You can replace the files in the corpus folder with ones more representative of the files you hope to compress.
  3. c_decompress.py evaluates the decompression speed. In general, the gzip format is slow to compress but fast to decompress (particularly compared to formats developed at the same time). However, gzip decompression is slow relative to the modern zstd. Further, while gzip compression can benefit from parallel processing, decompression does not. An important feature of this script is that each variant of zlib contributes compressed files to the testing corpus, and then each tool is tested on this full corpus. This ensures we are comparing similar tasks, as some zlib compression methods might generate smaller files at the cost of creating files that are slower to decompress. The script also validates the compression and decompression of each datatype, ensuring the process is truly lossless.
  4. d_speed_size.sh compares different variants of pigz to gzip, zstd and bzip2 for compressing the corpus. Each tool is tested at different compression levels, but always using the preferred number of threads.
  5. e_test_mgzip.py evaluates mgzip which creates gz format files that are both compressed and decompressed in parallel. The files created by this method can be decompressed by any gz compatible tool, but the faster parallel decompression requires using mgzip.
  6. f_speed_size_decompress.py combines c_decompress.py and d_speed_size.sh into a single script. The strength of this script is that it is easy to extend. You can edit it to include additional compressors. For example, commented out lines test lz4 and xz compres./sion. It can be run with two optional arguments. The first sets the folder with files to compress (defaults to ./corpus). The second allows you to determine how many runs are computed (default 3). This script reports the fastest time across all the runs.

Testing custom versions of pigz

The script a_compile.py will compile 3 popular variants of pigz and copy these to the exe folder. The subsequent scripts will test all executables in this folder. Therefore, you can copy your own variation into this folder and compare your best effort against the competition. Issue 1 describes how to easily compile a custom variation without changing the base version.

Alternatives

  • Python users may want to examine mgzip. Like pigz, mgzip can compress files to gz format in parallel. However, it can also decompress files created with mgzip in parallel. The gz files created by mgzip are completely valid gzip files, so they can be decompressed by any gzip compatible tool. However, these files require a tiny bit more disk space which allows parallel blocked decompression (as long as you use mgzip to do the decompression). For optimal performance, one should set a blocksize that correspnds to the number of threads for compression. This repository includes the test_mgzip.py script to evaluate this tool.
  • Python users can use indexed-gzip to generate an index file for any gzip file. This index file accelerates random access to a gzip file.
  • These Python scripts are portedfrom shell scripts. Some users may prefer the shell scripts which have fewer dependencies.

pigzbench's People

Contributors

neurolabusc avatar nmoinvaz avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

pigzbench's Issues

python a_compile.py on an Apple M1 mac results in error

Attempting to run python a_compile.py on an Apple M1 mac results in error.

If it helps, building with cmake works as expected when using a similar project https://github.com/neurolabusc/pigz-bench

environment details

  System Version: macOS 12.6 (21G115)
  Kernel Version: Darwin 21.6.0
  Boot Volume: Macintosh HD
  Boot Mode: Normal
  Secure Virtual Memory: Enabled
  System Integrity Protection: Enabled
  Model Name: MacBook Pro
  Model Identifier: MacBookPro18,3
  Chip: Apple M1 Pro
  Total Number of Cores: 10 (8 performance and 2 efficiency)
  Memory: 16 GB
  System Firmware Version: 7459.141.1
  OS Loader Version: 7459.141.1

full console output below

  CMake Deprecation Warning at /Users/andrej/Documents/Projects/pigzbench-master/zlib-madler/CMakeLists.txt:1 (cmake_minimum_required):
   Compatibility with CMake < 2.8.12 will be removed from a future version of
   CMake.
 
   Update the VERSION argument <min> value or use a ...<max> suffix to tell
   CMake that the project does not need compatibility with older versions.
 
 
 -- Configuring done
 CMake Warning (dev):
   Policy CMP0042 is not set: MACOSX_RPATH is enabled by default.  Run "cmake
   --help-policy CMP0042" for policy details.  Use the cmake_policy command to
   set the policy and suppress this warning.
 
   MACOSX_RPATH is not specified for the following targets:
 
    zlib
 
 This warning is for project developers.  Use -Wno-dev to suppress it.
 
 -- Generating done
 -- Build files have been written to: /Users/andrej/Documents/Projects/pigzbench-master/pigz-madler/build
 Consolidate compiler generated dependencies of target zlibstatic
 [ 80%] Built target zlibstatic
 Consolidate compiler generated dependencies of target pigz
 [ 85%] Linking C executable pigz
 [100%] Built target pigz
 /Users/andrej/Documents/Projects/pigzbench-master/pigz-madler/build/pigz->/Users/andrej/Documents/Projects/pigzbench-master/exe/pigz-madler
 -- Configuring done
 -- Generating done
 -- Build files have been written to: /Users/andrej/Documents/Projects/pigzbench-master/pigz-cloudflare/build
 Consolidate compiler generated dependencies of target zlib
 [ 81%] Built target zlib
 Consolidate compiler generated dependencies of target pigz
 [ 86%] Linking C executable pigz
 Undefined symbols for architecture arm64:
   "_get_crc_table", referenced from:
       _main in pigz.c.o
 ld: symbol(s) not found for architecture arm64
 clang: error: linker command failed with exit code 1 (use -v to see invocation)
 make[2]: *** [pigz] Error 1
 make[1]: *** [CMakeFiles/pigz.dir/all] Error 2
 make: *** [all] Error 2
 Traceback (most recent call last):
   File "/Users/andrej/mambaforge/envs/python/lib/python3.10/shutil.py", line 815, in move
     os.rename(src, real_dst)
 FileNotFoundError: [Errno 2] No such file or directory: '/Users/andrej/Documents/Projects/pigzbench-master/pigz-cloudflare/build/Release/pigz' -> '/Users/andrej/Documents/Projects/pigzbench-master/exe/pigz-cloudflare'
 
 During handling of the above exception, another exception occurred:
 
 Traceback (most recent call last):
   File "/Users/andrej/Documents/Projects/pigzbench-master/a_compile.py", line 174, in <module>
     compile_pigz(args.rebuild)
   File "/Users/andrej/Documents/Projects/pigzbench-master/a_compile.py", line 164, in compile_pigz
     shutil.move(pigzexe, outnm)
   File "/Users/andrej/mambaforge/envs/python/lib/python3.10/shutil.py", line 835, in move
     copy_function(src, real_dst)
   File "/Users/andrej/mambaforge/envs/python/lib/python3.10/shutil.py", line 434, in copy2
     copyfile(src, dst, follow_symlinks=follow_symlinks)
   File "/Users/andrej/mambaforge/envs/python/lib/python3.10/shutil.py", line 254, in copyfile
     with open(src, 'rb') as fsrc:
 FileNotFoundError: [Errno 2] No such file or directory: '/Users/andrej/Documents/Projects/pigzbench-master/pigz-cloudflare/build/Release/pigz'

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.