piezoid / pugz Goto Github PK
View Code? Open in Web Editor NEWThis project forked from ebiggers/libdeflate
Truly parallel gzip decompression
License: MIT License
This project forked from ebiggers/libdeflate
Truly parallel gzip decompression
License: MIT License
Adding support wouldn't be so difficult.
Once a random access thread reach a block marked as the last block of a gzip part it could parse the footer and the header of the next part and start a classic decompression from there.
Error:
/usr/bin/ld: /tmp/ccNbQPgH.ltrans0.ltrans.o: undefined reference to symbol 'pthread_create@@GLIBC_2.2.5'
/lib/x86_64-linux-gnu/libpthread.so.0: error adding symbols: DSO missing from command line
collect2: error: ld returned 1 exit status
Makefile:192: recipe for target 'gunzip' failed
Cause:
g++-5 -o gunzip -std=c++14 -I. -Icommon -lpthread -Iexternal/type_safe/include -Iexternal/type_safe/external/debug_assert -Wall -Wundef -Wrestrict -Wnull-dereference -Wuseless-cast -Wshadow -Weffc++ -Wpedantic -Wvla -O4 -flto -march=native -mtune=native -g -D_POSIX_C_SOURCE=200809L -D_FILE_OFFSET_BITS=64 -DHAVE_CONFIG_H programs/gunzip.o programs/prog_util.o programs/tgetopt.o libdeflate.a -lrt
Reason:
The -lpthread is in the wrong location on this command line. All the -l options need to occur after the object files programs/gunzip.o programs/prog_util.o programs/tgetopt.o where the -lrt is currently found.
Somehow I suddenly am getting sporadic assertion failures, system_error
exceptions, deadlocks, and sometimes it works:
parallelization=64
fileSize=$(( parallelization * 512 * 1024 * 1024 ))
filePath="/dev/shm/base64.gz"
base64 /dev/urandom | head -c $fileSize | pigz > "$filePath"
for (( i = 0; i < 20; ++i )); do
time taskset --cpu-list 0-$(( parallelization - 1 )) pugz -t $parallelization -l "$filePath";
done
using 64 threads for decompression (experimental)
programs/../lib/deflate_decompress.hpp:945: Assertion '_state == state_t::FAIL' failed in 'std::pair<unique_span<unsigned char, lock_releaser<std::mutex> >, long unsigned int> DeflateThread::get_context()'.
Aborted
real 0m1.563s
user 0m39.937s
sys 0m5.495s
using 64 threads for decompression (experimental)
programs/../lib/deflate_decompress.hpp:945: Assertion '_state == state_t::FAIL' failed in 'std::pair<unique_span<unsigned char, lock_releaser<std::mutex> >, long unsigned int> DeflateThread::get_context()'.
Aborted
real 0m1.549s
user 0m44.777s
sys 0m7.197s
using 64 threads for decompression (experimental)
programs/../lib/deflate_decompress.hpp:945: Assertion '_state == state_t::FAIL' failed in 'std::pair<unique_span<unsigned char, lock_releaser<std::mutex> >, long unsigned int> DeflateThread::get_context()'.
Aborted
real 0m1.538s
user 0m47.619s
sys 0m3.533s
using 64 threads for decompression (experimental)
programs/../lib/deflate_decompress.hpp:945: Assertion '_state == state_t::FAIL' failed in 'std::pair<unique_span<unsigned char, lock_releaser<std::mutex> >, long unsigned int> DeflateThread::get_context()'.
Aborted
real 0m1.525s
user 0m44.756s
sys 0m6.804s
using 64 threads for decompression (experimental)
programs/../lib/deflate_decompress.hpp:945: Assertion '_state == state_t::FAIL' failed in 'std::pair<unique_span<unsigned char, lock_releaser<std::mutex> >, long unsigned int> DeflateThread::get_context()'.
Aborted
real 0m1.506s
user 0m47.683s
sys 0m4.834s
using 64 threads for decompression (experimental)
446230368
real 0m8.580s
user 8m3.151s
sys 0m16.146s
using 64 threads for decompression (experimental)
programs/../lib/deflate_decompress.hpp:945: Assertion '_state == state_t::FAIL' failed in 'std::pair<unique_span<unsigned char, lock_releaser<std::mutex> >, long unsigned int> DeflateThread::get_context()'.
Aborted
real 0m1.566s
user 0m47.136s
sys 0m6.704s
using 64 threads for decompression (experimental)
446230368
real 0m8.553s
user 8m3.348s
sys 0m14.164s
using 64 threads for decompression (experimental)
terminate called after throwing an instance of 'std::system_error'
what():
Aborted
real 0m5.413s
user 4m57.656s
sys 0m11.702s
using 64 threads for decompression (experimental)
programs/../lib/deflate_decompress.hpp:945: Assertion '_state == state_t::FAIL' failed in 'std::pair<unique_span<unsigned char, lock_releaser<std::mutex> >, long unsigned int> DeflateThread::get_context()'.
Aborted
real 0m1.524s
user 0m57.496s
sys 0m3.829s
using 64 threads for decompression (experimental)
programs/../lib/deflate_decompress.hpp:945: Assertion '_state == state_t::FAIL' failed in 'std::pair<unique_span<unsigned char, lock_releaser<std::mutex> >, long unsigned int> DeflateThread::get_context()'.
Aborted
real 0m1.531s
user 0m44.179s
sys 0m7.995s
using 64 threads for decompression (experimental)
programs/../lib/deflate_decompress.hpp:945: Assertion '_state == state_t::FAIL' failed in 'std::pair<unique_span<unsigned char, lock_releaser<std::mutex> >, long unsigned int> DeflateThread::get_context()'.
Aborted
real 0m1.542s
user 0m41.501s
sys 0m3.428s
using 64 threads for decompression (experimental)
programs/../lib/deflate_decompress.hpp:945: Assertion '_state == state_t::FAIL' failed in 'std::pair<unique_span<unsigned char, lock_releaser<std::mutex> >, long unsigned int> DeflateThread::get_context()'.
Aborted
real 0m1.505s
user 0m44.743s
sys 0m4.971s
using 64 threads for decompression (experimental)
^C
real 1m33.887s
user 0m35.798s
sys 0m2.809s
The last one deadlocked so that I had to interrupt it.
I cannot reproduce the problem when compressing with gzip or igzip as opposed to pigz.
When compressing with pigz --oneblock
, the error messages change a little:
using 32 threads for decompression (experimental)
pugz: pthread_mutex_lock.c:94: ___pthread_mutex_lock: Assertion `mutex->__data.__owner == 0' failed.
Aborted
using 32 threads for decompression (experimental)
pugz: pthread_mutex_lock.c:94: ___pthread_mutex_lock: Assertion `mutex->__data.__owner == 0' failed.
Aborted
using 32 threads for decompression (experimental)
13944699
using 32 threads for decompression (experimental)
pugz: pthread_mutex_lock.c:94: ___pthread_mutex_lock: Assertion `mutex->__data.__owner == 0' failed.
Aborted
using 32 threads for decompression (experimental)
pugz: pthread_mutex_lock.c:94: ___pthread_mutex_lock: Assertion `mutex->__data.__owner == 0' failed.
Aborted
using 32 threads for decompression (experimental)
pugz: pthread_mutex_lock.c:94: ___pthread_mutex_lock: Assertion `mutex->__data.__owner == 0' failed.
Aborted
using 32 threads for decompression (experimental)
terminate called after throwing an instance of 'gzip_error'
what(): Got a context from invalid position
Aborted
using 32 threads for decompression (experimental)
terminate called after throwing an instance of 'gzip_error'
what(): Got a context from invalid position
Aborted
using 32 threads for decompression (experimental)
programs/../lib/deflate_decompress.hpp:945: Assertion '_state == state_t::FAIL' failed in 'std::pair<unique_span<unsigned char, lock_releaser<std::mutex> >, long unsigned int> DeflateThread::get_context()'.
Aborted
using 32 threads for decompression (experimental)
pugz: pthread_mutex_lock.c:94: ___pthread_mutex_lock: Assertion `mutex->__data.__owner == 0' failed.
Aborted
Using a larger --blocksize for pigz, seems to alleviate the problems, but I guess they only become rarer not impossible.
add a table in the README with results (in MB/s) for e.g. counting lines and outputting whole text, versus gunzip
as results from the paper will likely be improved
Hi,
I just tried do build the latest commit
$ git rev-parse HEAD
42fb5b4f2ff825b2339a8e7b254ec400e822130c
but received the following error:
$ make
AR libdeflate.a
CXX programs/gunzip.o
In file included from programs/../lib/deflate_decompress.hpp:61:0,
from programs/../lib/gzip_decompress.hpp:40,
from programs/gunzip.cpp:28:
programs/../lib/input_stream.hpp: In member function ‘bool InputStream::set_position_bits(size_t)’:
programs/../lib/input_stream.hpp:233:16: warning: declaration of ‘bits’ shadows a member of ‘InputStream’ [-Wshadow]
size_t bits = bit_pos & 7;
^~~~
programs/../lib/input_stream.hpp:282:39: note: shadowed declaration is here
template<typename T = uint32_t> T bits(bitbuf_size_t n = 8 * sizeof(T)) const
^~~~
In file included from programs/../lib/deflate_decompress.hpp:58:0,
from programs/../lib/gzip_decompress.hpp:40,
from programs/gunzip.cpp:28:
programs/../lib/memory.hpp: In instantiation of ‘unique_span<T, D>::unique_span() [with T = unsigned char; D = lock_releaser<std::mutex>]’:
programs/../lib/deflate_decompress.hpp:946:42: required from here
programs/../lib/memory.hpp:227:21: error: no matching function for call to ‘lock_releaser<std::mutex>::lock_releaser()’
, _end(nullptr)
^
programs/../lib/memory.hpp:436:39: note: candidate: lock_releaser<std::mutex>::lock_releaser(std::unique_lock<std::mutex>::mutex_type&)
using std::unique_lock<Lockable>::unique_lock;
^~~~~~~~~~~
programs/../lib/memory.hpp:436:39: note: candidate expects 1 argument, 0 provided
programs/../lib/memory.hpp:436:39: note: candidate: lock_releaser<std::mutex>::lock_releaser(std::unique_lock<std::mutex>::mutex_type&, std::defer_lock_t)
programs/../lib/memory.hpp:436:39: note: candidate expects 2 arguments, 0 provided
programs/../lib/memory.hpp:436:39: note: candidate: lock_releaser<std::mutex>::lock_releaser(std::unique_lock<std::mutex>::mutex_type&, std::try_to_lock_t)
programs/../lib/memory.hpp:436:39: note: candidate expects 2 arguments, 0 provided
programs/../lib/memory.hpp:436:39: note: candidate: lock_releaser<std::mutex>::lock_releaser(std::unique_lock<std::mutex>::mutex_type&, std::adopt_lock_t)
programs/../lib/memory.hpp:436:39: note: candidate expects 2 arguments, 0 provided
programs/../lib/memory.hpp:436:39: note: candidate: template<class _Clock, class _Duration> lock_releaser<std::mutex>::lock_releaser(std::unique_lock<std::mutex>::mutex_type&, const std::chrono::time_point<_Clock, _Duration1>&)
programs/../lib/memory.hpp:436:39: note: template argument deduction/substitution failed:
programs/../lib/memory.hpp:227:21: note: candidate expects 2 arguments, 0 provided
, _end(nullptr)
^
programs/../lib/memory.hpp:436:39: note: candidate: template<class _Rep, class _Period> lock_releaser<std::mutex>::lock_releaser(std::unique_lock<std::mutex>::mutex_type&, const std::chrono::duration<_Rep1, _Period1>&)
using std::unique_lock<Lockable>::unique_lock;
^~~~~~~~~~~
programs/../lib/memory.hpp:436:39: note: template argument deduction/substitution failed:
programs/../lib/memory.hpp:227:21: note: candidate expects 2 arguments, 0 provided
, _end(nullptr)
^
programs/../lib/memory.hpp:438:5: note: candidate: lock_releaser<Lockable>::lock_releaser(std::unique_lock<_Mutex>&&) [with Lockable = std::mutex]
lock_releaser(std::unique_lock<Lockable>&& lock) noexcept
^~~~~~~~~~~~~
programs/../lib/memory.hpp:438:5: note: candidate expects 1 argument, 0 provided
programs/../lib/memory.hpp:434:49: note: candidate: lock_releaser<std::mutex>::lock_releaser(lock_releaser<std::mutex>&&)
template<typename Lockable = std::mutex> struct lock_releaser : private std::unique_lock<Lockable>
^~~~~~~~~~~~~
programs/../lib/memory.hpp:434:49: note: candidate expects 1 argument, 0 provided
make: *** [programs/gunzip.o] Error 1
My compiler:
$ g++ --version
g++ (GCC) 6.4.0
Copyright (C) 2017 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Not sure what I missed here.
Thank you for your input on this issue!
Best,
Cedric
Hi,
I saw your arXiv paper and was excited to try the lib. When I try to compile pugz, I get this:
(base) vini@mussismilia ~/code/pugz master make
AR libdeflate.a
CXX programs/gunzip.o
In file included from programs/../lib/deflate_decompress.hpp:58:0,
from programs/../lib/gzip_decompress.hpp:40,
from programs/gunzip.cpp:28:
programs/../lib/memory.hpp: In function 'malloc_span<T> alloc_huge(size_t)':
programs/../lib/memory.hpp:382:38: error: 'MADV_HUGEPAGE' was not declared in this scope
auto res = ::madvise(ptr, bytes, MADV_HUGEPAGE);
^~~~~~~~~~~~~
programs/../lib/memory.hpp:382:38: note: suggested alternative: 'MADV_MERGEABLE'
auto res = ::madvise(ptr, bytes, MADV_HUGEPAGE);
^~~~~~~~~~~~~
MADV_MERGEABLE
make: *** [Makefile:179: programs/gunzip.o] Error 1
I followed the CLI's suggestion and replaced those strings by running sed -i -- 's/MADV_HUGEPAGE/MADV_MERGEABLE/g' */*
. This changed lib/memory.hpp
and programs/prog_util.hpp
Obviously changing source code unadvised is not recommended, but I was only experimenting.
I then would get this error:
(base) vini@mussismilia ~/code/pugz master ● make
CXX programs/gunzip.o
CXX programs/prog_util.o
CXX programs/tgetopt.o
CCLD gunzip
/home/vini/anaconda3/bin/../lib/gcc/x86_64-conda_cos6-linux-gnu/7.3.0/../../../../x86_64-conda_cos6-linux-gnu/bin/ld: /tmp/ccLWiWLa.ltrans0.ltrans.o: undefined reference to symbol 'pthread_create@@GLIBC_2.2.5'
/home/vini/anaconda3/bin/../lib/gcc/x86_64-conda_cos6-linux-gnu/7.3.0/../../../../x86_64-conda_cos6-linux-gnu/bin/ld: /home/vini/anaconda3/bin/../x86_64-conda_cos6-linux-gnu/sysroot/lib/libpthread.so.0: error adding symbols: DSO missing from command line
collect2: error: ld returned 1 exit status
make: *** [Makefile:187: gunzip] Error 1
This was resolved by adding -pthread
to the compiler flags in the Makefile.
So now the program compiled without errors. But it does not work:
(base) vini@mussismilia ~/code/pugz master ● ./gunzip predicted_bkp.tar.gz
terminate called after throwing an instance of 'gzip_error'
what(): INVALID_LITERAL
[1] 41942 abort ./gunzip predicted_bkp.tar.gz
Any ideas?
Thank you for any assistance you can provide.
PS: I read thorugh #8 but because I would get a different error, thought it would be good to create a different issue.
Dear authors,
I used this tools to decompress nanopore reads file in gz format. Using command like this: gunzip -t 8 1.pass.fastq.gz > 1.fq
but It throw errors like below: terminate called after throwing an instance of 'gzip_error'
what(): Got a context from invalid position
Aborted
I found that the output file contain some reads but the command failed to continue. Could you help me to solve it .
Most gzip libraries provide an easy to use API of the form:
decompress(void* state, const char* input, size_t in_len, char* output, size_t out_len)
It decompresses the input
stream to the output
buffer until it run either out of input data or out of output buffer space. The user fills/empties the buffers and make another call, until end of file is reached.
In a multi threaded implementation this synchronous interface in no longer possible. The code consuming the decompressed stream must be called back from each decompressor thread when a buffer is ready, not in any particular order.
A gzip file is sliced in sections, processed sequentially, which are themselves splited into chunks, processed in parallel, one for each thread. The first chunk is decompressed normally in CPU cache, and yields ~32KB decompressed buffers to the user callback. The other chunks are decompressed into larger buffers (~100MB) and require a second pass of "back references translation". To do this in a cache friendly manner, we propose to translate the buffer by segments fitting in the L1 cache and invoke the callback after each cache fill.
There is other designs matters, like the possibility to run decompression from your own custom thread pool, C FFI compatibility, etc. See the full discussion bellow.
Doing the translation the back-references on the whole buffer before yielding it to the client is costly since it trigger unnecessary TLB + LL cache misses on large memory region. We need a cache efficient translation/consumption step.
The user callbacks will be called at random time from arbitrary threads with decompressed content from arbitrary positions in the stream. The user should be able to reorder them, either synchronously (eg. blocking reordering four writing to stdout) or asynchronously (eg. parser with rests).
Support different threading implementations. (, OpenMP, custom work stealing, etc)
Static library with minimal headers. Smallish machine code blob. Support multiple language, wit h C as the common denominator.)
Able to generate, load custom indexes.
(#section, #chunk, [#window flush])
std::mutex
+ std::condition_variable
),std::thread
.(stream position, content position, context)
is enough to generate an index. (overlaps with 2.)trying to decompress or count lines of compressed tab separated files
~/git/pugz/gunzip -l file.tsv.gz
terminate called after throwing an instance of 'gzip_error'
what(): INVALID_LITERAL
Aborted (core dumped)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.