lemire / simdcomp Goto Github PK

A simple C library for compressing lists of integers using binary packing

License: BSD 3-Clause "New" or "Revised" License

Go 0.07% C 99.26% Makefile 0.09% Python 0.57% Smarty 0.02%

simdcomp's Introduction

The SIMDComp library

A simple C library for compressing lists of integers using binary packing and SIMD instructions. The assumption is either that you have a list of 32-bit integers where most of them are small, or a list of 32-bit integers where differences between successive integers are small. No software is able to reliably compress an array of 32-bit random numbers.

This library can decode at least 4 billions of compressed integers per second on most desktop or laptop processors. That is, it can decompress data at a rate of 15 GB/s. This is significantly faster than generic codecs like gzip, LZO, Snappy or LZ4.

On a Skylake Intel processor, it can decode integers at a rate 0.3 cycles per integer, which can easily translate into more than 8 decoded billions integers per second.

This library is part of the Awesome C list of C resources.

Contributors: Daniel Lemire, Nathan Kurz, Christoph Rupp, Anatol Belski, Nick White and others

What is it for?

This is a low-level library for fast integer compression. By design it does not define a compressed format. It is up to the (sophisticated) user to create a compressed format.

It is used by:

Requirements

Your processor should support SSE4.1 (It is supported by most Intel and AMD processors released since 2008.)
It is possible to build the core part of the code if your processor support SSE2 (Pentium4 or better)
C99 compliant compiler (GCC is assumed)
A Linux-like distribution is assumed by the makefile

For a plain C version that does not use SIMD instructions, see https://github.com/lemire/LittleIntPacker

Usage

Compression works over blocks of 128 integers.

For a complete working example, see example.c (you can build it and run it with "make example; ./example").

Lists of integers in random order.

const uint32_t b = maxbits(datain);// computes bit width
simdpackwithoutmask(datain, buffer, b);//compressed to buffer, compressing 128 32-bit integers down to b*32 bytes
simdunpack(buffer, backbuffer, b);//uncompressed to backbuffer

While 128 32-bit integers are read, only b 128-bit words are written. Thus, the compression ratio is 32/b.

Sorted lists of integers.

We used differential coding: we store the difference between successive integers. For this purpose, we need an initial value (called offset).

uint32_t offset = 0;
uint32_t b1 = simdmaxbitsd1(offset,datain); // bit width
simdpackwithoutmaskd1(offset, datain, buffer, b1);//compressing 128 32-bit integers down to b1*32 bytes
simdunpackd1(offset, buffer, backbuffer, b1);//uncompressed

General example for arrays of arbitrary length:

int compress_decompress_demo() {
  size_t k, N = 9999;
  __m128i * endofbuf;
  uint32_t * datain = malloc(N * sizeof(uint32_t));
  uint8_t * buffer;
  uint32_t * backbuffer = malloc(N * sizeof(uint32_t));
  uint32_t b;

  for (k = 0; k < N; ++k){        /* start with k=0, not k=1! */
    datain[k] = k;
  }

  b = maxbits_length(datain, N);
  buffer = malloc(simdpack_compressedbytes(N,b)); // allocate just enough memory
  endofbuf = simdpack_length(datain, N, (__m128i *)buffer, b);
  /* compressed data is stored between buffer and endofbuf using (endofbuf-buffer)*sizeof(__m128i) bytes */
  /* would be safe to do : buffer = realloc(buffer,(endofbuf-(__m128i *)buffer)*sizeof(__m128i)); */
  simdunpack_length((const __m128i *)buffer, N, backbuffer, b);

  for (k = 0; k < N; ++k){
    if(datain[k] != backbuffer[k]) {
      printf("bug\n");
      return -1;
    }
  }
  return 0;
}

Frame-of-Reference

We also have frame-of-reference (FOR) functions (see simdfor.h header). They work like the bit packing routines, but do not use differential coding so they allow faster search in some cases, at the expense of compression.

Setup

make make test

and if you are daring:

make install

Go

If you are a go user, there is a "go" folder where you will find a simple demo.

Other libraries

Fast integer compression in Go: https://github.com/ronanh/intcomp
Fast Bitpacking algorithms: Rust port of simdcomp https://github.com/quickwit-oss/bitpacking
SIMDCompressionAndIntersection: A C++ library to compress and intersect sorted lists of integers using SIMD instructions https://github.com/lemire/SIMDCompressionAndIntersection
The FastPFOR C++ library : Fast integer compression https://github.com/lemire/FastPFor
High-performance dictionary coding https://github.com/lemire/dictionary
LittleIntPacker: C library to pack and unpack short arrays of integers as fast as possible https://github.com/lemire/LittleIntPacker
StreamVByte: Fast integer compression in C using the StreamVByte codec https://github.com/lemire/streamvbyte
MaskedVByte: Fast decoder for VByte-compressed integers https://github.com/lemire/MaskedVByte
CSharpFastPFOR: A C# integer compression library https://github.com/Genbox/CSharpFastPFOR
JavaFastPFOR: A java integer compression library https://github.com/lemire/JavaFastPFOR
Encoding: Integer Compression Libraries for Go https://github.com/zhenjl/encoding
FrameOfReference is a C++ library dedicated to frame-of-reference (FOR) compression: https://github.com/lemire/FrameOfReference
libvbyte: A fast implementation for varbyte 32bit/64bit integer compression https://github.com/cruppstahl/libvbyte
TurboPFor is a C library that offers lots of interesting optimizations. Well worth checking! (GPL license) https://github.com/powturbo/TurboPFor
Oroch is a C++ library that offers a usable API (MIT license) https://github.com/ademakov/Oroch

Other programming languages

References

Daniel Lemire, Nathan Kurz, Christoph Rupp, Stream VByte: Faster Byte-Oriented Integer Compression, Information Processing Letters, Information Processing Letters 130, February 2018, Pages 1-6https://arxiv.org/abs/1709.08990
Jianguo Wang, Chunbin Lin, Yannis Papakonstantinou, Steven Swanson, An Experimental Study of Bitmap Compression vs. Inverted List Compression, SIGMOD 2017 http://db.ucsd.edu/wp-content/uploads/2017/03/sidm338-wangA.pdf
P. Damme, D. Habich, J. Hildebrandt, W. Lehner, Lightweight Data Compression Algorithms: An Experimental Survey (Experiments and Analyses), EDBT 2017 http://openproceedings.org/2017/conf/edbt/paper-146.pdf
P. Damme, D. Habich, J. Hildebrandt, W. Lehner, Insights into the Comparative Evaluation of Lightweight Data Compression Algorithms, EDBT 2017 http://openproceedings.org/2017/conf/edbt/paper-414.pdf
Daniel Lemire, Leonid Boytsov, Nathan Kurz, SIMD Compression and the Intersection of Sorted Integers, Software Practice & Experience 46 (6) 2016. http://arxiv.org/abs/1401.6399
Daniel Lemire and Leonid Boytsov, Decoding billions of integers per second through vectorization, Software Practice & Experience 45 (1), 2015. http://arxiv.org/abs/1209.2137 http://onlinelibrary.wiley.com/doi/10.1002/spe.2203/abstract
Jeff Plaisance, Nathan Kurz, Daniel Lemire, Vectorized VByte Decoding, International Symposium on Web Algorithms 2015, 2015. http://arxiv.org/abs/1503.07387
Wayne Xin Zhao, Xudong Zhang, Daniel Lemire, Dongdong Shan, Jian-Yun Nie, Hongfei Yan, Ji-Rong Wen, A General SIMD-based Approach to Accelerating Compression Algorithms, ACM Transactions on Information Systems 33 (3), 2015. http://arxiv.org/abs/1502.01916
T. D. Wu, Bitpacking techniques for indexing genomes: I. Hash tables, Algorithms for Molecular Biology 11 (5), 2016. http://almob.biomedcentral.com/articles/10.1186/s13015-016-0069-5

simdcomp's People

Contributors

Stargazers

Watchers

simdcomp's Issues

simdunpack doesn't decompress the values properly

Here's a simple piece of code to reproduce the issue:

size_t N = 9999;
uint32_t * datain = malloc(N * sizeof(uint32_t));
uint8_t * buffer = malloc(N * sizeof(uint32_t) + N / SIMDBlockSize);
uint32_t * backbuffer = malloc(N * sizeof(uint32_t));

for (k = 1; k < N; ++k){       
    datain[k] = k;
}

const uint32_t b = maxbits(datain);

simdpackwithoutmask(datain, buffer, b);

simdunpack(buffer, backbuffer, b);

for (k = 1; k < N; ++k){       
    printf ("%d\n", backbuffer[k]);
}

Not quiet sure if this is an issue with the buffer that I'm initializing, basically i need to compress an array of random Integers.

Add to Microsoft's VC++ Packaging Tool

Microsoft launched a package manager that should be supported by this library...

https://github.com/Microsoft/vcpkg

Add ARM64 / Aarch64 support

Would you add arm support? Say arm64 or aarch64.
I find DLTcollab/sse2neon might be helpful.

SSE2 compatibility should be warranted

See the actual discussion here 1a6ea48, but in short - while various SSE versions have useful features, SSE2 is the standard set. The suggestion is to retain the compatibility with SSE2 even while some features from the upper SSE versions are used.

This is done by the following steps

retaining the emulation layer for the features that aren't available with SSE2
provide a legacy option to the makefiles, so then a fully SSE2 compatible version of the library can be compiled. This shouldn't involve any run time feature replacement (by usage of cpuid, etc).

Javascript version

It would be interesting to see if a javascript version of this library could be created using SIMD.js.

https://developer.mozilla.org/nl/docs/Web/JavaScript/Reference/Global_Objects/SIMD
https://github.com/tc39/ecmascript_simd
https://github.com/jonbrennecke/node-simd

C89 compatibility

@lemire i think it's finished now also with the strict gcc C89 compat. But actually I'm kind of disappointed with that as the source looks much uglier now, just as you said :) ... Here's the diff https://github.com/lemire/simdcomp/compare/c89_compat . The unit and example progs seem to pass however, also no any significant signs in the performance.

Not sure it should be merged, nevertheless one could keep that branch for people who possibly need C89 compatible code for whatever reasons (mostly compiler bound probably). Or, one could merge it but omit the gcc commit. That were a sort of broken C89 compatibility where the C source would partly follow it, but still would be able to use stdint.h, the always_inline attribute and maybe more. To note is also that gcc and vc are not the only compilers around.

I'm thinking about the real world implementation and particularly about implementing some PHP extension. There one is not only bound to the just compressing data, but can also implementing data stream compression, data serialization, beyond. However would probably reuse the C89 branch for that, as mentioned earlier, PHP sticks to C89 (but more like in that broken C89 variant ATM).

Thanks.

Unable to compress array of random numbers

The code is able to effectively compress array of values with fixed difference among them. I tried initializing the datain array with rand() function and compressing that. The result was, there was no compression in size and size of compressed data was originalsize+N. Is the code designed to compress only data with fixed difference??

Could you explain please why using ternary operator instead of bitwise or?

https://github.com/lemire/simdcomp/blob/master/src/avx512bitpacking.c#L10
https://github.com/lemire/simdcomp/blob/master/src/avx512bitpacking.c#L17
https://github.com/lemire/simdcomp/blob/master/src/avxbitpacking.c#L10

simdcomp vs roaringBitmap

@lemire , thanks a lot for these wonderful libraries.

Just being curious about one thing here, how do these two libraries (simdcomp vs roaringBitmap) stand against each other in terms of encoding/decoding speed and compression performances when it comes to ordered numbers?

Though i have used roaringBitmaps, never benchmarked/verified the compression ratio there..
But while glancing over this simdcomp, end up running the go/test and realised that the compression ratio was ~32.

Another side note, is it would be great to have a sample for the variable length array encoding/decoding for the Go users in simdcomp.

thanks!

Get maxbits for compressed array

How do I retrieve the maxbits when I only have a compressed array? Is it possible, or do I have to store the value from maxbits_length?

Use this to compress file?

If I read a file, then divide the file into blocks, each blocks is an array of uint32_t, then I use the library to compress that uint32_t array, then will it be efficient?

Also, what happens if I repeatedly compress the output, how far would I go before the output size >= the input size? I know there should be a limit

Also, in the readme:

The assumption is either that you have a list of 32-bit integers where most of them are small

What is the limit of small?

Thank you!

Support signed differential coding

Would you have a recommendation for data like a price tick stream where the values are uint32 and there are small deltas but are unsorted? The next price should be a small delta from the previous but it would be signed.

/GS for release Windows builds

The Windows makefile enables /GS (buffer security checks) for release builds. I won't argue in favor of security over performance or the other way around, but is this intentional or should it be /GS-?

How to compress/decompress sorted array of arbitrary length?

I didn't manage to create a function to successfully decompress a sorted array of arbitrary length. How to do this? I tried several things, including adapting the simdpack_length/simdunpack_length functions to simply use simdpackd1/simdunpackd1. Is this not correct? The function works correctly for 128 integers.

Packing 2 bits values into a bigger

Hi @lemire,

Thank you for all your work and all the resources you had provided.

Is there any reference to efficiently pack the 2 bits into a larger data type(basically int64)?

A = 0000.........10(64 bit), B = 0000........11(64 bit), C = 0000........01, and so on, these are int64 values containing 2-bit values.

So, I want to pack them in an int64 datatype(32 2-bit values)
Z = 101101...........(int64 value, containing packed values of A, B, C, D, .............. )

I want to perform this efficiently. So It would be helpful if you can help me with some references.

Thanks in advance!

Comment on performance for uint64_t performance

How is it expected to do on uint64_t values? Let's say I have an almost sorted array of uint64_t numbers.

How to get the usable bytes from a compressed array?

I understand I have to malloc enough to fit in the compressed array, but how do I retrieve the usable information? Sorry that I'm not a sophisticated user.

how can I compress any size of sorted uint32 array

from the example.c, the simdpackwithoutmaskd1 can compress a 128 uint32 array. But if my array's length is not the times of 128, I have no idea about the best practise of the simdcomp.
and I want to know the different between simdpackd1 and simdpackwithoutmaskd1.

Is it possible to speedup 128 block decoding via 256 bit instructions?

For different reasons I don't want to make block size bigger than 128
At the same time it looks like unpack function could use 256 bit registers to make less loads, stores and instructions.

Am I wrong? Or such idea doesn't provide speedup.
Or it wasn't the purpose?

Maybe it's not good idea to mix different registers, I'm not sure.
But at least for block with bit width that even possible to use only 256 bit instructions/registers

heap-buffer-overflow (detected by LibFuzzer)

Here is my function LLVMFuzzerTestOneInput :
test_input.zip

sizeof in is :  110
0 : 73
1 : 73
2 : 73
3 : 73
4 : 73
5 : 73
6 : 157
7 : 73
8 : 73
9 : 73
10 : 73
11 : 73
12 : 73
13 : 73
14 : 0
15 : 0
16 : 0
17 : 0
18 : 0
19 : 0
20 : 0
21 : 0
22 : 0
23 : 0
24 : 0
25 : 0
26 : 0
27 : 0
28 : 0
29 : 0
30 : 0
31 : 0
32 : 0
33 : 0
34 : 0
35 : 0
36 : 0
37 : 0
38 : 0
39 : 0
40 : 0
41 : 0
42 : 255
43 : 255
44 : 255
45 : 0
46 : 0
47 : 0
48 : 0
49 : 0
50 : 0
51 : 0
52 : 0
53 : 0
54 : 0
55 : 0
56 : 0
57 : 0
58 : 0
59 : 0
60 : 0
61 : 0
62 : 0
63 : 0
64 : 0
65 : 0
66 : 0
67 : 0
68 : 0
69 : 0
70 : 0
71 : 0
72 : 0
73 : 0
74 : 0
75 : 0
76 : 0
77 : 0
78 : 0
79 : 0
80 : 0
81 : 0
82 : 0
83 : 0
84 : 0
85 : 0
86 : 0
87 : 0
88 : 0
89 : 0
90 : 0
91 : 0
92 : 0
93 : 0
94 : 0
95 : 41
96 : 0
97 : 0
98 : 0
99 : 0
100 : 0
101 : 0
102 : 0
103 : 73
104 : 73
105 : 73
106 : 73
107 : 73
108 : 73
109 : 41
packing
unpacking
=================================================================
==2137==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x60b000001600 at pc 0x00000068a431 bp 0x7ffe5a401c90 sp 0x7ffe5a401c88
READ of size 16 at 0x60b000001600 thread T0
    #0 0x68a430  (/my/simdcomp/test_input+0x68a430)
    #1 0x54ab28  (/my/simdcomp/test_input+0x54ab28)
    #2 0x42eac7  (/my/simdcomp/test_input+0x42eac7)
    #3 0x439334  (/my/simdcomp/test_input+0x439334)
    #4 0x43a99f  (/my/simdcomp/test_input+0x43a99f)
    #5 0x429d5c  (/my/simdcomp/test_input+0x429d5c)
    #6 0x41cc22  (/my/simdcomp/test_input+0x41cc22)
    #7 0x7f1c2cf39b96  (/lib/x86_64-linux-gnu/libc.so.6+0x21b96)
    #8 0x41cc99  (/my/simdcomp/test_input+0x41cc99)

0x60b000001600 is located 0 bytes to the right of 112-byte region [0x60b000001590,0x60b000001600)
allocated by thread T0 here:
    #0 0x5121b0  (/my/simdcomp/test_input+0x5121b0)
    #1 0x54aaeb  (/my/simdcomp/test_input+0x54aaeb)
    #2 0x42eac7  (/my/simdcomp/test_input+0x42eac7)
    #3 0x439334  (/my/simdcomp/test_input+0x439334)
    #4 0x43a99f  (/my/simdcomp/test_input+0x43a99f)
    #5 0x429d5c  (/my/simdcomp/test_input+0x429d5c)
    #6 0x41cc22  (/my/simdcomp/test_input+0x41cc22)
    #7 0x7f1c2cf39b96  (/lib/x86_64-linux-gnu/libc.so.6+0x21b96)

SUMMARY: AddressSanitizer: heap-buffer-overflow (/my/simdcomp/test_input+0x68a430) 
Shadow bytes around the buggy address:
  0x0c167fff8270: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fa fa
  0x0c167fff8280: fa fa fa fa fa fa fd fd fd fd fd fd fd fd fd fd
  0x0c167fff8290: fd fd fd fd fa fa fa fa fa fa fa fa 00 00 00 00
  0x0c167fff82a0: 00 00 00 00 00 00 00 00 00 06 fa fa fa fa fa fa
  0x0c167fff82b0: fa fa 00 00 00 00 00 00 00 00 00 00 00 00 00 00
=>0x0c167fff82c0:[fa]fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c167fff82d0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c167fff82e0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c167fff82f0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c167fff8300: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c167fff8310: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
Shadow byte legend (one shadow byte represents 8 application bytes):
  Addressable:           00
  Partially addressable: 01 02 03 04 05 06 07 
  Heap left redzone:       fa
  Freed heap region:       fd
  Stack left redzone:      f1
  Stack mid redzone:       f2
  Stack right redzone:     f3
  Stack after return:      f5
  Stack use after scope:   f8
  Global redzone:          f9
  Global init order:       f6
  Poisoned by user:        f7
  Container overflow:      fc
  Array cookie:            ac
  Intra object redzone:    bb
  ASan internal:           fe
  Left alloca redzone:     ca
  Right alloca redzone:    cb
==2137==ABORTING
MS: 2 InsertByte-InsertRepeatedBytes-; base unit: e327b8bb508e837ff0f4550c4494658769d8262a
0x49,0x49,0x49,0x49,0x49,0x49,0x9d,0x49,0x49,0x49,0x49,0x49,0x49,0x49,0x0,0x0,0x0,0x0,0x0,0x0,0x0,0x0,0x0,0x0,0x0,0x0,0x0,0x0,0x0,0x0,0x0,0x0,0x0,0x0,0x0,0x0,0x0,0x0,0x0,0x0,0x0,0x0,0xff,0xff,0xff,0x0,0x0,0x0,0x0,0x0,0x0,0x0,0x0,0x0,0x0,0x0,0x0,0x0,0x0,0x0,0x0,0x0,0x0,0x0,0x0,0x0,0x0,0x0,0x0,0x0,0x0,0x0,0x0,0x0,0x0,0x0,0x0,0x0,0x0,0x0,0x0,0x0,0x0,0x0,0x0,0x0,0x0,0x0,0x0,0x0,0x0,0x0,0x0,0x0,0x0,0x29,0x0,0x0,0x0,0x0,0x0,0x0,0x0,0x49,0x49,0x49,0x49,0x49,0x49,0x29,
IIIIII\x9dIIIIIII\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\xff\xff\xff\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00)\x00\x00\x00\x00\x00\x00\x00IIIIII)
artifact_prefix='./'; Test unit written to ./crash-4dc1b47ec0015baf910b70995c37d0b7231129c6
Base64: SUlJSUlJnUlJSUlJSUkAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA////AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAApAAAAAAAAAElJSUlJSSk=

Which code to use?

This project references others for compressing sorted arrays, like SIMDCompressionAndIntersection and TurboPFor. AFAICT the core functionality in those projects (i.e. for.c ) seems to be identical, whereas the code in this repo is quite different. What are the differences and why are they there?