GithubHelp home page GithubHelp logo

lemire / fastpfor Goto Github PK

View Code? Open in Web Editor NEW
847.0 43.0 124.0 5.64 MB

The FastPFOR C++ library: Fast integer compression

License: Apache License 2.0

CMake 0.67% C++ 90.51% C 8.77% Roff 0.01% Shell 0.03%
simd-compression compression-schemes sorted-lists

fastpfor's Introduction

The FastPFOR C++ library : Fast integer compression

Ubuntu-CI

What is this?

A research library with integer compression schemes. It is broadly applicable to the compression of arrays of 32-bit integers where most integers are small. The library seeks to exploit SIMD instructions (SSE) whenever possible.

This library can decode at least 4 billions of compressed integers per second on most desktop or laptop processors. That is, it can decompress data at a rate of 15 GB/s. This is significantly faster than generic codecs like gzip, LZO, Snappy or LZ4.

It is used by the zsearch engine as well as in GMAP and GSNAP. DuckDB derived some of their code from this library It has been ported to Java, C# and Go. The Java port is used by ClueWeb Tools.

Apache Lucene version 4.6.x uses a compression format derived from our FastPFOR scheme.

Python bindings

Myths

Myth: SIMD compression requires very large blocks of integers (1024 or more).

Fact: This is not true. Our fastest scheme (SIMDBinaryPacking) works over blocks of 128 integers. Another very fast scheme (Stream VByte) works over blocks of four integers.

Myth: SIMD compression means high speed but less compression.

Fact: This is wrong. Some schemes cannot easily be accelerated with SIMD instructions, but many that do compress very well.

Working with sorted lists of integers

If you are working primarily with sorted lists of integers, then you might want to use differential coding. That is you may want to compress the deltas instead of the integers themselves. The current library (fastpfor) is generic and was not optimized for this purpose. However, we have another library designed to compress sorted integer lists:

https://github.com/lemire/SIMDCompressionAndIntersection

This other library (SIMDCompressionAndIntersection) also comes complete with new SIMD-based intersection algorithms.

There is also a C library for differential coding (fast computation of deltas, and recovery from deltas):

https://github.com/lemire/FastDifferentialCoding

Other recommended libraries

Reference and documentation

For a simple example, please see

example.cpp

in the root directory of this project.

Please see:

This library was used by several papers including the following:

It has also inspired related work such as...

License

This code is licensed under Apache License, Version 2.0 (ASL2.0).

Software Requirements

This code requires a compiler supporting C++11. This was a design decision.

It builds under

  • clang++ 3.2 (LLVM 3.2) or better,
  • Intel icpc (ICC) 13.0.1 or better,
  • MinGW32 (x64-4.8.1-posix-seh-rev5)
  • Microsoft VS 2012 or better,
  • and GNU GCC 4.7 or better.

The code was tested under Windows, Linux and MacOS.

Hardware Requirements

We require an x64 platform.

To fully use the library, your processor should support SSSE3. This includes almost every Intel or AMD processor sold after 2006. (Note: the key schemes require merely SSE2.)

Some specific binaries will only run if your processor supports SSE4.1. They have been purely used for specific tests however.

Building with CMake

You need cmake. On most linux distributions, you can simply do the following:

  git clone https://github.com/lemire/FastPFor.git
  cd FastPFor
  mkdir build
  cd build
  cmake ..
  cmake --build .

It may be necessary to set the CXX variable. The project is installable (make install works).

To create project files for Microsoft Visual Studio, it might be useful to target 64-bit Windows (e.g., see http://www.cmake.org/cmake/help/v3.0/generator/Visual%20Studio%2012%202013.html).

Multithreaded context

You should not assume that our objects are thread safe. If you have several threads, each thread should have its own IntegerCODEC objects to ensure that there is no concurrency problems.

Why C++11?

With minor changes, all schemes will compile fine under compilers that do not support C++11. And porting the code to C should not be a challenge.

In any case, we already support 3 major C++ compilers so portability is not a major issue.

What if I prefer Java?

Many schemes cannot be efficiently ported to Java. However some have been. Please see:

https://github.com/lemire/JavaFastPFOR

What if I prefer C#?

See CSharpFastPFOR: A C# integer compression library https://github.com/Genbox/CSharpFastPFOR

What if I prefer Go?

See Encoding: Integer Compression Libraries for Go https://github.com/zhenjl/encoding

Testing

If you used CMake to generate the build files, the check target will run the unit tests. For example , if you generated Unix Makefiles

make check

will do it.

Simple benchmark

make codecs
./codecs --clusterdynamic
./codecs --uniformdynamic

Optional : Snappy

Typing "make allallall" will install some testing binaries that depend on Google Snappy. If you want to build these, you need to install Google snappy. You can do so on a recent ubuntu machine as:

sudo apt-get install libsnappy-dev

Processing data files

Typing "make" will generate an "inmemorybenchmark" executable that can process data files.

You can use it to process arrays on (sorted!) integers on disk using the following 32-bit format: 1 unsigned 32-bit integer indicating array length followed by the corresponding number of 32-bit integer. Repeat.

( It is assumed that the integers are sorted.)

Once you have such a binary file somefilename you can process it with our inmemorybenchmark:

./inmemorybenchmark --minlength 10000 somefilename

The "minlength" flag skips short arrays. (Warning: timings over short arrays are unreliable.)

Testing with the Gov2 and ClueWeb09 data sets

As of April 2014, we recommend getting our archive at

http://lemire.me/data/integercompression2014.html

It is the data was used for the following paper:

Daniel Lemire, Leonid Boytsov, Nathan Kurz, SIMD Compression and the Intersection of Sorted Integers, arXiv: 1401.6399, 2014 http://arxiv.org/abs/1401.6399

I used your code and I get segmentation faults

Our code is thoroughly tested.

One common issue is that people do not provide large enough buffers. Some schemes can have such small compression rates that the compressed data generated will be much larger than the input data.

Is any of this code subject to patents?

I (D. Lemire) did not patent anything.

However, we implemented varint-G8UI which was patented by its authors. DO NOT use varint-G8UI if you want to avoid patents.

The rest of the library should be patent-free.

Funding

This work was supported by NSERC grant number 26143.

fastpfor's People

Contributors

amallia avatar cbsmith avatar elshize avatar galo2099 avatar hurricane1026 avatar kimikage avatar kou avatar kruus avatar lemire avatar maximecaron avatar michellemay avatar mpetri avatar ncave avatar orz-- avatar pdamme avatar pps83 avatar rayburgemeestre avatar romange avatar searchivarius avatar seb711 avatar xcorail avatar xndai avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

fastpfor's Issues

Follow-up on #75

Hey,
thank you for your fast answer! I took a look at the SIMDCompressionAndIntersection library and at the papers. I will now call what I described with "Point-Based Access" Random Access, since this seems to be the common term. From what I see, the paper describes that Random Access is possible with some of the codecs. I also saw that you implemented functions like the insert function to insert new values into the compressed structure. I did, however, not find a clear example on how to use the library for Radom Access. If one of your codecs does support Random Access, could you provide a concrete code example on how to use it? This would be immensely helpful to me!

running example.cpp

Hi,
I want to display some values of the compressed and decompressed vectors in the example.cpp so i added these lines:
for(uint32_t i = 0; i < 8;i ++)
std::cout<<"compressed_vector " << i <<" " << compressed_output.data()[i]<<std::endl;
...........
for(uint32_t i = 0; i < 8;i ++)
std::cout<<"Dompressed_vector " << i <<" " << mydataback.data()[i]<<std::endl;
but the result is
sofi@ubuntu:~/FastPFor$ ./example
compressed_vector 0 9984
compressed_vector 1 3
compressed_vector 2 0
compressed_vector 3 0
compressed_vector 4 183
compressed_vector 5 2517106944
compressed_vector 6 738787840
compressed_vector 7 167903426
You are using 0.294 bits per integer.
Dompressed_vector 0 0
Dompressed_vector 1 0
Dompressed_vector 2 0
Dompressed_vector 3 0
Dompressed_vector 4 0
Dompressed_vector 5 0
Dompressed_vector 6 0
Dompressed_vector 7 0
how can i get these values correctly? What is the problem ?
Thanks in advance.

Program received signal SIGILL, Illegal instruction

Program received signal SIGILL, Illegal instruction.
0x0000000000418128 in BP32::encodeArray(unsigned int const_, unsigned long, unsigned int_, unsigned long&) ()
(gdb) bt
#0 0x0000000000418128 in BP32::encodeArray(unsigned int const_, unsigned long, unsigned int_, unsigned long&) ()
#1 0x00000000004182ad in CompositeCodec<BP32, VariableByte>::encodeArray(unsigned int const_, unsigned long, unsigned int_, unsigned long&) ()
#2 0x0000000000423278 in void Delta::process<std::vector<unsigned int, AlignedSTLAllocator<unsigned int, 64ul> > >(std::vector<algostats, std::allocator >&, std::vector<std::vector<unsigned int, AlignedSTLAllocator<unsigned int, 64ul> >, std::allocator<std::vector<unsigned int, AlignedSTLAllocator<unsigned int, 64ul> > > > const&, processparameters&, std::string) ()
#3 0x0000000000406f87 in main ()

and my cpuinfo is

cat /proc/cpuinfo
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 15
model name : Intel(R) Core(TM)2 Duo CPU E6550 @ 2.33GHz
stepping : 11
microcode : 0xba
cpu MHz : 1998.000
cache size : 4096 KB
physical id : 0
siblings : 2
core id : 0
cpu cores : 2
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 10
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good nopl aperfmperf pni dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm lahf_lm dts tpr_shadow vnmi flexpriority
bogomips : 4640.08
clflush size : 64
cache_alignment : 64
address sizes : 36 bits physical, 48 bits virtual
power management:

processor : 1
vendor_id : GenuineIntel
cpu family : 6
model : 15
model name : Intel(R) Core(TM)2 Duo CPU E6550 @ 2.33GHz
stepping : 11
microcode : 0xba
cpu MHz : 1998.000
cache size : 4096 KB
physical id : 0
siblings : 2
core id : 1
cpu cores : 2
apicid : 1
initial apicid : 1
fpu : yes
fpu_exception : yes
cpuid level : 10
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good nopl aperfmperf pni dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm lahf_lm dts tpr_shadow vnmi flexpriority
bogomips : 4640.08
clflush size : 64
cache_alignment : 64
address sizes : 36 bits physical, 48

Is the compressed data portable?

Hello authors,

This is a great library!

I have a question: Is the compressed data portable, i.e. is it dependent on specific endianess, instruction set(cpu architecture) or anything else? If it is not portable, is there such a compression library that produces portable compressed data? If so, please give me a pointer.

Thank you!

Add support for 64-bit integers

Adding this as an issue, it is mentioned in the TODO.md file.

We would like to compress 64 bit unsigned integers.
Since the library now requires a 64 bit system (see README.md), 64 bit compression would be useful.

You tried to apply Simple16 to an incompatible set of integers

throw std::runtime_error(

so, there is that exception, but no explanation at all what Simple16 considers incompatible.

Also, many codecs are plainly broken, read/write random memory or crash. These are buggy/broken: simple16, simple8b_rle, simple9, simple9_rle, vsencoding, simdfastpfor256, fastpfor128, fastpfor256, simdfastpfor128 (the last for read random memory here and there).

C++ and Java implementations of fastpfor + varbyte unable to decode each others output

I've been investigating using the FastPFor algorithm for integer columns in Druid, following up on the discussion and experimentation mentioned in this issue. Initially, I was solely using your Java port of this library, and it's the basis of a PR I currently have open. However, curiosity got the better of me and I wanted to compare performance with the c++ implementation, sooo, I wrote a JNI wrapper to allow me to call this library from java, resulting in this branch which is a sort of alternate implementation of the PR. That's the short of how I encountered the issue mentioned in the title - seeing what happened when I fed one into the other. I have not had a chance to dig in to determine if the issue is in fact in this library or the other one, but it can be replicated by modifying the example program to write encoded data to a file, then loading that into a bytebuffer in java and attempting to decode.

#include <iostream>
#include <fstream>
#include <cstdlib>
#include <ctime>

#include "codecfactory.h"
using namespace std;
using namespace FastPForLib;

int main()
{
  IntegerCODEC &codec = *CODECFactory::getFromName("simdfastpfor256");
  size_t N = 10000;
  std::vector<uint32_t> mydata(N);

  srand (time(NULL));

  for (uint32_t i = 0; i < N; i++)
    mydata[i] = rand() % 100;

  std::vector<uint32_t> compressed_output(N + 1024);
  size_t compressedsize = compressed_output.size();
  codec.encodeArray(mydata.data(), mydata.size(), compressed_output.data(),
                    compressedsize);
  ofstream out("numbers.bin", ios::out | ios::binary);
  if(!out) {
    cout << "Cannot open file.";
    return 1;
   }

  out.write((char *)compressed_output.data(), N * sizeof(uint32_t));

  out.close();

  cout << "length " << compressedsize << "\n";
  return 0;
}

and then adjust the encodedSize variable to match the output of the c++ program in the following java

import com.google.common.io.Files;
import me.lemire.integercompression.FastPFOR;
import me.lemire.integercompression.IntWrapper;
import me.lemire.integercompression.SkippableComposition;
import me.lemire.integercompression.SkippableIntegerCODEC;
import me.lemire.integercompression.VariableByte;

...
    int buffSize = 1 << 16;

    int numValues = 10000;
    int maxNumValues = buffSize >> 2;

    SkippableIntegerCODEC codec = new SkippableComposition(new FastPFOR(), new VariableByte());
    int encodedSize = 2214;

    ByteBuffer encodedValues = Files.map(new File("numbers.bin"));

    int[] valueArray = new int[numValues];
    int[] encodedValueArray = new int[encodedSize];

    // copy encoded buffer to int array
    for (int i = 0; i < encodedSize; i++) {
      encodedValueArray[i] = encodedValues.getInt();
    }

    // decode with java

    IntWrapper outPos = new IntWrapper(0);
    codec.headlessUncompress(encodedValueArray, new IntWrapper(0), encodedSize, valueArray, outPos, numValues);
    // explodes before we get here
    assert (numValues == outPos.get());

The exception in this case was

java.lang.ArrayIndexOutOfBoundsException: 2555904
	at me.lemire.integercompression.FastPFOR.decodePage(FastPFOR.java:239)
	at me.lemire.integercompression.FastPFOR.headlessUncompress(FastPFOR.java:229)
	at me.lemire.integercompression.SkippableComposition.headlessUncompress(SkippableComposition.java:55)

but during experimentation I recall seeing exceptions coming from the VariableByte implementation as well, so the exact error may be dependent on length and composition of the input. I don't have one of these exceptions handy unfortunately, I will attempt to trigger it again and update the ticket. The JNI wrapper version can correctly read and decode the file data generated by the example program.

Of potential interest: the size of output varies slightly between the implementations, with c++ being a handful of int32 sized values larger. The full set of compatibility tests I threw together can be found here with the c++ snippet above, here. These tests that pass for me are java -> java, jni -> jni, external native encoded file -> jni, and the failing tests are jni -> java, java -> jni, external native encoded file -> java.

Another thing, which might be potentially related and indicative of a bug here (or a bug somewhere in my code), I experience an assert failure at the end of decodeArray function when using simd versions of fastpfor that only popped up whenever I would attempt to decode arbitrary offsets of a bytebuffer (from a memory mapped file). I did not experience this issue during initial testing of my JNI wrapper which was using newly allocated buffers populated with data. I explored briefly; commenting out the assert statement in the header and rebuilding the library resulted in the output decoded without exploding, but the final value would be incorrect when I went to validate the output. Since I experienced this issue only when decoding the mapped buffer where an encoded chunk could be located at arbitrary offsets, I speculated it was an issue with alignment, and indeed copying the values to a 16 byte aligned chunk has caused the assert failure to not appear again. That said, I haven't had the opportunity to try to replicate this behavior purely in c++ yet, so it is possible this issue is one of my own rather than this library, but wanted to throw it out there in the event it helps isolate why the java and c++ outputs are incompatible.

In general I've tested my branches with both my mac os laptop with

Apple LLVM version 10.0.0 (clang-1000.10.25.5)
g++-8 (Homebrew GCC 8.2.0) 8.2.0

and ubuntu linux with

g++ (Ubuntu 5.4.0-6ubuntu1~16.04.10) 5.4.0 20160609

but will double check that both issues happen on linux as soon as possible, and will attempt to dig into this deeper in general whenever I get the chance.

I'm quite excited to get this stuff worked into Druid, and using this version of the library is currently my preference since it's benchmarking faster and makes the implementation of the java part of my PR a bit more straightforward, but I think it's probably important to have the ability to fallback to a java only implementation to make the decision a little less permanent.

Synthetic data

void fillClustered(iterator begin, iterator end, uint32_t Min, uint32_t Max) {
const uint32_t N = static_cast<uint32_t>(end - begin);
const uint32_t range = Max - Min;
if (range < N)
throw std::runtime_error("can't generate that many in small interval.");
assert(range >= N);
if ((range == N) || (N < 10)) {
fillUniform(begin, end, Min, Max);
return;
}
const uint32_t cut = N / 2 + unidg.rand.getValue(range - N);
assert(cut >= N / 2);
assert(Max - Min - cut >= N - N / 2);
const double p = unidg.rand.getDouble();
assert(p <= 1);
assert(p >= 0);
if (p <= 0.25) {
fillUniform(begin, begin + N / 2, Min, Min + cut);
fillClustered(begin + N / 2, end, Min + cut, Max);
} else if (p <= 0.5) {
fillClustered(begin, begin + N / 2, Min, Min + cut);
fillUniform(begin + N / 2, end, Min + cut, Max);
} else {
fillClustered(begin, begin + N / 2, Min, Min + cut);
fillClustered(begin + N / 2, end, Min + cut, Max);
}
}

The above is the synthetic data generator for a clustered series. The reference is "Vo Ngoc Anh and Alistair Moffat. 2010. Index compression using 64-bit words".

The original paper says the following:

The second family, ClusterData, has a structure similar to UniformData. However, the sequence is generated in a way that creates a clustered rather than uniform distribution, to more closely model the way that terms in an information retrieval system tend to be clustered across clumps of documents. A recursive process is used to set, in a clustered manner, f bits of the array A[l ...r] of bits, where f ≤r −l +1. If f is small ( f <10) then f locations in A[l ...r] are selected, and the corresponding bits are turned on. When f ≥10, the array A is randomly divided into two sub-arrays A[l ...m] and A[m+1...r] for some choice of m, and the task becomes that of turning on f/2 bits in each of the sub-arrays, with care taken so that the number of 1-bits does not exceed either of the two sub-array lengths.

In the source code the cut (which corresponds to m in the paper) is never used to split the vector, but just to adapt min/max of the recursive calls.

API design: user wants low-level control for block-based compression

Hi, one thing that I miss from this library is the ability to integrate into a bigger project where I handle the block encoding manually.
To be more specific, it would be grate to have encodeArray and encodeBlock, so the user can decide what to use.

If i want to encode blocks of 128 elements, I don't want to use encodeArray because it will store the length of the block which is redundant.

Packing 2 bits into a large data type(INT64)

Hi @lemire,

I was the one to ask you for the reference for packing 2 bits into a larger data type yesterday.

Basically, what is the difference between SIMDcomp and FastPFor? And for the 2 bits of packing and unpacking which library can be used among the two?

Thanks

Compile Problem

I tried to compile the library and got an error that c++11 was not enabled. I tried using g++ 4.7, 4.8 and 4.9 on Ubuntu 14.04.

This is the beginning of the output of make using g++ 4.8:

In file included from /usr/include/c++/4.8/chrono:35:0,
                 from /home/demian/Documents/FastPFor/headers/common.h:31,
                 from /home/demian/Documents/FastPFor/headers/bitpacking.h:9,
                 from /home/demian/Documents/FastPFor/src/bitpacking.cpp:1:
/usr/include/c++/4.8/bits/c++0x_warning.h:32:2: error: #error This file requires compiler and library support for the ISO C++ 2011 standard. This support is currently experimental, and must be enabled with the -std=c++11 or -std=gnu++11 compiler options.
 #error This file requires compiler and library support for the \
  ^
In file included from /home/demian/Documents/FastPFor/src/bitpacking.cpp:1:0:
/home/demian/Documents/FastPFor/headers/bitpacking.h:11:26: error: ‘uint32_t’ does not name a type
 void __fastunpack0(const uint32_t *  __restrict__ in, uint32_t *  __restrict__  out);
                          ^
/home/demian/Documents/FastPFor/headers/bitpacking.h:11:51: error: ISO C++ forbids declaration of ‘in’ with no type [-fpermissive]
 void __fastunpack0(const uint32_t *  __restrict__ in, uint32_t *  __restrict__  out);
                                                   ^
/home/demian/Documents/FastPFor/headers/bitpacking.h:11:55: error: ‘uint32_t’ has not been declared
 void __fastunpack0(const uint32_t *  __restrict__ in, uint32_t *  __restrict__  out);
                                                       ^
/home/demian/Documents/FastPFor/headers/bitpacking.h:12:26: error: ‘uint32_t’ does not name a type
 void __fastunpack1(const uint32_t *  __restrict__ in, uint32_t *  __restrict__  out);

This is the output of cmake:

-- No build type selected, default to Release
-- The CXX compiler identification is GNU 4.8.2
-- The C compiler identification is GNU 4.8.2
-- Check for working CXX compiler: /usr/bin/c++
-- Check for working CXX compiler: /usr/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working C compiler: /usr/bin/cc
-- Check for working C compiler: /usr/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- TEST TEST
-- Performing Test SUPPORT_SSE42
-- Performing Test SUPPORT_SSE42 - Success
-- Performing Test SUPPORT_AVX
-- Performing Test SUPPORT_AVX - Success
-- Performing Test SUPPORT_AVX2
-- Performing Test SUPPORT_AVX2 - Success
-- CMAKE_SIZEOF_VOID_P (should be 8): 8
-- Good. You appear to have a 64-bit system. 
-- CMAKE_CXX_COMPILER_ID: GNU
-- CMAKE_C_COMPILER: 4.8
-- CXX_COMPILER_VERSION: 4.8
-- SSE 4.2 support detected
-- Snappy was found. Building additional targets codecssnappy and inmemorybenchmarksnappy.
-- Configuring done
-- Generating done
-- Build files have been written to: /home/demian/Documents/FastPFor/build

I would really appreciate if you could have a look at this and help me compile your library

codecs --clusterdynamic runtime failure

The clusterdynamic benchmark consistently fails. The first part, 1024 arrays of size 32768, succeeds. The second part, 32 arrays of size 1048576, fails. Here is what is printed to stderr:

Input's out of range: 321214499
terminate called after throwing an instance of 'std::runtime_error'
  what():  You tried to apply Simple16 to an incompatible set of integers: they should be in [0,2^28).

And here is what is printed to stdout leading up to the failure (I doubt this is useful):

# found clusterdynamic
# dynamic clustered data generation...
# generated 1024 arrays
# their size is  32768
#BP32+VariableByte	@BP32+VariableByte	JustCopy	@JustCopy	FastBinaryPacking16+VariableByte	@FastBinaryPacking16+VariableByte	FastBinaryPacking32+VariableByte	@FastBinaryPacking32+VariableByte	FastBinaryPacking8+VariableByte	@FastBinaryPacking8+VariableByte	FastPFor128+VariableByte	@FastPFor128+VariableByte	FastPFor256+VariableByte	@FastPFor256+VariableByte	MaskedVByte	@MaskedVByte	NewPFor<4,Simple16>+VariableByte	@NewPFor<4,Simple16>+VariableByte	OPTPFor<4,Simple16>+VariableByte	@OPTPFor<4,Simple16>+VariableByte	PFor+VariableByte	@PFor+VariableByte	PFor2008+VariableByte	@PFor2008+VariableByte	SIMDBinaryPacking+VariableByte	@SIMDBinaryPacking+VariableByte	SIMDFastPFor128+VariableByte	@SIMDFastPFor128+VariableByte	SIMDFastPFor256+VariableByte	@SIMDFastPFor256+VariableByte	SIMDGroupSimple+VariableByte	@SIMDGroupSimple+VariableByte	SIMDGroupSimple_RingBuf+VariableByte	@SIMDGroupSimple_RingBuf+VariableByte	SIMDNewPFor<4,Simple16>+VariableByte	@SIMDNewPFor<4,Simple16>+VariableByte	SIMDOPTPFor<4,Simple16>+VariableByte	@SIMDOPTPFor<4,Simple16>+VariableByte	SIMDPFor+VariableByte	@SIMDPFor+VariableByte	SIMDSimplePFor+VariableByte	@SIMDSimplePFor+VariableByte	Simple16	@Simple16	Simple8b	@Simple8b	Simple8b_RLE	@Simple8b_RLE	Simple9	@Simple9	Simple9_RLE	@Simple9_RLE	SimplePFor+VariableByte	@SimplePFor+VariableByte	streamvbyte	@streamvbyte	VariableByte	@VariableByte	VarIntG8IU	@VarIntG8IU	varintgb	@varintgb	VByte	@VByte	VSEncoding	@VSEncoding	
# for each scheme we give compression speed (million int./s) decompression speed and bits per integer
14	792.7	821.6	15.11		883.1	1125	16.18		2213	2161	32		2612	2714	32		778.7	789.5	15.54		824.4	852.7	16.65		1013	1009	15.11		1105	1124	16.18		476.9	559.2	16.59		527.3	621.9	18		289.5	928.4	14.71		308.7	628.4	16.09		230.6	1063	14.71		385.6	1216	16.1		296.6	764.4	17.13		560.1	895	20.67		134.71151	15.58		134.3	1240	17.48		6.817	649.1	14.76		8.486	486.5	16.57		325.3	1113	17.65		346.9	943.6	19.23		310.3	1213	17.65		306.7	1093	19.24		1953	243615.53		2258	2958	16.51		344	1862	14.71		418.5	2397	16.09		396.2	2074	14.71		466.2	2499	16.11		280.8	1231	18.64		268.9	1345	22.96		289.9	1193	19.08259.9	1382	23.26		137.1	1760	15.58		140.6	2322	17.48		6.844	839.2	14.76		7.955	650.2	16.57		417.7	2185	17.65		441.5	2586	19.23		343.5	2035	14.7		419.22419	16.09		172.4	330.5	20.32		251.8	430.9	25.69		242.3	431.8	16.23		344.4	420.7	18.12		77.4	508.4	16.23		86.88	505.1	18.12		198	325.6	20.42		255.8	421.425.72		66.28	320.4	20.42		64.18	252.4	25.72		313.4	1071	14.7		355.2	1166	16.09		384.4	2234	17.19		461.7	2629	20.01		254.8	244.5	17.13		390.2	368.6	20.67161.2	1621	17.75		164.4	1847	22.61		440.9	770.7	17.19		475.3	809.4	20.01		253.9	263.7	17.13		389.6	409.3	20.67		11.93	693.8	16.44		14.22	791.4	18.08		
# generated 32 arrays
# their size is  1048576
#BP32+VariableByte	@BP32+VariableByte	JustCopy	@JustCopy	FastBinaryPacking16+VariableByte	@FastBinaryPacking16+VariableByte	FastBinaryPacking32+VariableByte	@FastBinaryPacking32+VariableByte	FastBinaryPacking8+VariableByte	@FastBinaryPacking8+VariableByte	FastPFor128+VariableByte	@FastPFor128+VariableByte	FastPFor256+VariableByte	@FastPFor256+VariableByte	MaskedVByte	@MaskedVByte	NewPFor<4,Simple16>+VariableByte	@NewPFor<4,Simple16>+VariableByte	OPTPFor<4,Simple16>+VariableByte	@OPTPFor<4,Simple16>+VariableByte	PFor+VariableByte	@PFor+VariableByte	PFor2008+VariableByte	@PFor2008+VariableByte	SIMDBinaryPacking+VariableByte	@SIMDBinaryPacking+VariableByte	SIMDFastPFor128+VariableByte	@SIMDFastPFor128+VariableByte	SIMDFastPFor256+VariableByte	@SIMDFastPFor256+VariableByte	SIMDGroupSimple+VariableByte	@SIMDGroupSimple+VariableByte	SIMDGroupSimple_RingBuf+VariableByte	@SIMDGroupSimple_RingBuf+VariableByte	SIMDNewPFor<4,Simple16>+VariableByte	@SIMDNewPFor<4,Simple16>+VariableByte	SIMDOPTPFor<4,Simple16>+VariableByte	@SIMDOPTPFor<4,Simple16>+VariableByte	SIMDPFor+VariableByte	@SIMDPFor+VariableByte	SIMDSimplePFor+VariableByte	@SIMDSimplePFor+VariableByte	Simple16	@Simple16	Simple8b	@Simple8b	Simple8b_RLE	@Simple8b_RLE	Simple9	@Simple9	Simple9_RLE	@Simple9_RLE	SimplePFor+VariableByte	@SimplePFor+VariableByte	streamvbyte	@streamvbyte	VariableByte	@VariableByte	VarIntG8IU	@VarIntG8IU	varintgb	@varintgb	VByte	@VByte	VSEncoding	@VSEncoding	
# for each scheme we give compression speed (million int./s) decompression speed and bits per integer

Multiple definition error on __builtin_clz() in MSVC

In MSVC, the header file util.h defines a function __builtin_clz().
Since it is not a inline function, a multiple definition error LNK2005 can occur.

FastPFor/headers/util.h

Lines 117 to 126 in d873fe1

#ifdef _MSC_VER
// taken from
// http://stackoverflow.com/questions/355967/how-to-use-msvc-intrinsics-to-get-the-equivalent-of-this-gcc-code
uint32_t __builtin_clz(uint32_t x) {
unsigned long r = 0;
_BitScanReverse(&r, x);
return (31 - r);
}
#endif

A possible solution is adding the inline to the function:

inline uint32_t __builtin_clz(uint32_t x) {

any restriction on short (16 bits) data ?

Hi FastPFor,

We recently tried the FastPFor on short typed data. It does not work.
The normal int (32 bits) works fine.

Any underlying restriction on that ?

Thanks for explaining on it.

Bests,
Bin

Friendly build system for integrating into larger projects.

Was looking at adding this and realized that the headers are not scoped to say

include/fpfor/*

for example


#include "common.h"
#include "codecs.h"
#include "vsencoding.h"
#include "util.h"
#include "simple16.h"
#include "simple9.h"
#include "simple9_rle.h"
#include "simple8b.h"
#include "simple8b_rle.h"
#include "newpfor.h"
#include "simdnewpfor.h"
#include "optpfor.h"
#include "simdoptpfor.h"
#include "fastpfor.h"
#include "simdfastpfor.h"
#include "variablebyte.h"
#include "compositecodec.h"
#include "blockpacking.h"
#include "pfor.h"
#include "simdpfor.h"
#include "pfor2008.h"
#include "VarIntG8IU.h"
#include "simdbinarypacking.h"
#include "snappydelta.h"
#include "varintgb.h"
#include "simdvariablebyte.h"
#include "streamvariablebyte.h"
#include "simdgroupsimple.h"


Are you OK with me submitting a change to

  1. CMake
  2. include fixes to the files which is largely cosmetic ?

Would you rather do it ?

At it's core it plays nice to say

  cmake -DCMAKE_BUILD_TYPE=Release \ 
      -DCMAKE_CXX_FLAGS="-O3 -fPIC" \
      -DCMAKE_INSTALL_PREFIX:PATH='{{third_party_dir}}' \

for example (took from my proj)

It makes it easy to integrate w/ other projects.

thoughts ?

Understanding performance

I didn't find a mailing list, but wanted to evaluate this encoding - see:

https://gist.github.com/db0db4996ed4584ee45ee692db1bac27

Benchmark                                             Time           CPU Iterations
------------------------------------------------------------------------------------
BM_fast_pfor_simd_encode/256/256                    827 ns        829 ns     845117
BM_fast_pfor_simd_encode/512/512                   1279 ns       1281 ns     551218
BM_fast_pfor_simd_encode/1024/1024                 2102 ns       2104 ns     333331
BM_fast_pfor_simd_encode/131072/131072           238223 ns     237819 ns       2938
BM_fast_pfor_simd_encode/134217728/134217728  260439758 ns  260120405 ns          3

BM_varint_encode/256/256                            172 ns        172 ns    4059510
BM_varint_encode/256/256                            174 ns        174 ns    4003933
BM_varint_encode/512/512                            360 ns        360 ns    1942246
BM_varint_encode/1024/1024                          730 ns        729 ns     940887
BM_varint_encode/131072/131072                   132649 ns     132549 ns       5286
BM_varint_encode/134217728/134217728          193145278 ns  193002870 ns          4


In this example, a simple varint, is 30% faster than fastPfor - SIMD128 version.

Is this expected?

Optimization level -O2

Input length in Simple8b and Simple16 codecs

At the end of decoding in the specified codecs, some checks are performed to verify that the pointer of the input data is less than the estimated data to read:

assert(in64 <= finalin64);

ASSERT(in <= endin, in - endin);

Although this makes sense in debug mode to check consistency, the nvalue parameter is enough to decode. So, if this check does not perform another task than the mentioned, (1) maybe the len parameter could be removed; or (2) the assertions could be modified to allow len = 0, which is useful when the data len is not known , e.g.:

assert(len == 0 || in64 <= finalin64)

Some of the functions leave gaps in output buffers

For my own purpose I added unit tests and I got lots of failures because some of the codecs in FastPFor leave uninitialized gaps in output buffers when encoding. Namely, these are the codecs that fail in my tests:
simdfastpfor128, simdfastpfor256, varintg8iu, imdgroupsimple_ringbuf

This is the test I use:

TEST(Compression, testIntegerCodec)
{
    const size_t SZ = 12345;
    const uint32_t n = 1000;
    const int diff = 10;
    std::vector<uint32_t> in(SZ, n);
    for (auto& n : in)
        n += (rand32() % (diff * 2)) - diff;
    // in is an array of random numbers in the range of [990, 1010]

    for (auto& coder : coders::allCodecs())
    {
        if (coder.name() == "Null")
            return;
        LOG("coder: %s", std::string(coder.name()).c_str());
        std::vector<uint8_t> buf8(4 * (SZ + SZ / 4), 0xaa);  // << == init to 0xaa
        std::vector<uint32_t> buf32a, buf32b;
        size_t a = coder.encode(in, buf32a);
        size_t b = coder.encode(&in[0], in.size(), buf32b);
        size_t c = coder.encode(&in[0], in.size(), &buf8[0], buf8.size());
        buf8.resize(c);
        ASSERT_EQ(a, b);
        ASSERT_EQ(a, c);
        ASSERT_EQ(buf32a, buf32b);
        ASSERT_EQ(0, memcmp(buf32a.data(), buf8.data(), c));
    }
}

Most codecs pass the test, but those that fail have this memory diff:
image

as you can see, those that fail, set first 8 bytes, then 8 bytes left untouched. IMO, these codecs should be fixed to set that "untouuched" memory to something.

How to serialize/save and load/deserialize

Hi,

I have attempted to serialize/save using the following code:

std::ofstream outfile("outfile.dat", std::ofstream::binary);
outfile.write(reinterpret_cast<const char*>(compressed_output.data() /* or &v[0] pre-C++11 */), sizeof(uint32_t) * compressed_output.size());
outfile.close();```

 I am not sure this is the right way because I am ending with a file that is larger (~2x) than what is reported by:
```c++
  std::cout << std::setprecision(3);
  std::cout << "You are using "
            << 32.0 * static_cast<double>(compressed_output.size()) /
                   static_cast<double>(mydata.size())
            << " bits per integer. " << std::endl;```

For instance, I get `You are using 9.11 bits per integer` with `96504` integers but my output file is `-rw-r--r--  1 root root 215K May 10 18:58 outfile.dat`

What am I doing wrong? Also, how would you load/deserialize from this file?

Thanks for your help.

Provide Windows CI script

We could possibly add appveryor CI testing with an .appveyor.yml file that looks like the following... we need to validate that CMake builds work under Windows.

version: '{build}'
branches:
  only:
  - master
image:
    - Visual Studio 2017
    - Visual Studio 2015
clone_folder: c:\projects

platform:
- x64

environment:
  matrix:
    - GENERATOR: "Visual Studio 15 2017" # x86 build
      AVXFLAG: "OFF"
    - GENERATOR: "Visual Studio 15 2017 Win64" # x64 build
      AVXFLAG: "OFF"
    - GENERATOR: "Visual Studio 15 2017 Win64" # x64 build
      AVXFLAG: "ON"
    - GENERATOR: "Visual Studio 14 2015" # x86 build
      AVXFLAG: "OFF"
    - GENERATOR: "Visual Studio 14 2015 Win64" # x64 build
      AVXFLAG: "OFF"
    - GENERATOR: "Visual Studio 14 2015 Win64" # x64 build
      AVXFLAG: "ON"


matrix:
    fast_finish: true
    exclude:
      - image:      Visual Studio 2015
        GENERATOR: "Visual Studio 14 2015 Win64" 
        AVXFLAG: "ON"
      - image:      Visual Studio 2015
        GENERATOR: "Visual Studio 15 2017" # x86 build
      - image:      Visual Studio 2015
        GENERATOR: "Visual Studio 15 2017 Win64" # x64 build
      - image:      Visual Studio 2017
        GENERATOR: "Visual Studio 14 2015" # x86 build
      - image:      Visual Studio 2017
        GENERATOR: "Visual Studio 14 2015 Win64" # x64 build

build_script:
  - mkdir build
  - cd build
  - ps: cmake -G "$env:GENERATOR" -DFORCE_AVX="$env:AVXFLAG" ..
  - cmake --build .
  - ctest --verbose

Question on compressor of uint16_t

Hi, FastPFor is a wonderful project and has already been added to our database infinity as the posting list codec for the builtin full text search engine.
However, currently we've not enabled it right now because we require the codec to be a template based one such that both uint32_t and uint16_t could be compressed and decompressed correctly. Current FastPFor provides interfaces for uint32_t, a naive solution is to cast between uint16_t and uint32_t one by one but it is not efficient, as a result, how to provide builtin codec for uint16_t ? Thank you ~

Segmentation fault (Core dumped)

I have Ubuntu 13.0 in Vmware player. I installed FastPFor library by executing:
Cmake .
make
After that, I tried to run the example.cpp but this error was occured:
Segmentation fault( Core dumped).
What is the problem and how can i correct it?????,

Looking for software for delta compression

Hi Lemire,

I found your software when searching for a solution regarding large binary data.

We are using files with an unpublished proprietory file format.
With hexdump I noticed that it belongs of long data blocks which could efficiently compressed with
delta-compression techniques. It is 4 bytes each. Longs or floats (mantissa+exp) I do not know yet.

Do you know of a compression programs that could compress such file. It would need to identify these block positions
automatically. Could this be done with FastPFor?

Many thanks

Christoph

Hi, can you help verify this simple 8b? I actually am quite confused over two things.

                • The selector value 2 corresponds to b = 1. This allows us to store 60 integers having values in {0,1}, which are packed in the data bits.
                
                
                • The selector value 3 corresponds to b = 2 and allows one to pack 30 integers having values in [0, 4] in the data bits.

shouldn't this be [0,3] instead of [0,4]? i mean it should only allow for 0,1,2,3 values right for bit length of 2.

                 • Selector values 0 or 1 represent sequences containing 240 and 120 zeros, respectively. In this instance the 60 data bits are ignored.

i'm not sure how to use this 240 or 120 zeros. is it for rare situations only?

Function to compute or estimate compressed size

in "FastPFor/headers/codecs.h" virtual std::vector<uint32_t> compress(const std::vector<uint32_t> & data)
Is it possible to compute the size of the compressed output or a close approximation instead of allocating a big chunk and resizing it later.?

Binary Packing

Without looking at the implementations it is hard to get whats the difference between bitpacking, bitpackingaligned, bitpackingunaligned and blockpacking. Also are there any reference to specific paper for those?

64bits support for FastPFor

Hi,

I extended the current FastPFor library to support 64bits integers. And I'd like to contribute it back. Is this something you'd like to take? @lemire

The basic idea is to make FastPFor a template class which then supports different types of integers. The fastback/unpack needs to be updated accordingly to support packing and unpacking over 32 bits.

The IntegerCODEC class will also add below need interfaces for 64bits encoding -

` virtual void encodeArray(const uint64_t *in, const size_t length,
uint32_t *out, size_t &nvalue,
EncodeMeta * meta = nullptr) ;

virtual const uint32_t *decodeArray(const uint32_t *in, const size_t length,
uint64_t *out, size_t &nvalue,
const EncodeMeta * meta = nullptr);`

The 64bits support is only added to CompositeCodec, FastPFor and VariableByte. For all the other codes, the above new interface would throw not implemented exception as default behavior.

If you think this is a reasonable change, I can open a PR next week.

ARM64 build: it would be good at least to be buildable

Hi!

I've tried to just build on Raspberry PI 4b, on Ubuntu 20.04

$ uname -a
Linux ubuntu 5.4.0-1032-raspi #35-Ubuntu SMP PREEMPT Fri Mar 19 20:52:40 UTC 2021 aarch64 aarch64 aarch64 GNU/Linux
$ gcc --version
gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
Copyright (C) 2019 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
$ cat /proc/cpuinfo
processor	: 0
BogoMIPS	: 108.00
Features	: fp asimd evtstrm crc32 cpuid
CPU implementer	: 0x41
CPU architecture: 8
CPU variant	: 0x0
CPU part	: 0xd08
CPU revision	: 3
...

Hardware	: BCM2835
Revision	: c03112
Serial		: 100000007c08a3a6
Model		: Raspberry Pi 4 Model B Rev 1.2

with simple make build && cd build && cmake .. then make

That tiny naive try immediately failed with absence of 'immintrin.h'

I've tried a little digg and replaced line from 'common.h' about immitrin.h to

//#include <immintrin.h>
#  if defined(__ARM_NEON)
#    include <arm_neon.h>
#  elif defined(__WINDOWS__) || defined(__WINRT__)
/* Visual Studio doesn't define __ARM_ARCH, but _M_ARM (if set, always 7), and _M_ARM64 (if set, always 1). */
#    if defined(_M_ARM)
#      include <armintr.h>
#      include <arm_neon.h>
#      define __ARM_NEON 1 /* Set __ARM_NEON so that it can be used elsewhere, at compile time */
#    endif
#    if defined (_M_ARM64)
#      include <arm64intr.h>
#      include <arm64_neon.h>
#      define __ARM_NEON 1 /* Set __ARM_NEON so that it can be used elsewhere, at compile time */
#    endif
#  endif

it also fails with huge bunch of errors about non-declared '_mm_storeu_si128', '_mm_loadu_si128', '__m128i', and many others.

So, there are some obvious questions from it:

  1. Is it possible to build on that platform? (AFAIK that is not only RPi, but new Mac on M1 also might be targeted, that is more serious!)
  2. If it is possible - is any instructions available?
  3. If it is not possible with hardware intrinsic - is any legacy way possible - let it works not so fast at all, as a pure 'stub', but at least would be buildable, so that consumers of the lib can also be usable?

unique process of compression

Hi,
I succeed to build the package FastPFor in my machine and compile the example.cpp. So, i change the integers in the vector data:
std::vector<uint32_t> mydata(N);
mydata[0] = 4294967295;
mydata[1] = 4294967295;
i display compressed data and decompressed data
std::cout<<"Compressed data " << compressed_output.data()<<std::endl;
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
codec.decodeArray(compressed_output.data(),
compressed_output.size(), mydataback.data(), recoveredsize);
std::cout<<"Decompressed data 1 " <<mydataback.data()[0]<<std::endl;
std::cout<<"Decompressed data 2 " <<mydataback.data()[1]<<std::endl;
the result is in first execution
Compressed data 0x1b99a80
You are using 0.109 bits per integer.
Decompressed data 1 4294967295
Decompressed data 2 4294967295
////////////////////////////////////////////////////////////////////////////////
in second execution
Compressed data 0xd0da80
You are using 0.109 bits per integer.
Decompressed data 1 4294967295
Decompressed data 2 4294967295
---> i obtain a different compressed data. Perhaps this is the address of the compressed data, for thus i add
std::cout<<"Compressed data " << compressed_output.data()[0]<<std::endl;
but the result of the compressed data is the same when i change mydata[0] .
How can i distinguish between two processes of compression?? Is the compressed data unique for each compression???What is the information which makes compression process unique and unchangeable??

Thanks in advance.

32-bit ARM support

Hey,

Is it possible to build and run the decompression on an ARM 32-bit system?
Also, how easy would it be to use 16-bit integers instead of 32 or 64?

Best

Point Based Access?

Hey,
I wanted to ask if this library supports point-based access? With Point-Based Access I mean something like this:

uint_32t single_decompressed_value = decoded.get(i); // only the number at the i-th position gets decompressed

If this is not the case: Can you maybe give an insight into what your considerations are in this area? And since you seem to be quite the experts, would you know of other good libraries that support it?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.