GithubHelp home page GithubHelp logo

lyp-bobi / alp-varlen Goto Github PK

View Code? Open in Web Editor NEW

This project forked from cwida/alp

0.0 0.0 0.0 6.26 MB

Toward an ALP implement which does better with vectors with variable length (e.g. 500~5000 floating numbers)

License: MIT License

C++ 98.36% Python 0.13% C 0.87% CMake 0.35% Jupyter Notebook 0.30%

alp-varlen's Introduction

ALP: Adaptive Lossless Floating-Point Compression

Lossless floating-point compression algorithm for double/float data type. ALP significantly improves over all previous floating-point encodings in both speed and compression ratio (figure below; each dot represents a dataset).

ALP Benchmarks

  • High Speed: Scans 44x faster than Gorilla, 64x faster than Chimp, 31x faster than Zstd. Compresses 11x faster than Zstd, 138x faster than PDE and x10 faster than Chimp.
  • High Compression: 50% more compression than Gorillas. 24% more than Chimp128. On par with Zstd level 3.
  • Adaps to data: By using a two-stage algorithm that first samples row-groups and then vectors.
  • Scalar code: Auto-vectorizes thanks to FastLanes.
  • Lightweight Encoding: Compression and decompression occurs in blocks of 1024 values. Ideal for columnar databases.
  • Proven Effectiveness: Effectiveness and speed led to deprecating Chimp128 and Patas in DuckDB.
  • Works on difficult floats: Can losslessly compress even floats present as ML models parameters better than Zstd and all other encodings.

To rigorously benchmark ALP with your own data we provide our ALP primitives as a single C++ header file.

To quickly test ALP we recommend following our examples in the Quickstart guide or using it on DuckDB (note that ALP inside DuckDB is slower than using our primitives).

ALP details can be found in the publication.

Used By

DuckDB

Contents

ALP in a Nutshell

ALP has two compression schemes: ALP for doubles/floats which were once decimals, and ALP_RD for true double/floats (e.g. the ones which stem from many calculations, scientific data, ML weights).

ALP losslessly transforms doubles/floats to integer values with two multiplications to FOR+BitPack them into only the necessary bits. This is a strongly enhanced version of PseudoDecimals.

ALP_RD splits the doubles/floats bitwise representations into two parts (left and right). The left part is encoded with a Dictionary compression and the right part is Bitpacked to just the necessary bits.

Both encodings operate in vectors of 1024 values at a time (fit vectorized execution) and leverage in-vector commonalities to achieve higher compression ratios and be faster (by avoiding per-value adaptivity) than other methods.

Both encodings encode outliers as exceptions to achieve higher compression ratios.

Quickstart

Usage examples are available under the example directory. In here, we use a simple [de]compression API to store/read ALP data in/from memory.

Note that the [de]compression API used by these examples is only a naive wrapper of the real ALP core: the primitives.

Building and Running

Requirements:

  1. Clang++
  2. CMake 3.20 or higher

Building and running the simple compress example:

cmake -DALP_BUILD_EXAMPLE=ON .   # or set option in the CMakeLists.txt
cd example
make
./simple_compress

This will also generate the ALP Primitives.

ALP Primitives

You can make your own [de]compression API by using ALP primitives. An example of the usage of these can be found in our simple compression and decompression API. The decoding primitives of ALP are auto-vectorized thanks to FastLanes. For benchmarking purposes, we recommend you use these primitives.

You can use these by including our library in your code: #include "alp.hpp".

Check the full documentation of these on the PRIMITIVES.MD readme.

ALP in DuckDB

ALP replaced Chimp128 and Patas in DuckDB. In DuckDB, ALP is x2-4 times faster than Patas (at decompression) achieving twice as high compression ratios (sometimes even much more). DuckDB can be used to quickly test ALP on custom data, however, we advise against doing so if your purpose is to rigorously benchmark ALP against other algorithms.

Here you can find a basic example on how to load data in DuckDB forcing ALP to be used as compression method. These statements can be called using the Python API.

Please note: ALP inside DuckDB: i) Is slower than using our primitives presented here, and ii) compression ratios can be slightly worse due to the metadata needed to skip vectors and DuckDB storage layout.

Benchmarking (Replicating Paper Experiments)

In BENCHMARKING.md we detail how to replicate the experiments and benchmarks presented in our publication.

On the benchmarked datasets from our publication:

  • ALP achieves on average x3 compression ratios (sometimes much much higher).
  • ALP encodes on average 0.5 doubles per CPU cycle.
  • ALP decodes on average 2.6 doubles per CPU cycle.

On FCBench:

  • ALP achieves a compression ratio of 2.08 (beating all other compressors)

alp-varlen's People

Contributors

lyp-bobi avatar azimafroozeh avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.