eyalroz / libgiddy Goto Github PK

View Code? Open in Web Editor NEW

42.0 7.0 9.0 325 KB

Giddy - A lightweight GPU decompression library

License: BSD 3-Clause "New" or "Revised" License

CMake 13.62% Shell 0.22% Cuda 52.46% C++ 32.31% C 1.39%

compression-scheme gpu cuda compressed-data lightweight-compression pcie databases run-length-encoding

libgiddy's Introduction

Giddy - A GPU lightweight decompression library

The code is now somewhat out-of-date. Contact me regarding an upcoming update

(Originally presented in this mini-paper in the DaMoN 2017 workshop)

Table of contents
- Why lightweight compression for GPU work? - What does this library comprise? - Which compression schemes are supported? - How to decompress data using Giddy? - Performance - Acknowledgements

For questions, requests or bug reports - either use the Issues page or email me.

Why lightweight compression for GPU work?

Discrete GPUs are powerful beasts, with numerous cores and high-bandwidth memory, often capable of 10x the throughput in crunching data relative to the maximum achievable on a CPU. Perhaps the main obstacle, however, to utilizing them, is that data usually resides in main system memory, close to the CPU - and to have the GPU process it, we must send it over a PCIe bus. Thus the CPU has a potential of processing in-memory data at (typically in 2017) 30-35 GB/sec, and a discrete GPU at no more than 12 GB/sec.

One way of counteracting this handicap is using compression. The GPU can afford expending more effort in decompressing data arriving over the bus than the CPU; thus if we the data is available apriori in system memory, and is amenable to compression, using it may increase the GPU's effective bandwidth more than it would the CPU's.

Compression schemes come in many shapes and sizes, but it is customary to distinguish "heavy-weight" schemes (such as those based on Lempel-Ziv) from "lightweight" schemes, involving small amounts of computation per element, few accesses to the compressed data for decompressing any single element.

Giddy enables the use of lightweight compressed data on the GPU by providing decompressor implementations for a plethora of compression schemes.

What does the library comprise?

Giddy comprises:

CUDA kernel source code for decompressing data using each of the compression schemes listed below. The kernels are templated, and one may instantiate them for a variety of combinations of types and some compression scheme parameters which it would not be efficient to pass at run-time.
- ... and source code for auxiliary kernels required for decompression (e.g. for the scattering of patch data).
A uniform mechanism for configuring launches of these kernels (grid dimensions, block dimensions and dynamic shared memory size).
A kernel wrapper abstraction class --- which is not specific to decompression work, but rather general --- and individual kernel wrappers for each decompression scheme (templated similarly to the kernels themselves). Instead of dealing directly with the kernels at the lower level, making CUDA API calls yourself, you can instead use the associated wrapper.
The kernel wrapper class also registers itself in a factory, which you can use for instantiate wrappers without having compiled against their code. The factory provides us with instances of a common base class - and their virtual methods are used to pass scheme-specific arguments.

If this sounds a bit confusing, scroll down to the examples section.

Supported compression schemes

The following compression schemes are currently included:

Delta
Dictionary
Discard Zero-Bytes (Null Suppression):
- Fixed width
- Variable width
(Generalized) Frame of Reference
Incidence Bitmaps
Model
Run Length Encoding
Run Position Encoding

(Note the Wiki pages for each of the schemes are just now being written.)

Additionally, two patching schemes are supported:

Naive patching
Compressed-Indices patching

As these are "aposteriori" patching schemes, you apply them by simply decompressing using some base scheme, then using one of the two kernels data_layout::scatter or data_layout::compressed_indices_scatter on the initial decompression result. You will not find specific kernels, kernel wrappers or factory entries for the "combined" patched scheme, only for its components.

How to decompress data using Giddy?

Note: The examples use the C++'ish CUDA API wrappers), making the host-side code somewhat clearer and shorter.

Suppose we're presented with compressed data with the following characteristics, which for simplicity is already in GPU memory:

Parameter	Value
Decompression scheme	Frame of Reference
width of size/index type	32 bits
Uncompressed data type	int32_t
type of offsets from FOR value	int16_t
segment length	(runtime variable)
total length of compressed data	(runtime variable)

in other words, we want to implement the following function:

using size_type         = uint32_t; // assuming less than 2^32 elements
using uncompressed_type = int32_t;
using compressed_type   = int16_t;

void decompress_on_device(
	uncompressed_type*              __restrict__  decompressed,
	const compressed_type*          __restrict__  compressed,
	const model_coefficients_type*  __restrict__  segment_model_coefficients,
	size_type                                    length,
	size_type                                    segment_length);

We can do this with Giddy in one of three ways.

Direct use of the kernel source code

The example code for this mode of use is found in examples/src/direct_use_of_kernel.cu.

In this mode, we

Include the kernel source file; we now have a pointer to the kernel's device-side function.
Include the launch config resolution mechanism header.
Instantiate a launch configuration resolution parameters object, with the parameters specific to our launch.
Call resolve_launch_configuration() function with the object we instantiated, obtaining a launch_configuration_t struct.
Perform a CUDA kernel launch, either using the API wrapper (which takes the device function pointer and a launch_configuration_t) or the plain vanilla way, extracting the fields of the launch_configuration_t.

Instantiation of the specific kernel launch wrapper

The example code for this mode of use is found in examples/src/instantiation_of_wrapper.cu.

Each decompression kernel has a corresponding thin wrapper class. An instance of the wrapper class has no state - no data members; we only use it for its vtable - its virtual methods, specific to the decompression scheme. Thus, in this mode of use, we:

Include the kernel's wrapper class definition.
Instantiate the wrapper class cuda::kernels::decompression::frame_of_reference::kernel_t
Call the wrapper's resolve_launch_configuration() method with the appropriate parameters, obtaining a launch_configuration_t structure.
Call the freestanding function cuda::kernel::enqueue_launch() with our wrapper instance, the launch configuration, and the arguments we need to pass the kernel

Use of factory-provided, type-erased wrapper

The example code for this mode of use is found in examples/src/factory_provided_type_erased_wrapper.cu.

The kernel wrappers are intended to allow a uniform interface for launching kernels. This uniformity is achieved by type-erasure: The wrappers' base class virtual methods wrappers' all take a map of boost::any objects; and it is up to the caller to pass the appropriate parameters in that map. Thus, in this mode, we:

Include just the common base class header for the kernel wrappers.
Use the cuda::registered::kernel_t class' static method produceSubclass() - to instantiate specific the wrapper relevant to our scenario (named "decompression::frame_of_reference::kernel_t<4u, int, short, cuda::functors::unary::parametric_model::constant<4u, int> >"). What we actually hold is an std::unique_ptr() to such an instance.
Prepare a type-erased map of parameters, and pass it to the resolve_launch_configuration() method of our isntance, obtaining a launch_configuration_t structure.
Prepare a second type-erased map of parameters, and pass it to the enqueue_launch() method of our isntance, along with the launch configuration structure we've just obtained.

No facility for compression!

No code is currently provided for compressing data - neither on the device nor on the host side. This is Issue #3.

Performance

Some of the decompressors are well-optimized, some need more work. The most recent (and only) performance analysis is in the mini-paper mentioned above. Step-by-step instructions for measuring performance (using well-known data sets) are forthcoming.

Acknowledgements

This endevor was made possible with the help of:

CWI Amsterdam
Prof. Peter Boncz, co-author of the above-mentioned paper
The MonetDB DBMS project - which got me into DBMSes and GPUs in the first place, and which I (partially) use for performance testing

libgiddy's People

Stargazers

Watchers

Forkers

jhgjhtuytdfbnfvmnbgjtuydt misc-git-forks mindis mfkiwl alexk-etsy zeta1999 ue4devtool softant2008 jeffrey-mys

libgiddy's Issues

identify supported build environment

Hey, @eyalroz, love the project, can't wait to try it out.

I'm trying to build libgiddy and I'm running into a lot of problems. Most of them are just version mismatches e.g. my CUDA was too new, but my cmake was too old, and my Boost install was too new, etc. I'm still running into build problems but I have reason to believe they're versioning-related.

Could you provide a list of known good versions for libgiddy's requirements?

Performance is generally weak for short inputs

There's currently almost no attempt to optimize performance for short inputs - e.g. tweaking the serialization factor or even the core algorithm to allow for better parallelizing cases in which we can fill entire blocks the way we would like to.

This would require scrutiny of the kernels, and no less importantly the launch configuration resolution code (both the general mechanism and the kernel-specific parts).

Support sub-byte-resolution indices into dictionaries

The current implementation of dictionary compression assumes a dictionary entry index is some basic unsigned integral type; we can't have a dictionary of 128 entries, or 16, or 4.

We need to specialize for the case of sub-byte dictionary index - at least for units of 1, 2 and 4 bits if not for arbirtrary bit length (perhaps even over-1-byte).

Reduce dependencies on on general-purpose C++ code requiring linking

The codebase begins with an "inheritance" of a C++ utilities collection in the src/util directory. Some of that code requires linking, and is not header-only; I would rather the users of this library not be encumbered by those objects.

After getting rid of many dependencies before the initial commit, we're left with two:

poor_mans_reflection.cpp
stack_trace.cpp

The former should be easier to remove I believe, the latter - well, mostly emotionally painful I guess... I like having stack traces on my exceptions. But for the interest of simplicity we'll drop those until we can just depend on Polukhin's upcoming Boost.Stacktrace,

Support bit-resolution sizes for all relevant compression schemes

While slower to decompress, it is certainly useful to support data whose compressed form has sizes which are not always multiples of a byte. This is particularly useful for element sizes of 1...7 bits.

Bit-resolution element sizes should be supported for the decompressed type by all compression schemes which have a theoretically-unrestricted Uncompressed typename template.

As for the compressed input, bit-resolution sizes should be supported at least by the following schemes:

Dictionary (for when the number of distinct elements is small; this is Issue #65).
Discard Zero Bytes - Fixed (for when almost all elements are sized, say, 3+b bits to 16+b bits or so, for some non-negative b)
Frame of Reference (for when the discrepancies from the FOR model function are usually very small)
Run-Length Encoding (either for small-size data or for when run lengths are usually relatively short)
Delta (for when elements don't differ much from each other; if they differ consistently, your want DELTA-then-FOR compression)

And we should perhaps consider also:

Discard Zero Bytes - Variable, in a variant in which the sizes are in bits,

Make Delta decompression great^H^H^H^H^H typed again

Why did I switch to only using sizes for Delta decompression? Let's be more flexible that that. It's true that it's not strictly necessary in many/most cases, but it's reasonable to expect to be able to pass signed integers at least as differences.

Implement compression primitives for runtime-specified-length data

Our current implementation of compression schemes (see Issues #24, #25) all assume the uncompressed data type is known at compile time. In other words, there will be a separate instantiation for each type; this has its downsides even for smaller-sized types, but it's especially inopportune for fixed-length strings, whose length one cannot make assumptions about at compile time (it's uniform between records of a DB, not between columns and schemata).

So, we need variants of at least some of those compression schemes which produce untyped bytes, and take the output length as a run-time parameter.

Support changing dictionaries for the Dictionary compression scheme

It is often the case that, locally, data in a column has limited support, while globally the support is much larger. If we're very lucky, this is captured by a model function, in which case we can use a compression scheme like FOR, FORLIN etc. But in other cases we just have a jumble of values which change over time. Example: the postal codes of students at a school.

For these cases it is useful to extend the Dictionary scheme so that once in a while the Dictionary can be replaced. Thus instead of the following inputs:

name	length	description
compressed_input	n	an index into the dictionary for each compressed element
dictionary_entries	m	an uncompressed value corresponding to each possible dictionary index
n	1	number of compressed (and decompressed) elements
m	1	number of dictionary entries

we'll have

name	length	description
compressed_input	n	an index into the dictionary for each compressed element
dictionary_entries	(varies)	an uncompressed value corresponding to each possible dictionary index
dictionary_application_start	d	the first position in the input (and the output) at which each dictionary comes into effect (the next dictionary start position is where the current dictionary goes out of effect)
dictionary_lengths	d	the number of dictionary entries in each of the dictionaries
n	1	number of compressed (and decompressed) elements
d	1	number of dictionaries

This should be useful for data such as that of the USDT-Ontime benchmark.

Support segmentation and non-segmentation in more decompression kernels

(copied from Issue #163 in the kernel testbench)

At the moment, most of our decompressors can be used with segment anchors or without them - but not both:

Scheme	Segmented	Unsegmented
BITMAP	N/A	🗹
RPE	🗹	🗷
DICT	🗷	🗹
FOR	🗹	🗷
MODEL	🗷	🗹
NS	N/A	🗹
NSV	🗹	🗷
RPE	🗹	🗷
RLE	🗹	🗷

First, segment execution is important even as a single option; so MODEL and DICT should definitely have it. Then, it would be nice to support, at least for the sake of completeness, the unsegmented versions of these schemes, especially DELTA for benchmarking purposes, and RPE for cases where the overall support of the column is so small that segmentation is mostly a hassle.

Support (byte-resolution-size) dictionary indices of arbitrary sizes

It should be able to specify dictionary indices of sizes 3, 5, 6 or 7 bytes - not just 1,2, 4 or 8. It's even kind of useful in many cases for size 3.

Add an example program using the library

We currently have no executable binary using the library - neither as a test nor as an example. Now, testing we don't do in this repository - there's a one on BitBucket for that; but we should definitely have some examples to illustrate how the library is used.

Add code for host-side compression and decompression

Even though this library is about doing work on the GPU, it wouldn't be a bad idea to have a utility which can compress and decompress data entirely on the host side, to facilitate playing around with the actual GPU-side code. At least initially, the code doesn't need to be fast, so it should not be that much of an effort to write.

Compression code can be lifted from my kernel testbench, simply taken out of the context of a test; decompression code I'll have to actually get down to writing. Of course, there would be a de/compression binary, it will have lots of command-line options, etc. etc. - and that will be most of the work.

Consider using dynamic parallelism in decompressing variable-run-length data

In our RLE decoding, we have to account for the same run of data possibly being very long, making us take as input an extra anchoring column with offsets inside the run.

If it were possible to branch out into multiple threads for very long runs, perhaps this could speed up the kernel; and it's conceivable this might be possible with dynamic parallelism.

... on the other hand, it could be that the overhead is just too high.

Provide C-language bindings for the kernel wrappers and the factory.

While this library is currently C++only, there's no reason why it shouldn't be used from C code.

Use more uniform terminology for decompression schemes & kernels

(copied from here)

In the decompression code, we interchangeably use terms like "periods", "intervals" and "segments" to signify what is - extensionally if not intentionally - the same thing. Same goes for other terms. So - let's unify the terminology somewhat, to refer to "segments" and "anchors" (or "anchor values" etc.) throughout the code.

Consider optimizing IncidenceBitmaps dictionary accesses for very small dictionaries

Typically, the IncidenceBitmaps compression scheme would be used when there are very few input values (beyond 32 distinct values there's no longer any storage space benefit, although if one pre-filters and sends just a subset of the values the scheme could be beneficial in other contexts). But even that is not very efficient, and one can expect this scheme to be used a lot with 2,3,4,5 value cases.

The case of 2 values is a special case, in which a single bitmap would be enough, so let's put it aside. But in the other cases of a small number of bitmaps, the dictionary is also very small. So small, in fact, that instead of looking up in a shared-memory copy of it, one might be better served be a case statement over these values, or an if-else chain, which might incur predication, but might still be faster than going to shared memory.

In these cases, and with the uncompressed data being not-too-large, it might be possible to fit the entire dictionary into each single thread's

Lackluster performance of the Incidence Bitmaps decompressor

We get the following results decompressing incidence bitmaps:

hostname	timestamp	db_name	table_name	column_name	length	compression_ratio	execution_time	time_unit	bandwidth	bandwidth_unit	test_adapter
bricks02.scilens.private	2017-02-14T12:41:31+01:00	tpch-sf-10	lineitem	l_returnflag	59986052	8/3	1.493719	ms	40.158	GB/sec	decompression::IncidenceBitmaps<unsigned int, unsigned char, true, 8u>
bricks02.scilens.private	2017-02-14T12:41:31+01:00	tpch-sf-10	lineitem	l_linestatus	59986052	8/2	1.47756	ms	40.598	GB/sec	decompression::IncidenceBitmaps<unsigned int, unsigned char, true, 8u>

while the GPUs theoretical memory bandwidth is 336 GB/sec, and with Model we've been able to get close to 300. While it's true we should also count the bandwidth used for reading i data, but if that's done with perfect efficiency (which I would think it isn't) - it still gives us only a factor of 11/8 or 5/4 (respectively) to the bandwidth - still no more than 1/7 or 1/8 of the maximum. I'm pretty sure this can be improved.