burmanm / gorilla-tsc Goto Github PK

View Code? Open in Web Editor NEW

208.0 15.0 38.0 128 KB

Implementation of time series compression method from the Facebook's Gorilla paper

License: Apache License 2.0

Java 100.00%

timeseries time-series compression java

gorilla-tsc's Introduction

Time series compression library, based on the Facebook’s Gorilla paper

Introduction

This is Java based implementation of the compression methods described in the paper "Gorilla: A Fast, Scalable, In-Memory Time Series Database". For explanation on how the compression methods work, read the excellent paper.

In comparison to the original paper, this implementation allows using both integer values (long) as well as floating point values (double), both 64 bit in length.

Versions 1.x and 2.x are not compatible with each other due to small differences to the stored array. Versions 2.x will support reading and storing older format also, see usage for more details.

Usage

The included tests are a good source for examples.

Maven

    <dependency>
        <groupId>fi.iki.yak</groupId>
        <artifactId>compression-gorilla</artifactId>
    </dependency>

You can find latest version from the maven logo link above.

Compressing

To compress in the older 1.x format, use class Compressor. For 2.x, use GorillaCompressor (recommended). LongArrayOutput is also recommended compared to ByteBufferBitOutput because of performance. One can supply alternative predictor to the GorillaCompressor if required. One such implementation is included, DifferentialFCM that provides better compression ratio for some data patterns.

long now = LocalDateTime.now(ZoneOffset.UTC).truncatedTo(ChronoUnit.HOURS)
        .toInstant(ZoneOffset.UTC).toEpochMilli();

LongArrayOutput output = new LongArrayOutput();
GorillaCompressor c = new GorillaCompressor(now, output);

Compression class requires a block timestamp and an implementation of BitOutput interface.

c.addValue(long, double);

Adds a new floating-point value to the time series. If you wish to store only long values, use c.addValue(long, long), however do not mix these in the same series.

After the block is ready, remember to call:

c.close();

which flushes the remaining data to the stream and writes closing information.

Decompressing

To decompress from the older 1.x format, use class Decompressor. For 2.x, use GorillaDecompressor (recommended). LongArrayInput is also recommended compared to ByteBufferBitInput because of performance if the 2.x format was used to compress the time series. If the original compressor used different predictor than LastValuePredictor it must be defined in the constructor.

LongArrayInput input = new LongArrayInput(byteBuffer);
GorillaDecompressor d = new GorillaDecompressor(input);

To decompress a stream of bytes, supply GorillaDecompressor with a suitable implementation of BitInput interface. The LongArrayInput allows to decompress a long array or existing ByteBuffer presentation with 8 byte word length.

Pair pair = d.readPair();

Requesting next pair with readPair() returns the following series value or a null once the series is completely read. The pair is a simple placeholder object with getTimestamp() and getDoubleValue() or getLongValue().

Performance

The following performance in reached in a Linux VM running on VMware Player in Windows 8.1 host. i7 2600K at 4GHz. The benchmark used is the EncodingBenchmark. These results should not be directly compared to other implementations unless similar dataset is used.

Results are in millions of datapoints (timestamp + value) pairs per second. The values in this benchmark are in doubles (performance with longs is slightly higher, around ~2-3M/s).

Table 1. Compression

GorillaCompressor (2.0.0)	Compressor (1.1.0)
83.5M/s (~1.34GB/s)	31.2M/s (~499MB/s)

Table 2. Decompression

GorillaDecompressor (2.0.0)	Decompressor (1.1.0)
77,9M/s (~1.25GB/s)	51.4M/s (~822MB/s)

Most of the differences in decompression / compression speed between versions come from implementation changes and not from the small changes to the output format.

Roadmap

There were few things I wanted to get to 2.0.0, but had to decide against due to lack of time. I will implement these later with potentially some breaking API changes:

Support timestamp only compressions (2.2.x)
Include ByteBufferLongOutput/ByteBufferLongInput in the package (2.2.x)
Move bit operations to inside the GorillaCompressor/GorillaDecompressor to allow easier usage with other allocators (2.2.x)

Internals

Differences to the original paper

Maximum number of leadingZeros is stored with 6 bits to allow up to 63 leading zeros, which are necessary when storing long values. (>= 2.0.0)
Timestamp delta-of-delta are stored by first turning them with ZigZag encoding to positive integers and then reduced by one to fit in the necessary bits. In the decoding phase all the values are incremented by one to fetch the original value. (>= 2.0.0)
The compressed blocks are created with a 27 bit delta header (unlike in the original paper, which uses a 14 bit delta header). This allows to use up to one day block size using millisecond precision. (>= 1.0.0)

Data structure

Values must be inserted in the increasing time order, out-of-order insertions are not supported.

The included ByteBufferBitInput and ByteBufferBitOutput classes use a big endian order for the data.

Contributing

File an issue and/or send a pull request.

License

   Copyright 2016-2018 Michael Burman and/or other contributors.

   Licensed under the Apache License, Version 2.0 (the "License");
   you may not use this file except in compliance with the License.
   You may obtain a copy of the License at

       http://www.apache.org/licenses/LICENSE-2.0

   Unless required by applicable law or agreed to in writing, software
   distributed under the License is distributed on an "AS IS" BASIS,
   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
   See the License for the specific language governing permissions and
   limitations under the License.

gorilla-tsc's People

Contributors

Stargazers

Watchers

gorilla-tsc's Issues

Why must values be inserted in the increasing time order

In the data structure section of the readme, it says "Values must be inserted in the increasing time order, out-of-order insertions are not supported". If I understand it correctly, it requires the library user to insert data points with strict time order (no out of order data points). Can you elaborate on the reasoning of this? There seems no such limitation of the original paper and I did not see such login in your code either.

Thanks!

Move benchmarks to test part

The Benchmark parts are still residing in the main/ section of the source code, while they should be in the test. This causes the jmh-libraries and all its subdependencies to flow to implementing programs.

Support use of only timestamp or value compression

In some cases, the use of only timestamp or only the value compression is preferred. One of these cases could be the use of high precision in timestamps, where this compression method is not at its best.

Make nextTimestamp() and nextValue() public and allow the use of them directly.

Don't write end stream

There should be support for avoiding to write the end stream as this consumes several bytes per chunk. For chunks with small amount of datapoints, this is large overhead and completely avoidable. Perhaps, make it configurable and of course backwards compatible with older chunks.

LongArrayOutput#flipByteWithoutExpandCheck runs into ArrayIndexOutOfBoundsException

private void checkAndFlipByte() { // Wish I could avoid this check in most cases... if(bitsLeft == 0) { flipByte(); } }

We only call flipByte() (and in turn expand allocation) when bitsLeft is exactly 0. This might not always be the case. Also when bits > bitsLeft, we call flipByteWithoutExpandCheck and if position has already reached the long array size, then we run into ArrayIndexOutOfBoundsException.

Incorrect Maven dependency

Dependency group id is incorrect in the README file, here's the correct one.

<dependency>
	<groupId>fi.iki.yak</groupId>
	<artifactId>compression-gorilla</artifactId>
	<version>1.1.0</version>
</dependency>

Ability to use DFCM predictor

While last value predictor as used in the Gorilla library is often good for monitoring data, it's not very good predictor for many patterns. It should be configurable which predictor is used and implement the DFCM predictor as one opportunity (and turning current predictor to LastValuePredictor).

Also, by allowing to share the predictor instance between the blocks it's possible to use larger prediction tables and learn from previously stored values (instead of relearning always after a new block is used). This of course requires to rebuild the predictor from multiple blocks (instead of making a single block independent), but in theory could allow better compression ratio. This is of course up to the user to decide, this library should not do such decisions.

Support splitIterator in Decompressor

Support the Java 8 Stream functionality in Decompressor, by implementing the SplitIterator interface (without the parallelism support)

Fix travis issues

Travis should not try to use all the maven build modules (such as gpg).

Reconsider ZigZag for storing longs

When storing long values, ZigZag encoding should be considered (like with timestamps). While this can be done manually as well outside the library, it could as well be supported in the library.

This improves compression ratio when on negative/positive fluctuating serie (such as -1, 1, -1, 1)

Create copy constructors for LongArrayOutput and ByteBufferBitOutput

Provide a mechanism to get an exact copy of each BitOutput implementation.

If a user keeps a reference to the BitOutput implementation passed to a Compressor/GorillaCompressor, this will allow an exact copy to be made, finalized (as if Compressor.close() was called), and passed to a Decompressor/GorillaDecompressor without affecting the original Compressor/GorillaCompressor.

(pull request to follow)

Enforce positive timestamps and maximal gaps

The timestamp of value 0 is somewhat specially treated by the library and used as initialiser for the field storedTimestamp in both compression & decompression.
This leads to a number of strange behaviours if in the middle of a series some timestamp is 0.

It might be a good idea to disallow all non-positive timestamps to protect users from those behaviours.

Also, the same could be applied for timestamps that are too far apart in time.

PR's I wanted to open:

Support for 5-bit leading zeros instead of 6 (consistent with original paper and used in some other libraries
Explicit Float32 support with 1 less bit

Thanks for the effort to create this library it's been quite useful for us!