Hi, Thank you for ZSTD! Amazing work! I am the fou

Copied from <a class="issue-link js-issue-link" data-error-text="Failed to load title"

Compressing and decompressing with dictionaries, between different zstd versions about zstd HOT 3 CLOSED

ktsaou commented on April 28, 2024

Compressing and decompressing with dictionaries, between different zstd versions

from zstd.

Comments (3)

ktsaou commented on April 28, 2024

Copied from #3610 (comment) (I accidentally commended to the wrong thread).

Here is our situation at scale: we do about 1 million compressions / decompressions per second to move data between netdata agents using a few hundreds of compressors and decompressors. It is not one pipe, it is not one stream. It is hundreds of streams, each sending thousands of very small messages (usually less than 100 bytes) per second, that need to be transmitted on time (low latency).

We are using ZSTD in streaming mode, which we never reset, but we flush the output buffer of the compressor after every compressed message, since we need to send it asap.

The content itself has many variable components (values), but also a lot of fixed components (keywords). It is text.

We can take care of the transport required to move dictionaries or samples around, and also synchronize the compressors and decompressors to always use the same dictionaries (the versioning part you mentioned), so this is not a problem for us.

I don't know if using dictionaries in our case could provide the benefits of dictionaries. We see quite some decrease in CPU consumption and increase of compression ratio over time (e.g. after a day we experience 10% less CPU consumption that is almost gradually achieved), but I can't be sure if this is due to some change on the data, or that ZSTD learns to do it more efficiently over time.

If dictionaries can provide a benefit in streaming mode, the compatibility of dictionaries across versions of ZSTD is crucial for us.

from zstd.

Cyan4973 commented on April 28, 2024

Different versions of ZSTD will still be compatible (compress with any version, decompress with any other version)?

Yes. The format is frozen in RFC8878.

share the samples, so that the compressor and the decompressor will train the dictionaries using the same samples.

Nope. This implies that, using the same sample sets, the same dictionary will be generated. This is not guaranteed. It may turn out to be accidentally true using a limited set of versions and platforms for testings, but that's not a safe guarantee to build upon.

train the dictionary on the compressor and share the binary dictionary with the decompressor.

Yes, that's the more common approach.

the binary dictionary needs to be compatible among different versions of the compressor and decompressor and even between different architectures (big/little endian).

Yes, the dictionary is also defined in RFC8878, so it's guaranteed to be interoperable.

Here is our situation at scale:
We are using ZSTD in streaming mode, which we never reset
It is hundreds of streams, each sending thousands of very small messages

OK, so you are currently using Streaming compression.
And you are considering Dictionary instead of, or on top of, Streaming.

Rule of thumb : except in specific corner cases, Streaming compression is expected to always compress stronger than Dictionary compression.
Dictionary compression bridges the gap between independent frames and streaming.

So, if compression ratio is the only parameter that matters, then it will be difficult to beat Streaming compression with Dictionary. You could imagine a combination of Streaming + Dictionary, which should compress even better, but depending on the total size of each stream (sum of all these little messages), the final benefits might be underwhelming to be worth the complexity, since dictionary is mostly effective during the first Kilobytes of the Stream.
There might be an additional saving in this scenario though, thanks header statistics computed and saved into the Dictionary, because individual messages are so small (~100 bytes) and constantly flushed so they may never allow transmission of more optimal header statistics. So who knows, maybe the final savings could end up being significant in this case.

The downside of Streaming is the need to maintain a state, on both side. for each stream.
If there are many streams (hundreds are mentioned), this can result in a significant memory budget.
If this memory budget is a problem, then dictionary compression (without Streaming) is a pretty good alternative.
It will probably not compress as well as streaming, but possibly be close enough, while slashing the memory cost, because now, the system only needs as many states as there are active compression / decompression sessions in parallel, which are typically orders of magnitude lower (depending on the application).

from zstd.

ktsaou commented on April 28, 2024

Great! Thank you very much for answering this. We will do some experiments to see if streaming with dictionaries improves the situation in compression and memory.

from zstd.

Compressing and decompressing with dictionaries, between different zstd versions about zstd HOT 3 CLOSED

Comments (3)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs