I'm experimenting with Zstd dictionaries, and I'm trying to build some intuition aroun

ZSTD Dictionary: Uncertainties surrounding size of sample set about zstd HOT 4 CLOSED

Tristan615 commented on April 28, 2024

ZSTD Dictionary: Uncertainties surrounding size of sample set

from zstd.

Comments (4)

Cyan4973 commented on April 28, 2024

Is 100Mb of training data for a 100kb dictionary strictly better than 10Mb of training data?

It's supposed to be better, though assuming good quality samples, the difference is expected to be relatively small.

The main issue at stake is the representativity of the training set. A small training set is more likely to feature some "local" correlation, for example a similar timestamp, or a similar region field for example, depending on how it was selected. Just by virtue of the law of numbers, a "large" training set is "more likely" to be more general.
But really, the main issue is the representativity of the corpus. And it's totally possible for a small training set to be less biased than a large one.

A simple rule of thumb is that, if you have a very large collection to select from, ensure that the extracted training set is randomly selected. This should minimize local bias.

After that, assuming you are targeting a 100 KB dictionary size, the difference between good quality 10 MB and 100 MB training set is expected to be relatively small. That's because dictionaries are only effective is they contain relatively "common" strings. If they are common, they should already be present in the smaller 10 MB set. Hence, the 100 MB set is unlikely to expose some new ones.
Unless, obviously, the smaller 10 MB set was biased.

If I train a dictionary with say, 8-10 Mb of data and see no meaningful improvement in compression ratio, is it plausible that increasing the size of the training set could produce a useful dictionary?

If a 100 KB dictionary created from a good quality ~10 MB training set is unable to provide meaningful improvement to compression ratio, it's a strong hint that the scenario might not benefit from dictionary compression technique. Moving to a 100 MB training set is unlikely to radically change the picture, unless the initial training set was "bad", for whatever reason.

Or does that suggest that the format of the data is not conducive to dictionary compression in general?

There are many scenarios that don't benefit from dictionary compression.
First, the obvious ones : data is not compressible (ex:encrypted) or too compressible (ex:many zeroes).
In both cases, dictionaries are unlikely to change the outcome meaningfully.
Then, highly numeric scenarios, like for example series of 16-bit values, with no reason for consecutives values to repeat exactly, are also a bad fit for dictionaries.
Another one: compressing "large" data with dictionaries is not going to provide meaningful savings (compared to no dictionary at all).

So indeed, several classes of scenarios are not a good fit for dictionary compression.

In general, is the strategy to provide the largest number that you're comfortable storing/working with?

Yes, though there are diminishing returns, and the training process gets slower as the training set grows, so there are practical limits. Imho, 100 MB is a good upper limit for a 100 KB dictionary.

How was 100kb selected as a "good" dictionary size recommendation?

This limit was vaguely determined through experimentations in the early days of Zstandard. Nowadays, I would tend to recommend even smaller dictionaries, like 16 KB for example, because they generally contain most of the benefits of the 100 KB one, in a more compact format.
Of course, this is just a "broad recommendation". Use cases vary. It really depends on your constraints associated with dictionary distribution.
For example, I've seen situations where "large" dictionaries of several MB were considered desirable because, when compressing the entire collection, they would still provide more compression benefit than their cost.
In other applications, I've seen dictionary sizes reduced to as low as 4 KB, on the ground that there were a ton of them in memory at any time, and memory occupation was a primary concerned.

So, really, optimal dictionary size is a parameter dependent on your use case.

from zstd.

Tristan615 commented on April 28, 2024

This is unbelievably helpful. Thank you so much for the detailed answer.
I continued to edit and add questions after the initial post, and the last one slipped through. pasting it here and adding to it:

How does Zstd level mesh with dictionary? The level is passed into the dictionary construction function. Why is that? I figure the provided level actually impacts construction, and thus says something about the efficacy of the resulting dictionary (and by extension - construction time)?
In other words are zstd level and dictionary tightly coupled, or are they independent variables in terms of affecting the compression ratio?

Does it otherwise maintain all of the same implications of zstd level? (AKA: Compressing a sample using a dictionary trained with zstd level 10 will take significantly more cycles than zstd level 1)

from zstd.

Cyan4973 commented on April 28, 2024

When it comes to dictionary training, the compression level can be useful, but only provides a small optional advantage.

Selecting a compression level during training means that training will employ this level while building statistics. And statistics created while compressing at level 1 are different from level 9 or 19.
If one creates a dictionary with a compression level 10, it will still work if one compresses with it using level 1.
And the compression speed will still correspond to level 1.
It's just that the statistics will be slightly off, and therefore the compression ratio will be a bit less good.

To be fair, this impact is quite small, and shouldn't be considered critical.
You can still train a dictionary using, say, level 5, and then use that dictionary for any compression with any level from 1 to 19. It will work fine. Just a little less good than having a dictionary trained for each compression level.

Now, if your question relates to the compression level set while creating the ZSTD_CDict* object,
then yes, indeed, this compression level is now baked into the materialized dictionary.
And now, compressing with this dictionary means compressing using the level set into the dictionary.
That's because all search data structures have been built during the creation process, and are now immutable.
They are created for a specific compression strategy, which corresponds to the selected compression level.

If you try to nonetheless force the compression level to another value, the new value will be ignored.
This is documented in the code here : https://github.com/facebook/zstd/blob/v1.5.5/lib/zstd.h#L331

from zstd.

Tristan615 commented on April 28, 2024

Thank you again for the prompt and detailed responses. This was very helpful, and I really appreciate it :)

from zstd.

ZSTD Dictionary: Uncertainties surrounding size of sample set about zstd HOT 4 CLOSED

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs