Comments (4)
Is 100Mb of training data for a 100kb dictionary strictly better than 10Mb of training data?
It's supposed to be better, though assuming good quality samples, the difference is expected to be relatively small.
The main issue at stake is the representativity of the training set. A small training set is more likely to feature some "local" correlation, for example a similar timestamp, or a similar region field for example, depending on how it was selected. Just by virtue of the law of numbers, a "large" training set is "more likely" to be more general.
But really, the main issue is the representativity of the corpus. And it's totally possible for a small training set to be less biased than a large one.
A simple rule of thumb is that, if you have a very large collection to select from, ensure that the extracted training set is randomly selected. This should minimize local bias.
After that, assuming you are targeting a 100 KB dictionary size, the difference between good quality 10 MB and 100 MB training set is expected to be relatively small. That's because dictionaries are only effective is they contain relatively "common" strings. If they are common, they should already be present in the smaller 10 MB set. Hence, the 100 MB set is unlikely to expose some new ones.
Unless, obviously, the smaller 10 MB set was biased.
If I train a dictionary with say, 8-10 Mb of data and see no meaningful improvement in compression ratio, is it plausible that increasing the size of the training set could produce a useful dictionary?
If a 100 KB dictionary created from a good quality ~10 MB training set is unable to provide meaningful improvement to compression ratio, it's a strong hint that the scenario might not benefit from dictionary compression technique. Moving to a 100 MB training set is unlikely to radically change the picture, unless the initial training set was "bad", for whatever reason.
Or does that suggest that the format of the data is not conducive to dictionary compression in general?
There are many scenarios that don't benefit from dictionary compression.
First, the obvious ones : data is not compressible (ex:encrypted) or too compressible (ex:many zeroes).
In both cases, dictionaries are unlikely to change the outcome meaningfully.
Then, highly numeric scenarios, like for example series of 16-bit values, with no reason for consecutives values to repeat exactly, are also a bad fit for dictionaries.
Another one: compressing "large" data with dictionaries is not going to provide meaningful savings (compared to no dictionary at all).
So indeed, several classes of scenarios are not a good fit for dictionary compression.
In general, is the strategy to provide the largest number that you're comfortable storing/working with?
Yes, though there are diminishing returns, and the training process gets slower as the training set grows, so there are practical limits. Imho, 100 MB is a good upper limit for a 100 KB dictionary.
How was 100kb selected as a "good" dictionary size recommendation?
This limit was vaguely determined through experimentations in the early days of Zstandard. Nowadays, I would tend to recommend even smaller dictionaries, like 16 KB for example, because they generally contain most of the benefits of the 100 KB one, in a more compact format.
Of course, this is just a "broad recommendation". Use cases vary. It really depends on your constraints associated with dictionary distribution.
For example, I've seen situations where "large" dictionaries of several MB were considered desirable because, when compressing the entire collection, they would still provide more compression benefit than their cost.
In other applications, I've seen dictionary sizes reduced to as low as 4 KB, on the ground that there were a ton of them in memory at any time, and memory occupation was a primary concerned.
So, really, optimal dictionary size is a parameter dependent on your use case.
from zstd.
This is unbelievably helpful. Thank you so much for the detailed answer.
I continued to edit and add questions after the initial post, and the last one slipped through. pasting it here and adding to it:
How does Zstd level mesh with dictionary? The level is passed into the dictionary construction function. Why is that? I figure the provided level actually impacts construction, and thus says something about the efficacy of the resulting dictionary (and by extension - construction time)?
In other words are zstd level and dictionary tightly coupled, or are they independent variables in terms of affecting the compression ratio?
Does it otherwise maintain all of the same implications of zstd level? (AKA: Compressing a sample using a dictionary trained with zstd level 10 will take significantly more cycles than zstd level 1)
from zstd.
When it comes to dictionary training, the compression level can be useful, but only provides a small optional advantage.
Selecting a compression level during training means that training will employ this level while building statistics. And statistics created while compressing at level 1 are different from level 9 or 19.
If one creates a dictionary with a compression level 10, it will still work if one compresses with it using level 1.
And the compression speed will still correspond to level 1.
It's just that the statistics will be slightly off, and therefore the compression ratio will be a bit less good.
To be fair, this impact is quite small, and shouldn't be considered critical.
You can still train a dictionary using, say, level 5, and then use that dictionary for any compression with any level from 1 to 19. It will work fine. Just a little less good than having a dictionary trained for each compression level.
Now, if your question relates to the compression level set while creating the ZSTD_CDict*
object,
then yes, indeed, this compression level is now baked into the materialized dictionary.
And now, compressing with this dictionary means compressing using the level set into the dictionary.
That's because all search data structures have been built during the creation process, and are now immutable.
They are created for a specific compression strategy, which corresponds to the selected compression level.
If you try to nonetheless force the compression level to another value, the new value will be ignored.
This is documented in the code here : https://github.com/facebook/zstd/blob/v1.5.5/lib/zstd.h#L331
from zstd.
Thank you again for the prompt and detailed responses. This was very helpful, and I really appreciate it :)
from zstd.
Related Issues (20)
- Automatic code formatting? HOT 1
- No `uncompressed` and `ratio` information in `zstd --list` output if the zstd file is created via pipe HOT 3
- lz4 "legacy" format support HOT 1
- Add common file types that are compressed to ' --exclude-compressed' HOT 3
- windows
- compressing files containing multiple similar portions HOT 5
- Using ZSTD_compressBound for Streaming Input HOT 4
- head file not found HOT 1
- ZSTD-1.5.2 compress Segmentation fault HOT 4
- Support history buffers in zstd hardware acceleration HOT 3
- RFC: Code for integration zstd into node.js
- seekable_format/examples/parallel_compression.c is not parallel
- use decompress
- soft link
- [More convenient build options] Cmake or meson/ninja? HOT 1
- --exclude-compressed in environement variable HOT 6
- Windows binaries are missing on v1.5.6 release HOT 3
- Clicking the website URL on GitHub repository displays a warning if browser is in HTTPS-only mode
- MSVC CMake build failed on v1.5.6
- v1.5.6 Windows binary downloads are double zipped HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from zstd.