Comments (16)
TLDR: This check is overly restrictive, and we can fix it easily.
I think this line is a holdover from when we were relying on dictionaries being valid. At this point, we ensure that compression never corrupts data, even when a dictionary is corrupted, invalid, or missing symbols in statistics (but the same between compression & decompression). And we have fuzzers to verify this.
During dictionary loading, we validate the Huffman / FSE tables, and mark them as either valid
meaning they have non-zero probabilities for every possible symbol and can be reused for any source, or check
meaning that they have zero probability for some symbols, and can't be re-used without histograming the source.
For Huffman, you can see that logic here:
zstd/lib/compress/zstd_compress.c
Lines 4905 to 4911 in edb6e8f
We should be to change the logic to be if (!hasZeroWeights && maxSymbolValue == 255) bs->entropy.huf.repeatMode = HUF_repeat_valid;
.
The 3 calls to FSE_dictNCountRepeat()
in that function is where we validate the FSE tables:
zstd/lib/compress/zstd_compress.c
Line 4939 in edb6e8f
zstd/lib/compress/zstd_compress.c
Line 4953 in edb6e8f
zstd/lib/compress/zstd_compress.c
Lines 4963 to 4970 in edb6e8f
from zstd.
For both FSE & Huffman tables:
- For lower compression levels, or for very small data, if the table is marked as
valid
we will always use the dictionaries (or previous blocks) tables. - For lower compression levels, if the dictionary is marked as
check
we won't re-use the tables ever. - For higher compression levels, if the dictionary is marked as
check
orvalid
, we will compute whether it is better to reuse the dictionaries tables, write new tables, or use the default tables (for FSE).
If you want your dictionary to play nice with lower compression levels and the reference encoder, you'd have to either:
- Ensure you have non-zero probabilities for all symbols
- Add a parameter to the compressor to allow overriding the default behavior, so lower levels can check whether it is better to re-use a dictionary or not. I can give you code pointers.
from zstd.
Also, @klauspost, please let us know about the results of your custom dictionary builder!
The baseline dictionary builder we generally use is ZDICT_optimizeTrainFromBuffer_cover()
with these parameters:
k=0
d=0
steps=256
compressionLevel=/* compression level you expect to use the dictionary with */
If your dictionary builder is competitive, or able to beat that, we'd love to try it out!
from zstd.
The problem is not there: the problem is the necessity to validate that the input is compatible with the partial statistics. This verification necessarily costs cpu time. Depending on your use case, you might consider this cost as irrelevant or well worth it, or you might consider it too high for the benefit. This is a trade-off.
Sure. I see how you can avoid histogramming the input in these cases.
Now, it doesn't mean that we can't support partial statistics.
It may be something internally, since I don't understand how it is fine to have a Huffman table where symbol 254 cannot be represented (weight=0), but symbol 255 must - or at least the table must produce 256 weight entries when decoded.
Hence it can remain unnoticed for a long time. This seems a concern that should also be addressed.
I guess it is just a matter of implementation expectations. But I do see how being able to always use the literal dictionary compressor can be an advantage.
In my implementation, it uses the dictionary the same as any dictionary from a previous block. 1) Generate histogram of literals 2) Check if previous can be used 3) Check if a new table would be better or even if just uncompressed. I don't actually think this is too far from the logic in zstd.
Since I always (by default) check if a new table will be better, I am already generating the histogram, so the only extra cost is checking the histogram against the dictionary table.
Sidenote - maybe returning a more "correct" error than zstd: error 11 : Allocation error : not enough memory
would be nice.
I am not asking you to change anything (other than maybe the error) - but I am trying to understand what is going on.
from zstd.
@terrelln Thanks - I will read through that. Much appreciated.
My current work is a different approach than yours. the backreference dictionary is done with a rather naïve chain of hashes that counts occurrences of 4-8 byte sequences and then from the top list tries to string them together to a set of longer strings until a hash exceeds a threshold - obviously emitting the most popular at the end.
The dictionaries beats zstd on "diverse input". For example a collection of "Go source code" and a collection of "HTML" files is able to generate a better dictionary than zstd (even when compressing with zstd). For sparse, very similar content, like "github users" it creates a worse dictionary, since it splits it up too much. It currently does not have "repeat-friendly" dicts, where it has several patterns, with a few "literal" chars.
The 3 repeat codes are pretty bad ATM. Though it seems like zstd doesn't generate unique repeat codes and it will maybe save 3-4 bytes the most. I will probably add some better scanning there.
FSE Tables are generated doing actual compression and normalizing the output. Huffman tables are generated using the literal output of the compression. Ran into a couple of cases where this didn't generate compressible tables, so I need a better secondary strategy.
It is still fairly experimental and can fail on certain input - for example if an FSE table is a single value (RLE).
from zstd.
@terrelln Just used regular --fast-cover
to compare. But when compared against zstd --train-fastcover="k=0,d=0,steps=256"
it is at least competitive:
HTML:
Default:
zstd: 1698 files compressed : 19.97% ( 95.4 MiB => 19.0 MiB)
kp: 1698 files compressed : 19.82% ( 95.4 MiB => 18.9 MiB)
Level 19:
zstd: 1698 files compressed : 16.73% ( 95.4 MiB => 16.0 MiB)
kp: 1698 files compressed : 16.80% ( 95.4 MiB => 16.0 MiB)
GO SOURCE:
Default:
zstd: 8912 files compressed : 21.86% ( 67.0 MiB => 14.6 MiB)
kp: 8912 files compressed : 22.17% ( 67.0 MiB => 14.8 MiB)
Level 19:
zstd: 8912 files compressed : 17.95% ( 67.0 MiB => 12.0 MiB)
kp: 8912 files compressed : 18.06% ( 67.0 MiB => 12.1 MiB)
Nothing groundbreaking, but I'll take my one win and keep tweaking :D
from zstd.
@terrelln Yes. That is pretty much what I refer to when "not repeat friendly". Especially considering how relatively often lower levels checks for repeats that should also be a factor.
Also zstd's ability to "splat" versions of similar data is missing. I will never emit a hash that has already been emitted as a substring of a higher priority previous item. This is fine for "diverse" input, but not optimal much for "sparse" input.
So "github-users" only generates a small dictionary, whereas zstd can emit several "versions" to chose from.
from zstd.
This is a very old implementation limitation.
When maxSymbolValue < 255
, that means that not every symbol (byte value) is present in the statistics.
Consequently, if a missing symbol ends up being present in the input, and then in the literals section, it will not be possible to represent it. This can happen if data to compress is different from training set.
If encoding was proceeding blindly, this would lead to corruption errors.
Initially, this choice was selected for speed consideration : we want to proceed blindly, i.e. trust the table provided by the dictionary, so we must be sure it's going to be able to represent any data thrown at it. This results in major speed benefits.
Nowadays, we could use some stronger analysis, deciding to employ dictionary statistics selectively depending on their fit with local statistics. Such analysis would be able to detect that a dictionary statistic is incomplete and unable to represent a symbol to compress, and replace them with locally generated statistics. This is something we already do for inter-block correlation when streaming at higher compression levels.
However, for fastest levels, there is still a problem regarding time spent during analysis.
Hence, to remain compatible with speed objectives, we would be more likely to just always trust, or always distrust.
Generating new statistics embedded in the frame is of course detrimental to compression ratio and speed, on both the compression and decompression sides. And since we mostly use dictionary compression for high speed applications, this is a situation we want to avoid.
Hence, the reference dictionary builder always generate "complete" statistics.
(note: the same problem happens with statistics for other symbols, such as match length, literal length and offset).
Consequently, our reference decoder has only met this scenario, so we have not so far needed to deal with Zstandard dictionaries containing incomplete statistics (and the need to mitigate the side effects).
You are right that incomplete statistics would likely compress better, since we would not waste some probability space for events that will never happen.
But the loss is also generally not that big, so this is just a (small) compression ratio optimization.
On the other hand, the processing requirement for the encoder to deal safely with incomplete statistics is unavoidable, and actually substantial for high speed applications.
The only use case I could imagine it to make sense is in combination with high compression ratio modes, which are already very slow, and therefore for which such additional processing stage would probably be insensible.
from zstd.
Ah, so you are forcing it to use the table on lower levels.
But it only checks if 255 can be represented. Not if all others can. Is this a "missing" check, or is it safe to use?
I generate the Huffman table from the statistics of the actual literal bytes, and discard low (and obviously zero) probability outputs - with the assumption that it can switch to another if the table cannot represent it.
My experience from toying with deflate told me that adding zero-probability codes "just incase they were needed" was less efficient overall than just having these cases generate a fresh table. So I applied that learning here.
Do the FSE tables have the same limitations? I generate these in a similar fashion, which will also lead to "short" tables and some symbols having 0 probability.
from zstd.
I don't remember this part of the code base,
so it could very well be that something is incomplete or not done well enough.
But for information, this entry point is fuzzed 24/24. So I don't expect an adversarial dictionary to be able to crash the encoder or make it produce corrupted frames. There are additional checks in the pipeline, and cumulatively, they seem to cover the situation well enough to survive heavy fuzzer tests, so no catastrophic scenario is expected. Worst case is, the dictionary will be refused.
Nonetheless, some improvements might be possible, if only for clarity. So that can still be worth another look.
As mentioned in the previous message, sure, zeroing some non-probable symbols is likely going to improve compression ratio, if only by a little bit. The problem is not there: the problem is the necessity to validate that the input is compatible with the partial statistics. This verification necessarily costs cpu time. Depending on your use case, you might consider this cost as irrelevant or well worth it, or you might consider it too high for the benefit. This is a trade-off.
In our use cases, we lean towards better speed. And full statistics is a way towards better compression speed.
Now, it doesn't mean that we can't support partial statistics. It's just that there was no such need so far, so the implementation need not to support it.
Now let's assume the implementation is updated to support partial statistics in dictionary. A secondary issue is that checking input compatibility with partial statistics is going to cost additional cpu time, but since correctness is not affected, this is a form of silent degradation: nothing crashes, it just silently reduces speed. Hence it can remain unnoticed for a long time. This seems a concern that should also be addressed.
from zstd.
Agreed, the error message is not reflective of what's going on.
This must be improved.
from zstd.
The 3 repeat codes are pretty bad ATM. Though it seems like zstd doesn't generate unique repeat codes and it will maybe save 3-4 bytes the most. I will probably add some better scanning there.
Yeah, we don't do anything with this currently. Maybe there is something to get here, but we haven't really found a use case for them.
However, we have found a really interesting pattern in our dictionaries w.r.t. repcodes. If I have the string:
<common substr 1><random 4 bytes><common substr 2>
And two dictionary snippets:
Dictionary A: <common substr 1><common substr 2>
Dictionary B: <common substr 1>0000<common substr 2>
Dictionary B will outperform Dictionary A, sometimes by a large margin, because the match finder will be able to use a repeat offset for <common substr 2>
.
This is one of the advantages of the cover algorithm: It keeps segments together, even if there is an infrequent sub-segment within it.
from zstd.
Also CC @ot (the original author of the cover algorithm), you may find this discussion interesting
from zstd.
@terrelln Yes, the algorithm was originally implemented for LZ4 use cases, so the relative position of substrings in the dictionary was irrelevant, the only optimization was to put the more frequent substrings at the end so they would be available for longer to the sliding window.
from zstd.
Yes, the algorithm was originally implemented for LZ4 use cases, so the relative position of substrings in the dictionary was irrelevant
Yeah, I think we mostly got lucky, since it wasn't a design consideration going on. When the cover dictionary builder selected parameters, it often ended up selecting larger segment sizes than we expected, like a few hundred bytes.
We only fully realized that the larger segment size was important for utilizing repeat offsets in the dictionary content later, when we were removing that property and saw a regression.
I found the ZSTD_generateSequences()
API very useful when debugging dictionary performance. It allowed me to build a heatmap of the dictionary by # of uses, and also inspect things like which positions are referenced using repeat offsets.
from zstd.
If you wanted to be more repcode aware, you could potentially concatenate all the sources, or random pairs of sources or something, then use ZSTD_generateSequences()
to check for matches of the pattern <match><literals><repeat 0>
. You could zero the literals, and treat that whole segment as one unit somehow.
from zstd.
Related Issues (20)
- truncated file name in error message HOT 2
- Hidden files and folders are not mirrored correctly when source is a relative path HOT 2
- How can I add my software to https://facebook.github.io/zstd/#other-languages HOT 2
- position of out buffer is not updated when using stream decompress API HOT 2
- Document best practices on how to use zstd in a memory constrained environment
- Support getting block info for decompression HOT 5
- Compatibility between compress/decompress APIs HOT 2
- Zstd workspace poisoning doesn't unpoison before memory is freed HOT 1
- Document and the compiler support and specify the C/C++ standard to use HOT 1
- Unclear license status HOT 3
- zstd won't remove the file after compression unless `--rm` is last argument HOT 10
- How to compress extremely small amounts of data using zstd? HOT 4
- Adding ZSTD_DCtx_loadDictionary_advanced() to the stable API HOT 2
- CLI: `-q` flag doesn't function when used after `-V`(`--version`) HOT 2
- Size tresholds for changing the compression parameters penalizes 256K inputs HOT 2
- [meson] unresolved symbols when linking some programs
- I hope to provide a cross compiled script for Android
- Zstd Encode on limited resources HOT 1
- Support read-only source HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from zstd.