Comments (2)
In term of mental models, there are probably several ways to "see" this operation that can make sense,
for reference here is how I see it :
- Dictionaries represent a sort of "past". It's as if we had been compressing up to that point, and would resume from that point. Same thing for decompression.
- Choosing an "optimal" past for a myriad of (hopefully somewhat similar) inputs is an NP hard problem. We have some rough approximation algorithms that try to get there, and that seem relatively effective, but there are probably other ways to do the same job, sometimes better, and certainly corner cases that would be better served by some very specific hand-selected strategies. Overfitting is a thing for dictionaries though.
- A
CDict
is a "digested" dictionary for compression. ADDict
is the same for decompression. Both are developed from the same source, but developed into very different objects. - The purpose of
CDict
andDDict
is to be able to start compression (respectively decompression) without any extra initialization workload, because everything is already developed "ready to use" within these objects. This makes a large performance difference, especially at scale.
the digested decompression dictionary size is always equal to 27352 bytes.
Is that to be expected?
When the content is not included (byRef
mode), then yes, the DDict
is expected to have a stable size.
Why it's ~27 KB I can't tell, I would need to go into the source code.
It seems a bit higher than I remember, but not by that much.
The main purpose of the DDict
, on top of referencing the dictionary content, is to develop the entropy tables. The FSE tables used for the Sequences should cost ~5 KB, while the Huffman tables could use anywhere between 2 KB to 16 KB, and probably always settle for 16 KB, because while it costs a bit more to build, it then runs faster at decoding time.
There are still a few KB to find to reach 27KB, that's where scrubbing the source code would be useful.
Secondly, the digested compression dictionary size appears to be on average roughly half the size of the raw dictionary.
This is likely dependent on the compression level.
I would expect the search structures, which cost the majority of the CDict
size, to differ depending on the compression level.
from zstd.
thank you for the detailed explanation!
from zstd.
Related Issues (20)
- Modernize macros to use `do { } while (0)` instead of `{ }` HOT 9
- [question] Seek for insights on the suitable case for zstd dictionary compression HOT 5
- zstd not buildable with PAC/BTI becauseof `huf_decompress_amd64.S` HOT 4
- get a core dump on use ZSTD_compressCCtx (Unhandled exception access violation) HOT 3
- Adding a library to https://facebook.github.io/zstd/#other-languages HOT 1
- Allow files bigger than 2GiB for --patch-from option HOT 1
- Allow files bigger than 2GiB for --patch-from option HOT 2
- [Question] How to force single literal streams during compression HOT 1
- Export "selected" CMake target for zstd HOT 5
- [Question] Understanding of compression level with external sequence producer HOT 2
- zstd fails to process some filenames on Windows [we need a hero] HOT 3
- We need a ZStd JavaScript library HOT 1
- Compiler warnings present when integrated with Swift Package Manager
- will zstd get nvcomp acceleration or a gpu support like g-brotli? HOT 1
- How can I change the window size? HOT 1
- Increase minimum C standard from C89/C90 to C11 HOT 3
- New zstd 1.5.5 version is two times slower in compression speed than older 1.4.5 version HOT 11
- ZSTD with T option does not scale on multicore CPUs HOT 7
- Question about FSE Huffman literal part
- C++ Builder and mem.h ambiguity HOT 7
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from zstd.