GithubHelp home page GithubHelp logo

Comments (2)

Cyan4973 avatar Cyan4973 commented on April 28, 2024

In term of mental models, there are probably several ways to "see" this operation that can make sense,
for reference here is how I see it :

  • Dictionaries represent a sort of "past". It's as if we had been compressing up to that point, and would resume from that point. Same thing for decompression.
  • Choosing an "optimal" past for a myriad of (hopefully somewhat similar) inputs is an NP hard problem. We have some rough approximation algorithms that try to get there, and that seem relatively effective, but there are probably other ways to do the same job, sometimes better, and certainly corner cases that would be better served by some very specific hand-selected strategies. Overfitting is a thing for dictionaries though.
  • A CDict is a "digested" dictionary for compression. A DDict is the same for decompression. Both are developed from the same source, but developed into very different objects.
  • The purpose of CDict and DDict is to be able to start compression (respectively decompression) without any extra initialization workload, because everything is already developed "ready to use" within these objects. This makes a large performance difference, especially at scale.

the digested decompression dictionary size is always equal to 27352 bytes.
Is that to be expected?

When the content is not included (byRef mode), then yes, the DDict is expected to have a stable size.
Why it's ~27 KB I can't tell, I would need to go into the source code.
It seems a bit higher than I remember, but not by that much.

The main purpose of the DDict, on top of referencing the dictionary content, is to develop the entropy tables. The FSE tables used for the Sequences should cost ~5 KB, while the Huffman tables could use anywhere between 2 KB to 16 KB, and probably always settle for 16 KB, because while it costs a bit more to build, it then runs faster at decoding time.
There are still a few KB to find to reach 27KB, that's where scrubbing the source code would be useful.

Secondly, the digested compression dictionary size appears to be on average roughly half the size of the raw dictionary.

This is likely dependent on the compression level.
I would expect the search structures, which cost the majority of the CDict size, to differ depending on the compression level.

from zstd.

Tristan617 avatar Tristan617 commented on April 28, 2024

thank you for the detailed explanation!

from zstd.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.