The docs say that dictionary is helpful for small data since there is no past. For

Question: how does dictionary achieve superior compression for small data? about zstd HOT 6 CLOSED

tomerr90 commented on April 27, 2024

Question: how does dictionary achieve superior compression for small data?

from zstd.

Comments (6)

Cyan4973 commented on April 27, 2024

If you were to concatenate all the same small files into a single blob, and then compress the resulting blob with zstd, what compression ratio would you obtain ?

from zstd.

tomerr90 commented on April 27, 2024

I havent setup a test case of my own, Im asking based on what I saw in the GitHub readme "The Case For Small Data"

from zstd.

Cyan4973 commented on April 27, 2024

When using different inputs, one logically obtains different compression performance.

If you want to compare small data with large data, use the same data.
Either concatenate the small ones to create large data (preferable), or split the large data to create small data.
Otherwise, it's not comparable.

Compressing large data should always win, though by how much depends on the data (incompressible data remain incompressible). Dictionary will help to close the gap, but typically cannot overtake the large data scenario, especially when the cost of the dictionary (its size) is taken in consideration.
It's likely possible to create a contrarian scenario where above statement is false, by taking advantage of imperfections in the compression process, since it's using fast imperfect heuristics. That would be an exception though, not the general expectation.

from zstd.

tomerr90 commented on April 27, 2024

Thank you for your response!
Thats not quite what Im asking though, you are explaining a general scenario, Im asking about the results that are published in Zstd's GitHub page.
It makes sense to me the the ratio for big files show be the upper limit, however, it seems like for small data, its able to achieve much more, is it just because the input used happens to be such that is ver compressible?
Meaning if I would take all the samples in the scenario described and concatenate them, would I achieve the same ratio (~10) without the dictionary as well?

from zstd.

Cyan4973 commented on April 27, 2024

Yes,
dictionary compression works best on structured data featuring a lot of redundancy across messages, though very little within the message.
This is what the github collection sample achieves: it's just a bunch of json records, with very similar structure.
If they were compressed all concatenated together, the compression ratio would be greater than 10.

Dictionary compression is for scenarios where one cannot concatenate these similar records together, for example because the records must be sent immediately and can't wait inside a batch queue.

from zstd.

tomerr90 commented on April 27, 2024

I see, now I understand, thank you!

from zstd.

Recommend Projects

Question: how does dictionary achieve superior compression for small data? about zstd HOT 6 CLOSED

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs