GithubHelp home page GithubHelp logo

Comments (3)

ktsaou avatar ktsaou commented on April 28, 2024

Copied from #3610 (comment) (I accidentally commended to the wrong thread).

Here is our situation at scale: we do about 1 million compressions / decompressions per second to move data between netdata agents using a few hundreds of compressors and decompressors. It is not one pipe, it is not one stream. It is hundreds of streams, each sending thousands of very small messages (usually less than 100 bytes) per second, that need to be transmitted on time (low latency).

We are using ZSTD in streaming mode, which we never reset, but we flush the output buffer of the compressor after every compressed message, since we need to send it asap.

The content itself has many variable components (values), but also a lot of fixed components (keywords). It is text.

We can take care of the transport required to move dictionaries or samples around, and also synchronize the compressors and decompressors to always use the same dictionaries (the versioning part you mentioned), so this is not a problem for us.

I don't know if using dictionaries in our case could provide the benefits of dictionaries. We see quite some decrease in CPU consumption and increase of compression ratio over time (e.g. after a day we experience 10% less CPU consumption that is almost gradually achieved), but I can't be sure if this is due to some change on the data, or that ZSTD learns to do it more efficiently over time.

If dictionaries can provide a benefit in streaming mode, the compatibility of dictionaries across versions of ZSTD is crucial for us.

from zstd.

Cyan4973 avatar Cyan4973 commented on April 28, 2024

Different versions of ZSTD will still be compatible (compress with any version, decompress with any other version)?

Yes. The format is frozen in RFC8878.

share the samples, so that the compressor and the decompressor will train the dictionaries using the same samples.

Nope. This implies that, using the same sample sets, the same dictionary will be generated. This is not guaranteed. It may turn out to be accidentally true using a limited set of versions and platforms for testings, but that's not a safe guarantee to build upon.

train the dictionary on the compressor and share the binary dictionary with the decompressor.

Yes, that's the more common approach.

the binary dictionary needs to be compatible among different versions of the compressor and decompressor and even between different architectures (big/little endian).

Yes, the dictionary is also defined in RFC8878, so it's guaranteed to be interoperable.

Here is our situation at scale:
We are using ZSTD in streaming mode, which we never reset
It is hundreds of streams, each sending thousands of very small messages

OK, so you are currently using Streaming compression.
And you are considering Dictionary instead of, or on top of, Streaming.

Rule of thumb : except in specific corner cases, Streaming compression is expected to always compress stronger than Dictionary compression.
Dictionary compression bridges the gap between independent frames and streaming.

So, if compression ratio is the only parameter that matters, then it will be difficult to beat Streaming compression with Dictionary. You could imagine a combination of Streaming + Dictionary, which should compress even better, but depending on the total size of each stream (sum of all these little messages), the final benefits might be underwhelming to be worth the complexity, since dictionary is mostly effective during the first Kilobytes of the Stream.
There might be an additional saving in this scenario though, thanks header statistics computed and saved into the Dictionary, because individual messages are so small (~100 bytes) and constantly flushed so they may never allow transmission of more optimal header statistics. So who knows, maybe the final savings could end up being significant in this case.

The downside of Streaming is the need to maintain a state, on both side. for each stream.
If there are many streams (hundreds are mentioned), this can result in a significant memory budget.
If this memory budget is a problem, then dictionary compression (without Streaming) is a pretty good alternative.
It will probably not compress as well as streaming, but possibly be close enough, while slashing the memory cost, because now, the system only needs as many states as there are active compression / decompression sessions in parallel, which are typically orders of magnitude lower (depending on the application).

from zstd.

ktsaou avatar ktsaou commented on April 28, 2024

Great! Thank you very much for answering this. We will do some experiments to see if streaming with dictionaries improves the situation in compression and memory.

from zstd.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.