Comments (3)
Copied from #3610 (comment) (I accidentally commended to the wrong thread).
Here is our situation at scale: we do about 1 million compressions / decompressions per second to move data between netdata agents using a few hundreds of compressors and decompressors. It is not one pipe, it is not one stream. It is hundreds of streams, each sending thousands of very small messages (usually less than 100 bytes) per second, that need to be transmitted on time (low latency).
We are using ZSTD in streaming mode, which we never reset, but we flush the output buffer of the compressor after every compressed message, since we need to send it asap.
The content itself has many variable components (values), but also a lot of fixed components (keywords). It is text.
We can take care of the transport required to move dictionaries or samples around, and also synchronize the compressors and decompressors to always use the same dictionaries (the versioning part you mentioned), so this is not a problem for us.
I don't know if using dictionaries in our case could provide the benefits of dictionaries. We see quite some decrease in CPU consumption and increase of compression ratio over time (e.g. after a day we experience 10% less CPU consumption that is almost gradually achieved), but I can't be sure if this is due to some change on the data, or that ZSTD learns to do it more efficiently over time.
If dictionaries can provide a benefit in streaming mode, the compatibility of dictionaries across versions of ZSTD is crucial for us.
from zstd.
Different versions of ZSTD will still be compatible (compress with any version, decompress with any other version)?
Yes. The format is frozen in RFC8878.
share the samples, so that the compressor and the decompressor will train the dictionaries using the same samples.
Nope. This implies that, using the same sample sets, the same dictionary will be generated. This is not guaranteed. It may turn out to be accidentally true using a limited set of versions and platforms for testings, but that's not a safe guarantee to build upon.
train the dictionary on the compressor and share the binary dictionary with the decompressor.
Yes, that's the more common approach.
the binary dictionary needs to be compatible among different versions of the compressor and decompressor and even between different architectures (big/little endian).
Yes, the dictionary is also defined in RFC8878, so it's guaranteed to be interoperable.
Here is our situation at scale:
We are using ZSTD in streaming mode, which we never reset
It is hundreds of streams, each sending thousands of very small messages
OK, so you are currently using Streaming compression.
And you are considering Dictionary instead of, or on top of, Streaming.
Rule of thumb : except in specific corner cases, Streaming compression is expected to always compress stronger than Dictionary compression.
Dictionary compression bridges the gap between independent frames and streaming.
So, if compression ratio is the only parameter that matters, then it will be difficult to beat Streaming compression with Dictionary. You could imagine a combination of Streaming + Dictionary, which should compress even better, but depending on the total size of each stream (sum of all these little messages), the final benefits might be underwhelming to be worth the complexity, since dictionary is mostly effective during the first Kilobytes of the Stream.
There might be an additional saving in this scenario though, thanks header statistics computed and saved into the Dictionary, because individual messages are so small (~100 bytes) and constantly flushed so they may never allow transmission of more optimal header statistics. So who knows, maybe the final savings could end up being significant in this case.
The downside of Streaming is the need to maintain a state, on both side. for each stream.
If there are many streams (hundreds are mentioned), this can result in a significant memory budget.
If this memory budget is a problem, then dictionary compression (without Streaming) is a pretty good alternative.
It will probably not compress as well as streaming, but possibly be close enough, while slashing the memory cost, because now, the system only needs as many states as there are active compression / decompression sessions in parallel, which are typically orders of magnitude lower (depending on the application).
from zstd.
Great! Thank you very much for answering this. We will do some experiments to see if streaming with dictionaries improves the situation in compression and memory.
from zstd.
Related Issues (20)
- Windows binaries are missing on v1.5.6 release HOT 3
- Clicking the website URL on GitHub repository displays a warning if browser is in HTTPS-only mode
- MSVC CMake build failed on v1.5.6
- v1.5.6 Windows binary downloads are double zipped HOT 4
- Raise version's in win32 binaries header HOT 3
- Why was the new release 1.5.6 removed? HOT 15
- long file names are cut off in output HOT 3
- Should zstd check archive consistency before overwriting files? HOT 1
- Should zstd delete incomplete archives? HOT 5
- 32-bit x86 build failure with 1.5.6 HOT 3
- v1.5.6 breaks 32-bit Windows clang-cl build HOT 3
- Decompress multiple zstaa backups on FAT32 drives HOT 4
- Replication of bug #3517 HOT 16
- Separate dictionary references to enable dictionary usage for any combination of window size and content size HOT 1
- Decompression speed regression in zstd 1.5.6 (win)
- Embed hash of raw dictionary in compressed resource (optionally) HOT 4
- Decompression crash after upgrading from zstd 1.4.5 to 1.5.6 HOT 12
- Missing check on failed allocation leads to NULL-ptr dereference
- libzstd.lib missed in package, also VC sample seems include wrong mem.h or ambigious including!
- Environment variable for --memory HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from zstd.