Comments (6)
If you were to concatenate all the same small files into a single blob, and then compress the resulting blob with zstd
, what compression ratio would you obtain ?
from zstd.
I havent setup a test case of my own, Im asking based on what I saw in the GitHub readme "The Case For Small Data"
from zstd.
When using different inputs, one logically obtains different compression performance.
If you want to compare small data with large data, use the same data.
Either concatenate the small ones to create large data (preferable), or split the large data to create small data.
Otherwise, it's not comparable.
Compressing large data should always win, though by how much depends on the data (incompressible data remain incompressible). Dictionary will help to close the gap, but typically cannot overtake the large data scenario, especially when the cost of the dictionary (its size) is taken in consideration.
It's likely possible to create a contrarian scenario where above statement is false, by taking advantage of imperfections in the compression process, since it's using fast imperfect heuristics. That would be an exception though, not the general expectation.
from zstd.
Thank you for your response!
Thats not quite what Im asking though, you are explaining a general scenario, Im asking about the results that are published in Zstd's GitHub page.
It makes sense to me the the ratio for big files show be the upper limit, however, it seems like for small data, its able to achieve much more, is it just because the input used happens to be such that is ver compressible?
Meaning if I would take all the samples in the scenario described and concatenate them, would I achieve the same ratio (~10) without the dictionary as well?
from zstd.
Yes,
dictionary compression works best on structured data featuring a lot of redundancy across messages, though very little within the message.
This is what the github collection sample achieves: it's just a bunch of json records, with very similar structure.
If they were compressed all concatenated together, the compression ratio would be greater than 10.
Dictionary compression is for scenarios where one cannot concatenate these similar records together, for example because the records must be sent immediately and can't wait inside a batch queue.
from zstd.
I see, now I understand, thank you!
from zstd.
Related Issues (20)
- Export "selected" CMake target for zstd HOT 5
- [Question] Understanding of compression level with external sequence producer HOT 2
- zstd fails to process some filenames on Windows [we need a hero] HOT 3
- We need a ZStd JavaScript library HOT 1
- Compiler warnings present when integrated with Swift Package Manager
- will zstd get nvcomp acceleration or a gpu support like g-brotli? HOT 1
- How can I change the window size? HOT 1
- Increase minimum C standard from C89/C90 to C11 HOT 3
- New zstd 1.5.5 version is two times slower in compression speed than older 1.4.5 version HOT 11
- ZSTD with T option does not scale on multicore CPUs HOT 7
- Question about FSE Huffman literal part
- C++ Builder and mem.h ambiguity HOT 7
- Reducing DCtx Size for Embedded Systems (like esp32) HOT 2
- Automatic code formatting? HOT 1
- No `uncompressed` and `ratio` information in `zstd --list` output if the zstd file is created via pipe HOT 3
- lz4 "legacy" format support HOT 1
- Add common file types that are compressed to ' --exclude-compressed' HOT 3
- windows
- compressing files containing multiple similar portions HOT 5
- Using ZSTD_compressBound for Streaming Input HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from zstd.