I'm very excited to see work being done on dictionary support in the API, because this

Thanks for detailed feedback <a class="user-mention notranslate" data-hovercard-type="

Thanks for all these explanations. Regarding <code class="notranslat

Compressing individual documents with a dictionary produced from a sampled batch to get better compression ratio,about facebook/zstd

Comments (86)

Cyan4973 commented on April 27, 2024

Thanks for detailed feedback @KrzysFR .

The goal would be:

For each new or modified document D, compress it AS IF we were compressing SAMPLES[cur_gen] + D.json, and only storing the bits produced by the D.json part.

When reading document D, decompress it AS IF we had the complete compressed version of SAMPLES[D.gen] + D.compressed, and only keeping the last decoded bits that make up D.

I believe the current dictionary API is compliant with this objective, in both direct and buffered mode.
Obviously, the selection and generation of SAMPLES[cur_gen] is completely outside of its scope, and is therefore managed directly by the calling program.

Your suggestion regarding a moving dictionary changing gradually as compression seems to deteriorate looks good to me. Sure, there's a risk of instability, but that's part of the challenge.

Now, if your request is to get a tool similar as femtozip to build a dictionary, I'm afraid I currently have nothing to offer... yet. Although I have some ideas on how to proceed, such tool remains a fairly complex object to create, which will require a lot of availability.
Considering all other remaining tasks to get to zstd v1, it's unclear if this one gets top priority. I'll look for guidance.

In the meantime, things you could try :

A basic method which gives good results in spite of its simplicity is to sample a training data, taking a few stripes of 16-64 bytes randomly from it, and just stick them together as a dictionary. Strangely enough, it gives visible results.
An even simpler method : take a few random samples, here is your dictionary.
Not sure what is the output format of femtozip, but if model.fzm is the raw dictionary content, ready for consumption for zlib, then it should work as is for zstd too. Even if there are a few specific metadata, it should work well enough too. The important thing is that it should not be encoded in any specific way.

Edit : finally found the format described. There are a few metadata at the beginning, but more importantly, there are some entropy tables encoded at the end which are not usable by anything else than fzip.
In order to avoid that, it's enough to instruct --dictonly (as indicated by ./fzip -h) to only get the raw dictionary content.

As a sidenote, now that I've read this page, femtozip seems to do almost exactly what I had in mind for a dictionary builder. It's so close that I had a "meh" moment. I guess there are not that many ways to achieve the same objective.

Anyway, I'll have to think again if it's worth developing a new dictionary builder tool of rather improve this one. It looks already fairly good. The only major limitation is the size of dictionary (< 64 KB), but it should prove enough for a large number of use cases.
The entropy part shall also be entirely rebuilt for zstd.

Edit 2 : well, according to this page, even dictionary size can be selected (it's not mentioned on ./zstd -h). What else ?

I don't know how having a dictionary would help with larger documents above 128KB or 256KB. Currently I'm only inserting the dictionary for the first block. Would I need to reuse the same dictionary for each 128KB block?

Currently, only the frame interface is exposed.
The block interface also exists internally, but it's not exposed yet. You probably have the skill and experience to find it and use it directly, so now I'm unsure at which level you are using Zstd.

From a frame perspective, each frame is independent.
That means, each frame must be re-initialized with a dictionary. The frame will internally cut input into blocks of 128 KB is need be.

From a block perspective, each block can reference previous ones, up to the beginning of the frame.
In which case, when you start a second block, it's able to look into previous block AND into the dictionary beyond it.

The final limit is the window (search) size. It starts at 512 KB (zstd -1) but can grow up to 32 MB (for compatibility with 32 bits) and beyond (for 64-bits implementations only). Which means that, in contrast with gzip, the dictionary size has almost unlimited size.

What is the best size for this dictionary? 16KB? 64KB? 128KB?

In theory, the larger the dictionary size, the better the compression.
However :

There are some diminishing returns. Once you have caught repetitive parts of a syntax, what remain are id elements that tend to have some common parts by accident. It's less and less viable to use that as dictionary content.
At which size is such a limit achieved ? Well, it depends ... :) I would hazard that beyond a few tens of KB, most of the easy gains should be already there.
The larger the dictionary to load, the higher the memory consumption, and the longer it takes to initialize search structures.

The decompression part is much more lucky here : it just needs to reference the place of the dictionary, without any extra work. A single dictionary is enough and can remain in place for multiple threads to work in parallel.

ZSTD_decompress_insertDictionary branches off into different implementation for lazy, greedy and so on. I'm not sure if all compression strategy can be used to produce such a dictionary?

Yes, all of them.

from zstd.

KrzysFR commented on April 27, 2024

Thanks for all these explanations.

Regarding femtozip I used a .NET port by Ayende Rayen (https://github.com/ayende/Rhea.Compression) that may or may not differ from the actual C version, and only for R&D and prototyping. The dictionary builder was especially slow, but I remember that compression was decent and decompression very fast.

From a frame perspective, each frame is independent. That means, each frame must be re-initialized with a dictionary. The frame will internally cut input into blocks of 128 KB is need be.

I'm pretty sure documents would be compressed using a single frame which should simplify things a lot (and most documents will be less than 128KB in size). I expect that using a dictionary is only when you are compressing very small things anyway.

Which means that, in contrast with gzip, the dictionary size has almost unlimited size.

That's something I didn't know. This can indeed simplify building the dictionary because we would not have to make it fit into the window size, and there is no risk of having everything slide out of the window after the first hundred KB of compressed data.

Are there any point in trying to have very short offsets when compressing? If I had a dictionary of size 256KB with the most frequently use token at the start of the dictionary, would this be less efficient than having that token at the end (closer to the data) ?

The larger the dictionary to load, the higher the memory consumption, and the longer it takes to initialize search structures.

I've done some testing this afternoon, by creating a "hollow" JSON template from an existing dataset (merging every documents into a single one and replacing values with null/empty).

Compressing each documents with this initial dictionary is WAY slower than compressing without. By that I mean 2 milliseconds without versus 20 seconds with dictionary for one hundred 2KB documents (x10,000 !?)

I'm not exactly sure if I'm doing something wrong or not. Regular compression is done by calling into ZSTD_compress which is blazing fast, while with the dictionary I had to use the ZSTD_compressBegin/ZSTD_compressContinue combo, using a ZSTD_CCtx.

~~A quick profling runs does show that the bulk of the time is spent somewhere in msvcrt probably in memory allocation/free:~~ (profiler showing wrong stacktrace removed)
(5.8sec for 31 calls is an average of 190ms per compression??)

The major difference I see is that ZSTD_compress has a context allocated on the stack, while ZSTD_createCCtx calls calloc (which must be in msvcrt120.dll). This is probably something silly but I did not see this before because I've never used ZSTD_CCtx directly before from .NET code.

Anyway, the compression ratio with this dictionary (about 2.2 KB of "hollow" JSON fields names and structure) are:

RAW: 660 KB (sampling of 100 random documents)
zstd -1 of the whole batch: 51 KB (1:13, the target goal)
sum of zstd -1 of each individual documents: 175 KB (1:3.7, that's already pretty good! gzip gave me 10% in earlier tests!)
sum of zstd -1 of each individual documents + dictionary: 126 KB (1:5.2, better but still 2.5 larger than batched).

This indicates to me that doing the extra work on builder a better dictionary would still be a significant gain.

from zstd.

Cyan4973 commented on April 27, 2024

Are there any point in trying to have very short offsets when compressing?

Yes. Small distances are easier to describe than long ones, and tend to cost less.
So it's still useful to have most common elements at the end of the dictionary, hence closer to the file to compress.

Compressing each documents with this initial dictionary is WAY slower than compressing without.

Slower, surely, but not by as much as described.
So there must be a problem somewhere.
Let me investigate.

from zstd.

KrzysFR commented on April 27, 2024

Ahah ok, i "fixed" the compression issue. Let's say I forgot to add the compressionLevel parameter to the PInvoke signature of ZSTD_compressionBegin which would be called with whatever was on the stack. I could understand why compressionLevel 100 hundred billion would take slightly longer.

Maybe by bad luck the value was between 0 and 20, but if not it would mean that the compressionLevel is not properly validated by ZSTD_compressionBegin?

Anyway, now the compression with dictionary is 140KB instead of 126 KB which is not much gain for a dictionary that should perfeclty cover about a 1/3rd of each documents (2.2KB dict for 6.6KB docs).

from zstd.

Cyan4973 commented on April 27, 2024

Any compression level value > MAX is corrected into == MAX.
So it should be the same as requiring 20.

Which is basically enough to explain the vast difference in speed observed.

from zstd.

KrzysFR commented on April 27, 2024

I whould have thought that passing a level of 0xDEADBEEF would return an error instead of silently consuming 20 seconds of CPU (from expected 2ms)...

Also, since this has nothing to do with memory allocations, this means that I cannot trust at all my profiler regarding native callstacks :( oh well...

from zstd.

Cyan4973 commented on April 27, 2024

I whould have thought that passing a level of 0xDEADBEEF would return an error instead of silently

yes, it could, it's a matter of interface convention.

My initial reasoning was that maximum level wasn't settled and could change from one version to another, so rather than forcing users to follow these variations, it looked simpler to default to closest one.
By the same logic, a negative compression level is also translated into 1.

But of course, it could be argued that it should instead trigger an error.

If the scale of compression levels remain stable, then I guess the situation change, and it sounds better to flag that as an error. Just, let's wait a bit more time to ensure the scale is now really stable .

from zstd.

Cyan4973 commented on April 27, 2024

There is another side effect to consider :

Compression modes are selected by a range of options which can be directly managed through _advanced() variants of compression functions.
These options are pretty complex, so they have been simplified into compression level, which is basically an index into a pre-calculated table.

However, there are several tables.
The reason is, the heuristics selected for large file absolutely suck on small data blocks.
So, instead, several sets of parameters were created, depending on input size.

Input size is a necessary ingredient within ZSTD_compress(), so it's automatically taken care of.
However, when using the streaming variant, required by shared dictionary, nothing is provided by default. In such a case, zstd uses the "default" table, that is the one for large files.

There is a work-around :
use the _advanced() variant, in association with ZSTD_getParams(), and provide the combined size of dictionary + data block as a source size hint. It should make a better job, selecting more efficient parameters, especially for small blocks.

from zstd.

KrzysFR commented on April 27, 2024

Looks like there is an issue currently with using ZSTD_getParams() through a dll from a managed language: since ZSTD_parameters is a struct returned by value, I'm not exactly sure how I can define a valid method signature from the .NET side: seems like the struct is currently 32 bytes long, and the largest value type in .NET is 16 bytes. I could probably replicate the ZSTD_parameters C struct as an equivalent .NET struct, but I'm afraid about the C compiler optimizing the layout of the fields in some weird ways... I'm also not able to do sizeof(ZSTD_parameters) which could change at any time...

It would be easier if the method would return a ZSTD_parameters*, or would take a pointer to a user allocated space and copy the struct there, but how to deal with the sizeof(..) issue? Looks like an API break :( I did not catch this issue before since I wasn't using the _advanced functions and the ZSTD_CCtx/ZSTD_DCtx are not impacted by this.

There are some methods that already take a ZSTD_parameters* as argument. Is there a reason why ZSTD_getParam() returns a struct instead of a pointer? (pointer could allow modifying the table directly?)

from zstd.

Cyan4973 commented on April 27, 2024

I'm afraid about the C compiler optimizing the layout of the fields in some weird ways...

It should not.
A C89/C99/C11 compiler cannot select to "optimize" a structure layout, layout is clearly described by the relevant standard, and it's strictly forbidden to play with it, as it is an essential part of ABI.

I'm also not able to do sizeof(ZSTD_parameters) which could change at any time...

Sorry, I didn't pay attention to such limitation. I guess it's related to using a DLL from .NET.
I suspect the only "good" way is to expose a create method.

Is there a reason why ZSTD_getParam() returns a struct instead of a pointer?

For convenience. Also because of an old advise from John Carmack to make code easier to debug and maintain by borrowing some principles from functional programming.

But these objectives are secondary, and shouldn't prevent usage of the interface to begin with.

It would be easier if the method would return a ZSTD_parameters*

Yes, possibly.
Though, my initial suggestion was merely to feed the result directly to the init function, so that :

ZSTD_compressBegin(ctx, dst, dstSize, cLevel);

becomes

ZSTD_compressBegin_advanced(ctx, dst, dstSize, ZSTD_getParams(cLevel, dictSize+blockSize));

Not sure it that's possible from .NET, but it's a valid C construction.
In above proposition, it's not necessary to access / allocate ZSTD_parameters structure.
Or do you explicitly want to access the structure content ?

(pointer could allow modifying the table directly?)

This is a const memory region, so it can only be read.

There are some methods that already take a ZSTD_parameters* as argument.

Indeed, for example : void ZSTD_validateParams(ZSTD_parameters* params);

In this case, the objective is to modify params, in order to ensure it fits within authorized limits. Hence it was passed as a ptr.

Now, it's true that it's also possible to write it this way :
ZSTD_parameters ZSTD_validateParams(ZSTD_parameters params);
and it would look "more functional".
And maybe it should be changed to look this way to keep the interface coherent.
Just it felt more natural to pass a ptr in this case.

the other example is on the decompression side :
size_t ZSTD_getFrameParams(ZSTD_parameters* params, const void* src, size_t srcSize);

It was selected this way because both params and function result are needed.
Function result can be an error code, so it must be size_t.
Hence there is no other solution than ptr to provide "params".

Looks like an API break :(

Note that anything within zstd_static.h should be considered breakable as long as we are in beta status (v0.x serie). After all, it's necessary to test and make mistakes in order to improve.

from zstd.

KrzysFR commented on April 27, 2024

A C89/C99/C11 compiler cannot select to "optimize" a structure layout, layout is clearly described by the relevant standard, and it's strictly forbidden to play with it, as it is an essential part of ABI.

I guess I'm used to the JIT optimizing the layout at runtime in safe contexts (where you are not allowed to get offset to members of types by default) in order to align pointers (required) and fill gaps. As an example: struct { int32 A; void* B; int32 C } can be laid out as { A, B, C } on 32 bits with no waste, but would be better laid out as { B, A, C } on 64 bits to remove gaps (since B must always be aligned on 64 bits addresses). Having the JIT automatically do that for you is nice... except when you have to respect the layout of a struct defined in another non-managed API :) That's what LayoutKind.Explicit is made for

I guess it's related to using a DLL from .NET.

I should stress that this is not specific to .NET but probably most type-safe JITed VMs (on Windows probably?).

I guess we could say that when building a DLL, all the informations in the .h gets erased, and only the pointers to the functions entry points are kept. Hence why I don't have access to #defines, structs, or static tables, only pointers to exported functions.

When you are consuming such a dll from C/C++ you can include the original .h and if you have the matching version, things should work OK.

But in .NET, for example, you cannot use a .h so you have to provide your own version via PInvoke, where you substitute C types with equivalent .NET types.

That's why, in my wrapper, I have to use IntPtr to emulate pointers (since I don't have the original struct definitions and don't like void*) and UIntPtr to emulate size_t, because they are the only value types in .NET whose size changes automatically to 32 or 64 bits (at runtime after JITing). All other types have fixed size, and more complex struct-like types need special treatment by the VM which must marshal them everytime you cross the boundary. Hence why it's faster to only rely on simple value types (which require no marshalling) or pointers (which most of the times with C libraries are opaque pointers, from the point of view of the app).

Edit: forgot to say that most .NET assemblies are compiled for "AnyCPU", and are only specialized to 32 or 64 bits at runtime by the JIT. This means that when C# interop code is compiled, it must be platform neutral (can't use #if / #else) and rely on a loader to select the proper zstdlib_x##.dll variant. This is especially problematic for BigEndian vs LittleEndian but fortunately I haven't have to deal with that yet !

.NET supports "unsafe" structs with explicit layout, in order to interop with native libraries, so I think I would be able to make it work with the current code (at the cost of extra cpu for the marshalling of the struct between native and managed code, but this is only at init time).

Though, if you change the struct definition (to add or remove fields), then the wrapper code will break the stack at runtime. The .NET CLR should be able to catch this and throw an exception, but that's not pretty.

from zstd.

Cyan4973 commented on April 27, 2024

Though, if you change the struct definition (to add or remove fields), then the wrapper code will break the stack at runtime.

Indeed, static definitions are always a problem for DLL.
I'll likely keep that part into zstd_static.h for this reason, but zstd.h will need a DLL compatible equivalent, such as a create method.

When it's absolutely necessary to define structures into .h, I tend to give them a bit of headroom, for potential future expansions. This is the solution selected for LZ4F and it has worked well so far, allowing several interface updates without breaking ABI.

from zstd.

KrzysFR commented on April 27, 2024

So I was able to make the ZSTD_getParam() work as intended, and I get the correct values in the fields. I also tweaked ZSTD_compress_advanced to be able to pass the dictionary buffer so that I can reuse the existing compression code.

But now I'm getting access violations on a specific document (reproducible in Release and Debug build) somewhere inside ZSTD_compressBlock_fast_generic (mls=4) if I specify a dictionary, but OK with no dictionary. It works fine for a few dozens documents before this one. I'm in the process of installing the Windows SDK to be able to run it with windbg (oh joy!) because without it I can't tell much more.

from zstd.

Cyan4973 commented on April 27, 2024

But now I'm getting access violations on a specific document

Maybe there's a bug to fix.
If the document and dictionary can be shared, I can try to debug it on my side too.

from zstd.

KrzysFR commented on April 27, 2024

With windbg I get the access violation in BIT_addBits, because nbBits is equal to 0xCCCC (so its an overflow in mask[], not the source or destination.

This is because symbolTT in FSE_encodeSymbol (called by ZSTD_compressSequences) contains garbage. I also see that uninitialized variables on the stack also have value 0xCCCC... so I suspect that FSE_encodeSymbol is reading from unitialized part of the stack somewhere.

I've read that .NET clears newly allocated memory (including the stack?) with 0xCC bytes in Debug mode, which is compatible with what I see.

I would have to review the sample document to see if it does not contain sensible data before sending it.

from zstd.

Cyan4973 commented on April 27, 2024

sure !

Nice investigation,
understanding why FSE_encodeSymbol() is reading from unitialized memory can become a fairly complex issue.

from zstd.

KrzysFR commented on April 27, 2024

And of course, running zstd.exe from the command line with the same input and dictionary does NOT fail. and gives me the expected result. I'm starting to think that this is not the document itself, but some combination of memory alignements and .NET Pinvoke weirdness with passing struct by value as argument.

from zstd.

Cyan4973 commented on April 27, 2024

sounds reasonable

from zstd.

Cyan4973 commented on April 27, 2024

For information, at some point during development of streaming interface, ZSTD_compressBegin() prototype looked like this :

size_t ZSTD_compressBegin(ZSTD_CCtx* cctx, void* dst, size_t maxDstSize, int compressionLevel, unsigned long long srcSizeHint);

Note the last parameter. Using srcSizeHint would provide information to make the algorithm select the appropriate table of parameters, and is basically what is achieved when doing : ZSTD_compressBegin_advanced(ctx, dst, dstSize, ZSTD_getParams(cLevel, dictSize+blockSize));

It was later removed in favor of ZSTD_compressBegin_advanced() in order to reduce number of functions exposed, hence complexity of interface.
So now, it's either simple (just compression Level) or advanced (complete control over all parameters).

While this looks good, the _advanced() version requires a structure to manipulate the numerous and potentially evolving parameters. However, I did not expected that a function with struct return or argument type would cause so many problems when invoked from a different (managed) language.

Hopefully, it can be sorted out, but if not, or if too complex, maybe a different interface will be preferable.

from zstd.

KrzysFR commented on April 27, 2024

The way I see it, if you start adding features, like dictionaries, and so on, you will need to add more and more arguments to the _advanced variant. Maybe, since the function takes in a ZSTD_CCtx*, we could simply call a bunch of ZSTD_context_setFoo(...) for each new features (so right now the params and dictionaries), BEFORE calling ZSTD_compress_advanced, this would be forward compatible. But right now the ctx is reset by ZSTD_begin... which would erase all settings.

Regarding the access violation, I just realized that when invoked from the commandline, the code goes through ZSTD_compressBlock_fast_extDict_generic but when I call ZSTD_compress_advanced (changed to call ZSTD_compress_insertDictionary), the code goes through ZSTD_compressBlock_fast_generic instead, which calls a different set of functions.

My dictionary is allocated in a different buffer than the source, and even though they are probably close in memory (allocated in the same gen0 heap by .NET GC) they should not be contiguous (I'm allocating buffers rounded to the next power of 2).

from zstd.

KrzysFR commented on April 27, 2024

Changing the C code by adding fprintf everywhere does change the result, but now I'm seeing a buffer overflow inside ZSTD_compressBlock_fast_generic in the main while(ip <ilimit) loop with ip=0x00000001118D24EB and ilimit = 0x00000001118D24EC. Meaning that ip ends up on the very last byte of the buffer, which will overflow in ZSTD_hashPtr or in MEM_read32(ip+1-offset_1) below.

I guess in this case, the buffer is right on top of the stack and touches guard pages right after it, while in some other cases there is extra space after it (filled with 0xCCCC) and it continues with garbage input that break later functions (what I was seeing above).

I'm sorry I cannot give better details because I have really no way of debugging managed and C code at the same time.

from zstd.

Cyan4973 commented on April 27, 2024

ZSTD_compress_advanced() is not compatible with streaming, nor dictionary.
It's only meant to compress a single memory segment, like its sibling ZSTD_compress().

The expectation for dictionary compression is to use the streaming interface only :
Init by calling ZSTD_compressBegin_advanced()
then ZSTD_compress_insertDictionary() to enable dictionary compression,
then loop on ZSTD_compressContinue() to compress content,
and finish with ZSTD_compressEnd().

The way I see it, if you start adding features, like dictionaries, and so on, you will need to add more and more arguments to the _advanced variant

Well, that's why ZSTD_parameters was created, to add new arguments without modifying function prototypes. But really I wasn't expecting that manipulating structures would be such an issue outside of C. After all, even zlib uses structures within its interface.

from zstd.

KrzysFR commented on April 27, 2024

I was doing that sequence of calls before, and it did work fine. I guess if ZSTD_compress_generic does not expect a dictionary, it could explain the weirdness I see. It seems that it works most of the times though.

I thought that calling a single C function that does everything would be quicker than calling multiple functions from .NET (which incur a marshalling cost everytime). If ZSTD_compress_advanced is not the correct method to call, then I could at least try to add a different function.

I guess we should see if this will fix my issue, before incriminating the ZSTD_parameters struct.

from zstd.

Cyan4973 commented on April 27, 2024

OK, so I guess what you were expecting is :

a kind of preparation function, to declare dictionary compression, generating a context which can be duplicated multiple times for the many documents to come
a single function call to compress each document, like ZSTD_compress(), but for dictionary compression.

I suspect I can modify the interface to look this way.

from zstd.

KrzysFR commented on April 27, 2024

This is my current state of affairs after fprintfing ALL THE THINGS!

When called from ztsd.exe (OK):
ZSTD_compressBlock_fast_extDict_generic(..., mls=4, lowLimit=0, dictLimit=2253)

When called from a custom method that calls ZSTD_compressBegin_advanced + ZSTD_compress_insertDictionary + ZSTD_compressContinue + ZSTD_compressEnd (FAIL with AVex):
ZSTD_compressBlock_fast_extDict_generic(..., mls=4, lowLimit=-32479, dictLimit=-27011)

I'm seeing NEGATIVE values for lowLimit and dictLimit in my case (-32479 and -27011). Is that possible if they are defined as U32? (fprintf shows negative because of %d I guess)

Looking at the code in ZSTD_compressContinue, the negative values come from the first if (src != zc->nextSrc) that set lowLimit = 0 and dictLimit = -27011.

Logs: (1) is start, (2) after first IF, (3) after second IF, (4) after third IF right before call to ZSTD_compress_generic

> ZSTD_compressContinue(op=0x000000BC6C92ABF5 11520, src=0x000000BC62C00450 10916)
(1) ip=000000BC62C00450, base=000000BC62C00450, nextSrc=000000BC62BF9ACD, lowLimit=0, dictLimit=0
(2) ip=000000BC62C00450, base=000000BC62C06DD3, nextSrc=000000BC62BF9ACD, lowLimit=0, dictLimit=-27011
(3) lowLimit=0, dictLimit=-27011
(4) lowLimit=10916, dictLimit=-27011
...
ZSTD_compressBlock(..., lowlimit -32479 dictlmit -27011)

from zstd.

Cyan4973 commented on April 27, 2024

I'm seeing NEGATIVE values for lowLimit and dictLimit in my case (-32479 and -27011). Is that possible if they are defined as U32?

Nope. It's necessarily a bug.

the negative values come from the first if (src != zc->nextSrc)

This code detects that memory segments are not contiguous, which is normal since the first one is the dictionary, and the second is the input to compress.

In theory, after this if statement : dictLimit = (U32)(zc->nextSrc - zc->base) == sizeOfDictionary

Strangely enough, in your test, zc->base > zc->nextSrc, which is supposed to be impossible.

So maybe something is modifying zc->base (or zc->nextSrc) in an unexpected way.
It would happen before ZSTD_compressContinue(), hence probably within ZSTD_compress_insertDictionary().

from zstd.

Cyan4973 commented on April 27, 2024

Looking into ZSTD_compress_insertDictionary(), I see this line :

zc->base += ip - zc->nextSrc;

Both zc->base and zc->nextSrc are supposed to be == NULL at this stage.
So it should result in : zc->base = ip == src == startOfDictionary

Now, maybe zc->base and/or zc->nextSrc are not NULL ....
It would not be expected, but could explain the strange value of base observed later on, within ZSTD_compressContinue() .

from zstd.

KrzysFR commented on April 27, 2024

I had to replace it with zc->base = (const BYTE*) (ip - zc->nextSrc); to fix a warning treated as error by VS.

It does fix the issue with zc->lowLimit which is now positive, but zc->dictLimit is till negative (-27011).

For reference, I have src ptr = 0x994FE62B80 (10916 bytes) and dict ptr = 0x994FE5B930 (2253 bytes) (parameter={W14 C14 H14 S1 L4 SZ10916})

raw log:

WithDictionary ctx=0x0000009958E2AC50 with dst=0x0000009959B6B0D0 11525, src=0x000000994FE62B80 10916, dict=0x000000994FE5B930 2253, prms={W14 C14 H14 S1 L4 SZ10916}
> Adding dictionary 0x000000994FE5B930 of size 2253
> ZSTD_compressContinue(op=0x0000009959B6B0D5 11520, src=0x000000994FE62B80 10916)
(1) ip=000000994FE62B80, base=000000994FE62B80, nextSrc=000000994FE5C1FD, lowLimit=0, dictLimit=0
(2) ip=000000994FE62B80, base=000000994FE69503, nextSrc=000000994FE5C1FD, lowLimit=0, dictLimit=-27011
(3) lowLimit=0, dictLimit=-27011
(4) lowLimit=10916, dictLimit=-27011
  > ZSTD_compress_generic ip 000000994FE62B80, op 0000009959B6B0D5, remaining 10916, maxDstSize 11520
ZSTD_compressBlock(0000009959B6B0D8 11517, 000000994FE62B80 10916, -32479 -27011)
  > ZSTD_compressBlock_fast_extDict 4
ZSTD_compressBlock_fast_extDict_generic(0000009959B6B0D8 11517, 000000994FE62B80 10916, seqtore 0000009958E2ACB0, mls 4, lowLimit=-32479, dictLimit=-27011)
> ip=000000994FE62B84, ilimit=000000994FE6561C h=13276, matchIndex=0, current=-27007, repIndex=-27010

from zstd.

KrzysFR commented on April 27, 2024

And here is my test function added (with all the fprintf glory)

size_t ZSTD_compress_withDictionary(ZSTD_CCtx* ctx,
    void* dst, size_t maxDstSize,
    const void* src, size_t srcSize,
    const void* dict, size_t dictSize,
    ZSTD_parameters params)
{
    BYTE* const ostart = (BYTE*)dst;
    BYTE* op = ostart;
    size_t oSize;

    fprintf(stdout, "WithDictionary ctx=0x%p with dst=0x%p %llu, src=0x%p %llu, dict=0x%p %llu, prms={W%d C%d H%d S%d L%d SZ%llu}\r\n", ctx, dst, maxDstSize, src, srcSize, dict, dictSize, params.windowLog, params.contentLog, params.hashLog, params.searchLog, params.searchLength, params.srcSize);

    /* Header */
    oSize = ZSTD_compressBegin_advanced(ctx, dst, maxDstSize, params);
    if (ZSTD_isError(oSize)) return oSize;
    op += oSize;
    maxDstSize -= oSize;

    if (dict)
    {
        fprintf(stdout, "> Adding dictionary 0x%p of size %llu\r\n", dict, dictSize);
        oSize = ZSTD_compress_insertDictionary(ctx, dict, dictSize);
        if (ZSTD_isError(oSize)) return oSize;
    }

    /* body (compression) */
    ctx->base = (const BYTE*)src;
    fprintf(stdout, "> ZSTD_compressContinue(op=0x%p %llu, src=0x%p %llu)\r\n", op, maxDstSize, src, srcSize);
    oSize = ZSTD_compressContinue(ctx, op, maxDstSize, src, srcSize);
    if (ZSTD_isError(oSize)) return oSize;
    op += oSize;
    maxDstSize -= oSize;

    /* Close frame */
    fprintf(stdout, "> ZSTD_compressEnd(op=0x%p %llu)\r\n", op, maxDstSize);
    oSize = ZSTD_compressEnd(ctx, op, maxDstSize);
    if (ZSTD_isError(oSize)) return oSize;
    op += oSize;

    fprintf(stdout, "> done! %llu\r\n", op - ostart);

    return (op - ostart);
}

from zstd.

Cyan4973 commented on April 27, 2024

ah ok. now I understand.
I did not caught that you were building your own function in C, directly modifying internal state.

This line should be removed :
ctx->base = (const BYTE*)src;

This line is likely inherited from original ZSTD_compress_advanced() which is only meant to compress a single memory segment. In this specific case, it makes sense, because it doesn't call ZSTD_compress_continue() but ZSTD_compress_generic() instead.

In a streaming scenario, base is handled within streaming functions. It's a critical variable to track indexes and search limits, and should not be manipulated directly.

Anyway, your sample is interesting, as it helps me understanding what kind of interface you would like to use.

from zstd.

KrzysFR commented on April 27, 2024

This fixed it! Yes, I copy/pasted the ZSTD_compress_advanced() because it looked like what I wanted. Sorry for the mixup.

I'm now able to compress with a dictionary using a single method, and get the following results on a complete dataset:

RAW: 41.2 MB
zstd BATCH: 2.8 MB
zstd 1by1: 10.9 MB (in 235 ms)
zstd 1by1 with dict: 9.4 MB (in 204ms)

Looking at the dataset, I see that there is a lot of stuff in the values that could be in the dictionary. I should also try the other simpler methods for generating a dummy dictionary to see what's best.

from zstd.

KrzysFR commented on April 27, 2024

Creating a dictionary by concatenating 5 random documents gives me a dictionary of 29KiB and compression of 8.2 MB (better than 9.4 MiB). But if I take 10 documents I get 9.2 MB (marginally better) and with 25 documents I get 9.5 MB (worse).

For the other technique of taking random striped of 64 bytes from sample documents:

32 KB dictionary: 8.9 MB
64 KB dictionary: 8.8 MB
128 KB dictionary: 8.7 MB
256 KB dictionary: 9.0 MB

Taking larger stripes (128 bytes) gives 8.5 MB with a dict of 128KB.

All in all, all methods seems to gravitate around the same compression ratio, which is not much better than no dictionary at all (25% better), but still ~3.5X worse than compressing the whole batch.

from zstd.

Cyan4973 commented on April 27, 2024

Could be worth trying femtozip with option --dictonly ...

from zstd.

Cyan4973 commented on April 27, 2024

Out of curiosity, what's the (average) size of each document ?
From the sample provided, it looks as if they are ~5 KB each.

from zstd.

KrzysFR commented on April 27, 2024

Average size of documents is 6.6 KB but can vary from 2 KB to 20KB.

Ok, so fzip --dictonly from a sample of 100 docs give me a 64KB dictionary, which, when used with zstd -1 gives a total size of 8.2 MB. Using fzip --compress with the same model gives a file size of 3.2 MB (against 2.8 MB for zstd -1 on the complete batch).

Seems like only having the dictionary for zstd is not enough, and I guess you'd need to capture more of the internal state of FSE/huff0 to achieve the same kind of results... That's a bummer :)

I tried using zstd -5 with the femtozip dictionary, and I get 7.6 MB (950 ms cpu time) and 7.4 MiB with zstd -9.

from zstd.

Cyan4973 commented on April 27, 2024

Well, basically, yes.
That's by the way the last major feature I want to include within zstd before finalizing the format.

At the start of each compressed document, there are entropy tables descriptors.
Block headers, basically.
The cost of these headers vary, but expect a few hundred bytes.

When using default block size of 128 KB, it doesn't matter much.
But for very small blocks, it does.
If it consumes say 200 bytes, that's about 20% for a 1K document (after compression).
There are also some "lost opportunities", since the algorithm will not even try compression as soon as it detects that header + compressed size is going to be worse or marginally better than raw size.

So the idea is to create extremely compressed entropy descriptors, with diff maps from a model, which can be calculated while building the dictionary.

femtozip goes one step further by not generating any entropy table descriptor within the compressed document, using only the one generated with the dictionary. That's almost the same thing : it's a diff map full of zero.

So yes, that part is the last one planned for zstd to reach v1 status.

from zstd.

KrzysFR commented on April 27, 2024

I've tried something else: compressing the dictionary as a first block, then compressing the data as a second block (so two calls to compressContinue), and then keeping only the last block. This gives me about the same total size as if I was calling _insertDictionary, which I guess is expected. You are probably not keeping any state from one block to the other, except for the window size which can see the previous block source text.

I'm not going to mess with dictionary just yet if this only gives me 20-25% gain, and will probably have to patiently wait for v1 :)

Lots of thanks for taking some time to help me with this!

from zstd.

Cyan4973 commented on April 27, 2024

Thanks for pushing the limits @KrzysFR. It really helps to plan future directions.

from zstd.

KrzysFR commented on April 27, 2024

By the way, I think that you can keep ZSTD_parameters as a struct, it seems to be working fine once I used the API correctly. It's currently at 32 bytes on x64 which is still small enough I'd say.

If the application treat it as an opaque struct, there is no issue with managed languages. Only problem is if you add or remove fields, changing the size of the struct. In this case, managed wrappers would need to update their code to stay in sync. If you don't change it often, or only with major versions, then it seems acceptable to me.

from zstd.

KrzysFR commented on April 27, 2024

Continuing with my tests with dictionaries, I've seen a some weird things that may or may not be expected.

I have three tests:

typical JSON document (1KB) with a mix of natural text names, product sku, ip addresses,...)
natural text (french song lyrics, 3KB)
string of random digits (512KB)

For each, I'm trying to compress with a naïve dictionary: "hollow" json structure, a few french generic words from the song, and for the random digits the string "0123456789876543210" (expected to be ineffective).

I'm getting the following results:

JSON: raw = 1150, nodict = 702 (61%), dict= 483 (42%)
Song: raw = 3138, nodict = 1724 (55%), dict = 1858 (59%, worse!)
Digits: raw = 524293, nodict = 221212 (42%), dict = 221212 (42%)

Both JSON and Digits test look good. The digits still compress because they only use 10 symbols out of 256 possible. The dictionary does nothing for the digits, and compression is the same

But I'm a bit surprised by the fact that the song lyrics actually compress worse with the naive dictionary.

I used my trusty ASCII art data visualization toolset to dump byte distributions for each compressed buffer, and I get the following result:

note: you may need to c/p into a fixedsys text editor or something, best viewed on back text/white background

## Generic JSON document
Raw    :  1,150 [                                $ @     :: :m~==ooo==;;==~m      =.==;+.:=~.=~=;=~+o=~-.--  +    m;=oM===o .o=ooo oooo..-=.. .     .                                     .                        ..                                                            ]
Zstd   :    702 [@oo=== =m=~~o=~=o=~==oo=o~~~o:: =~:==== =~o~ ===o=:=~==~=:=~=~=:~oM=o ===~=~~~ =o :~=~~=~= =~~:==M~=~:::====o:~=oo:===~ :=~~M=:=~o=:=::~m::o~ =~ :o:==~::~o=~=o=o:~:=~=~~= ~~ =: === m~==~==: = ==~ ::===o=:~:::~o~=~~=:~ ~:m=====:o=::o:=oo~=~ ===~~=:~= ~=== ~]
Zstdict:    483 [@m======m;==o;===o o;==;;; ;;= ======;=o==  =o=;=o===; o==;o==   m = =m= == =o=;o;o===;=; ;== ; =;o=  = =;;=;==; =o;;o;;=;;o    ==o;====o   ==o=;;==;;;;; o= =;=;=; ;= =   ;o;=;=o=o;=;;==;;;;;;oo==;;  =;===;;=o =;=;= =;=;====om; ;==  ;;  == ;;= =;== = ;;= ;]

## Song Lyrics
Raw    :  3,138 [          o  o                  @-;    =    =~=      .    -    : : +==.. -= =;~:=::=--;          m=oo$===m=~mommo=MMmm=-~=                                      ; .     ;o:   .     =    . .       o                                                            ]
Zstd   :  1,724 [@B+m=mom=~;om;o+=:m;==$=~m=+o=m+m=+=m+m:mm=Mom=o==m==o=:==o=====m+o=+mmo=~++mo=o+=oo;~~+moo=~+===oM+==o+:+=~++o:B~+mm~=+o==+~~; =m=mm+m=mm=m~==;++m~~++mm==o==;m:+=o;m~o+=o~;om;=+~=m=~o+=++;~+=moo++omm=o~:=m=~o+=o+oom=~++o~=++om=+;m;=-+;=~~:+;=;oo~;+~~=;= =]
Zstdict:  1,858 [MM:oooBMB~~oomB;M~o=;-==o+;~Mo:oo~o-oo+ooo=o=B++==M;M~=~B.m-~-=o+@+o;+;;;~:o+:=M;m~=~oo=+=oo~~~=-o=-+;~.+M:=;+~=oM-:M+=;=o~-o:o==o~M~o=~+o~=:o;:=:~~o=++-== +-=mM+~-;~M+==oB=~m==o=:o=~;=:+:o;=;Mom-mo=;~;=:-=:~o~-Moo+=-~=:~-=~m=o==--+o+B; -=-o++=+=.-=om;o;~.]

## Random string of digits...
Raw    :524,293 [                                                =@o=======                                                                                                                                                                                                      ]
Zstd   :221,212 [@#BB$mMm$MmoBoo=$$MMmoo=BMooo=o=$BBMM=M=m=o+m+o+BMMMm=o+m==~o+=~BoMo$oMoM=+~M==+mo=+o~+~o++~=~~~MmMoM=m=m=+~o++~oo++o~+;o+~~=~~;$MmoBomo#MooMo==MM+==+++MM===+~+moo==~=~o+~~+~~;mo===~~~=+~~+~~;BomoBoo=M=+~m=++o=+=+~~~o++~+;~;ooo==~=~=~~;+;~;o=~++;~;=+~~+;;;]
Zstdict:221,212 [@#BB$mMm$MmoBoo=$$MMmoo=BMooo=o=$BBMM=M=m=o+m+o+BMMMm=o+m==~o+=~BoMo$oMoM=+~M==+mo=+o~+~o++~=~~~MmMoM=m=m=+~o++~oo++o~+;o+~~=~~;$MmoBomo#MooMo==MM+==+++MM===+~+moo==~=~o+~~+~~;mo===~~~=+~~+~~;BomoBoo=M=+~m=++o=+=+~~~o++~+;~;ooo==~=~=~~;+;~;o=~++;~;=+~~+;;;]

How to read: Each line has 256 caracs representing the count of bytes with this value in the buffer with palette .-:;~+=omMB$#@ (not very precise, but helpful to spot patterns).

What I see immediately, is that the compressed result for JSON and songs do not use all the bytes and are clumped in multiple groups. This same effect can be seen for the random digits where you see a sawtooth pattern with cycles of 16 or 32 in length.

Is this normal? May this is because the test documents are not long enough and compress already well enough to not require using the complete range of bytes? I would have expected that the random digit once comrpessed would have a close to even byte distribution (since the original text also as an even distribution, except 1 and 2 for some reason (probably artefact of the random generator used)

from zstd.

Cyan4973 commented on April 27, 2024

Lots of things to say in these interesting and fairly detailed tests.

Compression method

I suspect you compressed using the -1 aka "fast" compression method.
In which case, the selected parameters for small docs are these ones :

/*         W,  C,  H,  S,  L,  strat */
    {  0, 14, 14, 14,  1,  4, ZSTD_fast    },  /* level  1 */

which means it will be happy as soon as it finds a 4-bytes match, and blindly accept that (since it's a fast method).
Good, but in some cases, it could be that the cost of offset + length is effectively larger than just 4 symbols directly compressed with Huffman. It could also be that the "small" match found at position P is not as good as the better one that would have been found at P+1.

Both these reasons can explain why "Song" sample ends up being worse with dictionary than without.

As a test, if you have direct control over ZSTD_parameters, you may want to increase searchLength to 5, in order to force matches to feature a minimum decent length to be selected. It could play favorably for this sample ... and less good for others.
I'm not sure to have a good "overall solution" at this stage. Blind selection is a key characteristics of the fast algorithm. It's basically required to be fast. So the last possibility seems to play with heuristics.

Random digits

It's a specific distribution. Interesting, but pretty rare in "normal" documents.
In this case, the "match repeat sequence" is basically useless.
So everything get throwned at Huffman.

However, 10 evenly distributed characters don't make a clean power of 2.
So, some characters will get 3 bits, others 4 bits.

Consequently, the combination of those 3bits & 4bits symbols will not be perfectly flat. 4bits symbols appear more often than they should. Hence, byte distributions will be affected, with some values being more common than others.

It's one of those rare cases for literals were FSE would do a noticeably better job. I suspect compressing "digits" with huff0 directly will produce the same layout you discovered, while using FSE would produce a much flatter byte distribution.

Now, for me the issue is that this sample is too specific; conclusions don't necessarily help to produce an algorithm with better characteristics for the more general case.

Small files

JSON and Song samples are pretty small once compressed.
That means that each byte has a visible impact in your statistics. Even with a noise source, it can be expected that some bytes values are unused when the sample is < 500 bytes long.

It also means that headers have a non-negligible share of the compressed result.
I'm suspecting that headers + huffman compressed literals dominate the distribution within compressed file.

Huffman compressed bytes may be subject to the same issue as described earlier. I suspect the impact should be small though, although on small samples, even a small impact can become visible.

Literals may also not be compressed at all. If their numbers are too small, it's not worth trying due to the fixed cost of huffman headers. In which case, their distribution should look like the original source.

Headers have a specific structure, which tries to be compact, but also fast to encode / decode. The distribution of bytes values within headers is not guaranteed to be perfectly flat. I wouldn't be surprised if some values were more common than others, but haven't tested lately.

That being said, it's correct to say that non-flat distribution is a hint that compression hasn't reached its upper limit. Quite clearly, yes.
But zstd is also about trade-off between compression and speed, so it's not obvious to draw a line between the two.

from zstd.

KrzysFR commented on April 27, 2024

You are correct, changing the search length to 5 does fix the compression for the Song test... but now compression of the random digits is worse (240KB for -6 versus 220KB for -1)

I used compression level 6 (because it has search lengtrh 5 for both < 16KB and large blocks):

## Generic JSON document
Raw    :  1,150 [                                $ @     :: :m~==ooo==;;==~m      =.==;+.:=~.=~=;=~+o=~-.--  +    m;=oM===o .o=ooo oooo..-=.. .     .                                     .                        ..                                                            ]
Zstd   :    691 [@o~=m~o: =~o~oo=m~~:o~==~=~~o~m:==~:m o~~oo~==~:=~om:m=== =~:~=:M=o==o=~:ooo~:~=oo=o~:o=~ =o::=~om=~=~o=:mm=m=o =~:===ooo~=~ =~=:o~~~:o~o= = ~=~o= o~=:::~m:~= o=o:::=oo:o~~ o=~o:o:oo:~:=:::o=o~ooom~=~~::= o~m:=o~:=:~==~~~: o=: o~ = o=o~~:~ :o~~::~ ::o~==~:]
Zstdict:    442 [@o=moo;=o=;=om=o===;m=;;oo=; o; moo;m=;;;;;=oo=o m==;=o= = m;;;= =o; ;=;om o;o= =o  ;o;;o;; =;  ===;;o= ;;;===;o;;=o;=o  = ;m;  o = ;;oo==;==;mo=;==o= ;;=;;==o=;o; ;;o o;= ; =moo=  = ;=o; ;=ooo==; ;=;; ;  ;; o  ;;o;= ==;=;; o; =; ;  ;===o;    = ;==;;;  o==]

## Song Lyrics
Raw    :  3,138 [          o  o                  @-;    =    =~=      .    -    : : +==.. -= =;~:=::=--;          m=oo$===m=~mommo=MMmm=-~=                                      ; .     ;o:   .     =    . .       o                                                            ]
Zstd   :  1,723 [$m$o+oo@Bo=o=~M=o~m+mmo+M~=~M;+o==:~oM+;o=m;-+;M=m+o=mm+=++;~==;mooM=:::oo~oB;+++m~o-=;.BMm:~~~o+moo:;o-o~m=+.-:o=;++:::+o+;o~=oooo;m~===;mM+:o+o:oo;+mo=~:o+=;~=B+=++o:=m;=B;:o~;;oMMo--o.:+m~;m+=+~M+:o=$oomoo;:+:o=o~;~;:+.;~o:o;~++~:;::~::+oo:=oo;;===.~oo-]
Zstdict:  1,701 [$B=M=~o~oMmo==;+m~m=oo=mooo~M~==moo=@o+:=o=+===o=m~mmm+;=B;;=~=om==o=~-:M==M=:-;o++=;m$=o=;=+o=;m+moom-Bo+o==:;o:;m==;;:=+mo~=++B~m~o+~=+=oo;~=+B+M~++~+=~===:m==o+++=+:=~===o;==o=o=o~=~=+:m=====~+ommom+ooo=+==~++=o==;:+:+~;:=;~o=~o:=o=o;~-o;m;=+M++m~~;o=~=]

## Random string of digits... (note: biased for 1 and 2!)
Raw    :524,288 [                                                =@o=======                                                                                                                                                                                                      ]
Zstd   :240,668 [@BMoMoomMmm=m==oMMmmo=momm==ooo=MmMoomm=mo==ooo+mmooo==+====o===MmMoMoo=mom=o==+o====+==o=======Mo==o===o==+===+o======+=====+=+MMmoMo==Mm==m===mo==o===m==+=+=+m===o===m==+=+=+=======+===+=+++Mom=ooo=o====+==o====+=+=+++=+=+o====+=+o==+===+o=+==+++==++++++]
Zstdict:240,670 [@BMmBmM=MmooooooMMmmo=mommo==oo=BMmmoom=moo==o==Mm==o===o===o==+$mBomooom=o=m=o=mo==o===o==+==++Mmmo===+oo=======o===+=+o+===+++BMmoB===mm======moo=o===oo=+o===m=o=m==+o==+===+o==========+=+++Moo=m===o===o==+o===o+==o+=+=+=+m======+===+==++o=+==+++==+==+++]

Also, the way the random digit string is constructed introduce a bias which should obey Benford's law: I'm appending random numbers (1, 123, 123456) to a string buffer, and not random characters, so 1 is more frequent than 2, than 3, etc... This is "visible" in the =@o======= distribution where digit 1 is marked with @ (max), 2 with o (75th percentile) and the rest with = (median). This bias in the distribution is probably still visible in the result.

Random digits are of course not very common... except when you use 128-bit random UUIDs as keys for documents, when you have strings of 36 almost random digits everywhere in your JSON or XML documents (ex: "fb20205a-d823-4068-baf2-91d5cd5ad54d", "4dce2acc-660f-4064-9a7c-4c40d53c352c", "ab911aa0-0682-41e3-a7b6-d9ccb56fa591"). Each source byte as 4 bit of entropy, so these 36 chars are worth the original 128 bits, but these 128 bits are not really compressible.

My experience with JSON is that the bulk of the really unique data are GUIDs, high precision real numbers (64 bit double 3.141592653589793238) and timestamps with very high precision (2015-12-16T19:29:23.8649501Z), though timestamps usually start with common prefix "2015-12-... or even "2015-12-16T19:.." if you are creating a lot of documents per hour.

It looks like the fields of ZSTD_parameters seem to really impact (negatively or positively) the compression ratios, but each parameter is not exactly link with the compressionLevel only (currently there are 4 tables that for the same "level" gives different search lengths). This means that I cannot use a constant compressLevel if I want very specific values for search length, but don't care for the rest.

It could be usefull to have a function that would probe various parameters and return the optimum set, given a specific sample document and time budget. Would be nice to have when designing the compression stage, especially for "small" documents (where zstd would be used with data stores or small REST http queries with JSON like payloads).

from zstd.

Cyan4973 commented on April 27, 2024

Random digits are of course not very common... except when you use 128-bit random UUIDs as keys for documents,

Very good point

This means that I cannot use a constant compressLevel if I want very specific values for search length, but don't care for the rest.

Indeed. compressionLevel should be considered a "default" set of parameters, trying to match a relative speed budget. But if you want full control, you need to use _advanced() variants, and set parameters as you want them.

It could be usefull to have a function that would probe various parameters and return the optimum set, given a specific sample document and time budget.

This function almost exists : a program paramgrill tries to match the best possible parameters combinations for each compression level. That being said, it currently works on the full compression level range (from 1 to 20). A leaner version, which only targets one specific speed, can probably be done from it.

from zstd.

Cyan4973 commented on April 27, 2024

There is a new version of zstd which is likely to help this use case :

zstd_static.h exposes new prototypes for dictionary compression : ZSTD_compress_usingDict() and ZSTD_decompress_usingDict(). It's just making use of dictionary content, no entropy transport yet. But one has to start somewhere ....

Pay attention that, as a consequence, ZSTD_compress_advanced() scope has been extended and its prototype changed, hence it could break your link stage if you were using it previously.

from zstd.

Cyan4973 commented on April 27, 2024

There's a new branch of zstd which is intended to help this use case.
This branch is planned to become the foundation of v0.5.

The most visible addition is the presence of a dictionary builder tool, within directory dictBuilder. It's a stand-alone command line tool, which receives a list of samples (generally through command line expansion using * wildcard character) and generate a dictionary of selectable maximum size.

The process is as follows :

Create a dictionary file using dictBuilder :
./dictBuilder path/to/samples/* -o dictFileName
Compress using the command line zstd :
./zstd fileName -D dictFileName
Compress using API : ZSTD_compress_usingDict()
Decompress using API : ZSTD_decompress_usingDict()

If you have a lot of messages / blocks to compress using the same dictionary, it can be beneficial to load it only once, and then re-use it for all blocks.

For this, you need 2 contexts : one to load a reference, the other to "run" the compression/decompression operation :

ZSTD_CCtx* referenceCCtx;
ZSTD_CCtx* workCCtx;

ZSTD_compressBegin_usingDict (referenceCCtx, dictBuffer, dictSize);   /* load only once */

/* loop on each block */
ZSTD_copyCCtx(workCCtx, referenceCCtx);
ZSTD_compressContinue(workCCtx, ...);
(...)
ZSTD_compressEnd(workCCtx, dst, dstCapacity);

The process for decompression is similar though slightly simpler if you do decompression in a single step :

ZSTD_DCtx* referenceDCtx;
ZSTD_DCtx* workDCtx;

ZSTD_decompressBegin_usingDict (referenceDCtx, dictBuffer, dictSize);   /* load only once */

/* loop on each block */
ZSTD_decompress_usingPreparedDCtx(workCCtx, referenceCCtx, ...);

Preliminary tests show some great gains compressing small json records.

Finally, if you are going to compress very small records into even smaller compressed messages, every byte matter. You can get rid of frame metadata, at the cost of losing compatibility withzstd command line utility.
It's possible to access data directly at block level, using the new block interface.
Currently, it saves 11 bytes, which can make a non negligible difference when the compressed message is < 50 bytes. Note though that the caller will have to save or know metadata (compressed size, original size, compression method) by other means.

The previous examples become :

Compression :

ZSTD_CCtx* referenceCCtx;
ZSTD_CCtx* workCCtx;

ZSTD_compressBegin_usingDict (referenceCCtx, dictBuffer, dictSize);   /* load only once */

/* loop on each block */
ZSTD_copyCCtx(workCCtx, referenceCCtx);
ZSTD_compressBlock(workCCtx, ...);

Decompression :

ZSTD_DCtx* referenceDCtx;
ZSTD_DCtx* workDCtx;

ZSTD_decompressBegin_usingDict (referenceDCtx, dictBuffer, dictSize);   /* load only once */

/* loop on each block */
ZSTD_copyDCtx(workDCtx, referenceDCtx);
ZSTD_decompressBlock(workDCtx, ...);

Note that blocks have a hard limit of 128 KB max.

from zstd.

KrzysFR commented on April 27, 2024

Wow, this looks very promising! Can't wait to test this tomorrow.

A quick scan over the code of dictBuilder makes me realize that there is a LOT of things under the hood. Do you plan to talk about this at sometime in a blog post maybe? It would be really interesting to understand how this thing works.

Also, from the previous discussion, are you currently only building the dictionary? Or is this also dealing with removing the entropy table descriptor at the start of the compressed output?

Regarding the use of block API or not, in the past I needed to add my own prefix before compressed data to mark the codec type and original size, and it usually end up taking 4 to 6 bytes... which means that I would only gain 5 or 6 bytes per documents... Sticking with the regular format would probably be easier I guess. I see the block API being very useful when compressing blocks with fixed size or files with a known granularity (like compressing pages of a memory dump aligned to 4KB or 64KB pages, etc...

from zstd.

Cyan4973 commented on April 27, 2024

Do you plan to talk about this at sometime in a blog post maybe? It would be really interesting to understand how this thing works.

Good points, yes, I'll write a blog post about it.

Also, from the previous discussion, are you currently only building the dictionary? Or is this also dealing with removing the entropy table descriptor at the start of the compressed output?

It also deals with table descriptors.

I see the block API being very useful when compressing blocks with fixed size or files with a known granularity

Indeed, it's a good use case.

There are also situations where the size of records is not known but bounded by construction (ex : < 4 KB). And the size of the compressed blob is necessarily known, either because it is the size of the file, or because it can be derived from a jump table which is built for random access. As for the compression method, it can also be part of the jump table metadata.
In which case, no additional information is required, so block interface can be used.

Well, anyway, these are just optimizations, could be attempted at a later stage.

from zstd.

KrzysFR commented on April 27, 2024

Ok, so I reused the same dataset and training sample from my earlier tests with femtozip, and dictBuilder produced a ~35 KiB dictionary, (compared to exactly 64 KiB for fzip).

The results I have

RAW: 41.2 MB
zstd -1 BATCH: 2.8 MB
zstd -1 1by1 no dict: 10.9 MB (in 251 ms)
zstd -1 1by1 v0.5 dict: 5.0 MB (in 259ms)

note: I'm not using the trick to copy the context yet, so the compression time is slower than it could be.

Here is a recap of this test and the previous ones:

raw dataset: 41.2 MB (1 : 1)
zstd 0.5.x no dict (worst case): 10.95 MB (1 : 3.76)
zstd 0.4.x w/ femtozip dict: 9.4 MB (1 : 4.4)
zstd 0.5.x w/ dictBuilder dict: 5.0 MB (1 : 8.2)
gzip -5 single batch (best case): 3.8 MB (1 : 10.8)
fzip: 3.2 MB (1 : 12.8)
zstd 0.5.x single batch (best case): 2.77 MB (1 : 14.9)

The 0.5.x branch gives a 2x improvement over the 0.4.x branch using a dictionary, which gets us down to about 50% of the best case scenario (compressing everything in a single batch).

Femtozip still compresses better, maybe because it is producing a larger dictionary?

from zstd.

Cyan4973 commented on April 27, 2024

Thanks for this valuable feedback @KrzysFR.

Indeed, there are a number of factors that could play to even the ground with fzip.

Dictionary size : yes it matters.
There is a poorly documented feature within dictBuilder called the Selectivity Level.
The idea being : the larger the number, the more elements get through the filter.
(Sidenote : not even sure if the name is explicit enough. Could be changed though ...)
Default level is 9.
If you try higher values (-S10 for example), it should generate larger dictionary ( up to specified limit).
Compression level : it seems you tested zstd at -1 level, which is its fastest.
But fzip seems to be closer to gzip -9.
So it sounds fair to test some higher zstd compression levels. Try -4/-5 for some middle ground, -16 for higher compression. It makes a large difference.
Once you have settled for a desirable compression level, you can tune the statistics in favor of this particular compression level. Try option -L# with # being the target level. Difference is pretty small though, so it's more a finishing touch.
I also noted fzip advantage when it comes to very small data packets.
I then realized that most of this advantage comes from fzip complete absence of header information. For small packets, it makes a sizable difference (about 20 bytes) !
Hence the suggestion to use block interface to get rid of frame headers.
There are still about ~6 bytes I could probably remove if the objective was to make a static format completely specific to this application. Not sure if it's a correct direction for zstd though...

from zstd.

KrzysFR commented on April 27, 2024

The results with various compression levels (same dictionary):

zstd -1 : 5.0 MB
zstd -5 : 4.2 MB
zstd -16: 3.9 MB
fzip: 3.2 MB

Below are the distributions of documents sizes I used and after compression.

The smallest compressed docs are ~150 bytes compressed, so an overhead of 10 bytes per doc (6,360 docs in the set) would only get me ~60KB out of about 4 MB. Not very significant for my use case anyway.

ASCII histograms to the rescue!

RAW INPUT SIZE DISTRIBUTION:

Min/Max: 742 .. 20,126 , Average: 6,490
Median : 5,524 (+/-1,933), StdDev: 3,666
Distrib: (1,706.4) - 3,814.7 =[ 5,524.0 ]= 9,706.7 - (13,398.8)

   __________________________________________________________________________________________________________________________________
  |____[ Min , Max )____|___Count___|__Percent____________________________________________________|___Cumulative________|__Weight____|
  |      700 - 800      |        25 |   0.393% :                                                  |   0.393% `--------- |     18,750 |
  |      900 - 1,000    |        13 |   0.204% .                                                  |   0.597% [--------- |     12,350 |
  |    1,000 - 1,200    |         9 |   0.142% .                                                  |   0.739% [--------- |      9,900 |
  |    1,200 - 1,400    |       127 |   1.997% #                                                  |   2.736% (--------- |    165,100 |
  |    1,400 - 1,600    |        52 |   0.818% +                                                  |   3.553% (--------- |     78,000 |
  |    1,600 - 1,800    |       173 |   2.720% #+                                                 |   6.274% )--------- |    294,100 |
  |    1,800 - 2,000    |        72 |   1.132% x                                                  |   7.406% )--------- |    136,800 |
  |    2,000 - 2,500    |       263 |   4.135% ##.                                                |  11.541% -[-------- |    591,750 |
  |    2,500 - 3,000    |       170 |   2.673% #;                                                 |  14.214% -(-------- |    467,500 |
  |    3,000 - 3,500    |       275 |   4.324% ##:                                                |  18.538% -]-------- |    893,750 |
  |    3,500 - 4,000    |       653 |  10.267% #####.                                             |  28.805% --]------- |  2,448,750 |
  |    4,000 - 4,500    |       413 |   6.494% ###:                                               |  35.299% ---|------ |  1,755,250 |
  |    4,500 - 5,000    |       312 |   4.906% ##=                                                |  40.204% --->------ |  1,482,000 |
  |    5,000 - 6,000    |     1,189 |  18.695% #########;                                         |  58.899% -----]---- |  6,539,500 |
  |    6,000 - 7,000    |       437 |   6.871% ###+                                               |  65.770% ------)--- |  2,840,500 |
  |    7,000 - 8,000    |       223 |   3.506% #$                                                 |  69.277% ------]--- |  1,672,500 |
  |    8,000 - 9,000    |       123 |   1.934% #                                                  |  71.211% -------[-- |  1,045,500 |
  |    9,000 - 10,000   |       341 |   5.362% ##X                                                |  76.572% -------)-- |  3,239,500 |
  |   10,000 - 12,000   |       930 |  14.623% #######;                                           |  91.195% ---------[ | 10,230,000 |
  |   12,000 - 14,000   |       346 |   5.440% ##X                                                |  96.635% ---------) |  4,498,000 |
  |   14,000 - 16,000   |       141 |   2.217% #.                                                 |  98.852% ---------] |  2,115,000 |
  |   16,000 - 18,000   |        58 |   0.912% =                                                  |  99.764% ---------> |    986,000 |
  |   18,000 - 20,000   |        13 |   0.204% .                                                  |  99.969% ---------> |    247,000 |
  |   20,000 - 25,000   |         2 |   0.031% `                                                  | 100.000% ---------@ |     45,000 |
  `---------------------------------------------------------------------------------------------------------------------------------'

ZSTD -1 OUTPUT SIZE DISTRIBUTION:

Min/Max: 175 .. 2,030, Average: 791
Median : 695 (+/-201), StdDev: 344
Distrib: (337.4) - 535.7 =[ 694.6 ]= 1,032.8 - (1,450.2)

   __________________________________________________________________________________________________________________________________
  |____[ Min , Max )____|___Count___|__Percent____________________________________________________|___Cumulative________|__Weight____|
  |      160 - 180      |         9 |   0.142% .                                                  |   0.142% `--------- |      1,530 |
  |      180 - 200      |        16 |   0.252% .                                                  |   0.393% `--------- |      3,040 |
  |      200 - 250      |         8 |   0.126% .                                                  |   0.519% [--------- |      1,800 |
  |      250 - 300      |        29 |   0.456% :                                                  |   0.975% [--------- |      7,975 |
  |      300 - 350      |       342 |   5.377% ##X                                                |   6.352% )--------- |    111,150 |
  |      350 - 400      |       292 |   4.591% ##;                                                |  10.943% -[-------- |    109,500 |
  |      400 - 450      |       214 |   3.365% #X                                                 |  14.308% -(-------- |     90,950 |
  |      450 - 500      |       232 |   3.648% #$                                                 |  17.956% -]-------- |    110,200 |
  |      500 - 600      |     1,254 |  19.717% #########&                                         |  37.673% ---]------ |    689,700 |
  |      600 - 700      |       829 |  13.035% ######=                                            |  50.708% -----[---- |    538,850 |
  |      700 - 800      |       455 |   7.154% ###x                                               |  57.862% -----]---- |    341,250 |
  |      800 - 900      |       476 |   7.484% ###X                                               |  65.346% ------|--- |    404,600 |
  |      900 - 1,000    |       474 |   7.453% ###X                                               |  72.799% -------(-- |    450,300 |
  |    1,000 - 1,200    |       853 |  13.412% ######X                                            |  86.211% --------)- |    938,300 |
  |    1,200 - 1,400    |       495 |   7.783% ###&                                               |  93.994% ---------( |    643,500 |
  |    1,400 - 1,600    |       255 |   4.009% ##                                                 |  98.003% ---------] |    382,500 |
  |    1,600 - 1,800    |        89 |   1.399% X                                                  |  99.403% ---------] |    151,300 |
  |    1,800 - 2,000    |        35 |   0.550% ;                                                  |  99.953% ---------> |     66,500 |
  |    2,000 - 2,500    |         3 |   0.047% `                                                  | 100.000% ---------@ |      6,750 |
  `---------------------------------------------------------------------------------------------------------------------------------'

ZSTD -5 OUTPUT SIZE DISTRIBUTION:

Min/Max: 159 .. 1,789, Average: 672
Median : 589 (+/-164), StdDev: 290
Distrib: (302.7) - 452.2 =[ 589.4 ]= 873.7 - (1,234.1)

   __________________________________________________________________________________________________________________________________
  |____[ Min , Max )____|___Count___|__Percent____________________________________________________|___Cumulative________|__Weight____|
  |      140 - 160      |         1 |   0.016% `                                                  |   0.016% `--------- |        150 |
  |      160 - 180      |        24 |   0.377% :                                                  |   0.393% `--------- |      4,080 |
  |      200 - 250      |        25 |   0.393% :                                                  |   0.786% [--------- |      5,625 |
  |      250 - 300      |       246 |   3.868% #&                                                 |   4.654% |--------- |     67,650 |
  |      300 - 350      |       402 |   6.321% ###:                                               |  10.975% -[-------- |    130,650 |
  |      350 - 400      |       289 |   4.544% ##;                                                |  15.519% -)-------- |    108,375 |
  |      400 - 450      |       572 |   8.994% ####=                                              |  24.513% --|------- |    243,100 |
  |      450 - 500      |       697 |  10.959% #####=                                             |  35.472% ---|------ |    331,075 |
  |      500 - 600      |     1,033 |  16.242% ########.                                          |  51.714% -----[---- |    568,150 |
  |      600 - 700      |       527 |   8.286% ####.                                              |  60.000% ----->---- |    342,550 |
  |      700 - 800      |       470 |   7.390% ###X                                               |  67.390% ------)--- |    352,500 |
  |      800 - 900      |       657 |  10.330% #####:                                             |  77.720% -------]-- |    558,450 |
  |      900 - 1,000    |       504 |   7.925% ####                                               |  85.645% --------)- |    478,800 |
  |    1,000 - 1,200    |       550 |   8.648% ####;                                              |  94.292% ---------( |    605,000 |
  |    1,200 - 1,400    |       264 |   4.151% ##.                                                |  98.443% ---------] |    343,200 |
  |    1,400 - 1,600    |        77 |   1.211% x                                                  |  99.654% ---------> |    115,500 |
  |    1,600 - 1,800    |        22 |   0.346% :                                                  | 100.000% ---------@ |     37,400 |
  `---------------------------------------------------------------------------------------------------------------------------------'

ZSTD -16 OUTPUT SIZE DISTRIBUTION:

Min/Max: 153 .. 1,634, Average: 614
Median : 539 (+/-162), StdDev: 265
Distrib: (277.5) - 414.1 =[ 539.0 ]= 795.5 - (1,132.0)

   __________________________________________________________________________________________________________________________________
  |____[ Min , Max )____|___Count___|__Percent____________________________________________________|___Cumulative________|__Weight____|
  |      140 - 160      |        23 |   0.362% :                                                  |   0.362% `--------- |      3,450 |
  |      160 - 180      |         2 |   0.031% `                                                  |   0.393% `--------- |        340 |
  |      180 - 200      |         4 |   0.063% `                                                  |   0.456% `--------- |        760 |
  |      200 - 250      |        68 |   1.069% =                                                  |   1.525% [--------- |     15,300 |
  |      250 - 300      |       402 |   6.321% ###:                                               |   7.846% ]--------- |    110,550 |
  |      300 - 350      |       371 |   5.833% ##&                                                |  13.679% -(-------- |    120,575 |
  |      350 - 400      |       493 |   7.752% ###&                                               |  21.431% --[------- |    184,875 |
  |      400 - 450      |       806 |  12.673% ######;                                            |  34.104% ---(------ |    342,550 |
  |      450 - 500      |       718 |  11.289% #####x                                             |  45.393% ----|----- |    341,050 |
  |      500 - 600      |       752 |  11.824% #####&                                             |  57.217% -----)---- |    413,600 |
  |      600 - 700      |       426 |   6.698% ###;                                               |  63.915% ------(--- |    276,900 |
  |      700 - 800      |       738 |  11.604% #####$                                             |  75.519% -------)-- |    553,500 |
  |      800 - 900      |       586 |   9.214% ####x                                              |  84.733% --------|- |    498,100 |
  |      900 - 1,000    |       393 |   6.179% ###.                                               |  90.912% ---------[ |    373,350 |
  |    1,000 - 1,200    |       394 |   6.195% ###.                                               |  97.107% ---------) |    433,400 |
  |    1,200 - 1,400    |       143 |   2.248% #.                                                 |  99.355% ---------] |    185,900 |
  |    1,400 - 1,600    |        38 |   0.597% ;                                                  |  99.953% ---------> |     57,000 |
  |    1,600 - 1,800    |         3 |   0.047% `                                                  | 100.000% ---------@ |      5,100 |
  `---------------------------------------------------------------------------------------------------------------------------------'

from zstd.

KrzysFR commented on April 27, 2024

I looked at the Selectivity Level (-S# command args) and it did not change anything to the dictionary size. Looking at the code, it looks like you are shifting the number of files by this value. Only issue is that I was training the dictionary with a single file (that is the concatenation of a 10% sample of all the docs), and in this case nbFiles will be 1, which once shifted by anything over 0 will always be 0...

I'm not sure how it would be practical to extract a sample set of some docs, and save all of them as a separate file into a folder before passing the list of all the names back to dictBuilder?

I guess this would be a non-issue with a library version of the builder where we could easily pass an array of buffer pointers (and not deal with files).

from zstd.

KrzysFR commented on April 27, 2024

And final set of tests: creating multiple dictionaries by passing -L# to dictBuilder for each compression level didn't change anythhing. The dictionaries produced only vary by a couple of bytes in size, and the compression size does not change.

Wondering if this is because I'm only training with a single file or not.

from zstd.

Cyan4973 commented on April 27, 2024

I was training the dictionary with a single file (that is the concatenation of a 10% sample of all the docs)

Ah yes, indeed, in this case, selectivity level won't be useful at all.
(note to self : dictBuilder should better deal with this use case too, maybe trigger some warning).

2 ideas I can think of :

Change constant MINRATIO, from 4 to 3. It will let more elements get through the filter (source code change, requires a recompilation)
Complete the dictionary with random samples. It would require to comment this line :
https://github.com/Cyan4973/zstd/blob/v05x/dictBuilder/dictBuilder.c#L957 :
if (shiftRatio==0) { ==> /*if (shiftRatio==0)*/ { and recompile.
Problem is, this method tends to produce dictionaries which are better for high compression, but worse for fast compression.

The dictionaries produced only vary by a couple of bytes in size

Dictionary file size doesn't change much, but compressing with them should have a larger impact than a couple of bytes. That being said, the impact will remain small (~1%).

Another issue I'm thinking of :
because the training set was provided as a single file, the entropic training stage will be flawed, since it will really believe there is only a single file to study, so will only use the beginning of that file.

I don't see a good solution for this.
It could be possible to cut input file into arbitrary block sizes, pretending there are many more. However, in such a case, real beginnings will be lost, so statistics will be created using random beginnings.

Maybe it would nonetheless improve situation compared to current, where the training is limited to a single segment.

Bottom line : entropic statistics do prefer to get one file per sample. Use * wildcard expansion to provide them to dictBuilder : ./dictBuilder samplesDir/* -o testDict

from zstd.

Cyan4973 commented on April 27, 2024

Btw, can the training set be shared for study ?

[Edit] : although not completely satisfactory, in case no good solution to increase zstd dictionary size is found in a reasonable timeframe, there remains the possibility to decrease fzip's one, in an attempt to "even" the comparison (option --maxdict)

from zstd.

KrzysFR commented on April 27, 2024

The training set contains some unique public identifiers (IP or serial numbers) that would need to be anonymized first, but I guess it could be done. I'm not sure how to transfer the file, though...

from zstd.

Cyan4973 commented on April 27, 2024

To start a private discussion, there is this link :
http://www.emailmeform.com/builder/form/a70egol6fyVf15zd46jY88

The training set contains some unique public identifiers

Indeed, it's an issue. To anonymize samples, replace ID with a repeating character of same length (could be space, or any other character). The dictionary builder will detect it's uninterested stuff and remove it automatically. It can also detect "regular structures" and preserve them (that is, when an identifier is always 8 bytes for example, and header/footer tend to be always the same, such as xml tags).

from zstd.

KrzysFR commented on April 27, 2024

I have found a way to convert each documents to remove some of the stuff (unique serial numbers), but still keep others like IP addresses, replacing them with a different IP produced from a cryptographic hash, that still preserve a common prefix (I'd like to see the benefit of shared directories when compressing IPv4 addresses that are in the same /16 or /24).

Once I have verified that the file is safe, I will send it to you. Each doc is a JSON object, so I will probably store them in a single text file, with one document per line, which should be easy to use and split as you like.

It will look like this (note: UTF-8 encoded)

{"Id":1,.....}\r\n
{"Id":2,.....}\r\n
{"Id":3,.....}\r\n
{"Id":4,.....}\r\n

The files contains a mix of natural text, keywords, common GUIDs shared by almost all docs, and unique GUIDs that less shared, etc... It should be a good sample for medium sized JSON documents in a Document Store, I guess.

from zstd.

KrzysFR commented on April 27, 2024

Email sent.

from zstd.

Cyan4973 commented on April 27, 2024

Thanks @KrzysFR.
These look excellent samples.

As suggested, file was cut into a one-line per sample set, using split. Then each document is compressed independently.

Strangely enough, I had no problem generating a dictionary of 110KB (default limit size).
No need to change any parameter.
Is it the same sample set you tested ? Could it be a side effect of anonymizing strategy ?

Using the dictionary created by dictBuilder with default settings :
Original set : 39.46 MB, 6360 files
zstd -1 noDict : 10.34 MB
zstd -1 defaultDict : 4.21 MB
zstd -5 defaultDict : 3.42 MB
zstd -16 defaultDict : 3.03 MB
[Edit] finally got fzip numbers :
fzip -9, Dict64K : 2.82 MB

Just for reference, when grouping all files together :
gzip -1 singleBatch : 5.25 MB
gzip -9 singleBatch : 3.41 MB
zstd -1 singleBatch : 2.50 MB
zstd -16 singleBatch : 1.37 MB

Since the default dictionary size seems easily filled, I used Selectivity the other way : become more restrictive in what can get into the dictionary or not.
Got best result with -S8 -L16 :
zstd -16 optDict110K : 2.96 MB

Now it could be argued that, since zstd dictionary is now 110 KB, it's advantaged versus fzip 64 KB. So I also tested with a dictionary limited to 64 KB :
./dictBuilder sampleDir/* --maxDict 64K -L16 -S8
zstd -16 optDict64K : 3.05 MB

from zstd.

KrzysFR commented on April 27, 2024

The file I sent you is the complete set of all the documents. The sample set used for training was a random selection of 100 documents only from this set.

One difference is that my original dataset had a lot more unique GUID, that have been replaced by large runs of whitespace in the one I sent you, which may help a lot for compression ?

All fields whose value starts with a * followed by only spaces were originally unique ids. You could probably fill back some holes by replacing the spaces by random alphanum characters. Or I could regenerate a sample set by replacing the spaces with random text...

from zstd.

Cyan4973 commented on April 27, 2024

One difference is that my original dataset had a lot more unique GUID, that have been replaced by large runs of whitespace in the one I sent you, which may help a lot for compression ?

Yes it does.
But what is surprising is the impact on filling the dictionary.

The sample set used for training was a random selection of 100 documents only from this set.

OK, that could be the reason.

100 documents is really a small amount. At 5.5KB average doc size, it means just 550 KB for training. Difficult to extract >10% of it as dictionary.

I would recommend to use something closer to 1000 documents. Generic rule of thumb : dictionary should extract no more than 1% of samples

Note : I understand that fzip longer build time may make it desirable to use less documents for training. Fortunately, zstd's dictBuilder runs faster, and can afford analyzing tens of MB of training sample in a reasonable timeframe.

from zstd.

KrzysFR commented on April 27, 2024

Mmm... having to sample 1/6th of the complete dataset would make it difficult for my use case. I'm not looking to get the perfect dictionary, only one that is good enough for the type of documents I'm compressing:

All documents share a lot of the same JSON structure (typically I would be using one dictionary per type of Document Collection like "Users", "Products", "Contacts", ...), meaning that probably 50% of the tokens are common to all the documents (JSON fields names, and artefacts of the stucture, like runs of "}]],[[{"Timestamp":"2015-01-29T between arrays or arrays of objects, etc...

Of the remaining 50%, there will be a lot of sharing when for example you are storing documents that all have the same "geographic site ID", or that all have a unique IP address but that all share the , "IpAddress":"192.168. prefix, or all have dates that start with ","LastModified":"2015-01- because they were all last modified in the last month, etc... Then there are all the common keywords (in my dataset, the vendor names and model names for hardware), or if this was a Contact database, all the @domain.com suffix of all the emails, or company names that are shared.

In this model, I'd like to periodically construct a small sample that will capture most of the redundant data, but do it in a very short time. Typically, if used with a Document Store, when storing/updating a document, I would decide with some heuristic that it is time to create a new dictionary generation. This means that the whole process of sampling and training the dictionary would need to happen very very fast (taking at most a few hundred of milliseconds), and that I cannot spend too much time or memory on this step (without pausing the caller thread too long).

Now, in a different scenario where each document would be totally different, like say if you were compressing the lyrics of all french pop songs from the 90ies, where there would be lilttle structure in common between each song, I guess in this case you would need a larger sample to produce a better dictionary.

from zstd.

Cyan4973 commented on April 27, 2024

I would decide with some heuristic that it is time to create a new dictionary generation. This means that the whole process of sampling and training the dictionary would need to happen very very fast

Does the dictionary generation needs to be a blocking operation ? Could it be built in parallel, and made available later, when it's ready ?

I made a quick speed test on my laptop :
100 documents (00*) : 0.07 sec, 28 KB dictionary ; zstd -16 => 3.73 MB
636 documents (*0) : 0.57 sec, 110 KB dictionary ; zstd -16 => 3.10 MB

The difference looks quite impactful, for a half-second difference of construction time.

For comparison : fzip
100 documents (00*) : 8.65 sec, 64 KB dictionary ; fzip -9 => 3.43 MB

from zstd.

KrzysFR commented on April 27, 2024

I've built dictBuilder for win32, so that I can easily invoke it from by tests with different number of samples and compression levels, in order to see the impact of the various parameters on the final result.

Unfortunately, the windows command prompt does not support wildcard expansions :(

C:\Git\zstd\visual\2013\x64\Release>dictBuilder.exe c:\temp\zstd\training\samples\*.json
Warning : set contains only 1 files ...
Error 10 : impossible to open file c:\temp\zstd\training\samples\*.json

It seems that, at least on windows, it is the responsibility of the application to do the expansion, which is rather cumbersome ...

I could generate a list of all the files and pass that via the command line to dictBuilder.exe but I'm pretty sure I would hit the maximum size of the command line at some point. Maybe one way to workaround that would be to have the ability to pass the list of files via stdin, or maybe specify the path to a text file with one input file per line?

Ultimately, the best way would be to have dictbuilder as a library that could accept byte buffers, without having to save the samples to disk and then read back the dictionary into memory...

from zstd.

Cyan4973 commented on April 27, 2024

There is a quick work-around :
link setargv library : https://msdn.microsoft.com/en-us/library/8bch7bkk.aspx

setargv.obj must added into the dependency list :
Properties -> Linker -> input -> Additionnal Dependencies

Long term second solution :
it would also be good that the program accepts a directory as valid input, loading its content.
Problem is, I could not find any portable way to deal with directories using standard C.
Seems this is necessarily system specific ...

Ultimately, the best way would be to have dictbuilder as a library that could accept byte buffers, without having to save the samples to disk and then read back the dictionary into memory...

Good point.
Do you have an idea of the kind of prototype you would like to use ?
Note that, for good results, we need more than the "concatenated content" of all files. It's much better to also get the beginning position of each sample.

from zstd.

Cyan4973 commented on April 27, 2024

Here is a proposed prototype for the byte buffer API :

/*! DiB_trainFromBuffer
    Train a dictionary from a memory buffer @samplesBuffer
    where @nbSamples samples have been stored concatenated.
    Each sample size is provided into an orderly table @sampleSizes.
    Resulting dictionary will be saved into @dictBuffer.
    @parameters is optional and can be provided with 0 values to mean "default".
    @result : size of dictionary stored into @dictBuffer (<= @dictSize)
              or an error code, which can be tested by DiB_isError().
*/
size_t DiB_trainFromBuffer(void* dictBuffer, size_t dictSize,
                           const void* samplesBuffer, const size_t* sampleSizes, unsigned nbSamples,
                           DiB_params_t parameters);

For discussion

from zstd.

KrzysFR commented on April 27, 2024

It looks fine. I suppose DiB_params_t would grow in the future witth more arguments.

One issue I see with DiB_setNotificationLevel(..) and the DISPLAYLEVEL(..) macros is that they ultimately call fprintf. That's fine for a cli, but not so for a library hosted in a service or web process. For now, I'm setting the display level to -1 to disable it.

Another thing to consider is cancellation: since the training could potentially take some time (a few seconds, up to 10 seconds?) it could be usefull to have a way to cancel a training in progress. For example, when called from .NET I would probably wrap the training operation into a Task<..>, and have it run on a threadpool thread. But for long running operatiosn (either I/O or compute) it is good practice to also provide a way to cancel them (via CancellationTokens in .NET, there is probably equivalent APIs in other languages like Go or Java).

I'm not sure if having callbacks would be a good idea for cancellation (there is a cost to setup marshalling of callbacks between managed VMs and native libraries), so maybe the good old flag in memory would be better? Maybe add an abort boolean to the DiB_params_t (or a pointer to an abort flag that would be ignored if null) that would be checked from time to time by the training loop (i.e: once per sample file, or once every X bytes processed?). That way the caller could simply set the flag to true to request cancellation. It would not be immediate, but at least it would abort nicely after some time (and return an "operation aborted" error code).

from zstd.

t-mat commented on April 27, 2024

I'm sorry for my interruption, but I'd like to +1 for "avoiding callback".
I suppose both of you already know about it, but since meaning of context is different in many languages especially for exception handling (eg. file not found, connection failed), it'd be nice to provide freedom of flow and resource control ability for the host language.
If we use callback, callstack will be "Host code(A) --(invoke)--> C(DiB) --(callback)--> Host code(B)". Since C can't recognize host languages's exception, this callstack likely cause difficult (more precisely: annoying 😩) exception propagation situation from host code(B).
(Wikipedia article "Dispose pattern" is good introduction for this problem.)

from zstd.

Cyan4973 commented on April 27, 2024

I suppose DiB_params_t would grow in the future witth more arguments.

Quite likely, since it is still early days, so API updates are expected

One issue I see with DiB_setNotificationLevel(..) and the DISPLAYLEVEL(..) macros is that they ultimately call fprintf. That's fine for a cli, but not so for a library hosted in a service or web process.

Totally correct. The transition from CLI to library status is not yet completed.
Maybe I should consider dividing the library into 2 parts, one for IO (which would keep the notifications), and the other one for the builder proper.

For now, I'm setting the display level to -1 to disable it.

It shouldn't be necessary.
By default, this display level is 0, which means "no notification".

Another thing to consider is cancellation: since the training could potentially take some time (a few seconds, up to 10 seconds?) it could be usefull to have a way to cancel a training in progress.

I was considering 2 ideas to handle this situation :

Add a parameter (typically inside DiB_param_t),
which tells how much time the algorithm is authorized to work.
The idea is : if it knows when it must stop,
the function can properly finish processing data,
and nonetheless output a valid dictionary,
with whatever processing it has been able to achieve in the allocated time frame.
Long time is typically related to large training set.
The algorithm is very fast I believe up to 10 MB,
and got slower with every size increase of the training set.
With a 100 MB training set, it's getting noticeably slower.
One idea would be to automatically downsize the training set,
by randomly selecting samples within the training set, up to some pre-defined limit.
Here also it would seem good practice to allow the caller to "opt-out" from this scheme,
typically by using another parameter.

I'm sorry for my interruption, but I'd like to +1 for "avoiding callback".

Heck, I was considering using callback to deal with the "notification" issue.
I guess I will have to find something else ...
(maybe a hidden "internal" callback).

from zstd.

KrzysFR commented on April 27, 2024

@t-mat I 100% agree with you on callbacks, I also had the issue with exceptions from managed callbacks not being flowed to the native C++ side and creating havoc. I know that sometimes you can't really get rid of them, but at least for cancellation it would great to not have to mess with that.

Regarding a parameter to set the maximum excution time, I'm not sure if this is ideal, especially when using the builder alongside a process that is doing a lot of other things with the CPU.

It could be that even if we set a time limit of 1 sec, the CPU would be overused by some other threads, and the builder would abort after having effectively worked for 100ms or less, producing a subpar dictionary. This mean that when doing benchmarks (or even in production), the quality of the compression could be affected by some random task happening while the dictionary was being built, and having impacts for days or weeks until a new dictionary gets built...

Would it possible to have a scenario where we pass a sample set of documents, and only one half is used to build the dictionary, but the whole set used to test compression, and stop when a target compression ratio is met (or dictionary size or numbers of rounds)? Maybe there would be a way to compute the cost/benefit ratio to do each additional round of training, and stop when the gains from the previous round was marginal?

from zstd.

KrzysFR commented on April 27, 2024

I've been doing tests with another type of documents, and in this case PCL-XL spool files (produced by printer drivers).

This format is interesting because it combines multiple types of data in the same file:

a PJL prolog wich is textual data with a ceremony of settings that are usually the same (page size, format, dpi, paper type, and so on). This can take a few KB of ceremonial text that is unique per file, but almost the same across files and I guess could greatly benefit of having a dictionary.
the PCL-XL format itself is a compact binary representation of the content of the pages. Think DrawLine, DrawText, SetColor(R,G,B), DrawBezierCurve, which will not compress very well: the bytes are unique to the "shape" of the printed page.
Images and raster are usually encoded with a very poor RLE or DeltaRow encoding, which are usually only efficient with very simple line arts
TrueType soft-font uploads, where the individual glyphs for each letter or symbol of each font used in the document gets uploaded as-is. These are tightly packed sequences of (x,y) coordinates. If a company is for example using Verdana 9pt in all its documents, then the same glyphs (for 'A', 'B', '1', '2', '!', '?' etc..) will probably be present in all the spool files.

I grabed a set of various spool documents, of various sizes (spool files typically vary a lot in sizes from a few KB to few hundreds of MB), all produced by the same printer driver. Overall, I'm getting a compression ratio around 1:2 (+/- 0.5).

When using the full data set (1.4GB in 276 files), I'm not getting much gains from building a dictionary, probably because the 110KB dict has little impact when compressing files of a few MB or more. Typically, I would get a gain of 500 KB difference with or without dictionary from 1.4 GB of spools compressed down to 750-800 MB (so about 0.034% gain)

I did a second test keeping only the spools less than 2MB in size (200 files), and again, the gain from having a dictionary was not very significant (about 0.05%)

For the last test, I only kept spool files of 256 KB or less (100 files, these would be simple documents of 1 or 2 pages, without much graphics), and only then did I see some gains: With a 110 KB dict, I get about 8-10% more compression ratio.

legend: '-D' means with a dictionary produced for the same compression level

In this scenario, I'm not sure if getting to the trouble of maintaining a dictionary would be worth it. The only gain I see, is that zstd -1 with a dictionary, is faster and compress a bit more than gzip -5. Using anything above zstd -9 is way too slow for not much gain with this type of files.

from zstd.

Cyan4973 commented on April 27, 2024

I only kept spool files of 256 KB or less (100 files, these would be simple documents of 1 or 2 pages, without much graphics), and only then did I see some gains:

Dictionary compression is only helpful for small files / messages.
It makes less sense for large files. Gains are pretty important in the first few KB, but then become less and less frequent.

As a rough guidance, I advise dictionary compression for files / records < 64 KB. After that, most of the gains from dictionary compression are achieved, and their relative share fades away proportionally to the size of the file.

Regarding giving the ability to an external process / thread to stop training execution, if callbacks are not recommended and providing a time limit is not a good option either, it seems the last option is to provide a read-only address where an external thread/process could write a specific value to mean "stop now".
But I don't like it.
The complete training process is a suite of steps. By knowing maximum execution time, it's possible to forecast a good share for each one of them. On the other hand, if a "stop order" can appear anytime during any of these steps, the following steps will stand little chance to do their job properly. The resulting dictionary will be in quite bad shape.

I believe the most effective way to limit training execution time is to limit the size of the training set, either from the client side, or directly within the library, by randomly sampling the submitted training set to extract a smaller subset.
10 MB should prove enough to build a 110 KB dictionary. And with such quantity, the dictionary builder needs less than a second to complete its job.

[Edit] : for information, there is a --fast mode available on dictBuilder. It's there just in case someone needs to create a dictionary extremely fast, but it produces dictionaries of pretty bad quality.

from zstd.

KrzysFR commented on April 27, 2024

When I'm talking about cancellation, I mean in the sense of "aborting", as an error condition. That can easily be done with a pointer to a shared memory location that is polled from time to time by the builder.

I'm not sure in what scenario it would be feasible for the host application to "know" when to tell the builder to stop training and return the result as-is, without having deep knowledge of the internal state of the builder or being able to predict with high accuracy the time needed for any sample set... And that is in case where there would be no outside perturbations.

Also: would it be possible to parallelize the builder using multiple threads? Or is most of the process serialized by nature? If we could reduce the time down to ~150ms with 8 threads crunching on 10MB of data, then the runtime of the training could become a non-issue.

from zstd.

KrzysFR commented on April 27, 2024

Last test, only selecting spools < 100KB and using a sample of at most 100 files or 10MB

PS: I hate Excel!

Final compressed size is indeed a lot better with dictionaries:

Compression time (in ms)

Bytes saved / seconds (ie: (INPUT_SIZE - OUTPUT_SIZE) / TIME_SEC)

The conclusion seems to be that zstd -1 with a dictionary wins if one is interested in reducing the size but without spending too much time on it.

from zstd.

FrancescAlted commented on April 27, 2024

Hi. I have been following your discussion about dictionaries in zstd with the utter interest. I am planning to add support for it in Blosc2 and the possibility to reach higher compression ratios with small blocks by using dictionaries is very appealing.

Just out of curiosity, @KrzysFR have you a plot with the decompression times when using dictionaries vs when not using them? I expect the times to be very close, but just want to make sure.

Thanks!

from zstd.

KrzysFR commented on April 27, 2024

@FrancescAlted to be honest I've only focused on compression time and not taken the time to measure the decompression. I'm assuming that it will be as fast (or even faster). I'll try to look into that tomorrow.

Also, please bear in mind that I'm measuring the time rather crudely, by measuring the time it takes to run zstd from the cmdline for each file to compress, so I'm probably including the overhead of starting a process, and reading the files from disk. This means that the whole is probably a fancy way to measure the speed of my SSD drive. I'll try doing it with the buffered API in memory, to see just the overhead of the compression/decompression.

from zstd.

t-mat commented on April 27, 2024

I'm really sorry to continue bothering you, but if we'll keep stick with C89/90/99, please avoid using raw shared memory (volatile pointer) to communicate with other thread or languages (especially which has thread support at language level) without proper library or OS specific API.
It invites our long time bad friend: undefined behaviour 😨
In short, to achieve this in portable way, we need C11 (as a language), it's standard library and header <stdatomic.h>. Old C/C++ memory model guarantees nothing about shared memory (note: C/C++'s volatile is designed for completely different (orthogonal) purpose).

http://cbloomrants.blogspot.com/2009/01/01-25-09-low-level-threading-junk-part.html
(@t-mat's note: please notice this blogpost's date)

Basically C/C++ doesn't have a memory model. Modern C# and Java do which means that the language natively supports the ideas of memory access ordering and such.

http://preshing.com/20120515/memory-reordering-caught-in-the-act/

When writing lock-free code in C or C++, one must often take special care to enforce correct memory ordering. Otherwise, surprising things can happen.

from zstd.

Cyan4973 commented on April 27, 2024

Great charts @KrzysFR !

I'm assuming that it will be as fast (or even faster). I'll try to look into that tomorrow.

This is correct. In most circumstances, decompression speed should prove faster.
This is assuming the caller is using ZSTD_decompress_usingPreparedDCTx() or equivalent, which allows loading the dictionary only once for a batch of blocks.

zstd includes an internal benchmark that can be used for quick mem-to-mem speed measurement.

Test one file :
./zstd -b filename

Test multiple files :
./zstd -b dirName/*

Test multiple files at level 9 :
./zstd -b9 dirName/*

Test multiple files at level 9 using a dictionary :
./zstd -b9 dirName/* -D dictionaryFilename

Test multiple files from level 1 to 16 using a dictionary :
./zstd -br16 dirName/* -D dictionaryFilename

@t-mat : I'm going to trust you expertise on this ;)

from zstd.

KrzysFR commented on April 27, 2024

Result of runing the bench mode with zstd -b#, but limited to only 3 files, because it is too slow to run on 100+ files everytime.

Test rig: Intel Skylake 4.0Ghz

// file                   Input        Output (ratio)  C.Speed     D.Speed
zstd -1
> 9019503_65_bw.spl :     39728 ->     23009 (1.727), 233.9 MB/s , 495.6 MB/s
> 2387335_66_bw.spl :     39778 ->     23035 (1.727), 233.4 MB/s , 493.9 MB/s
> 1153244_68_bw.spl :     42576 ->     24426 (1.743), 233.9 MB/s , 481.6 MB/s
zstd -1 -D
> 9019503_65_bw.spl :     39728 ->      9835 (4.039), 254.1 MB/s , 604.7 MB/s
> 2387335_66_bw.spl :     39778 ->      9861 (4.034), 254.2 MB/s , 602.7 MB/s
> 1153244_68_bw.spl :     42576 ->     10982 (3.877), 255.7 MB/s , 592.4 MB/s
zstd -3
> 9019503_65_bw.spl :     39728 ->     22385 (1.775), 196.1 MB/s , 443.2 MB/s
> 2387335_66_bw.spl :     39778 ->     22412 (1.775), 195.6 MB/s , 446.2 MB/s
> 1153244_68_bw.spl :     42576 ->     23788 (1.790), 194.6 MB/s , 438.9 MB/s
zstd -3 -D
> 9019503_65_bw.spl :     39728 ->      9589 (4.143), 102.2 MB/s , 535.0 MB/s
> 2387335_66_bw.spl :     39778 ->      9619 (4.135), 101.4 MB/s , 534.0 MB/s
> 1153244_68_bw.spl :     42576 ->     10633 (4.004), 102.2 MB/s , 503.5 MB/s
zstd -5
> 9019503_65_bw.spl :     39728 ->     21439 (1.853),  93.1 MB/s , 405.4 MB/s
> 2387335_66_bw.spl :     39778 ->     21461 (1.854),  93.2 MB/s , 406.5 MB/s
> 1153244_68_bw.spl :     42576 ->     22716 (1.874),  93.3 MB/s , 397.4 MB/s
zstd -5 -D
> 9019503_65_bw.spl :     39728 ->      8360 (4.752),  98.4 MB/s , 571.0 MB/s
> 2387335_66_bw.spl :     39778 ->      8386 (4.743),  98.3 MB/s , 570.6 MB/s
> 1153244_68_bw.spl :     42576 ->      9387 (4.536),  96.9 MB/s , 530.7 MB/s
zstd -7
> 9019503_65_bw.spl :     39728 ->     21272 (1.868),  71.4 MB/s , 418.3 MB/s
> 2387335_66_bw.spl :     39778 ->     21295 (1.868),  70.6 MB/s , 418.4 MB/s
> 1153244_68_bw.spl :     42576 ->     22493 (1.893),  67.9 MB/s , 413.4 MB/s
zstd -7 -D
> 9019503_65_bw.spl :     39728 ->      7815 (5.084),  75.6 MB/s , 635.6 MB/s
> 2387335_66_bw.spl :     39778 ->      7838 (5.075),  75.8 MB/s , 638.9 MB/s
> 1153244_68_bw.spl :     42576 ->      8740 (4.871),  74.9 MB/s , 615.6 MB/s
zstd -9
> 9019503_65_bw.spl :     39728 ->     21259 (1.869),  57.9 MB/s , 423.7 MB/s
> 2387335_66_bw.spl :     39778 ->     21282 (1.869),  57.6 MB/s , 425.6 MB/s
> 1153244_68_bw.spl :     42576 ->     22482 (1.894),  55.6 MB/s , 422.1 MB/s
zstd -9 D
> 9019503_65_bw.spl :     39728 ->      8218 (4.834),  66.4 MB/s , 633.3 MB/s
> 2387335_66_bw.spl :     39778 ->      8243 (4.826),  66.1 MB/s , 627.3 MB/s
> 1153244_68_bw.spl :     42576 ->      9328 (4.564),  63.2 MB/s , 605.9 MB/s
zstd -11
> 9019503_65_bw.spl :     39728 ->     21251 (1.869),  49.3 MB/s , 428.0 MB/s
> 2387335_66_bw.spl :     39778 ->     21278 (1.869),  49.0 MB/s , 424.9 MB/s
> 1153244_68_bw.spl :     42576 ->     22484 (1.894),  47.3 MB/s , 420.8 MB/s
zstd -11 D
> 9019503_65_bw.spl :     39728 ->      8344 (4.761),  49.7 MB/s , 636.1 MB/s
> 2387335_66_bw.spl :     39778 ->      8368 (4.754),  50.2 MB/s , 634.4 MB/s
> 1153244_68_bw.spl :     42576 ->      9421 (4.519),  48.5 MB/s , 613.0 MB/s
zstd -13
> 9019503_65_bw.spl :     39728 ->     21258 (1.869),  38.6 MB/s , 426.5 MB/s
> 2387335_66_bw.spl :     39778 ->     21285 (1.869),  38.7 MB/s , 424.5 MB/s
> 1153244_68_bw.spl :     42576 ->     22483 (1.894),  36.0 MB/s , 423.2 MB/s
zstd -13 -D
> 9019503_65_bw.spl :     39728 ->      7861 (5.054),  36.8 MB/s , 634.5 MB/s
> 2387335_66_bw.spl :     39778 ->      7883 (5.046),  36.7 MB/s , 636.9 MB/s
> 1153244_68_bw.spl :     42576 ->      8896 (4.786),  36.0 MB/s , 610.6 MB/s
zstd -16
> 9019503_65_bw.spl :     39728 ->     21258 (1.869),  38.8 MB/s , 425.8 MB/s
> 2387335_66_bw.spl :     39778 ->     21285 (1.869),  38.6 MB/s , 425.6 MB/s
> 1153244_68_bw.spl :     42576 ->     22483 (1.894),  37.3 MB/s , 423.9 MB/s
zstd -16 -D
> 9019503_65_bw.spl :     39728 ->      7880 (5.042),  29.2 MB/s , 653.3 MB/s
> 2387335_66_bw.spl :     39778 ->      7906 (5.031),  29.0 MB/s , 647.0 MB/s
> 1153244_68_bw.spl :     42576 ->      8906 (4.781),  28.3 MB/s , 627.9 MB/s

So it looks like the dictionary helps a lot with the compression ratio, which in turn gives a 10% compression speed improvement and 20% decompression speed improvement, EXCEPT for some levels like -3 where the compression speed is actually slower with the dictionary (but decompression is faster).

Then above -11 the trend seems to inverse, and compression is slower, while decompression still better (again, probably because there are less bytes to compress).

The best ratio is with -16 -D that gives 5.0 versus 1.86 for -16, but at 1/10th the speed of -1 -D that still gives a 4.0 ratio

Another look at the data:

Compression ratio vs C.Speed / D.Speed
xy plot of the data above, orange is decompression speed, blue is compression speed

We can clearly see that without dictionary (left side) the ratio does not change much. But once we add the dictionary, there is A LOT of variation in the ratio. It looks almost like the ratio could even be higher with a better dictionary...

Also, speed-wise, again it seems that zstd -1 -D is the best compromise, and the upward trend show that _de_compression speed goes up when ratio goes up (again, less stuff to decompress)

from zstd.

KrzysFR commented on April 27, 2024

Another weird result I just realized:

Without dictionary, the decompression speed decreases with the compression ratio, but with dictionary the inverse is true and the decompression speed increases with the compression ratio .. ??

Maybe it is because with dictionary, there are more token matches (which can be decompressed by memcpy) while without dictionary, they would instead be compressed via the huffman or FSE stages (which are slower to decompress) ?

Very interesting :)

from zstd.

KrzysFR commented on April 27, 2024

Ok and one last graph and I'm done:

Efficiency (ie: number of bytes removed per second due to compression)

from zstd.

FrancescAlted commented on April 27, 2024

Cool plots! I am pretty impressed by the improvement in decompression speed when using dictionaries. I'd say that this could be related to the fact that it takes less time to transmit the compressed buffer into the caches for decompression. Anyway, dictionaries + zstd is worth a try indeed 👍

from zstd.

Cyan4973 commented on April 27, 2024

Great graphs once again @KrzysFR !

Without dictionary, the decompression speed decreases with the compression ratio, but with dictionary the inverse is true and the decompression speed increases with the compression ratio .. ??

Indeed. Without dictionary, the compressor has little room left to improve compression ratio, so it goes after very small matches. This strategy decreases decompression speed, since it will have to do more work to rebuilt same amount of information.
With dictionary, the reverse situation happens : since it has a lot more choices, the compressor is looking for better and longer matches, making the decompressor do less work to rebuild same amount of information.

We can clearly see that without dictionary (left side) the ratio does not change much. But once we add the dictionary, there is A LOT of variation in the ratio. It looks almost like the ratio could even be higher with a better dictionary...

Correct. You could be tempted by testing larger dictionaries in some cases.

Also, it's difficult to reproduce a nice compression level curve with dictionary. We are all accustomed to "higher level = more compression but slower". With dictionary, it's a lot more chaotic.
I just made a change in buffer management that will help restoring a bit the compression level curve shape, but it's still not completely regular.

from zstd.

Cyan4973 commented on April 27, 2024

Merged into "dev" branch

from zstd.

Cyan4973 commented on April 27, 2024

Released as v0.5

from zstd.

KrzysFR commented on April 27, 2024

Thanks for doing this work! It will be really help us some some perf issues we have currently (both cpu and disk footprint).

from zstd.

Compressing individual documents with a dictionary produced from a sampled batch to get better compression ratio about zstd HOT 86 CLOSED

Comments (86)

Compression method

Random digits

Small files

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs