Describe the bug Decompression failed with an error of -20. We ha

Thank you <a class="user-mention notranslate" data-hovercard-type="user" data-hovercar

If I read it correctly, the case in <a class="issue-link js-issue-link" data-error-tex

<div class="snippet-clipboard-content notranslate position-relative overflow-auto" data-snippet-clip

<div class="snippet-clipboard-content notranslate position-relative overflow-auto" data

Replication of bug #3517,about facebook/zstd

Comments (24)

AV-Coding commented on May 3, 2024 1

Thank you @Cyan4973 and @terrelln for your response.

We use ZSTD to compress blocks of data up to 256KiB in size. At this point, there are billions of compressed blocks in the world. Since we updated, there are probably 100's of millions of blocks with the new version. We have seen two sets of occurrences of the error (both with the new version). Two independent hardware configurations in separate parts of the world that have no relationship to each other. Each hit the problem twice while running a unique workload. So, workload A created two hits on different days for user A and workload B created two hits on different days for user B.

We have CRC across the "decompressed" data. We will start working on a bit flip sandbox to see if we can get the data to recover. We also have neighboring compressed blocks that are similar in nature that do decompress. We will also work on adding a decompression-verification routine that can be enabled dynamically.

We also attempted to decompress the data with the older version with no success. If we continue to see occurrences, we may have to revert out of caution.

At this point, it's most likely a flaw in the newer ZSTD version we are now running or a flaw in our code that leverages the new version. It could perhaps still be hardware given both user A and B use the same brand/configuration of hardware components.

from zstd.

AV-Coding commented on May 3, 2024 1

It's multi-threaded, but it's always the same thread which calls the routine. What we posted above is also a streamlined version of what's actually being called. The context is saved per instance or handle and the handle is only accessed by a single thread at any given time.

from zstd.

AV-Coding commented on May 3, 2024 1

Compressed on 32 bit, attempting decompression on both 32 and 64. 32bit big endian

from zstd.

AV-Coding commented on May 3, 2024 1

We have an important update of the problem we are hitting.
Similar blocks for the same workload have a large amount of zero data, so we were thinking perhaps these invalid offsets are referencing a zero match.
As shown in this log, we replaced invalid offset match data with zeros nine times for the compressed 256KiB block.
Not only does it decompress correctly, but the uncompressed CRC matches.

zstd/lib//decompress/zstd_decompress_block.c:1325: Replaced bad offset 239 (limit of 24) with 5 zeros
zstd_new/zstd/lib//decompress/zstd_decompress_block.c:1325: Replaced bad offset 239 (limit of 45) with 5 zeros
zstd_new/zstd/lib//decompress/zstd_decompress_block.c:1325: Replaced bad offset 239 (limit of 70) with 5 zeros
zstd_new/zstd/lib//decompress/zstd_decompress_block.c:1325: Replaced bad offset 954 (limit of 83) with 33 zeros
zstd_new/zstd/lib//decompress/zstd_decompress_block.c:1325: Replaced bad offset 3663 (limit of 1769) with 5 zeros
zstd_new/zstd/lib//decompress/zstd_decompress_block.c:1325: Replaced bad offset 68470 (limit of 1775) with 4 zeros
zstd_new/zstd/lib//decompress/zstd_decompress_block.c:1325: Replaced bad offset 68470 (limit of 1782) with 5 zeros
zstd_new/zstd/lib//decompress/zstd_decompress_block.c:1325: Replaced bad offset 68470 (limit of 1799) with 9 zeros
zstd_new/zstd/lib//decompress/zstd_decompress_block.c:1325: Replaced bad offset 68470 (limit of 1809) with 5 zeros

This zero rework helped with user A's two occurrences allowing both of them to be successfully decompressed.
But, it did not work for user B. We attempted to replace the user B's match source with values from 0x00-0xFF with no luck.
The mystery match reference for user B's cases must be a pattern that isn't single byte repeating.

We also attempted to compress the uncompressed block using 1.5.2 and 1.5.6. Both compress and decompress successfully using the zstd tool with the same compression level.
They also compress significantly more. For example, the bad compressed block was about 33KiB in size while the test compressed block is about 10KiB.
We are wondering if the 33KiB block has an excess set of literals.
We are thinking it may be some form of context corruption or invalid context reset, like it's using a dictionary from a previous block.

One question. The 32bit application performing the compression uses a static libzstd.a library which is not multi-threaded yet the application itself is multi-threaded.
But, as mentioned before, the thread which does call the compression function is always the same thread. Though other threads exist, they do not call the ZSTD functions.
Do you think this may be a problem?

from zstd.

Cyan4973 commented on May 3, 2024

If I read it correctly, the case in #3517 requires :

Block Splitter is activated (i.e. high compression only)
It confuses specifically literals_length == 65536 with literals_lenth == 0 .
It uses a repcode in the same sequence, and because it incorrectly believes that literals_lenth == 0, it picks a wrong index.
It splits at this specific position

So that's a pretty hard combination of circumstances to generate, even on purpose.

Looking at your trace :
decompress/zstd_decompress_block.c: seq: litL=24, matchL=5, offset=239
the source issue seems different, because litL=24, so that's incompatible with the above scenario.

It still doesn't explain why the data is wrong.
Starting the first sequence with offset=239 looks obviously incorrect, though it has nothing to do with repcode.

It could happen if a dictionary was used to compress this data, in which case this long offset is searching for a match into the dictionary.
I presume no dictionary is expected to have been used to compress this data, in which case the next logical explanation is corruption.
And at this point, all bets are off.

If the software is the cause of this corruption, it should be reproducible, meaning that the same input with the same compression settings should result in the same corruption event. This can be difficult to reproduce though if you don't have the original data anymore, or if you don't know what were the compression settings. It would also give a simple mitigation strategy: just revert to older (presumed working) version v1.2.0, and see if the corruption events stop (assuming this is observable, which can be difficult for very rare events).
Unfortunately, for very rare corruption events, we also can't rule out hardware issues. It's a measurable thing when you have a large enough fleet of servers and data set to look after. It can happen at network, storage and cpu levels, so there's a broad range of options. That's also why checksum is so useful.

At this point, what would help is some watermarking, that would trace the origin system that produced the data, when and how. If all corruption events come from the same system, for example, it's a pretty strong indication. Unfortunately, that's easier said than done. If such watermark was not in place at the time of the corruption event, there is now very little decompression can do to investigate or fix the problem after the fact.

That being said, you also mention that checksum was enabled, so it can now be used as a "validator".
You also mention that, for the specific case reproduced above, you believe that most of the data is correct, it's only the offset of the first sequence which would be incorrect.
In which case, it seems one could "just" try all possible valid values of "offset", and see which one regenerates the "right" checksum (this requires a capability to manually apply the wanted sequences, which is not trivial).
There's no guarantee of success though, because the basic premise is that only this first offset value is wrong, but who knows.

from zstd.

terrelln commented on May 3, 2024

decompress/zstd_decompress_block.c: seq: litL=24, matchL=5, offset=239

This is super suspicious. Like Yann said, as long as a dictionary was used, this should be basically impossible for the compressor to generate. This corruption could not be caused by the issue fixed in #3517. I'd start by investigating bitflips. When we investigate issues like this internally, the root cause almost always ends up being bad hardware of some sort.

What is the order of magnitude number of compressions you are doing? Thousands, Millions, billions, or trillions? Are all the corrupt blobs compressed by the same host? Are you sending the data over the network unencrypted, where bitflips could happen?

I'd first start by looking for bitflips in the frame where the first sequence is corrupted. The first block is 6672 bytes, so the bits for the first offset will be stored right near the end of the block, since they are read in reverse. So I would basically start flipping bits from the end of the block, for say 20 bytes to be safe. And see if any bitflips cause the checksum to succeed.

Beyond that, I'll have to think a bit about how we can go about debugging this. Maybe revert to v1.2.0 temporarily to see if the issue goes away?

from zstd.

terrelln commented on May 3, 2024

If you think that the issue will eventually reproduce, you could add decompression verification directly after compression. Then when it fails, save both the original data, and the compressed data for debugging.

Then you can see if the issue reproduces, both on the same host, and on other hosts. This will rule out faulty hardware. If it is a deterministic issue that reproduces on another host, and you have the original input data, then I am 100% confident that we can work together to find & fix the issue, even though you can't share the data. At that point, just bisecting the issue to a commit would likely be enough to find the issue.

from zstd.

terrelln commented on May 3, 2024

If trying out bitflips doesn't work, I highly suggest running a decompression directly after the compression. Then log the original blob if it fails. If you have that blob, and can reproduce the issue & bisect, we will be able to fix the issue. It also has the additional property of validating that the blob is not corrupt.

In the meantime, I will do a bit of digging to see if I can think of anything that could cause this issue. I don't really expect to find anything without being able to reproduce it though.

I have a few questions to narrow my search:

Can you share exactly how you call zstd compression, including all the parameters you set, and all the APIs you use?
Can you print out these variables during the decompression? I will also put up a PR to add them to our debuglogs, for the future.

from zstd.

AV-Coding commented on May 3, 2024

    static ZSTD_CCtx *comp_context = NULL;

    if (comp_context == NULL) {
      // allocate a context to help speed up back to back compressions under the same thread
      comp_context = ZSTD_createCCtx();
      if (comp_context == NULL) {
        err("Failure to allocation ZSTD context\n");
        exit(-1);
      }
    }

    comp_param = 1;
    comp_size = ZSTD_compressCCtx(comp_context, out_buf_p, out_buf_rest, in_buf_p, in_len, comp_param);

zstd/lib//decompress/zstd_decompress_block.c:763: LLtype=2, OFtype=2, MLtype=2

sample.txt

from zstd.

terrelln commented on May 3, 2024

static ZSTD_CCtx *comp_context = NULL;

Are you using this code in a multi-threaded environment? ZSTD_CCtx are not thread-safe, and it looks like this would be using the same context in two different threads.

from zstd.

terrelln commented on May 3, 2024

Is the system your compressing on 32-bit or 64-bit?

from zstd.

embg commented on May 3, 2024

@AV-Coding, one note:

It could perhaps still be hardware given both user A and B use the same brand/configuration of hardware components.

Hardware corruption can happen with any pair of compression / decompression setups. Good machines can go bad over time as the hardware ages. Corruption can happen at any point from the compression machine, to various NICs and routers in between the hosts, to issues on the decompression machine.

In addition to @terrelln's suggestion (verify decompression immediately after compression), another way to rule out hardware issues is to add a checksum of the compressed data. If you discover a compressed blob where the compressed checksum matches, but the decompressed checksum fails, then you can rule out a large class of hardware errors.

from zstd.

AV-Coding commented on May 3, 2024

We agree. We have already completed that 2nd suggestion a couple years ago by adding an extra CRC over the compressed data. Not all users have that newer level. Unfortunately, neither of the cases that have corruption were running this newer level when the corruption was discovered. But, one of the two users has recently upgraded and we too believe it would help us narrow down when the corruption occurred if we get another hit.

The main reason we feel it likely isn't hardware is because each user hit the problem twice for a particular workload that was run at different times. In one case, it was the exact same step in a sequence of millions of data-creation steps where the compressed record became corrupted. This implies there is something about the data in that step that causes the issue. But, it still may be how our software reacts to that particular job and step.

We ran the bit flip test where we we flipped the last 32 bits of the compressed block with no success.
We also ran a test where we replaced the last three bytes with every value from 0x000000 to 0xFFFFFF. The decompression attempts made further progress for thousands of scenarios, getting as far as sequence number 220, but it still broke down. I'm assuming this means the corruption spans more than three bytes.

Perhaps the length being used to find the last byte of the sequence chain is invalid? Is there an eye catcher or pattern we could look at for the next block to see if it starts properly?

from zstd.

embg commented on May 3, 2024

I looked over the thread again, and I'm still unclear on which version produced the offending blobs: was it v1.2.0 or v1.5.2? I do understand that you have experienced the decoder-side issue with both v1.5.2 and v1.5.6, just clarifying what version the encoder was on.

Apologies if this is answered somewhere in the thread, and I simply missed it.

from zstd.

AV-Coding commented on May 3, 2024

@embg, the version where we ran into the offending blobs was at 1.5.2. After being unable to decompress the data with v1.5.2, we attempted to decompress with v1.2.0 and v1.5.6, but with no success

from zstd.

Cyan4973 commented on May 3, 2024

We are thinking it may be some form of context corruption or invalid context reset, like it's using a dictionary from a previous block.

It's also my thoughts on reading your investigation results.

The 32bit application performing the compression uses a static libzstd.a library which is not multi-threaded yet the application itself is multi-threaded.
But, as mentioned before, the thread which does call the compression function is always the same thread. Though other threads exist, they do not call the ZSTD functions.

Even when the static library is not multi-threaded, it can still be used in multi-threaded environment: the only restriction is that it's not possible to trigger multi-threaded compression of a single source, which is generally not a problem.

What really matters is that a given ZSTD_CCtx* compression context is only used to compress one session at a time.

Given that there is only one compression event at a time ever possible in the above design, it certainly reduces risks.
It's still possible to mis-manage the sharing of context between 2 consecutive sessions (assuming a compression context is long-lived across multiple sessions). But the problem would be the same for v1.2.0 or v1.5.2, and it's strange that the issue only starts appearing after the update to v1.5.2. Assuming it's not just bad coincidental timing.

from zstd.

AV-Coding commented on May 3, 2024

Here's an update of where we are:

We are still unable to find root cause of this issue, but we have made progress. After further analysis, we have determined that the invalid compressed block with invalid offsets is in fact referencing matches from the previous unrelated block that had used the same context. The invalid offset in the bad block does seem to have a pattern when compared to the previous block. For example, if the invalid offset is 12,288, the data it is attempting to reference is at absolute offset X+12288 within the previous block where X is some multiple of 4KiB.

In our use case, we have a large contiguous buffer broken up into 4K indexed segments. Data up to 256KiB arrives into this buffer prior to compression using one or more of the 4KiB segments. The location is dependent on where the previous buffer left off, rounded up to the next 4KiB boundary plus 32 bytes of meta-data not included in the compression payload. So, though back to back blocks are not exactly end-to-end, the ending of the previous block and start of the current block can be anywhere from 32 bytes to 4KiB of each other. Or, if the end of buffer is hit, the next block will arrive at the beginning with an address lower than the previous block. Each block is requested to be compressed using the same context and always occurs serially using the same thread. This can go on for millions of blocks.

As of today, we do not call any explicit context reset functions between compression requests. Should we be calling such a function to reset the context?

Additional items we have noticed is that the decompression of the bad block is attempting to use a SplitLiteral, while if we compress the same block in a testbed, it does not use a SplitLiteral.
We also see references to DEBUGLOG(6, "invalidating dictionary for current block (distance > windowSize)"); within the good testbed compression. There is a recent change in this area under Fix #3102.
if (blockEndIdx > loadedDictEnd + maxDist [ ADDED || loadedDictEnd != window->dictLimit])
Perhaps it may be related?

Overall, it would appear we have some sort of race where the context is too sticky. We see that some days it works fine while the next day the exact same blocks (content) hit the issue, making us believe it might have something to do with the location of the blocks within our large 4KiB segmented buffer.

from zstd.

Cyan4973 commented on May 3, 2024

What you are trying to achieve is unclear to me.

Are you cutting some "large" input into independent blocks of 4 KB, compressed individually ?

from zstd.

AV-Coding commented on May 3, 2024

No, blocks up to 256KiB exist within the buffer on 4KiB boundaries. Each block will use one or more 4KiB blocks as needed and the entire block (up to 256KiB) is sent to the compression API as a single request.

The details above are attempting to explain how it's a shared buffer and blocks (up to 256KiB) are located in this shared buffer end to end on 4KiB address boundaries.

from zstd.

Cyan4973 commented on May 3, 2024

So, you are compressing independent inputs of up to 256 KB, and sizes are always multiple of 4 KB ?
And each inputs is presented as a single continuous buffer ?
And the buffer itself is shared across (single-threaded) compression sessions ?

from zstd.

AV-Coding commented on May 3, 2024

Yes, though the length isn't always divisible by 4KiB. We just round up to the next 4KiB boundary for the next block. So, the end of the previous block to the start of the next block is < 4KiB.

from zstd.

Cyan4973 commented on May 3, 2024

This part is much less clear to me.

It seems you employ the word "block" to mean "independent inputs" ?

from zstd.

AV-Coding commented on May 3, 2024

Each block is an independent input of contiguous data up to 256KiB in size. Multiple blocks reside in a contiguous address/memory space on 4KiB boundaries. They are each independently compressed sequentially. The context of the bad block is sometimes sticky and uses offsets into the previous block. Though each block (up to 256KiB in size) is requested to be compressed in a single request, the previous blocks likely still exist in valid memory space just ahead of the current block. It simply depends on timing and whether that portion of the buffer has been reused by the time we ask ZSTD to compress the next one.
[ block 1 ] [pad] [block 2] [pad] [block 3] [pad] [block 4] .... and so on.
ZSTD_compressCCTx( contextA. out_buf, out_buf_len, &block1, length_block1);
// move out_buf contents elsewhere
// Update the buffer state to state the block 1 space can be reused. (4K to 256KiB of space).
ZSTD_compressCCTx( contextA, out_buf, out_buf_len, &block2, length_block2):
// move out_buf contents elsewhere
// Update the buffer state to state the block 1 space can be reused. (4K to 256KiB of space).
ZSTD_compressCCTx( contextA, out_buf, out_buf_len, &block3, length_block3); <----- bad compressed block that compresses half as well as it should and references data via invalid offsets from block2.

from zstd.

Cyan4973 commented on May 3, 2024

ZSTD_compressCCtx() is designed to be a fully complete operation,
meaning it will reset the compression state even if it was used in a previously unfinished operation,
and it will properly close the operation when successful.

When compression is unsuccessful, the resulting state of contextA is less clear, so it should be reset.
But just invoking ZSTD_compressCCtx() should be enough to clear the state and render it valid for the new compression job, because it starts with a reset operation.

So, indeed, invoking ZSTD_compressCCtx() repetitively on a set of input / output buffers, even overlapping ones, should be successful. It actually doesn't matter if these buffers are separate, or contiguous, or the same, or partially overlapping. All the guarantees still apply.

Therefore, it's weird that a call to ZSTD_compressCCtx() would reference data from a previous input completed using the same contextA state. If that's the case, there's something seriously wrong there.

Let's also be clear that this is a scenario for which ZSTD_compressCCtx() is already heavily tested, in multiple CI environments, and so far it has been working fine. So it would be surprising if some blatant lingering bug would still be present there.

Unfortunately, it's hard to tell more without access to a reproducible test case. As soon as a scenario can reliably reproduce the problem, we'll be able to analyze it and create a fix for it.

One problem is that the bug is observed at decompression time, which can be much later than compression time, thus obscuring the conditions required to trigger it. One possibility would be to confirm each compression job by starting a validation decompression stage right after it, thus detecting the problem at the moment it's created.
Also, since the bug seems related to a sequence of compressions, it would be useful to log all compression events, as a way to rebuilt the history and re-create the same sequence of compressions when the bug is detected.
Thanks to these traces and immediate corruption detection, it should be possible to recreate the exact same scenario, and observe if it fails reliably or not.

But if the bug tends to be "random", meaning it mostly works, but sometimes it fails unpredictably at a very low rate, and no scenario can reliably fail or succeed, then this symptom can also be compatible with a hardware failure. This is rare, but not unheard of, and essentially guaranteed to happen once the fleet size becomes "large enough" (we are regularly confronted to hardware failure scenarios in our working environment, just due to its scale).

from zstd.

Replication of bug #3517 about zstd HOT 24 OPEN

Comments (24)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs