GithubHelp home page GithubHelp logo

Comments (4)

felixhandte avatar felixhandte commented on April 28, 2024 1

Is it possible that the input buffer is being modified during the compression operation? That would explain this symptom, I think. This could happen either by some other thread touching the input buffer, or if the the CCtx's internal buffers are colliding in part with the input buffer, so that Zstd is accidentally mutating the input by writing into its CCtx.

Is the code snippet you provided literally the reproduction case you've discovered? Can you describe the environment a little more? Are there other threads running in your program?

from zstd.

Cyan4973 avatar Cyan4973 commented on April 28, 2024

Thanks for the detailed debugging.

Similar scenarios are supposed to be abundantly tested and fuzzed, so I'm surprised such an obvious bug would be able to pass through. For example, the fact that memory is allocated with malloc, and therefore is not initialized, is not a bug: the initialization process is supposed to be cognizant of this fact, and adjust accordingly, zeroing some memory segments only when necessary.
In the case of the FSE tables, it should not be necessary: we expect these tables to be written to first, during block statistics stage. So even if this memory contains garbage data, it should not matter.

Anyway, another important detail is that v1.5.2 is > 2 years old, and our code base has evolved since.

Would you be able to repeat the experience using the current source code in dev branch ?

from zstd.

irelia97 avatar irelia97 commented on April 28, 2024

Thanks for the detailed debugging.

Similar scenarios are supposed to be abundantly tested and fuzzed, so I'm surprised such an obvious bug would be able to pass through. For example, the fact that memory is allocated with malloc, and therefore is not initialized, is not a bug: the initialization process is supposed to be cognizant of this fact, and adjust accordingly, zeroing some memory segments only when necessary. In the case of the FSE tables, it should not be necessary: we expect these tables to be written to first, during block statistics stage. So even if this memory contains garbage data, it should not matter.

Anyway, another important detail is that v1.5.2 is > 2 years old, and our code base has evolved since.

Would you be able to repeat the experience using the current source code in dev branch ?

i dont know how, but it does happen in my environment, and i cant repeat it in dev branch(even in my other dev environment), its strange.

A few hours ago I was able to more accurately debug that it was the zc->blockState.nextCBlock block that was the problem. If I use memset to clear this memory but all workspace, the program can also complete the compression normally.

// zstd_compress.c : 1910
zc->blockState.nextCBlock = (ZSTD_compressedBlockState_t*) ZSTD_cwksp_reserve_object(ws, sizeof(ZSTD_compressedBlockState_t));
RETURN_ERROR_IF(zc->blockState.nextCBlock == NULL, memory_allocation, "couldn't allocate nextCBlock");
//memset(zc->blockState.nextCBlock, 0, ZSTD_cwksp_align(sizeof(ZSTD_compressedBlockState_t), sizeof(void*)));

So I guess this piece of junk value might be playing some toxic role in the ZSTD_buildSequencesStatistics-> ZSTD_buildCTable->FSE_buildCTable_wksp function. But this piece of code is so complicated that I can't understand what is done.

/UPDATE/
I find the real problem now. In function FSE_buildCTable_wksp, we will calculate deltaNbBits and deltaFindState in the symbolTT array element in the last loop. But when normalizedCounter[s] == 0, The branch only calculates the value of deltaNbBits. So the deltaFindState may get a garbage value. Then when the program runs to FSE_encodeSymbol function(in fse.h), the program will get a Segmentation Fault.
statePtr->value = stateTable[ (statePtr->value >> nbBitsOut) + symbolTT.deltaFindState];

from zstd.

Cyan4973 avatar Cyan4973 commented on April 28, 2024

when normalizedCounter[s] == 0,

it means the symbol s should not be present at all, not even once.
If s is nonetheless found as part of the input, then indeed there will be a pretty big problem: it's an non-codable event.

But this should not happen, because prior to calculating the tables, the process starts by histogramming the whole input. So no symbol should be missing. Only symbols which are confirmed absent will receive a weight of 0.

This is a very blatant issue, and the issue board would have witnessed mountains of segv and support requests if it was present in an earlier release such as v1.5.2.

from zstd.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.