GithubHelp home page GithubHelp logo

isValidUtf8 is broken about bytestring HOT 4 CLOSED

sol avatar sol commented on June 11, 2024
isValidUtf8 is broken

from bytestring.

Comments (4)

Bodigrim avatar Bodigrim commented on June 11, 2024 1

Yeah, that's my conclusion as well (I've been looking at aarch version). Seems wrapping lookahead manipulation in if(big_strides) should be enough.

GHC team has asked for any version bumps to be included in GHC 9.4.8 to be ready by Friday: https://mail.haskell.org/pipermail/ghc-devs/2023-October/021420.html. We probably want to fix this and backport to bytestring-0.11.

@clyring if you get time to put up a PR, fill free to merge and release without my approval, I'll be AFK until next week.

from bytestring.

sol avatar sol commented on June 11, 2024

For the minimal case isValidUtf8 (drop 1 "\128\0")

  • both is_valid_utf8_avx2 and is_valid_utf8_ssse3 are broken
  • while is_valid_utf8_sse2 and is_valid_utf8_fallback seem to work

from bytestring.

clyring avatar clyring commented on June 11, 2024

I see the problem. Here is the relevant bit of the avx2 code:

  // 'Roll back' our pointer a little to prepare for a slow search of the rest.
  uint32_t tokens_blob = _mm256_extract_epi32(prev_input, 7);
  int8_t const *tokens = (int8_t const *)&tokens_blob;
  ptrdiff_t lookahead = 0;
  if (tokens[3] > (int8_t)0xBF) {
    lookahead = 1;
  } else if (tokens[2] > (int8_t)0xBF) {
    lookahead = 2;
  } else if (tokens[1] > (int8_t)0xBF) {
    lookahead = 3;
  }
  uint8_t const *const small_ptr = ptr - lookahead;
  size_t const small_len = remaining + lookahead;
  return is_valid_utf8_fallback(small_ptr, small_len);

When the input is too small for a 128-byte big stride, we reach this code with prev_input containing all zeros.

This lookahead fiddling is meant to ensure we do not call the fallback after just part of a multi-byte code point. But for some reason it looks for any non-continuation-byte in prev_input rather than specifically for valid first bytes of a multi-byte code point. And zero is not a continuation byte, so we end up checking if [start-1,end) is valid rather than [start,end) as intended.

I'll put up a PR soon.

from bytestring.

clyring avatar clyring commented on June 11, 2024

It seems I lack the authority to circumvent the approval requirement before merging to master.

Well, I pushed to bytestring-0.11 and cut a release anyway.

from bytestring.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.