GithubHelp home page GithubHelp logo

Comments (8)

Bodigrim avatar Bodigrim commented on June 5, 2024 2

This is actually a problem with UTF-8 decoding in bytestring, the issue with text is just a consequence:

import qualified Data.ByteString as BS

main :: IO ()
main = do
  -- This byte has the high bit set and thus is not valid UTF 8 by itself
  let simpleInvalidUtf8 = BS.pack [216]

  -- Correctly says False
  print $ BS.isValidUtf8 simpleInvalidUtf8

  -- Pad the invalid character with spaces
  let paddedInvalidUtf8 = BS.pack $ replicate 127 32 <> [216] <> replicate 128 32

  -- Incorrectly says True
  print $ BS.isValidUtf8 paddedInvalidUtf8

CC @kozross

from bytestring.

kozross avatar kozross commented on June 5, 2024 2

@Bodigrim - that's my suspicion as well. I'll take a look when I can.

from bytestring.

Bodigrim avatar Bodigrim commented on June 5, 2024 1

@fatho I can reproduce your example, but the root cause is the same. When text parses a bytestring, which is reported as invalid, it parses the first byte first (which succeeds) and then retries with the rest (which leaves us with 127+1+128, which again succeeds due to the bug).

from bytestring.

Bodigrim avatar Bodigrim commented on June 5, 2024

I'm currently on ARM so referring to cbits/aarch64/is-valid-utf8.c, but AVX2 implementation (which @fatho is running) has the same structure.

The input string is structured that way that a) the first chunk is a valid partial input, ending with a start of multibyte sequence b) the second chunk is completely ASCII.

// Check if we have ASCII
if (is_ascii(inputs)) {
// Prev_first_len cheaply.
prev_first_len = vqtbl1q_u8(first_len_tbl, vshrq_n_u8(inputs[3], 4));
} else {

I don't quite comprehend this, but my suspicion is that we are so happy to consume a full chunk of ASCII bytes that we eagerly forget the decoding state at which the previous chunk has ended.

from bytestring.

fatho avatar fatho commented on June 5, 2024

Thanks for looking into it! I did some further experimentation, and it seems that there is something else going on, besides the chunk-boundary case you suspect in bytestring.

In both reproducers, I increased the number of leading padding spaces to 128.

Now, the following bytestring reproducer expectedly prints False two times (consistent with your observation).

import qualified Data.ByteString as BS

main :: IO ()
main = do
  -- This byte has the high bit set and thus is not valid UTF 8 by itself
  let simpleInvalidUtf8 = BS.pack [216]

  -- Correctly says False
  print $ BS.isValidUtf8 simpleInvalidUtf8

  -- Pad the invalid character with spaces
  let paddedInvalidUtf8 = BS.pack $ replicate 128 32 <> [216] <> replicate 128 32

  -- Incorrectly says True
  print $ BS.isValidUtf8 paddedInvalidUtf8

However, this reproducer using the text library still claims to have decoded the second invalid string:

import Control.Exception
import qualified Data.Text.Encoding as Enc
import qualified Data.ByteString as BS

main :: IO ()
main = do
  -- This byte has the high bit set and thus is not valid UTF 8 by itself
  let simpleInvalidUtf8 = BS.pack [216]

  -- Correctly prints error
  testDecoding simpleInvalidUtf8

  -- Pad the invalid character with spaces
  let paddedInvalidUtf8 = BS.pack $ replicate 128 32 <> [216] <> replicate 128 32

  testDecoding paddedInvalidUtf8


testDecoding :: BS.ByteString -> IO ()
testDecoding input = do
  let
    decode = do
      result <- evaluate $ Enc.decodeUtf8 input
      putStrLn $ "Decoded " <> show input <> " to " <> show result

  catch decode $ \exc -> do
    putStrLn $ "Decoding " <> show input <> " failed with: " <> show (exc :: SomeException)

That behavior persists even with more padding, but the amount of padding in my initial example was the least amount of padding necessary to trigger this bug.

Is that something you can reproduce as well?

from bytestring.

fatho avatar fatho commented on June 5, 2024

@Bodigrim Thanks for your explanation. The "parse one byte, then retry the rest"-part sounded like an accidentally-quadratic algorithm to me, so I wrote a reproducer for that and opened another issue to not derail this comment thread: haskell/text#495

from bytestring.

Bodigrim avatar Bodigrim commented on June 5, 2024

@kozross could you possibly take a look at this? I prepared regression tests in #578.

from bytestring.

kozross avatar kozross commented on June 5, 2024

@Bodigrim - sorry for my lack of response: work's been busy. I've got approval for two full workdays to chase this, but I'm just stitching up something more urgent. I'll let you know once I'm available to deal with this.

Thanks for the regression tests - they'll help get this sorted.

from bytestring.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.