Comments (4)
Yeah, that's my conclusion as well (I've been looking at aarch version). Seems wrapping lookahead
manipulation in if(big_strides)
should be enough.
GHC team has asked for any version bumps to be included in GHC 9.4.8 to be ready by Friday: https://mail.haskell.org/pipermail/ghc-devs/2023-October/021420.html. We probably want to fix this and backport to bytestring-0.11
.
@clyring if you get time to put up a PR, fill free to merge and release without my approval, I'll be AFK until next week.
from bytestring.
For the minimal case isValidUtf8 (drop 1 "\128\0")
- both
is_valid_utf8_avx2
andis_valid_utf8_ssse3
are broken - while
is_valid_utf8_sse2
andis_valid_utf8_fallback
seem to work
from bytestring.
I see the problem. Here is the relevant bit of the avx2
code:
// 'Roll back' our pointer a little to prepare for a slow search of the rest.
uint32_t tokens_blob = _mm256_extract_epi32(prev_input, 7);
int8_t const *tokens = (int8_t const *)&tokens_blob;
ptrdiff_t lookahead = 0;
if (tokens[3] > (int8_t)0xBF) {
lookahead = 1;
} else if (tokens[2] > (int8_t)0xBF) {
lookahead = 2;
} else if (tokens[1] > (int8_t)0xBF) {
lookahead = 3;
}
uint8_t const *const small_ptr = ptr - lookahead;
size_t const small_len = remaining + lookahead;
return is_valid_utf8_fallback(small_ptr, small_len);
When the input is too small for a 128-byte big stride, we reach this code with prev_input
containing all zeros.
This lookahead
fiddling is meant to ensure we do not call the fallback after just part of a multi-byte code point. But for some reason it looks for any non-continuation-byte in prev_input
rather than specifically for valid first bytes of a multi-byte code point. And zero is not a continuation byte, so we end up checking if [start-1,end) is valid rather than [start,end) as intended.
I'll put up a PR soon.
from bytestring.
It seems I lack the authority to circumvent the approval requirement before merging to master
.
Well, I pushed to bytestring-0.11
and cut a release anyway.
from bytestring.
Related Issues (20)
- Invalid UTF-8 byte sequence is accepted as valid UTF-8 by text-2.0.1 HOT 8
- hGet documentation on return type is incorrect HOT 1
- CI arm jobs stay queued forever HOT 3
- Enable warnings in our C code
- Remove strict aliasing violations from our C code
- Test that `StrictByteString` results can be unboxed
- Data.ByteString.Char8.readDouble? HOT 4
- Use of `__builtin_ctzll` can cause runtime linker issues HOT 4
- is_valid_utf8 assumes avx2 support implies avx support HOT 6
- bytestring-0.11.5.1 (ghc-9.4.6) fails to build on Fedora HOT 4
- Compatibility with GHC's JavaScript backend HOT 1
- Overhaul Data.ByteString.Builder.RealFloat HOT 5
- Float and Double standard with precision 0 outputs incorrectly HOT 2
- Positive and Negative Zero Print Incorrect Precision
- Guidelines for using with WriterT HOT 2
- RealFloat Zero Padded Exponent HOT 7
- Faster Printing for Known Normal and Subnormal IEEE754 Floating Point Values
- Broken links in haddocks of `ShortByteString` HOT 1
- Audit `foreign import ccall` types
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from bytestring.