Comments (12)
Why would the gb18030 encoder algorithm ever hit the gb18030 ranges index for U+8000? It's in the gb18030 index.
from encoding.
Well then, here is another example of a problematic code point, and this time it doesn't appear in "gb18030 index": U+E5E5.
58853 ---> 19043, 0x9FA6 ---> 19043 + 58853 - 40870 ---> 37026
but:
37026 ---> 33550, 0xE865 ---> 59493 + 37026 - 33550 ---> 62969
and 69292 (F5F9 differs from E5E5).
from encoding.
Hmm, not immediately sure why that would fail. I assume the decoder/encoder has the same problem here? Doesn't seem like the byte conversion part should matter much.
from encoding.
@r12a any ideas? It seems that for your tests at http://www.w3.org/International/tests/repo/results/encoding-dbl-byte.en browsers mostly pass.
from encoding.
i didn't test all the BMP for gb18030, just selected ranges, so i missed that particular character. However, there are now two new tests that do test it, and find it (and only that one in the PUA) to be problematic on Firefox and Chrome (didn't try any others).
i won't integrate these tests into the i18n test suite fully until we decide what needs to be done about this.
i don't have any clear idea about why this fails, although peter's suggestion seems plausible (note the that problem character is the first one in the PUA (that isn't in the index)).
(http://r12a.github.io/apps/encodings/ also shows the behaviour @peteroupc describes)
from encoding.
Also, if #22 does not change the mapping for 0x8135F437 from U+1E3F to U+E7C7, another code point would have the problem.
from encoding.
@r12a I'm confused by the outcome of your tests since it suggests browsers simply do not encode U+E5E5 at all despite gb18030 supposedly being a UTF (both Chrome and Firefox emit an "HTML entity").
from encoding.
For Firefox, this is intentional because some sites expected a space for gbk 0xA3A0.
https://bugzilla.mozilla.org/show_bug.cgi?id=131837
But this bug is archaic. I'm fine with changing back the mapping to U+E5E5 to align with GB18030 spec/IE/Edge.
from encoding.
Ah, that is the problem. "This matches the GB18030-2000 standard for code points encoded as two bytes, except for 0xA3 0xA0 which maps to U+3000 to be compatible with deployed content." Except we never took care of mapping U+E5E5 to an error in the encoder.
from encoding.
I don't agree with emitting an encoder error for the code point 0xE5E5; like "vyv03354", I also agree with changing back the mapping for this code point, if that is indeed what the GB18030 standard says.
from encoding.
Since most browsers emit an error, it seems safer to just do that. Especially since WebKit "fixed" this in 2008, six years after Gecko did. Seems likely it might still be problematic.
from encoding.
I created a PR for my proposal in #25. I would appreciate review before landing this.
from encoding.
Related Issues (20)
- Add NeXTSTEP encoding HOT 2
- "For logical right shifts operands must have at ..." HOT 4
- Corner cases arising from Big5 encoder not excluding HKSCS codes with lead bytes 0xFA–FE HOT 6
- End-of-queue during decoding of GB18030 should not mask ASCII characters. HOT 4
- gb18030 encoder using index gb18030 ranges pointer HOT 4
- aria-label usage in BMP coverage table HOT 4
- Bug in TextDecoderStream around processing the end of stream. HOT 1
- Add a static decode and encode method to `TextEncoder` and `TextDecoder` HOT 10
- Shift_JIS decoder HOT 12
- [GB18030] Wrong codepoint at index 7533 HOT 4
- TextDecoderStream: empty Uint8Array should result in an empty string HOT 4
- 7-bit ASCII encoding HOT 3
- The concept of "output encoding" is not described anywhere HOT 5
- Visualization tables has lack of descriptions HOT 2
- Why Big5 index contains unmappable characters? HOT 2
- Consider adding windows-936-2000 as a label for GBK HOT 2
- Preface punctuation
- Reflect changes in GB 18030-2022 HOT 5
- Make encodeInto() throw when given a detached buffer HOT 5
- Ambiguous wording in GB18030 decoder HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from encoding.