GithubHelp home page GithubHelp logo

Comments (12)

annevk avatar annevk commented on May 21, 2024

Why would the gb18030 encoder algorithm ever hit the gb18030 ranges index for U+8000? It's in the gb18030 index.

from encoding.

peteroupc avatar peteroupc commented on May 21, 2024

Well then, here is another example of a problematic code point, and this time it doesn't appear in "gb18030 index": U+E5E5.

58853 ---> 19043, 0x9FA6 ---> 19043 + 58853 - 40870 ---> 37026

but:

37026 ---> 33550, 0xE865 ---> 59493 + 37026 - 33550 ---> 62969

and 69292 (F5F9 differs from E5E5).

from encoding.

annevk avatar annevk commented on May 21, 2024

Hmm, not immediately sure why that would fail. I assume the decoder/encoder has the same problem here? Doesn't seem like the byte conversion part should matter much.

from encoding.

annevk avatar annevk commented on May 21, 2024

@r12a any ideas? It seems that for your tests at http://www.w3.org/International/tests/repo/results/encoding-dbl-byte.en browsers mostly pass.

from encoding.

r12a avatar r12a commented on May 21, 2024

i didn't test all the BMP for gb18030, just selected ranges, so i missed that particular character. However, there are now two new tests that do test it, and find it (and only that one in the PUA) to be problematic on Firefox and Chrome (didn't try any others).

http://www.w3.org/International/tests/repo/encoding/legacy-mb-schinese/gb18030/gb18030-encode-form-other-pua.html

http://www.w3.org/International/tests/repo/encoding/legacy-mb-schinese/gb18030/gb18030-decode-other-pua.html

i won't integrate these tests into the i18n test suite fully until we decide what needs to be done about this.

i don't have any clear idea about why this fails, although peter's suggestion seems plausible (note the that problem character is the first one in the PUA (that isn't in the index)).

(http://r12a.github.io/apps/encodings/ also shows the behaviour @peteroupc describes)

from encoding.

vyv03354 avatar vyv03354 commented on May 21, 2024

Also, if #22 does not change the mapping for 0x8135F437 from U+1E3F to U+E7C7, another code point would have the problem.

from encoding.

annevk avatar annevk commented on May 21, 2024

@r12a I'm confused by the outcome of your tests since it suggests browsers simply do not encode U+E5E5 at all despite gb18030 supposedly being a UTF (both Chrome and Firefox emit an "HTML entity").

from encoding.

vyv03354 avatar vyv03354 commented on May 21, 2024

For Firefox, this is intentional because some sites expected a space for gbk 0xA3A0.
https://bugzilla.mozilla.org/show_bug.cgi?id=131837
But this bug is archaic. I'm fine with changing back the mapping to U+E5E5 to align with GB18030 spec/IE/Edge.

from encoding.

annevk avatar annevk commented on May 21, 2024

Ah, that is the problem. "This matches the GB18030-2000 standard for code points encoded as two bytes, except for 0xA3 0xA0 which maps to U+3000 to be compatible with deployed content." Except we never took care of mapping U+E5E5 to an error in the encoder.

from encoding.

peteroupc avatar peteroupc commented on May 21, 2024

I don't agree with emitting an encoder error for the code point 0xE5E5; like "vyv03354", I also agree with changing back the mapping for this code point, if that is indeed what the GB18030 standard says.

from encoding.

annevk avatar annevk commented on May 21, 2024

Since most browsers emit an error, it seems safer to just do that. Especially since WebKit "fixed" this in 2008, six years after Gecko did. Seems likely it might still be problematic.

from encoding.

annevk avatar annevk commented on May 21, 2024

I created a PR for my proposal in #25. I would appreciate review before landing this.

from encoding.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.