GithubHelp home page GithubHelp logo

Comments (3)

annevk avatar annevk commented on May 22, 2024

Note that this would require some special logic in the API since all the underlying algorithms operate on scalar values.

from encoding.

hsivonen avatar hsivonen commented on May 22, 2024

I think requiring the caller to pass data in chunks of valid UTF-16 is a feature and not a bug.

As you note, it allows us not to have a separate streaming mode on the encoder side (thanks to ISO-2022-JP not being supported on the TextEncoder side).

Additionally, accommodating strings that are not self-contained valid UTF-16 strings would be a step backwards in terms of steering the Web Platform in a direction that would allow browsers to use UTF-8 strings internally (except in the JS engine when a program manipulates a string by 16-bit units). Some years ago when I argued for document.write to take 16-bit code units instead of valid UTF-16 strings, getting rid of UTF-16 as an internal representation seemed hopeless. However, Servo gives me hope that we might be able to fix the design error of using UTF-16 as the browser-internal memory representation and use UTF-8 in the future. The least we can do on the spec side is to avoid adding new places that expose the internal memory representation of Unicode strings.

Furthermore, having recently worked on a decoder that tries to fill char16_t output buffers fully even if it means that an astral character gets split across a buffer boundary and having worked on an encoder that tries to work properly (as if unpaired surrogates had been replaced with U+FFFD in the input) in the face of invalid input, I've come to especially appreciate Rust's notion of making UTF-8 validity guarantees part of the core notion of safety of the language itself. To the extent we are stuck with using UTF-16 as the browser-internal representation, I think we would benefit from enforcing UTF-16 validity at the boundary between the JS engine and the rest of the browser in order to be able to write non-JS engine code with the assumption that UTF-16 sequences are always valid. (As opposed to sprinkling unpaired surrogate handling all over the code base.)

For these reasons, I think we should close this as "won't fix".

from encoding.

annevk avatar annevk commented on May 22, 2024

I agree and since @inexorabletash wasn't sure either, closing.

from encoding.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.