whatwg / encoding Goto Github PK

View Code? Open in Web Editor NEW

266.0 60.0 71.0 6.1 MB

Encoding Standard

Home Page: https://encoding.spec.whatwg.org/

License: Other

Makefile 0.06% Python 0.96% CSS 0.20% HTML 98.78%

whatwg encoding standard utf-8

encoding's People

Contributors

Stargazers

Watchers

encoding's Issues

Treatment of U+2022 is incorrect in Shift_JIS, EUC-JP and ISO-2022-JP

The following commit was made to fix https://www.w3.org/Bugs/Public/show_bug.cgi?id=28661

The bug is about U+2212, but the commit dealt with U+2022. I don't know how 2212 is morphed into 2022 by typos :-)

a7ab97e

Creating test suites for each encoding.

For example:
test-big5.txt
test-big5.utf8.txt
And the test-big5.txt just contains all the valid characters in BIG5 encoding.

test-big5.invalid.txt
test-big5.invalid.utf8.txt
And test-big5.invalid.txt
contains invalid characters in text along with valid characters
The file test-big5.invalid.utf8.txt would contains the corresponding valid characters.

GB 18030 2000 vs 2005

This is the continuation of https://www.w3.org/Bugs/Public/show_bug.cgi?id=28740#c11

I forgot to reply @annevk's question there:

Jungshik, do you mean you want to make the swap mentioned at the end of comment 5?

> GB 18030   -2005  -2000
> 0xA8BC     U+1E3F U+E7C7
> 0x8135F437 U+E7C7 U+1E3F

My answer would be yes. Chrome, Safari and Opera do that. Firefox and IE do not.

My goal is to minimize the number of PUA code points after decoding partly because there'll be NO font support for those PUA code points on platforms like Android, iOS (and even on Windows 10 when additional fonts are installed for legacy compatibility. That is, old fonts like Simsun support them, but newer fonts like Microsoft Yahei do not).

https://www.w3.org/Bugs/Public/show_bug.cgi?id=28740#c1 lists them and I thought that there are a bunch of PUA code point mappings that are dropped in GB 18030:2005 in favor of the regular Unicode code points.

According to Masatoshi Kimura , it's only U+1E3F for 0xA8BC that moved out of PUA area in GB 18030:2005, which is a big disappointment. (I wish GB18030 had taken a similar step to what's taken by HKSCS when it comes to PUA).

Anyway, at least one code point (0xA8BC <=> U+1E3F) should be mapped to a regular Unicode code point per GB18030:2005 instead of 2000.

"URL parsing HTML form submission"

https://encoding.spec.whatwg.org/#output-encodings

URL parsing HTML form submission

URL parsing and HTML form submission?

ISO 2022-jp encoding/decoding support

Results for a series of tests for EUC-jp encoding/decoding can be found at
https://www.w3.org/International/tests/repo/results/encoding-dbl-byte.en#iso2022jp

The tests can be run from that page (select the link in the left-most column) or get the tests from the WPT repo. There is a PR at
web-platform-tests/wpt#3199

The tests check whether:

the browser produces the expected byte sequences for all characters in the iso-2022-jp encoding when encoding bytes for a URL produced by a form, using the encoder steps in the specification.
the browser produces percent-escaped character references for a URL produced by a form when encoding miscellaneous characters that are not in the iso-2022-jp encoding. (tests for several ranges)
same two types of test when writing characters to an href value
the browser decodes all characters as expected from a file generated by encoding all pointers in the iso-2022-jp encoding per the encoder steps in the specification.
when decoding iso-2022-jp text, the browser uses replacement characters as described by the algorithm in the Encoding spec.

The following summarises the current situation according to my testing, for major desktop browsers. (I will be adding nightly results and perhaps other browsers in time.) The table lists the number of characters that were NOT successfully converted by the test.

Notes:

Edge fails all href encode tests because characters are not converted to percent-escapes in the href attribute.
Firefox fails all href encode tests for characters not in the encoding because it converts characters to percent-escaped Unicode values instead.

Can we please investigate the failures to ascertain whether:

the browser needs to be changed
the spec needs to be changed
the test is at fault

The following tool may be helpful for investigating issues. It converts between byte sequences and characters for all encodings in the Encoding spec. http://r12a.github.io/apps/encodings/

Streaming should be a flag on the decoder instead of being a flag on the decode() call

Making the streaming mode a flag on the decode() call is rather strange API design. In Gecko's implementation, performing a streaming decode(), followed by a non-streaming decode() followed by a streaming decode() yields potentially surprising results. I don't blame the Gecko implementation: it's weird API design that the API makes this possible.

I'd have expected the streaming mode to be a flag on the TextDecoder object set at the time of instantiation--i.e. a flag passed to the constructor. Is there a chance to make it so without breaking the Web at this point? If not, the spec could use some text explicitly calling out this oddity and explaining its implications.

Shift-JIS encoding/decoding support

Results for a series of tests for Shift-JIS encoding/decoding can be found at
https://www.w3.org/International/tests/repo/results/encoding-dbl-byte.en#shiftjis

The tests can be run from that page (select the link in the left-most column) or get the tests from the WPT repo. There is a PR at
web-platform-tests/wpt#3200

The tests check whether:

the browser produces the expected byte sequences for all characters in the shift_jis encoding after 0x9F when encoding bytes for a URL produced by a form, using the encoder steps in the specification.
the browser produces percent-escaped character references for a URL produced by a form when encoding miscellaneous characters that are not in the shift_jis encoding. (tests for several ranges)
same two types of test when writing characters to an href value
the browser decodes all characters as expected from a file generated by encoding all pointers in the shift_jis encoding per the shift_jis encoder steps in the specification.
the browser decodes characters that are not recognised from the shift_jis index as replacement characters.

Notes:

Edge fails all href encode tests because characters are not converted to percent-escapes in the href attribute.
Firefox fails all href encode tests for characters not in the encoding because it converts characters to percent-escaped Unicode values instead.

Can we please investigate the failures to ascertain whether:

the browser needs to be changed
the spec needs to be changed
the test is at fault

The following tool may be helpful for investigating issues. It converts between byte sequences and characters for all encodings in the Encoding spec. http://r12a.github.io/apps/encodings/

Please double-check the end of the EUDC range

In the Shift_JIS decoder, the inclusive end pointer 10528 looks suspicious, since it means only one possible trail byte (the lowest possible) is allowed for the lead byte F9. One would expect either the special case to run to the end of the pointers whose lead byte is F8 (making 10528 an exclusive bound) or run to the end of the pointers whose lead byte is F9.

Please double-check that the range is correct and, if it is, please add a note saying that the range is weird on purpose.

cc @vyv03354

"gb18030 ranges" have problematic definitions

It seems that for some code points, gb18030 ranges doesn't
have a round-trip mapping. Take the code point U+8000 for example.

When we apply the the "index gb18030 ranges pointer" we get:

32768 ---> 18962, 0x4DAF --> 18962 + 32768 - 19887 --> 31843

but when we apply the "index gb18030 ranges code point" from
31843 we get:

31843 ---> 19043, 0x9FA6 -->  40870 + 31843 - 19043 --> 53670

and that differs from our original 32768. I think the reason is that each range is poorly defined; it's not clear where each range starts and ends in "index-gb18030-ranges.txt".

Big5 encoding/decoding support

Results for a series of tests for Big5 encoding/decoding can be found at
https://www.w3.org/International/tests/repo/results/encoding-dbl-byte.en#big5

The tests can be run from that page (select the link in the left-most column) or get the tests from the WPT repo. There is a PR at
web-platform-tests/wpt#3197

The tests check whether:

the browser produces the expected byte sequences for all characters in the big5 encoding after 0x9F when encoding bytes for a URL produced by a form, using the encoder steps in the specification.
the browser produces percent-escaped character references for a URL produced by a form when encoding miscellaneous characters that are not in the big5 encoding. (tests for several ranges)
same two types of test when writing characters to an href value
the browser decodes all characters as expected from a file generated by encoding all pointers in the big5 encoding per the encoder steps in the specification.
the browser decodes all characters as expected from a file generated by encoding all pointers less than 5024 in the big5 encoding per the encoder steps in the specification.
the browser decodes characters that are not recognised from the big5 encoding as replacement characters.

Notes:

Edge fails all href encode tests because characters are not converted to percent-escapes in the href attribute.
Firefox fails all href encode tests for characters not in the encoding because it converts characters to percent-escaped Unicode values instead.

Can we please investigate the failures to ascertain whether:

the browser needs to be changed
the spec needs to be changed
the test is at fault

The following tool may be helpful for investigating issues. It converts between byte sequences and characters for all encodings in the Encoding spec. http://r12a.github.io/apps/encodings/

UTf7 supports

UTf7 are used in email files, so add it in encoding as default, or provide way to add extensions for support other encodings

GBK encoding/decoding support

Results for a series of tests for GBK encoding/decoding can be found at
https://www.w3.org/International/tests/repo/results/encoding-dbl-byte.en#gbk

The tests can be run from that page (select the link in the left-most column) or get the tests from the WPT repo. There is a PR at
web-platform-tests/wpt#3194

The test check whether:

the browser produces the expected byte sequences for all characters in the gbk encoding after 0x9F when encoding bytes for a URL produced by a form, using the encoder steps in the specification.
the browser produces percent-escaped character references for a URL produced by a form when encoding miscellaneous characters that are not in the gbk encoding (tests for several ranges).
same two types of test when writing characters to an href value
the browser decodes all characters as expected from a file generated by encoding all pointers in the gbk encoding per the encoder steps in the specification.
when decoding gbk text, the browser uses replacement characters as described by the algorithm in the Encoding spec.

Notes:

all href tests fail for Edge because characters are not converted to percent-escapes
Firefox consistently falls to produce expected results for href tests for character not in the gbk encoding

Can we please investigate the failures to ascertain whether:

the browser needs to be changed
the spec needs to be changed
the test is at fault

The following tool may be helpful for investigating issues. It converts between byte sequences and characters for all encodings in the Encoding spec. http://r12a.github.io/apps/encodings/

Editorial optimization: Move EUDC before index lookup

It looks like there are no jis0208 index entries in the EUDC pointer range. It is, therefore, useless to search the index before doing the EUDC check. Please move the EUDC check before the index search in the Shift_JIS decoder and maybe add some assertion into the index generation scripts to make sure the EUDC range stays unmapped in the index.

Editorial: Exclude index entries instead of just pointers

The definitions of "index Shift_JIS pointer" and " index Big5 pointer" talk about excluding pointers. This is ambiguous, since it might be read to interpret as the corresponding code points getting treated as unmapped. For clarity, please say "excluding entries whose pointers are in the range" instead.

iso-2022-jp encoder XSS risks

https://encoding.spec.whatwg.org/#iso-2022-jp-encoder

13.2.2 iso-2022-jp encoder

Suppose the input unicode string is A, B, ESC, $, B, 1, 2.

In this case, resulting encoded bytes will be A, B, ESC, $, B, 1, 2, according to the current spec.

The byte sequence can be, when decoded again, a unicode string A, B, 憶, which is apparently different from the original input unicode string.

The point I am making here is that, when encoding, ESC chars in the input unicode string should be specially handled (removed or replaced with something else). Or am I misunderstanding something?

Thanks,
Takeshi

"In equations, all numbers are integers, additio..."

https://encoding.spec.whatwg.org/#terminology

In equations, all numbers are integers, addition is represented by "+", subtraction by "−", multiplication by "×", division by "/", calculating the remainder of a division (also known as modulo) by "%", logical left shifts by "<<", logical right shifts by ">>", bitwise AND by "&", and bitwise OR by "|". floor(x) is the largest integer not greater than x.

"all numbers are integers" contradicts "floor(x) is the largest integer not greater than x". To be more specific, it implies that x is an integer and therefore floor(x) is a no-op.

I suggest defining "/" as integer division and dropping all mentions of floor() from the spec. floor(pointer / 10 / 126 / 10) should be changed to pointer / (10 * 126 * 10) to avoid loss of precision, and similarly with floor(pointer / 10 / 126).

Signaling EOF in the streaming case is unspecified (or too unclear)

AFAICT, Gecko's TextDecoder doesn't handle EOF at all in the streaming mode. When I look at the spec, it's not clear to me how EOF is supposed to be signaled in the API. By passing null as the input to decode() maybe?

Introduce "output encoding"

@hsivonen pointed out that when HTML ends up using e.g., the "replacement" encoding to encode something, we would want to label that as utf-8, not replacement. Therefore, we should introduce a new concept called "output encoding" that is the encoding itself except for replacement (and soon hopefully utf-16le/be).

Then we remove replacement's encoder. Make sure the encode hook uses the output encoding of an encoding. And make sure HTML and URL use the encode hook correctly.

Big5 decoder fails to prepend ASCII byte when pointer is non-null but code point is null

Consider decoding 0x81 0x40 as Big5.

First, 0x81 becomes big5lead.
Then, when byte is 0x40, pointer becomes non-null at step 2.
At step 4, there's a lookup from the index by pointer, but the index entry is missing, so code point becomes null.
Now pointer is still non-null!
Since pointer is non-null, step 5 does nothing and an ASCII byte (0x40) gets eaten.
Since code point is null, U+FFFD gets emitted in step 6.

Please adjust step 5 to perform the null test on code point instead of performing it on pointer.

TextEncoder#encode - write to existing Uint8Array

For binary protocols, such as msgpack, performance of writing/reading strings can be crucial, however those strings rarely (virtually never) exist on a binary buffer on their own. When decoding a string from a binary buffer of Uint8Array (obtained over, say, websockets) it's possible to cheaply create a subarray and pass it over to TextDecoder#decode, as such the decoding story is pretty straight forward.

The writing story is much more complicated.

In order to write a string into an existing buffer, one must take the result Uint8Array from TextEncoder#encode and TypedArray#set it at a desired offset into another buffer, which introduces unnecessary copying and, as result, is slower than plain JavaScript encoding implementations (at least for UTF-8).

A much better solution would be analog to the Node.js Buffer#write implementation where the encoder can write a string to a buffer at a desired offset. Example signature could be something like:

TextEncoder#write(DOMString source, Uint8Array target [, Number offset [, Number length]]) -> Number bytesWritten

Add more labels to the replacement encoding

As a defense-in-depth measure, it would be great to add the labels mentioned in https://www.w3.org/Bugs/Public/show_bug.cgi?id=21057#c16 to the replacement encoding. It's however unclear whether that is web-compatible, as sites might specify these labels and rely on them not being recognized.

Filing this issue so hopefully one day a browser can experiment and figure out if this is doable.

Clarify result of division in algorithms

where an algorithm tells you to do something like

Let lead be pointer / 157 + 0x81.

it's not immediately clear that you need to use the floor function, rather than rounding, to bring the result of the division back to an integer. It would be good to make this clearer in the text.

The title of the document needs to be more than one (commonly used) word

'Encoding' is used all over our discipline: of audio, video, images, and all sorts of transformations. The title tells us close to nothing of what this is about.

Please can we have a slightly more descriptive title (at least 'Text' included, and maybe the fact that this deals with the conversion and normalization of text)?

index-jis0208.txt should be JIS X 0208 and add another index file

The CCS(coded character set) of index-jis0208.txt is CCS of "CP932" or "Windows-31J". It is not JIS X 0208.
There are two differences:

CP932 has more characters than JIS X 0208.
All CCS in JIS X 0208 has its name, for mapping to UCS. There are some difference between it and index-jis0208.txt.

The important role of JIS X 0208 is restriction of character set. For example, some fonts in Japan have characters only in JIS X 0208, not CP932. So we need a strict character set to implement converter like Shift_JIS encoder. We need another index which is differ from index-jis0208.txt.

Another usecase: sometimes I want to convert Shift_JIS text into EPUB file. It's typical usecase of Shift_JIS decoder. When I do it, I want to convert EM DASH in Shift_JIS (0x815C, 1-1-29) to EM DASH in Unicode (U+2014). It's OK in JIS X 0208 (and JIS X 0213), but It's NG in CP932; the table of index-jis0208.txt maps EM DASH into HORIZONTAL BAR(U+2015). It's not what I want to do.

So my suggestions is:

index-jis0208.txt should be renamed, such as index-cp932 or index-windows31j,
and should add another index same as JIS X 0208.

I have machine-readable mapping table, "JIS X 0213:2004 8-bit code vs Unicode mapping table"
http://x0213.org/codetable/jisx0213-2004-8bit-std.txt. But It's JIS X 0213, not JIS X 0208.
I found the issue #31 and I also don't have a real usecase of SHIFT_JISX0213 nor Shift_JIS-2004 (when I want to use characters in JIS X 0213, I always use UTF-8). I need just only JIS X 0208, subset of jisx0213-2004-8bit-std.txt.

Unclear text in utf-8 decoder

I wrote a small application (http://rishida.io/apps/encodings/) to work with Encoding tests, and ran into some trouble with the utf-8 decoder. I tried to closely follow the algorithms in the spec, as a way of testing them, but when it came to:

"6. Increase utf-8 bytes seen by one and set utf-8 code point to utf-8 code point + (byte − 0x80) << (6 × (utf-8 bytes needed − utf-8 bytes seen)). "

i ended up with

u8cp = u8cp + (byte - 0x80) << (6 * (bytesneeded - bytesseen))

which gives a much too high number.

what's needed is

u8cp = u8cp + ((byte - 0x80) << (6 * (bytesneeded - bytesseen)))

u8cp += (byte - 0x80) << (6 * (bytesneeded - bytesseen))

the spec text would be clearer if a couple of extra brackets were introduced, ie.:

"set utf-8 code point to utf-8 code point + ((byte − 0x80) << (6 × (utf-8 bytes needed − utf-8 bytes seen))). "

to show that the shift takes place before adding to utf-8 code point.

BufferSource is not specified

https://encoding.spec.whatwg.org/#textdecoder -

USVString decode(optional BufferSource input, optional TextDecodeOptions options);

The BufferSource type is not specified anywhere.

ClosureCompiler insists it is Uint8Array -
http://closure-compiler.appspot.com/home#code%3D%252F%252F%2520%253D%253DClosureCompiler%253D%253D%250A%252F%252F%2520%2540output_file_name%2520default.js%250A%252F%252F%2520%2540compilation_level%2520ADVANCED_OPTIMIZATIONS%250A%252F%252F%2520%2540language%2520ECMASCRIPT6_STRICT%250A%252F%252F%2520%2540language_out%2520ECMASCRIPT5_STRICT%250A%252F%252F%2520%2540formatting%2520pretty_print%250A%252F%252F%2520%253D%253D%252FClosureCompiler%253D%253D%250A%2522use%2520strict%2522%253B%250Aclass%2520Foo%250A%257B%250A%2520constructor()%2520%257B%2520fetch(%2522url%2522).then(response%2520%253D%253E%2520response.arrayBuffer()).then(this.bar)%253B%2520%257D%250A%2520%252F**%2520%2540this%2520undefined%2520%2540param%2520%257BArrayBuffer%257D%2520arrayBuffer%2520*%252F%250A%2520bar(arrayBuffer)%2520%257B%2520console.log(new%2520TextDecoder(%2522utf-8%2522).decode(arrayBuffer))%2520%257D%250A%257D%250A%250Anew%2520Foo()%253B

MDN says it is an ArrayBufferView -
https://developer.mozilla.org/en-US/docs/Web/API/TextDecoder/decode#Parameters

This should be specified.

Perhaps http://heycam.github.io/webidl/#common-BufferSource (see whatwg/fetch#108).

EUC-jp encoding/decoding support

Results for a series of tests for EUC-jp encoding/decoding can be found at
https://www.w3.org/International/tests/repo/results/encoding-dbl-byte.en#eucjp

The tests can be run from that page (select the link in the left-most column) or get the tests from the WPT repo. There is a PR at
web-platform-tests/wpt#3198

The tests check whether:

the browser produces the expected byte sequences for all characters in the euc-jp encoding after 0x9F when encoding bytes for a URL produced by a form, using the encoder steps in the specification.
the browser produces percent-escaped character references for a URL produced by a form when encoding miscellaneous characters that are not in the euc-jp encoding. (tests for several ranges)
same two types of test when writing characters to an href value
the browser decodes all characters as expected from a file generated by encoding all pointers in the euc-jp encoding per the encoder steps in the specification.
the browser decodes characters that are not recognised from the euc-jp encoding as replacement characters.

Notes:

Edge fails all href encode tests because characters are not converted to percent-escapes in the href attribute.
Firefox fails all href encode tests for characters not in the encoding because it converts characters to percent-escaped Unicode values instead.
eucjp-decode-index: Edge fails on all and only the JIS-X-0212 characters, because it doesn't recognise 0xAF as the first in a 3-byte sequence.

Can we please investigate the failures to ascertain whether:

the browser needs to be changed
the spec needs to be changed
the test is at fault

The following tool may be helpful for investigating issues. It converts between byte sequences and characters for all encodings in the Encoding spec. http://r12a.github.io/apps/encodings/

Add do not flush flag to encode API, accept DOMString

Filing this for tracking purposes - we may decide to immediately close. I thought we had an issue on this already.

The decode API has a do not flush flag (spelled stream) used when partial byte sequences might be encountered.

But the encode API does not. It consumes USVStrings, which implies that the content to be encoded will always be complete strings. This means the case where partial UTF-16 data might be seen is not supported; a string split in between UTF-16 surrogates would be corrupted (c/o the IDL binding for USVString, nothing explicit in the API definition).

We had support for this in earlier iterations of the API, but moved to USVStrings and thus a stream option was a no-op and dropped.

I don't have a use case for this, but if anyone does we could reconsider. cc: @hsivonen since he is actively poking at the spec and should be aware of this limitation.

If gb18030 is revised, consider aligning the Encoding Standard

Cause GB18030-2005 is already one-to-one mapping bettween Unicode & GBK18030 except
The 14 characters that still mapped into Unicode PUA that at 2005,
But nowadays, all the 14 characters have correlated mapping into Unicode,
So I suggest encoding standard mapping those characters to normal Unicode characters but PUA characters.

The following 80 characters are the GBK chracters that ever mapped to Unicode PUA, and
the corresponding Unicode non-PUA character

Han Character      GBK              Unicode PUA       Unicode non-PUA
                FE50                E815                2E81
                FE51                E816                20087
                FE52                E817                20089
                FE53                E818                200CC
                FE54                E819                2E84
                FE55                E81A                3473
                FE56                E81B                3447
                FE57                E81C                2E88
                FE58                E81D                2E8B
                FE59                E81E                9FB4
                FE5A                E81F                359E
                FE5B                E820                361A
                FE5C                E821                360E
                FE5D                E822                2E8C
                FE5E                E823                2E97
                FE5F                E824                396E
                FE60                E825                3918
                FE61                E826                9FB5
                FE62                E827                39CF
                FE63                E828                39DF
                FE64                E829                3A73
                FE65                E82A                39D0
                FE66                E82B                9FB6
                FE67                E82C                9FB7
                FE68                E82D                3B4E
                FE69                E82E                3C6E
                FE6A                E82F                3CE0
                FE6B                E830                2EA7
                FE6C                E831                215D7
                FE6D                E832                9FB8
                FE6E                E833                2EAA
                FE6F                E834                4056
                FE70                E835                415F
                FE71                E836                2EAE
                FE72                E837                4337
                FE73                E838                2EB3
                FE74                E839                2EB6
                FE75                E83A                2EB7
                FE76                E83B                2298F
                FE77                E83C                43B1
                FE78                E83D                43AC
                FE79                E83E                2EBB
                FE7A                E83F                43DD
                FE7B                E840                44D6
                FE7C                E841                4661
                FE7D                E842                464C
                FE7E                E843                9FB9
                FE80                E844                4723
                FE81                E845                4729
                FE82                E846                477C
                FE83                E847                478D
                FE84                E848                2ECA
                FE85                E849                4947
                FE86                E84A                497A
                FE87                E84B                497D
                FE88                E84C                4982
                FE89                E84D                4983
                FE8A                E84E                4985
                FE8B                E84F                4986
                FE8C                E850                499F
                FE8D                E851                499B
                FE8E                E852                49B7
                FE8F                E853                49B6
                FE90                E854                9FBA
                FE91                E855                241FE
                FE92                E856                4CA3
                FE93                E857                4C9F
                FE94                E858                4CA0
                FE95                E859                4CA1
                FE96                E85A                4C77
                FE97                E85B                4CA2
                FE98                E85C                4D13
                FE99                E85D                4D14
                FE9A                E85E                4D15
                FE9B                E85F                4D16
                FE9C                E860                4D17
                FE9D                E861                4D18
                FE9E                E862                4D19
                FE9F                E863                4DAE
                FEA0                E864                9FBB

The following 14 characters are the GB18030-2005 chracters that are still mapped to Unicode PUA, and
I suggest the encoding standard mapping those characters into Unicode non-PUA, cause we have no need
to waiting GB18030 to update it's spec just for those 14 chracters, and we could be sure those 14 chracters's
corresponding Unicode non-PUA characters are decided.

Han Character      GBK              Unicode PUA       Unicode non-PUA
                FE51                E816                20087
                FE52                E817                20089
                FE53                E818                200CC
                FE59                E81E                9FB4
                FE61                E826                9FB5
                FE66                E82B                9FB6
                FE67                E82C                9FB7
                FE6C                E831                215D7
                FE6D                E832                9FB8
                FE76                E83B                2298F
                FE7E                E843                9FB9
                FE90                E854                9FBA
                FE91                E855                241FE
                FEA0                E864                9FBB

And according to these, we can decode all GBK encoding family strings to non-PUA Unicode,
Besides these, we still have the need to convert all the historical Unicode PUA characters
to proper GBK(GB18030) characters.

Editorial: Clean up gb18030 math

Please clean up the math for 4-byte sequences in the gb18030 decoder to perform the three multiplications by constant expressions and then adding the results. This both helps readability and reduces data dependencies if the spec text is transcribed to code.

Please clean up the math for 4-byte sequence in the gb18030 encoder to use the % operator with constant expressions on the high-hand side. This helps readability, reduces data dependencies if the spec text is transcribed to code and reduces operations when / and % with the same operands can be compiled into a single instruction.

Group constants together in half-width math

In the EUC-JP and Shift_JIS decoders, there's this math expression: "0xFF61 + byte − 0xA1".

Please rearrange it as "0xFF61 − 0xA1 + byte" to make it easy to get correct results without the implementor rearranging the expression.

In a programming language that's aware of integer overflow to the extent that the compiler doesn't perform constant folding in a way that would change whether an expression has integer overflow, the former doesn't allow constant folding but the latter does.

Furthermore, if the math happens to be done on the 16-bit precision, for example as an artifact of trying to compute a UTF-16 code unit, the expression in the spec actually has an overflow with the relevant values of byte. (Which doesn't matter e.g. in Java but does matter in e.g. debug-mode Rust.)

gb18030 encoding/decoding support

Results for a series of tests for gb18030 encoding/decoding can be found at
https://www.w3.org/International/tests/repo/results/encoding-dbl-byte.en#gb18030

The tests can be run from that page (select the link in the left-most column) or get the tests from the WPT repo. There is a PR at
web-platform-tests/wpt#3195

The test check whether:

the browser produces the expected byte sequences for all characters in the gb18030 index after 0x9F when encoding bytes for a URL produced by a form, using the encoder steps in the specification.
the browser produces the expected byte sequences for miscellaneous characters not in the gb18030 index when encoding bytes for a URL produced by a form, using the encoder steps in the specification. (tests for several ranges)
same two types of test when writing characters to an href value
the browser decodes all characters as expected from a file generated by encoding all pointers in the gb18030 index per the encoder steps in the specification.
the browser decodes all characters as expected from a file generated by encoding miscellaneous characters not in the gb18030 index per the encoder steps in the specification. (tests for several ranges)
when decoding gb18030 text, the browser uses replacement characters as described by the algorithm in the Encoding spec.

Notes:

all href tests fail for Edge because characters are not converted to percent-escapes

Can we please investigate the failures to ascertain whether:

the browser needs to be changed
the spec needs to be changed
the test is at fault

The following tool may be helpful for investigating issues. It converts between byte sequences and characters for all encodings in the Encoding spec. http://r12a.github.io/apps/encodings/

Use floor() more

In 929a3ff#commitcomment-15013806 @domenic points out some nesting of floor() might needed to preserve the original integer division semantics. I haven't been able to find a case where it gives a different outcome, however.

Consider making the DOM name the canonical name

It's pretty sad that this specification chose to invent a new case (the lower case) for canonical names when a historical casing (the IANA case implemented by WebKit/Blink and, except for gbk, by Gecko) already existed.

Imagine that you are writing a library for this stuff. You can't compute the DOM name from the name in this specification without hard-coding knowledge of the DOM names. However, you could compute the name in this specification by ASCII-lower-casing the DOM name.

Is there a good reason not to:

Change the names in this specification to the Compatibility names from the DOM spec.
Get rid of the special mapping in the DOM spec.
Define that .encoding of TextEncoder and TextDecoder returns name (i.e. now the mixed-case IANA/DOM name) ASCII-lowercased.

Clarify scope of document in preface

The Preface claims:

"… this specification … defines … the utf-8 encoding."

Isn't that formalised in ISO/IEC 10646:2014 and Unicode? I'm not suggesting that the contents of this document aren't useful, just that they don't define utf-8.

I suggest resolving this issue by:

Removing "(and defines)" from the preface.
Adding a reference to ISO/IEC 10646 to the References section and referring to it in the definition of utf-8.

Consider removing TextEncoder support for UTF-16

AFAICT, non-UTF-8 encoders are exposed in three places in the Web Platform:

TextEncoder
Query string parsing for URL and the associated APIs
Form submission

URL parsing and form submission use UTF-8 when the encoding of the document is either of the UTF-16 encodings, AFAICT, TextEncoder is the only place where the Web Platform exposes UTF-16 encoder functionality.

This is pretty sad, if TextEncoder support UTF-16 just because it can and the support doesn't have strong use cases.

Am I wrong and do proper use cases exist for UTF-16 in TextEncoder to a degree that sets UTF-16 apart from the other legacy encoding that aren't supported in TextEncoder?

Correct [Exposed] syntax

Exposed=Window,Worker should be Exposed=(Window,Worker).

Replacement encoding should handle empty input better

See https://twitter.com/hsivonen/status/691585344769282048 by @hsivonen and @zcorpan.

What's the correct choice for BIG5 F9FE?

According to:
https://lists.w3.org/Archives/Public/public-whatwg-archive/2012Apr/0095.html
The decoding for F9FE from different vendor are different,
And the whatwg encoding standard are choose
F9FE -> U+FFED, is that the correct behavior ?
Because from the vendor's point of view, U+2593 is a better option

> F9FE =>
> opera-hk: U+FFED
> firefox: U+2593
> chrome: U+2593
> firefox-hk: U+2593
> opera: U+2593
> chrome-hk: U+FFED
> internetexplorer: U+2593
> windows-os: U+2593
> hkscs-2008: <U+FFED>

Add UTF-7 to replacement encoding list? / Encoding sniffing

Maybe this was discussed before, but I couldn't find a bug on this. What do you think of treating UTF-7 the same way as ISO-2022-{KR,CN}, HZ-GB, etc?

When decoding, the whole input is replaced by U+FFFD. When encoding, use UTF-8.

Background: Blink began to use Compact Encoding Detector ( google/compact_enc_det ) when no encoding label is found (http, meta). When 7-bit encoding detection is on, it detects ISO-2022-{KR,CN}, HZ-GB AND UTF-7 in addition to ISO-2022-JP. 7-bit encoding detection is ON for ISO-2022-JP, but we want to suppress other 7-bit encodings. I think the best way to 'suppress' (unsupport) them is to turn the whole input to U+FFFD.

Benefits of "Legacy" Encodings – Byte Counter

Hi, I appreciate UTF family encodings enforcement in TextEncoder, but the "legacy" ones might be needed in some situations. I created this thread to demonstrate one and ask your opinion.

I have come across a problem which probably cannot be solved without "legacy" encodings support – byte counting of a text in its original encoding. I need to count bytes of a text in <textarea> as if it was in windows-1250, koi8-u... Do you happen to know how to count it, please?

Edititorial: Add a note about duplicates to "index Shift_JIS pointer"

Please add a note to "index Shift_JIS pointer" saying that there are duplicate code points in the index, so the exclusion of the earlier ones cause the later ones to be used.

Adding BRF as "legacy" single-byte encoding for braille

Hello,

BRF is a charset that permits to encode braille. http://brl.thefreecat.org/test.php is an example of text file encoded in the BRF charset, holding 3 braille patterns. BRF is the standard way of providing documents ready for braille embossing, all official documents available on the web ready for embossing are using it (books, courses, tax forms, income declaration, etc.), as required per section 508 in the US for instance. There is currently no other really-used standard way of shipping them using UTF-8 (the PEF format is still at very early stage), you will never see BRF documents encoded in utf-8.

For now, browsers ignore the "charset=brf" content-type qualifier, and show the file as if it was ascii, i.e. they print "A B C". They should recognize the BRF charset like other charsets (and for instance use iconv for converting it to unicode, and then display it just like any unicode text file), and thus print "⠁⠀⠃⠀⠉" instead of "A B C".

The BRF format defines bytes 0x00-0x1F as 1-to-1 equivalents to ascii 0x00-0x1F, and 0x20-0x5f as equivalents of 6-dot braille patterns of U+2800-U+283f.

BRF got added to IANA's list of charsets in 2006, see http://www.iana.org/assignments/charset-reg/BRF

BRF got added to glibc's iconv around the same period, see https://sourceware.org/git/?p=glibc.git;a=blob_plain;f=localedata/charmaps/BRF;hb=HEAD

Regards,
Samuel

Add "replacement" as a label for the replacement encoding

I've written "The name of an encoding is also one of its labels, except in the case of the replacement encoding whose name is not one of its labels." inconveniently many times when either documenting code that implements the Encoding Standard or when otherwise explaining the concepts.

Also, I've written code that does something like "if input string is 'replacement', don't run get an encoding, otherwise, run get an encoding" when working with interfaces that were designed (four links there, but GitHub's styling makes it unobvious) before there was clarity of what strings are labels and what strings are names used where an enum value or a reference to a singleton object representing an encoding would be more appropriate software design (when those interfaces potentially have callers in add-ons that I can't fix).

All this would become simpler if get an encoding for the name of an encoding always returned the encoding itself. However, it's kinda sad to expose another Web-exposed label just to make implementing and explaining stuff easier, so I'm not sure if I should request this.

But at least this deserves some discussion.

SHIFT_JISX0213 are not supported.

utf-8 decoder lead byte switch branches 0xE0 to 0xEF, 0xF0 to 0xF4 need 'return continue'

#38 changed how the lead byte is processed. It removed "Then (byte is in the range 0xC2 to 0xF4, inclusive) set UTF-8 code point to UTF-8 code point << (6 × UTF-8 bytes needed) and return continue."

The first two and final switch branches (0x00 to 0x7F, 0xC2 to 0xDF, and Otherwse) have returns, but the remaining two (0xE0 to 0xEF, 0xF0 to 0xF4) do not, so fall through to the trail byte processing.

Looks like they just need additional "Return continue" substeps added

EUK-kr encoding/decoding support

Results for a series of tests for EUK-kr encoding/decoding can be found at
https://www.w3.org/International/tests/repo/results/encoding-dbl-byte.en#euckr

The tests can be run from that page (select the link in the left-most column) or get the tests from the WPT repo. There is a PR at
web-platform-tests/wpt#3201

The tests check whether:

the browser produces the expected byte sequences for all characters in the euc-kr encoding after 0x9F when encoding bytes for a URL produced by a form, using the encoder steps in the specification.
the browser produces percent-escaped character references for a URL produced by a form when encoding miscellaneous characters that are not in the euc-kr encoding. (tests for two ranges)
same two types of test when writing characters to an href value
the browser decodes all characters as expected from a file generated by encoding all pointers in the euc-kr encoding per the encoder steps in the specification.
the browser decodes characters that are not recognised from the euc-kr encoding as replacement characters.

Notes:

Edge fails all href encode tests because characters are not converted to percent-escapes in the href attribute.
Firefox fails all href encode tests for characters not in the encoding because it converts characters to percent-escaped Unicode values instead.

Can we please investigate the failures to ascertain whether:

the browser needs to be changed
the spec needs to be changed
the test is at fault

The following tool may be helpful for investigating issues. It converts between byte sequences and characters for all encodings in the Encoding spec. http://r12a.github.io/apps/encodings/

Big5 encoder treats a code point as error when both an HKSCS and non-HKSCS pointer exists for the code point

The Big5 encoder first does an index lookup and then discards the code point as an error if the Big5 lead for the pointer is less than 0xA1. This makes the encoder discard code points that have two mappings: one whose Big5 lead is less than 0xA1 and another whose Big5 lead is greater or equal to 0xA1.

The following code points have such double pointer mappings:
7BB8
7C06
7CCE
7DD2
7E1D
8005
8028
83C1
84A8
840F
89A6
8D77
90FD
92B9
96B6
975C
97FF
9F16
5159
515B
515D
515E
7479
6D67
799B
9097
5B28
732A
7201
77D7
7E87
99D6
91D4
60DE
6FB6
8F36
4FBB
71DF
9104
9DF0
83CF
5C10
79E3
5A67
8F0B
7B51
62D0
5605
5ED0
6062
75F9
6C4A
9B2E
50ED
62CE
60A4
7162

When testing Gecko's old Big5 encoder, at least the first of these is a non-error: http://software.hixie.ch/utilities/js/live-dom-viewer/saved/3610 (test case violates the Same Origin Policy in Blink due to different treatment of data: origins). That is, Gecko's old encoder encodes U+7BB8 as 0xBA, 0xE6.

I believe that instead of checking whether lead is less than 0xA1 after the lead computation, the spec should say that when looking up pointers from the index when encoding, pointers below (0xA1 - 0x81) * 157 should be ignored, i.e. search the index from pointer (0xA1 - 0x81) * 157 onwards.

Document buffer tricks?

See http://logs.glob.uno/?c=freenode%23whatwg&s=18%20Apr%202016&e=18%20Apr%202016#c992642 for a clever one by @hsivonen. We could add a note somewhere or even an implementation considerations section.

Serializing internal TextDecoder state?

Was talking to @bsittler about using TextDecoder and he mentioned that he needs to be able to serialize the current decoding state to IndexedDB in order to survive e.g. service worker upgrades. TextDecoder definitely has internal state (a few bytes left in its internal "stream" variable, basically) which can get lost in this manner. It would be great to expose this, and allow construction of a new TextDecoder primed with that state.

whatwg / encoding Goto Github PK

encoding's People

Contributors

Stargazers

Watchers

Forkers

encoding's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs