whatwg / encoding Goto Github PK
View Code? Open in Web Editor NEWEncoding Standard
Home Page: https://encoding.spec.whatwg.org/
License: Other
Encoding Standard
Home Page: https://encoding.spec.whatwg.org/
License: Other
The following commit was made to fix https://www.w3.org/Bugs/Public/show_bug.cgi?id=28661
The bug is about U+2212, but the commit dealt with U+2022. I don't know how 2212 is morphed into 2022 by typos :-)
For example:
test-big5.txt
test-big5.utf8.txt
And the test-big5.txt just contains all the valid characters in BIG5 encoding.
test-big5.invalid.txt
test-big5.invalid.utf8.txt
And test-big5.invalid.txt
contains invalid characters in text along with valid characters
The file test-big5.invalid.utf8.txt would contains the corresponding valid characters.
This is the continuation of https://www.w3.org/Bugs/Public/show_bug.cgi?id=28740#c11
I forgot to reply @annevk's question there:
Jungshik, do you mean you want to make the swap mentioned at the end of comment 5?
> GB 18030 -2005 -2000
> 0xA8BC U+1E3F U+E7C7
> 0x8135F437 U+E7C7 U+1E3F
My answer would be yes. Chrome, Safari and Opera do that. Firefox and IE do not.
My goal is to minimize the number of PUA code points after decoding partly because there'll be NO font support for those PUA code points on platforms like Android, iOS (and even on Windows 10 when additional fonts are installed for legacy compatibility. That is, old fonts like Simsun support them, but newer fonts like Microsoft Yahei do not).
https://www.w3.org/Bugs/Public/show_bug.cgi?id=28740#c1 lists them and I thought that there are a bunch of PUA code point mappings that are dropped in GB 18030:2005 in favor of the regular Unicode code points.
According to Masatoshi Kimura , it's only U+1E3F for 0xA8BC that moved out of PUA area in GB 18030:2005, which is a big disappointment. (I wish GB18030 had taken a similar step to what's taken by HKSCS when it comes to PUA).
Anyway, at least one code point (0xA8BC <=> U+1E3F) should be mapped to a regular Unicode code point per GB18030:2005 instead of 2000.
https://encoding.spec.whatwg.org/#output-encodings
URL parsing HTML form submission
URL parsing and HTML form submission?
Results for a series of tests for EUC-jp encoding/decoding can be found at
https://www.w3.org/International/tests/repo/results/encoding-dbl-byte.en#iso2022jp
The tests can be run from that page (select the link in the left-most column) or get the tests from the WPT repo. There is a PR at
web-platform-tests/wpt#3199
The tests check whether:
The following summarises the current situation according to my testing, for major desktop browsers. (I will be adding nightly results and perhaps other browsers in time.) The table lists the number of characters that were NOT successfully converted by the test.
Notes:
Can we please investigate the failures to ascertain whether:
The following tool may be helpful for investigating issues. It converts between byte sequences and characters for all encodings in the Encoding spec. http://r12a.github.io/apps/encodings/
Making the streaming mode a flag on the decode()
call is rather strange API design. In Gecko's implementation, performing a streaming decode()
, followed by a non-streaming decode()
followed by a streaming decode()
yields potentially surprising results. I don't blame the Gecko implementation: it's weird API design that the API makes this possible.
I'd have expected the streaming mode to be a flag on the TextDecoder
object set at the time of instantiation--i.e. a flag passed to the constructor. Is there a chance to make it so without breaking the Web at this point? If not, the spec could use some text explicitly calling out this oddity and explaining its implications.
Results for a series of tests for Shift-JIS encoding/decoding can be found at
https://www.w3.org/International/tests/repo/results/encoding-dbl-byte.en#shiftjis
The tests can be run from that page (select the link in the left-most column) or get the tests from the WPT repo. There is a PR at
web-platform-tests/wpt#3200
The tests check whether:
The following summarises the current situation according to my testing, for major desktop browsers. (I will be adding nightly results and perhaps other browsers in time.) The table lists the number of characters that were NOT successfully converted by the test.
Notes:
Can we please investigate the failures to ascertain whether:
The following tool may be helpful for investigating issues. It converts between byte sequences and characters for all encodings in the Encoding spec. http://r12a.github.io/apps/encodings/
In the Shift_JIS decoder, the inclusive end pointer 10528 looks suspicious, since it means only one possible trail byte (the lowest possible) is allowed for the lead byte F9. One would expect either the special case to run to the end of the pointers whose lead byte is F8 (making 10528 an exclusive bound) or run to the end of the pointers whose lead byte is F9.
Please double-check that the range is correct and, if it is, please add a note saying that the range is weird on purpose.
cc @vyv03354
It seems that for some code points, gb18030 ranges doesn't
have a round-trip mapping. Take the code point U+8000 for example.
When we apply the the "index gb18030 ranges pointer" we get:
32768 ---> 18962, 0x4DAF --> 18962 + 32768 - 19887 --> 31843
but when we apply the "index gb18030 ranges code point" from
31843 we get:
31843 ---> 19043, 0x9FA6 --> 40870 + 31843 - 19043 --> 53670
and that differs from our original 32768. I think the reason is that each range is poorly defined; it's not clear where each range starts and ends in "index-gb18030-ranges.txt".
Results for a series of tests for Big5 encoding/decoding can be found at
https://www.w3.org/International/tests/repo/results/encoding-dbl-byte.en#big5
The tests can be run from that page (select the link in the left-most column) or get the tests from the WPT repo. There is a PR at
web-platform-tests/wpt#3197
The tests check whether:
The following summarises the current situation according to my testing, for major desktop browsers. (I will be adding nightly results and perhaps other browsers in time.) The table lists the number of characters that were NOT successfully converted by the test.
Notes:
Can we please investigate the failures to ascertain whether:
The following tool may be helpful for investigating issues. It converts between byte sequences and characters for all encodings in the Encoding spec. http://r12a.github.io/apps/encodings/
UTf7 are used in email files, so add it in encoding as default, or provide way to add extensions for support other encodings
Results for a series of tests for GBK encoding/decoding can be found at
https://www.w3.org/International/tests/repo/results/encoding-dbl-byte.en#gbk
The tests can be run from that page (select the link in the left-most column) or get the tests from the WPT repo. There is a PR at
web-platform-tests/wpt#3194
The test check whether:
The following summarises the current situation according to my testing, for major desktop browsers. (I will be adding nightly results and perhaps other browsers in time.) The table lists the number of characters that were NOT successfully converted by the test.
Notes:
Can we please investigate the failures to ascertain whether:
The following tool may be helpful for investigating issues. It converts between byte sequences and characters for all encodings in the Encoding spec. http://r12a.github.io/apps/encodings/
It looks like there are no jis0208 index entries in the EUDC pointer range. It is, therefore, useless to search the index before doing the EUDC check. Please move the EUDC check before the index search in the Shift_JIS decoder and maybe add some assertion into the index generation scripts to make sure the EUDC range stays unmapped in the index.
The definitions of "index Shift_JIS pointer" and " index Big5 pointer" talk about excluding pointers. This is ambiguous, since it might be read to interpret as the corresponding code points getting treated as unmapped. For clarity, please say "excluding entries whose pointers are in the range" instead.
https://encoding.spec.whatwg.org/#iso-2022-jp-encoder
13.2.2 iso-2022-jp encoder
Suppose the input unicode string is A, B, ESC, $, B, 1, 2.
In this case, resulting encoded bytes will be A, B, ESC, $, B, 1, 2, according to the current spec.
The byte sequence can be, when decoded again, a unicode string A, B, 憶, which is apparently different from the original input unicode string.
The point I am making here is that, when encoding, ESC chars in the input unicode string should be specially handled (removed or replaced with something else). Or am I misunderstanding something?
Thanks,
Takeshi
https://encoding.spec.whatwg.org/#terminology
In equations, all numbers are integers, addition is represented by "+", subtraction by "−", multiplication by "×", division by "/", calculating the remainder of a division (also known as modulo) by "%", logical left shifts by "<<", logical right shifts by ">>", bitwise AND by "&", and bitwise OR by "|". floor(x) is the largest integer not greater than x.
"all numbers are integers" contradicts "floor(x) is the largest integer not greater than x". To be more specific, it implies that x is an integer and therefore floor(x) is a no-op.
I suggest defining "/" as integer division and dropping all mentions of floor() from the spec. floor(pointer / 10 / 126 / 10) should be changed to pointer / (10 * 126 * 10) to avoid loss of precision, and similarly with floor(pointer / 10 / 126).
AFAICT, Gecko's TextDecoder
doesn't handle EOF at all in the streaming mode. When I look at the spec, it's not clear to me how EOF is supposed to be signaled in the API. By passing null
as the input to decode()
maybe?
@hsivonen pointed out that when HTML ends up using e.g., the "replacement" encoding to encode something, we would want to label that as utf-8
, not replacement
. Therefore, we should introduce a new concept called "output encoding" that is the encoding itself except for replacement
(and soon hopefully utf-16le/be).
Then we remove replacement's encoder. Make sure the encode hook uses the output encoding of an encoding. And make sure HTML and URL use the encode hook correctly.
Consider decoding 0x81 0x40 as Big5.
Please adjust step 5 to perform the null test on code point instead of performing it on pointer.
For binary protocols, such as msgpack, performance of writing/reading strings can be crucial, however those strings rarely (virtually never) exist on a binary buffer on their own. When decoding a string from a binary buffer of Uint8Array (obtained over, say, websockets) it's possible to cheaply create a subarray and pass it over to TextDecoder#decode
, as such the decoding story is pretty straight forward.
The writing story is much more complicated.
In order to write a string into an existing buffer, one must take the result Uint8Array
from TextEncoder#encode
and TypedArray#set
it at a desired offset into another buffer, which introduces unnecessary copying and, as result, is slower than plain JavaScript encoding implementations (at least for UTF-8).
A much better solution would be analog to the Node.js Buffer#write
implementation where the encoder can write a string to a buffer at a desired offset. Example signature could be something like:
TextEncoder#write(DOMString source, Uint8Array target [, Number offset [, Number length]]) -> Number bytesWritten
As a defense-in-depth measure, it would be great to add the labels mentioned in https://www.w3.org/Bugs/Public/show_bug.cgi?id=21057#c16 to the replacement encoding. It's however unclear whether that is web-compatible, as sites might specify these labels and rely on them not being recognized.
Filing this issue so hopefully one day a browser can experiment and figure out if this is doable.
where an algorithm tells you to do something like
Let lead be pointer / 157 + 0x81.
it's not immediately clear that you need to use the floor function, rather than rounding, to bring the result of the division back to an integer. It would be good to make this clearer in the text.
'Encoding' is used all over our discipline: of audio, video, images, and all sorts of transformations. The title tells us close to nothing of what this is about.
Please can we have a slightly more descriptive title (at least 'Text' included, and maybe the fact that this deals with the conversion and normalization of text)?
The CCS(coded character set) of index-jis0208.txt is CCS of "CP932" or "Windows-31J". It is not JIS X 0208.
There are two differences:
The important role of JIS X 0208 is restriction of character set. For example, some fonts in Japan have characters only in JIS X 0208, not CP932. So we need a strict character set to implement converter like Shift_JIS encoder. We need another index which is differ from index-jis0208.txt.
Another usecase: sometimes I want to convert Shift_JIS text into EPUB file. It's typical usecase of Shift_JIS decoder. When I do it, I want to convert EM DASH in Shift_JIS (0x815C, 1-1-29) to EM DASH in Unicode (U+2014). It's OK in JIS X 0208 (and JIS X 0213), but It's NG in CP932; the table of index-jis0208.txt maps EM DASH into HORIZONTAL BAR(U+2015). It's not what I want to do.
So my suggestions is:
I have machine-readable mapping table, "JIS X 0213:2004 8-bit code vs Unicode mapping table"
http://x0213.org/codetable/jisx0213-2004-8bit-std.txt. But It's JIS X 0213, not JIS X 0208.
I found the issue #31 and I also don't have a real usecase of SHIFT_JISX0213 nor Shift_JIS-2004 (when I want to use characters in JIS X 0213, I always use UTF-8). I need just only JIS X 0208, subset of jisx0213-2004-8bit-std.txt.
I wrote a small application (http://rishida.io/apps/encodings/) to work with Encoding tests, and ran into some trouble with the utf-8 decoder. I tried to closely follow the algorithms in the spec, as a way of testing them, but when it came to:
"6. Increase utf-8 bytes seen by one and set utf-8 code point to utf-8 code point + (byte − 0x80) << (6 × (utf-8 bytes needed − utf-8 bytes seen)). "
i ended up with
u8cp = u8cp + (byte - 0x80) << (6 * (bytesneeded - bytesseen))
which gives a much too high number.
what's needed is
u8cp = u8cp + ((byte - 0x80) << (6 * (bytesneeded - bytesseen)))
or
u8cp += (byte - 0x80) << (6 * (bytesneeded - bytesseen))
the spec text would be clearer if a couple of extra brackets were introduced, ie.:
"set utf-8 code point to utf-8 code point + ((byte − 0x80) << (6 × (utf-8 bytes needed − utf-8 bytes seen))). "
to show that the shift takes place before adding to utf-8 code point.
https://encoding.spec.whatwg.org/#textdecoder -
USVString decode(optional BufferSource input, optional TextDecodeOptions options);
The BufferSource type is not specified anywhere.
MDN says it is an ArrayBufferView -
https://developer.mozilla.org/en-US/docs/Web/API/TextDecoder/decode#Parameters
This should be specified.
Perhaps http://heycam.github.io/webidl/#common-BufferSource (see whatwg/fetch#108).
Results for a series of tests for EUC-jp encoding/decoding can be found at
https://www.w3.org/International/tests/repo/results/encoding-dbl-byte.en#eucjp
The tests can be run from that page (select the link in the left-most column) or get the tests from the WPT repo. There is a PR at
web-platform-tests/wpt#3198
The tests check whether:
The following summarises the current situation according to my testing, for major desktop browsers. (I will be adding nightly results and perhaps other browsers in time.) The table lists the number of characters that were NOT successfully converted by the test.
Notes:
Can we please investigate the failures to ascertain whether:
The following tool may be helpful for investigating issues. It converts between byte sequences and characters for all encodings in the Encoding spec. http://r12a.github.io/apps/encodings/
Filing this for tracking purposes - we may decide to immediately close. I thought we had an issue on this already.
The decode API has a do not flush flag (spelled stream
) used when partial byte sequences might be encountered.
But the encode API does not. It consumes USVStrings, which implies that the content to be encoded will always be complete strings. This means the case where partial UTF-16 data might be seen is not supported; a string split in between UTF-16 surrogates would be corrupted (c/o the IDL binding for USVString, nothing explicit in the API definition).
We had support for this in earlier iterations of the API, but moved to USVStrings and thus a stream
option was a no-op and dropped.
I don't have a use case for this, but if anyone does we could reconsider. cc: @hsivonen since he is actively poking at the spec and should be aware of this limitation.
Cause GB18030-2005 is already one-to-one mapping bettween Unicode & GBK18030 except
The 14 characters that still mapped into Unicode PUA that at 2005,
But nowadays, all the 14 characters have correlated mapping into Unicode,
So I suggest encoding standard mapping those characters to normal Unicode characters but PUA characters.
The following 80 characters are the GBK chracters that ever mapped to Unicode PUA, and
the corresponding Unicode non-PUA character
Han Character GBK Unicode PUA Unicode non-PUA
FE50 E815 2E81
FE51 E816 20087
FE52 E817 20089
FE53 E818 200CC
FE54 E819 2E84
FE55 E81A 3473
FE56 E81B 3447
FE57 E81C 2E88
FE58 E81D 2E8B
FE59 E81E 9FB4
FE5A E81F 359E
FE5B E820 361A
FE5C E821 360E
FE5D E822 2E8C
FE5E E823 2E97
FE5F E824 396E
FE60 E825 3918
FE61 E826 9FB5
FE62 E827 39CF
FE63 E828 39DF
FE64 E829 3A73
FE65 E82A 39D0
FE66 E82B 9FB6
FE67 E82C 9FB7
FE68 E82D 3B4E
FE69 E82E 3C6E
FE6A E82F 3CE0
FE6B E830 2EA7
FE6C E831 215D7
FE6D E832 9FB8
FE6E E833 2EAA
FE6F E834 4056
FE70 E835 415F
FE71 E836 2EAE
FE72 E837 4337
FE73 E838 2EB3
FE74 E839 2EB6
FE75 E83A 2EB7
FE76 E83B 2298F
FE77 E83C 43B1
FE78 E83D 43AC
FE79 E83E 2EBB
FE7A E83F 43DD
FE7B E840 44D6
FE7C E841 4661
FE7D E842 464C
FE7E E843 9FB9
FE80 E844 4723
FE81 E845 4729
FE82 E846 477C
FE83 E847 478D
FE84 E848 2ECA
FE85 E849 4947
FE86 E84A 497A
FE87 E84B 497D
FE88 E84C 4982
FE89 E84D 4983
FE8A E84E 4985
FE8B E84F 4986
FE8C E850 499F
FE8D E851 499B
FE8E E852 49B7
FE8F E853 49B6
FE90 E854 9FBA
FE91 E855 241FE
FE92 E856 4CA3
FE93 E857 4C9F
FE94 E858 4CA0
FE95 E859 4CA1
FE96 E85A 4C77
FE97 E85B 4CA2
FE98 E85C 4D13
FE99 E85D 4D14
FE9A E85E 4D15
FE9B E85F 4D16
FE9C E860 4D17
FE9D E861 4D18
FE9E E862 4D19
FE9F E863 4DAE
FEA0 E864 9FBB
The following 14 characters are the GB18030-2005 chracters that are still mapped to Unicode PUA, and
I suggest the encoding standard mapping those characters into Unicode non-PUA, cause we have no need
to waiting GB18030 to update it's spec just for those 14 chracters, and we could be sure those 14 chracters's
corresponding Unicode non-PUA characters are decided.
Han Character GBK Unicode PUA Unicode non-PUA
FE51 E816 20087
FE52 E817 20089
FE53 E818 200CC
FE59 E81E 9FB4
FE61 E826 9FB5
FE66 E82B 9FB6
FE67 E82C 9FB7
FE6C E831 215D7
FE6D E832 9FB8
FE76 E83B 2298F
FE7E E843 9FB9
FE90 E854 9FBA
FE91 E855 241FE
FEA0 E864 9FBB
And according to these, we can decode all GBK encoding family strings to non-PUA Unicode,
Besides these, we still have the need to convert all the historical Unicode PUA characters
to proper GBK(GB18030) characters.
Please clean up the math for 4-byte sequences in the gb18030 decoder to perform the three multiplications by constant expressions and then adding the results. This both helps readability and reduces data dependencies if the spec text is transcribed to code.
Please clean up the math for 4-byte sequence in the gb18030 encoder to use the %
operator with constant expressions on the high-hand side. This helps readability, reduces data dependencies if the spec text is transcribed to code and reduces operations when /
and %
with the same operands can be compiled into a single instruction.
In the EUC-JP and Shift_JIS decoders, there's this math expression: "0xFF61 + byte − 0xA1".
Please rearrange it as "0xFF61 − 0xA1 + byte" to make it easy to get correct results without the implementor rearranging the expression.
In a programming language that's aware of integer overflow to the extent that the compiler doesn't perform constant folding in a way that would change whether an expression has integer overflow, the former doesn't allow constant folding but the latter does.
Furthermore, if the math happens to be done on the 16-bit precision, for example as an artifact of trying to compute a UTF-16 code unit, the expression in the spec actually has an overflow with the relevant values of byte. (Which doesn't matter e.g. in Java but does matter in e.g. debug-mode Rust.)
Results for a series of tests for gb18030 encoding/decoding can be found at
https://www.w3.org/International/tests/repo/results/encoding-dbl-byte.en#gb18030
The tests can be run from that page (select the link in the left-most column) or get the tests from the WPT repo. There is a PR at
web-platform-tests/wpt#3195
The test check whether:
The following summarises the current situation according to my testing, for major desktop browsers. (I will be adding nightly results and perhaps other browsers in time.) The table lists the number of characters that were NOT successfully converted by the test.
Notes:
Can we please investigate the failures to ascertain whether:
The following tool may be helpful for investigating issues. It converts between byte sequences and characters for all encodings in the Encoding spec. http://r12a.github.io/apps/encodings/
In 929a3ff#commitcomment-15013806 @domenic points out some nesting of floor() might needed to preserve the original integer division semantics. I haven't been able to find a case where it gives a different outcome, however.
It's pretty sad that this specification chose to invent a new case (the lower case) for canonical names when a historical casing (the IANA case implemented by WebKit/Blink and, except for gbk, by Gecko) already existed.
Imagine that you are writing a library for this stuff. You can't compute the DOM name from the name in this specification without hard-coding knowledge of the DOM names. However, you could compute the name in this specification by ASCII-lower-casing the DOM name.
Is there a good reason not to:
.encoding
of TextEncoder
and TextDecoder
returns name (i.e. now the mixed-case IANA/DOM name) ASCII-lowercased.?
The Preface claims:
"… this specification … defines … the utf-8 encoding."
Isn't that formalised in ISO/IEC 10646:2014 and Unicode? I'm not suggesting that the contents of this document aren't useful, just that they don't define utf-8.
I suggest resolving this issue by:
AFAICT, non-UTF-8 encoders are exposed in three places in the Web Platform:
URL parsing and form submission use UTF-8 when the encoding of the document is either of the UTF-16 encodings, AFAICT, TextEncoder is the only place where the Web Platform exposes UTF-16 encoder functionality.
This is pretty sad, if TextEncoder support UTF-16 just because it can and the support doesn't have strong use cases.
Am I wrong and do proper use cases exist for UTF-16 in TextEncoder to a degree that sets UTF-16 apart from the other legacy encoding that aren't supported in TextEncoder?
Exposed=Window,Worker
should be Exposed=(Window,Worker)
.
According to:
https://lists.w3.org/Archives/Public/public-whatwg-archive/2012Apr/0095.html
The decoding for F9FE from different vendor are different,
And the whatwg encoding
standard are choose
F9FE
-> U+FFED
, is that the correct behavior ?
Because from the vendor's point of view, U+2593 is a better option
> F9FE =>
> opera-hk: U+FFED
> firefox: U+2593
> chrome: U+2593
> firefox-hk: U+2593
> opera: U+2593
> chrome-hk: U+FFED
> internetexplorer: U+2593
> windows-os: U+2593
> hkscs-2008: <U+FFED>
Maybe this was discussed before, but I couldn't find a bug on this. What do you think of treating UTF-7 the same way as ISO-2022-{KR,CN}, HZ-GB, etc?
When decoding, the whole input is replaced by U+FFFD. When encoding, use UTF-8.
Background: Blink began to use Compact Encoding Detector ( google/compact_enc_det ) when no encoding label is found (http, meta). When 7-bit encoding detection is on, it detects ISO-2022-{KR,CN}, HZ-GB AND UTF-7 in addition to ISO-2022-JP. 7-bit encoding detection is ON for ISO-2022-JP, but we want to suppress other 7-bit encodings. I think the best way to 'suppress' (unsupport) them is to turn the whole input to U+FFFD.
Hi, I appreciate UTF family encodings enforcement in TextEncoder
, but the "legacy" ones might be needed in some situations. I created this thread to demonstrate one and ask your opinion.
I have come across a problem which probably cannot be solved without "legacy" encodings support – byte counting of a text in its original encoding. I need to count bytes of a text in <textarea>
as if it was in windows-1250, koi8-u... Do you happen to know how to count it, please?
Please add a note to "index Shift_JIS pointer" saying that there are duplicate code points in the index, so the exclusion of the earlier ones cause the later ones to be used.
Hello,
BRF is a charset that permits to encode braille. http://brl.thefreecat.org/test.php is an example of text file encoded in the BRF charset, holding 3 braille patterns. BRF is the standard way of providing documents ready for braille embossing, all official documents available on the web ready for embossing are using it (books, courses, tax forms, income declaration, etc.), as required per section 508 in the US for instance. There is currently no other really-used standard way of shipping them using UTF-8 (the PEF format is still at very early stage), you will never see BRF documents encoded in utf-8.
For now, browsers ignore the "charset=brf" content-type qualifier, and show the file as if it was ascii, i.e. they print "A B C". They should recognize the BRF charset like other charsets (and for instance use iconv for converting it to unicode, and then display it just like any unicode text file), and thus print "⠁⠀⠃⠀⠉" instead of "A B C".
The BRF format defines bytes 0x00-0x1F as 1-to-1 equivalents to ascii 0x00-0x1F, and 0x20-0x5f as equivalents of 6-dot braille patterns of U+2800-U+283f.
BRF got added to IANA's list of charsets in 2006, see http://www.iana.org/assignments/charset-reg/BRF
BRF got added to glibc's iconv around the same period, see https://sourceware.org/git/?p=glibc.git;a=blob_plain;f=localedata/charmaps/BRF;hb=HEAD
Regards,
Samuel
I've written "The name of an encoding is also one of its labels, except in the case of the replacement encoding whose name is not one of its labels." inconveniently many times when either documenting code that implements the Encoding Standard or when otherwise explaining the concepts.
Also, I've written code that does something like "if input string is 'replacement', don't run get an encoding, otherwise, run get an encoding" when working with interfaces that were designed (four links there, but GitHub's styling makes it unobvious) before there was clarity of what strings are labels and what strings are names used where an enum value or a reference to a singleton object representing an encoding would be more appropriate software design (when those interfaces potentially have callers in add-ons that I can't fix).
All this would become simpler if get an encoding for the name of an encoding always returned the encoding itself. However, it's kinda sad to expose another Web-exposed label just to make implementing and explaining stuff easier, so I'm not sure if I should request this.
But at least this deserves some discussion.
#38 changed how the lead byte is processed. It removed "Then (byte is in the range 0xC2 to 0xF4, inclusive) set UTF-8 code point to UTF-8 code point << (6 × UTF-8 bytes needed) and return continue."
The first two and final switch branches (0x00 to 0x7F, 0xC2 to 0xDF, and Otherwse) have returns, but the remaining two (0xE0 to 0xEF, 0xF0 to 0xF4) do not, so fall through to the trail byte processing.
Looks like they just need additional "Return continue" substeps added
Results for a series of tests for EUK-kr encoding/decoding can be found at
https://www.w3.org/International/tests/repo/results/encoding-dbl-byte.en#euckr
The tests can be run from that page (select the link in the left-most column) or get the tests from the WPT repo. There is a PR at
web-platform-tests/wpt#3201
The tests check whether:
The following summarises the current situation according to my testing, for major desktop browsers. (I will be adding nightly results and perhaps other browsers in time.) The table lists the number of characters that were NOT successfully converted by the test.
Notes:
Can we please investigate the failures to ascertain whether:
The following tool may be helpful for investigating issues. It converts between byte sequences and characters for all encodings in the Encoding spec. http://r12a.github.io/apps/encodings/
The Big5 encoder first does an index lookup and then discards the code point as an error if the Big5 lead for the pointer is less than 0xA1. This makes the encoder discard code points that have two mappings: one whose Big5 lead is less than 0xA1 and another whose Big5 lead is greater or equal to 0xA1.
The following code points have such double pointer mappings:
7BB8
7C06
7CCE
7DD2
7E1D
8005
8028
83C1
84A8
840F
89A6
8D77
90FD
92B9
96B6
975C
97FF
9F16
5159
515B
515D
515E
7479
6D67
799B
9097
5B28
732A
7201
77D7
7E87
99D6
91D4
60DE
6FB6
8F36
4FBB
71DF
9104
9DF0
83CF
5C10
79E3
5A67
8F0B
7B51
62D0
5605
5ED0
6062
75F9
6C4A
9B2E
50ED
62CE
60A4
7162
When testing Gecko's old Big5 encoder, at least the first of these is a non-error: http://software.hixie.ch/utilities/js/live-dom-viewer/saved/3610 (test case violates the Same Origin Policy in Blink due to different treatment of data: origins). That is, Gecko's old encoder encodes U+7BB8 as 0xBA, 0xE6.
I believe that instead of checking whether lead is less than 0xA1 after the lead computation, the spec should say that when looking up pointers from the index when encoding, pointers below (0xA1 - 0x81) * 157 should be ignored, i.e. search the index from pointer (0xA1 - 0x81) * 157 onwards.
See http://logs.glob.uno/?c=freenode%23whatwg&s=18%20Apr%202016&e=18%20Apr%202016#c992642 for a clever one by @hsivonen. We could add a note somewhere or even an implementation considerations section.
Was talking to @bsittler about using TextDecoder and he mentioned that he needs to be able to serialize the current decoding state to IndexedDB in order to survive e.g. service worker upgrades. TextDecoder definitely has internal state (a few bytes left in its internal "stream" variable, basically) which can get lost in this manner. It would be great to expose this, and allow construction of a new TextDecoder primed with that state.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.