Replacement character (`\uFFFD`) in JavaScript/TypeScript causes to breaks about typescript HOT 19 CLOSED

wooorm commented on April 28, 2024

Replacement character (`\uFFFD`) in JavaScript/TypeScript causes to breaks

from typescript.

Comments (19)

RyanCavanaugh commented on April 28, 2024 1

Situation: \ufffd exists because it shouldn't ever appear in a correctly-encoded file on purpose

"We need to put a raw \ufffd in a file so we can test it"

Situation: \ufffd exists in a correctly-encoded file on purpose

https://xkcd.com/927/

from typescript.

RyanCavanaugh commented on April 28, 2024

Just write '\ufffd' instead? Putting the signifier that says "this file was corrupted" into a file, causing a tool for humans to say "this file looks corrupted", seems like correct behavior.

from typescript.

fatcerberus commented on April 28, 2024

I always thought the entire point of the 0xFFFD codepoint was that it's only supposed to be generated by a text decoding failure and has no reason to ever appear literally in the source text. Like Ryan says it's specifically designated as an indicator of corrupted text.

from typescript.

RyanCavanaugh commented on April 28, 2024

Really the thing we need is for Buffer.toString to say when it inserted a replacement character instead of directly reading one. Without that information there's not a lot we can do that isn't expensive.

from typescript.

jakebailey commented on April 28, 2024

Since we have a Buffer in sys.readFile (thanks to us dealing with BOMs and so on), we can technically write:

const buf = fs.readFileSync(p);
const asText = buf.toString("utf8");
const badDecode = Buffer.compare(buf, Buffer.from(asText, "utf8")) === 0;

We could do that conditionally based on the presence of U+FFFD, but then we don't have any way to bubble that information up (because readFile returns just a string, and the point would be to not error on parse because the lower-level thing said it was okay).

from typescript.

wooorm commented on April 28, 2024

"exists because it shouldn't ever appear in a correctly-encoded file on purpose"

I don't think this is really the case. There are many algorithms in the html spec, and markdown spec, that do things that produce the replacement character.

It's like saying NaN shouldn't exist. Of course you don't want NaN normally but it has to exist as a concept. And that also means that you can check for it in code. And in tests. You have to be able to talk about it. The Unicode spec and Wikipedia and html and markdown need to be able to talk about the character.

Also, no other tool does what TS just started doing.

I'd think it's better to check for control characters instead of a replacement character. Perhaps before that toString

from typescript.

fatcerberus commented on April 28, 2024

It's like saying NaN shouldn't exist. Of course you don't want NaN normally but it has to exist as a concept. And that also means that you can check for it in code. And in tests. You have to be able to talk about it. The Unicode spec and Wikipedia and html and markdown need to be able to talk about the character.

NaN is actually a really good analogy, but maybe not for the reason you think: there's a big difference between testing for a computation that produces NaNs, as a way to detect bugs, vs. literally writing

let x = NaN;

The minute you assume the latter is allowed to happen in the wild, for any reason, then the former test becomes useless as an error-detection mechanism (this is probably the rationale behind why direct tests against NaN don't work, come to think of it; it discourages writing them literally for any reason). Yes, x is equal to NaN, but it was done on purpose in this case and doesn't indicate an invalid computation. Which is pretty much exactly like the problem we're having here.

Now, that having been said, I will admit that testing for binary files by looking for a Unicode decode failure is kind of hacky. But alas, design constraints (specifically, the test is done at a time when all the compiler has access to is the UTF-8 decoded text from the source file).

from typescript.

wooorm commented on April 28, 2024

Buffer#toString resulting in � does not mean that Buffer#toString is the only function that is ever allowed to result in � per Unicode.
Markdown for example, which has to deal with potentially malicious authors, has to make documents safe. So, the function markdown(input) also produces �.

Here are some more practical examples:

15 hits in WHATWG, https://github.com/search?q=org%3Awhatwg+%22%EF%BF%BD%22&type=code
16 hits in my personal code: https://github.com/search?q=%22%EF%BF%BD%22+user%3Awooorm&type=code

Even if there were never �s in input, I also argue that Buffer#toString (implying Buffer#toString('utf8')) resulting in � does not mean that a file is binary. Whether a buffer is valid UTF-8 is not the same as whether a buffer is binary or not.

Looking on npm for “is binary” and checking out the code, yields:

In the TS code base, there is access to the buffer. And UTF-8 is enforced already. So the bytes can be checked? Not too complex? https://github.com/wayfind/is-utf8/blob/master/is-utf8.js

from typescript.

snarbies commented on April 28, 2024

(this is probably the rationale behind why direct tests against NaN don't work, come to think of it; it discourages writing them literally for any reason).

Off-topic, but NaN compares false to NaN because there are multiple bit patterns that can represent it and never equal is much better than sometimes equal.

from typescript.

fatcerberus commented on April 28, 2024

Yes, I'm aware that's the theory - but I tend to suspect there's a more pragmatic rationale behind it too 😅 I don't think many people are pulling NaNs apart to inspect their low-level bit patterns, and if they are that's well outside the purview of the IEEE float spec, so from the perspective of normal FP operations the existence of multiple representations isn't even observable.

from typescript.

snarbies commented on April 28, 2024

¯\_(ツ)_/¯ I guess it can be two things

from typescript.

RyanCavanaugh commented on April 28, 2024

In the TS code base, there is access to the buffer. And UTF-8 is enforced already. So the bytes can be checked? Not too complex?

Everything can be fixed; the question is how much slower everyone's tsc should be in order for a handful of test files to not need to use escape sequences.

from typescript.

wooorm commented on April 28, 2024

Indeed! But then I’d move one more step back: why are people passing 700mb video files through typescript? How many folks are doing that 😅

The patch for that already only looks at the first 256 characters. For every file.
There is apparently code for BOMs too.
I’d wager that looking at 256 first bytes with something like https://github.com/gjtorikian/isBinaryFile/blob/main/src/index.ts, particularly when modified to actually do what the goal is (bail quickly when not UTF-8), will be so slow.

from typescript.

wooorm commented on April 28, 2024

Looking more into the original issue, with the knowledge that many binary file formats have a “BOM-like” signature in the first few bytes: #21136 (comment)

from typescript.

RyanCavanaugh commented on April 28, 2024

0x47 is "G". I don't know why someone seeing an error when the letter G appears at two random offsets would be less surprised to see "file is binary" than you are when looking at �.

from typescript.

wooorm commented on April 28, 2024

Right. You still can use those uncommon 2 bytes to go into a slightly slower path tho?

from typescript.

RyanCavanaugh commented on April 28, 2024

We're also not just trying to detect MPEG transport streams. There are other tools, e.g. Expo, which are/were emitting arbitrary binary files into .js file extensions and causing slowdowns because we tried to parse a giant garbage file.

from typescript.

wooorm commented on April 28, 2024

Interesting! I dunno, detecting “arbitrary binary files” made by various tools is just going to be complex I think.

Right now TS throws. I assume that on giant binary files TS was also throwing.
Why not improve that crash with a better message, check whether something appeared binary, and suggest adding an ignore pattern?
Or also when files are like 1mb+, do a small byte check then?
Or when � appears with this recent patch.
And also, as @jakebailey mentions, comparing the buffer: #57930 (comment).

I feel like there are a couple cases where a slightly more thorough check can be done. And I’m not 100% that that check is that slow.

from typescript.

RyanCavanaugh commented on April 28, 2024

PRs accepted

from typescript.

Replacement character (`\uFFFD`) in JavaScript/TypeScript causes to breaks about typescript HOT 19 CLOSED

Comments (19)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs