GithubHelp home page GithubHelp logo

Regression vs legacy conhost: different behavior from ReadConsoleOutputCharacterW when surrogate pair(s) are present. about terminal HOT 12 CLOSED

chrisant996 avatar chrisant996 commented on May 26, 2024
Regression vs legacy conhost: different behavior from ReadConsoleOutputCharacterW when surrogate pair(s) are present.

from terminal.

Comments (12)

zadjii-msft avatar zadjii-msft commented on May 26, 2024 2

throwaway thought before I leave for lunch: #10810 sounds a lot like what you want

from terminal.

lhecker avatar lhecker commented on May 26, 2024 1

The regression is not intentional and unrelated to the intentional ones in #13339, but it is in the neighborhood. I think it's caused by #13626 which added proper support for surrogate pairs but changed the behavior of TextBufferCellIterator. It now iterates in columns instead of characters and this breaks single-cell surrogate pairs. The fact that this ever worked is mostly a coincidence because the iterator code is based on UCS2 and not UTF16, but still somewhat dealt with surrogate pairs (albeit poorly).

This internal regression causes the buffer to be filled with (for instance) 121 chars in a 120 column terminal, but you're only giving a 120 char buffer in your repro. The following code then drops the returned text and still returns S_OK. It has existed long before my changes to support surrogate pairs:

// Only copy if the whole result will fit.
if (chars.size() <= buffer.size())
{
std::copy(chars.cbegin(), chars.cend(), buffer.begin());
written = chars.size();
}

I've had a brief glance at the original conhost v1 code and it doesn't seem like this behavior existed there, so I believe this is another regression that was introduced in the original conhost v2 release.

I'm submitting a PR to fix this issue soon. It'll then correctly count the number of characters again.

from terminal.

chrisant996 avatar chrisant996 commented on May 26, 2024 1

@lhecker that all makes sense.

I can confirm that for my own purposes, the proposed change to RCOCW() would not cause any negative side effects in any of my projects (not even for users in locales where surrogate pairs are common). And specifically, Clink will not encounter problems from such a change. Because none of my projects do anything more than look for empty lines, anymore (well, Clink can "scrape" the screen for text to use in completions, but it's ok for that to miss some things).



Something to consider for the future:

It would be fantastic to have some way to save/restore the terminal screen (or a region within the terminal screen), and preserve arbitrarily complex text and colors and styles. I don't care whether it's an escape code or an API, and it would be perfectly fine for the saved state to be opaque (e.g. if an API filled an out-param buffer with some opaque payload that only the API knows how to unpack and restore -- though of course someone would try to "crack" the payload, and the API would need to guard against fuzzing, etc).

The Alternate Screen Buffer private mode codes (e.g. CSI ? 1049 h and CSI ? 1049 l) are almost suitable, except that there's no way to copy the Normal Screen Buffer to the Alternate Screen Buffer, so while the save/restore part works they also clear the Alternate Screen Buffer and thus are useless for showing a "popup" or "overlay" over the existing terminal screen.

FWIW, I stopped using ReadConsoleOutputW once it couldn't preserve colors and styles beyond the 4-bit VGA colors. But, I strongly wish there were a way to save and restore either the whole screen or a region of the screen (including colors, styles, and full Unicode text). The problem with the guidance "apps should keep track of what they print" is that sometimes an app wants to show a temporary popup window in a region of the screen that was printed by some other app. PowerShell's progress bars is one example, the Azure CLI installer is another example (but maybe it's just another PowerShell progress bar, I'm not sure), and there are several places in my own projects where I want to save/restore a region of the screen, but no longer can. In the case of Clink, it scrolls the screen to make room for the popup at the bottom of the screen where it doesn't overlay anything else -- it's functional, but it yields an unpolished and less pleasant experience.

from terminal.

lhecker avatar lhecker commented on May 26, 2024 1

For better or worse, the question about whether to use U+FFFD or not has been answered...
The DBCS implementation has some fairly exhaustive unit tests and those tests are failing now if I properly support narrow-cell surrogate pairs. The unit tests make is very clear that the nLength parameter in ReadConsoleOutputCharacterW is intended as a column count.

I'll make it return U+FFFD for now as that was my plan for the CHAR_INFO API anyways, which have the same restriction (CHAR_INFO is explicitly per-column/row after all).

from terminal.

zadjii-msft avatar zadjii-msft commented on May 26, 2024

I'm thinking this is in the neighborhood of #13339

from terminal.

chrisant996 avatar chrisant996 commented on May 26, 2024

I could potentially understand cutting the ability to read such lines. Although Windows has always been able to access the console buffer, *nix has never been able to.

At a minimum, though, the API should return failure (false) when it fails. The main problem in the case at hand wasn't even the failure to read the console text; the main issue was the API reported that it succeeded but it had actually failed.

Returning failure would have avoided the real world usage problem linked above, since the code was already actively checking for failures. Ideally the API would succeed and behave how it did in legacy conhost -- but if it won't, then returning that there was a failure lets the app handle fallback appropriately.

from terminal.

chrisant996 avatar chrisant996 commented on May 26, 2024

@lhecker ah ok that makes sense.

Thank you for working on a fix quickly!




It sounds like you're aware of everything I'm about to add, but for completeness, I'll still call these out overtly:

The trouble with 120 vs 121, is that the API signature uses the nLength param to mean two different things:

  1. The size of the buffer to receive the read content.
  2. The number of cells to read.

So, if the screen width is 120 and a caller passes 121, then on a line with no surrogate pairs the 121st character will be read from the next line. And that requires a caller to post-process the results to figure out how much of the returned content corresponds to the target line.

A caller that has a robust Unicode iterator could ask for e.g. 2 * the screen width and then parse the returned content, measuring widths (e.g. with the wcwidth functions) until they exceed the width of the screen. But in a line that contains zero width joiners (or emoji qualifiers or etc), it could require much more than 2 * the screen width. (And my projects do have a robust Unicode iterator, so anywhere they need the full actual content, they can indeed use iterative calls -- but for the most part they just want to know if a line is empty, i.e. contains only spaces.)

I recognize that it's problematic to fit Unicode into the console APIs whose original signatures aren't detailed enough to accommodate surrogates and joiners (or anything else beyond UCS2). I really like the balanced compromises that the API has made for that over the years. For example, a caller can iteratively call and analyze the returned output and piece together an accurate representation over multiple calls (or a single extra-large calls and then parse the results), regardless of how many WCHARs are required to fetch a full screen line of text. That's pretty cool, albeit complex, and gives callers the ability to operate to whatever degree of "correctness" or "thoroughness" that they wish to.

Thanks again for the speedy response and follow-up!

from terminal.

lhecker avatar lhecker commented on May 26, 2024

I've also been wondering if we should just replace surrogate pairs with U+FFFD to make this behave more consistent. While I also agree with your point ("gives callers the ability to [...]"), the support for surrogate pairs makes this API different from the CHAR_INFO APIs which don't ever support them.

from terminal.

chrisant996 avatar chrisant996 commented on May 26, 2024

I've also been wondering if we should just replace surrogate pairs with U+FFFD to make this behave more consistent. While I also agree with your point ("gives callers the ability to [...]"), the support for surrogate pairs makes this API different from the CHAR_INFO APIs which don't ever support them.

Have there been complaints about the existing behavior over the past 20+ years that it's behaved the current way?

It would be a breaking change. Personally, I'm very happy about how it's been working for years.

But since English is my primary language, surrogate pairs are an outlier for me and only appear when certain nerdfont icons or emoji are present. It might be useful to gather opinions from software developers for whom surrogate pairs are common in normal language text.

If there are neither complaints nor feedback, then I would lean towards preserving compatibility with previous behavior.

from terminal.

chrisant996 avatar chrisant996 commented on May 26, 2024

@lhecker also, if surrogate pairs would be stripped, then I assume the intent would also be to strip zero width joiners, and all combining marks, right? There's a lot of room for complications and impactful side effects.

from terminal.

lhecker avatar lhecker commented on May 26, 2024

Have there been complaints about the existing behavior over the past 20+ years that it's behaved the current way?

It would be a breaking change. [...]

That's true, sort of. The console so far was (in pseudocode terms) a TCHAR[height][width] buffer for the longest time. Even when TCHAR got replaced with WCHAR in conhost v2, surrogate pair support was rudimentary and only worked properly in a few select spots. The underlying buffer was still a rectangular matrix.

The reason your repro case worked at all is because that broken UTF16 implementation coincidentally counted the narrow cell with a surrogate pair as a wide cell and thus the iterator stepped over both the leading and trailing surrogate pair. In other words, before I "fixed" it, it actually thought that the row had 121 columns (as an example), even though the buffer is only 120 wide. This caused a ton of issues and regressions elsewhere.

So, while it's definitely a breaking change by definition (= change in behavior causing failures in other components), I don't believe that the exact contents that can be read via these APIs are properly specced out yet and never were. Sure, ReadConsoleOutputCharacterW happened to work coincidentally, but what about the other Win32 APIs and VT sequences?

But let me get to a much more concerning point...

@lhecker also, if surrogate pairs would be stripped, then I assume the intent would also be to strip zero width joiners, and all combining marks, right? There's a lot of room for complications and impactful side effects.

This is exactly my biggest concern right now. I'm planning to add proper Unicode support with extended grapheme clusters in the near term. That is, support for ZWJs, VS15/16, combining marks, and so on. I'm rather worried that this will make ReadConsoleOutputCharacterW behave extremely erratic, in particular in respect to the CHAR_INFO APIs which must return U+FFFD for such complex graphemes.

Thoughts?

from terminal.

chrisant996 avatar chrisant996 commented on May 26, 2024

throwaway thought before I leave for lunch: #10810 sounds a lot like what you want

Yes @zadjii-msft that'd be a great way for save/restore to work (i.e. to use pages and copy back and forth between pages, which I think are essentially a collection of alternate screen buffers).

It's currently NYI, though, right?

And yes, Far is an excellent example of an app that relies on accurately saving/restore content printed by other apps. Any solution that's viable for Far would meet even all of the "stretch" goals that I've ever had for what I would need from a save/restore mechanism.

from terminal.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.