Currently, smlfmt will report an error on non-ascii input. Example f

Support mlb option "allowExtendedTextConsts true" about smlfmt HOT 4 CLOSED

shwestrick commented on August 17, 2024 3

Support mlb option "allowExtendedTextConsts true"

from smlfmt.

Comments (4)

UltimatePea commented on August 17, 2024 2

Thanks for the info!

I am not very familiar with UTF8/Unicode, but I would suggest we at least fix the lexer to not produce an error when encountering a UTF8 character.

I am not so familiar with the difference between semantic position and visual position, so I vote for whatever is easier to implement, which is probably UTF8 semantic position.

from smlfmt.

shwestrick commented on August 17, 2024 1

By the way, what is the accepted standard practice these days for visually handling "characters" that are encoded as more than one UTF8 character? E.g., the flag emoji "🇺🇸" is actually two UTF8 characters ("🇺" followed by "🇸"). But of course, it is intended to be visually represented as a single character.

My initial thought is that this is important for smlfmt because we need to know positions to vertically align things correctly. Do we use the UTF8 semantic position, or the intended visual position? I'm inclined to use UTF8 semantic position...

from smlfmt.

shwestrick commented on August 17, 2024

Supporting this won't be too bad, but will require changes in a few places.

For the lexer, we'll need to skip over UTF8 characters in the function advance_oneCharOrEscapeSequenceInString. Note that this function already skips over escape sequences; handling UTF8 should be similar. And then we can selectively enable this functionality by adding an additional flag to the lexer functions Lexer.next and Lexer.tokens.

We'll need to update the implementation of Source, too, such as Source.absoluteStart which returns the position (line and col) of a source file segment. Currently these are computed via byte offsets, which is no longer correct under UTF8. I believe other functions will need to be updated, too, to ensure that a Source.t never starts or ends in the middle of a UTF8 sequence.

from smlfmt.

shwestrick commented on August 17, 2024

It occurred to me that a simpler way to support this is to allow for UTF-8 bytes but not check for validity of a UTF-8 byte sequence. #74 implements this.

By default, this is disabled. It can be enabled with -allow-extended-text-consts true at the command-line, or with the "allowExtendedTextConsts true" annotation within an MLB.

Your example above should now be working. Let me know if you have any trouble!

from smlfmt.

Support mlb option "allowExtendedTextConsts true" about smlfmt HOT 4 CLOSED

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs