unicode characters are not handled properly. about lintr HOT 6 CLOSED

r-lib commented on September 18, 2024

unicode characters are not handled properly.

from lintr.

Comments (6)

jimhester commented on September 18, 2024

This is due to substr not working with them correctly. I need to figure out how to handle multi-byte characters correctly in R.

from lintr.

daroczig commented on September 18, 2024

Can you please share some details on this bug?

from lintr.

jimhester commented on September 18, 2024

Sure, if you use any multibyte unicode characters in your source file it throws off all of the column counts. This happens because the string functions assume each character is only one byte. I need to learn how to properly handle multibyte unicode characters in R.

æ = 1

# bad.R:1:4: style: Use <-, not =, for assignment.
# æ = 1
#    ^

from lintr.

daroczig commented on September 18, 2024

But this is lurking in getParseData rather than in substr, no?

> getParseData(parse('/tmp/bad.R'))
   line1 col1 line2 col2 id parent     token terminal text
1      1    1     1    2  1      3    SYMBOL     TRUE    æ
3      1    1     1    2  3      0      expr    FALSE     
2      1    4     1    4  2      0 EQ_ASSIGN     TRUE    =
4      1    6     1    6  4      5 NUM_CONST     TRUE    1
5      1    6     1    6  5      0      expr    FALSE     
11     2    1     2    1 11     13    SYMBOL     TRUE    a
13     2    1     2    1 13      0      expr    FALSE     
12     2    3     2    3 12      0 EQ_ASSIGN     TRUE    =
14     2    5     2    5 14     15 NUM_CONST     TRUE    1
15     2    5     2    5 15      0      expr    FALSE

Where bad.R includes:

æ = 1
a = 1

What I did in some of my packages to identify the real width of multi-byte chars is checking nchar, which helped with CJK chars as well, such as 乂.

But to be on-topic, if this is not handled in core R (as it will return the column of bytes, not characters), a work-around might be to compute nchar for each SYMBOL etc and modify the col1 and col2 values accordingly.

from lintr.

jimhester commented on September 18, 2024

You are correct about the error being in getParseData, not substr. Your nchar workaround seems like a good solution to me, thanks for the suggestion!

from lintr.

jimhester commented on September 18, 2024

I think I have made a function which fixes this. If you run into any problems with multi-bytes characters let me know. Thank you for your nchar suggestion, that was exactly what I needed.

from lintr.

Recommend Projects

unicode characters are not handled properly. about lintr HOT 6 CLOSED

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs