GithubHelp home page GithubHelp logo

Comments (6)

jimhester avatar jimhester commented on September 18, 2024

This is due to substr not working with them correctly. I need to figure out how to handle multi-byte characters correctly in R.

from lintr.

daroczig avatar daroczig commented on September 18, 2024

Can you please share some details on this bug?

from lintr.

jimhester avatar jimhester commented on September 18, 2024

Sure, if you use any multibyte unicode characters in your source file it throws off all of the column counts. This happens because the string functions assume each character is only one byte. I need to learn how to properly handle multibyte unicode characters in R.

æ = 1

# bad.R:1:4: style: Use <-, not =, for assignment.
# æ = 1
#    ^

from lintr.

daroczig avatar daroczig commented on September 18, 2024

But this is lurking in getParseData rather than in substr, no?

> getParseData(parse('/tmp/bad.R'))
   line1 col1 line2 col2 id parent     token terminal text
1      1    1     1    2  1      3    SYMBOL     TRUE    æ
3      1    1     1    2  3      0      expr    FALSE     
2      1    4     1    4  2      0 EQ_ASSIGN     TRUE    =
4      1    6     1    6  4      5 NUM_CONST     TRUE    1
5      1    6     1    6  5      0      expr    FALSE     
11     2    1     2    1 11     13    SYMBOL     TRUE    a
13     2    1     2    1 13      0      expr    FALSE     
12     2    3     2    3 12      0 EQ_ASSIGN     TRUE    =
14     2    5     2    5 14     15 NUM_CONST     TRUE    1
15     2    5     2    5 15      0      expr    FALSE     

Where bad.R includes:

æ = 1
a = 1

What I did in some of my packages to identify the real width of multi-byte chars is checking nchar, which helped with CJK chars as well, such as .

But to be on-topic, if this is not handled in core R (as it will return the column of bytes, not characters), a work-around might be to compute nchar for each SYMBOL etc and modify the col1 and col2 values accordingly.

from lintr.

jimhester avatar jimhester commented on September 18, 2024

You are correct about the error being in getParseData, not substr. Your nchar workaround seems like a good solution to me, thanks for the suggestion!

from lintr.

jimhester avatar jimhester commented on September 18, 2024

I think I have made a function which fixes this. If you run into any problems with multi-bytes characters let me know. Thank you for your nchar suggestion, that was exactly what I needed.

from lintr.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.