GithubHelp home page GithubHelp logo

Comments (9)

blueyed avatar blueyed commented on July 23, 2024

Yeah, I think we should not use universal newlines.
Can you tell where in vimlparser this happens?

from vim-vimlparser.

ssokolow avatar ssokolow commented on July 23, 2024

Sorry for the delay. The last couple of days were distracting.

I think I know where the problem is, but I'll need to find a moment to test my hypothesis before I can give you a definitive answer.

Also, I can't say for certain without reading the Vim source code to try to match whatever it does, but I'm guessing that the proper solution for detecting newlines would be to check for \n, \r\n, and \r and assume whichever occurs first is what the file should be split on.

(I'll have to try generating a test .vimrc to see if Vim's parser breaks if I include a \r on the very first line of an otherwise \n file with Vim configured to support classic MacOS-style newlines.)

Failing that, maybe splitting on \n unconditionally and then trimming a maximum of one trailing \r from each line.

from vim-vimlparser.

ssokolow avatar ssokolow commented on July 23, 2024

I didn't have time to check vimlparser yet, but I did find a moment to look up Vim's documentation on the line-ending detection algorithm that we probably want to replicate:

From :h 'fileformats':

	- When more than one name is present, separated by commas, automatic
	  <EOL> detection will be done when reading a file.  When starting to
	  edit a file, a check is done for the <EOL>:
	  1. If all lines end in <CR><NL>, and 'fileformats' includes "dos",
	     'fileformat' is set to "dos".
	  2. If a <NL> is found and 'fileformats' includes "unix", 'fileformat'
	     is set to "unix".  Note that when a <NL> is found without a
	     preceding <CR>, "unix" is preferred over "dos".
	  3. If 'fileformat' has not yet been set, and if a <CR> is found, and
	     if 'fileformats' includes "mac", 'fileformat' is set to "mac".
	     This means that "mac" is only chosen when:
	      "unix" is not present or no <NL> is found in the file, and
	      "dos" is not present or no <CR><NL> is found in the file.
	     Except: if "unix" was chosen, but there is a <CR> before
	     the first <NL>, and there appear to be more <CR>s than <NL>s in
	     the first few lines, "mac" is used.
	  4. If 'fileformat' is still not set, the first name from
	     'fileformats' is used.

In other words, assuming that unix, dos, and mac are all listed for detection, it behaves like this:

  1. If a \n is found that isn't preceded by \r, assume UNIX line endings.
  2. Else, if no lone \n was found, but at least one \r\n was found, assume DOS line endings.
  3. Else, if neither \n nor \r\n were found, but \r was found, assume Classic MacOS line endings.
  4. Else, use whatever type of line-endings was listed first in fileencodings. (Defaults to dos,unix for native Windows builds, unix,dos for everything else including Cygwin.)

For step 4, the best way to eliminate surprises would probably be to use os.name == 'nt' to decide whether to split on \r\n or \n. That'll perfectly match Vim's behaviour in every situation except "single-line VimL file being loaded by a Vim with a non-default first entry in fileformats."

from vim-vimlparser.

ssokolow avatar ssokolow commented on July 23, 2024

That said, step 4 may not even be necessary because, a little further down, it looks like it's saying that it uses a simplified form of that algorithm for VimL that's sourced or in vimrc:

	For systems with a Dos-like <EOL> (<CR><NL>), when reading files that
	are ":source"ed and for vimrc files, automatic <EOL> detection may be
	done:
	- When 'fileformats' is empty, there is no automatic detection.  Dos
	  format will be used.
	- When 'fileformats' is set to one or more names, automatic detection
	  is done.  This is based on the first <NL> in the file: If there is a
	  <CR> in front of it, Dos format is used, otherwise Unix format is
	  used.

(Which, now that I think about it, would make sense. I'm on Linux and had to manually change the line endings from DOS to Unix on bwHomeEndAdv to get it to load properly.)

In that case, I suppose the best solution would be to just follow that "systems with a Dos-like <EOL>" routine unconditionally. If there's a <CR> in front of the first <NL>, split on \r\n. Otherwise, split on \n.

Sorry about not catching that before I made the previous post. I didn't sleep well and I'm just about to go back to bed.

from vim-vimlparser.

ssokolow avatar ssokolow commented on July 23, 2024

Sorry again for the delay.

The problem is this function:

def viml_readfile(path):
    lines = []
    f = open(path)
    for line in f.readlines():
        lines.append(line.rstrip("\r\n"))
    f.close()
    return lines

...and it works if experimentally changed to this:

def viml_readfile(path):
    lines = []

    # Replicate Vim's algorithm for identifying DOS vs. UNIX line endings 
    # in sourced scripts and .vimrc files
    with open(path, 'rb') as f:
        content = f.read().decode('utf-8')
        first_n = content.index('\n')
        if first_n > 0 and content[first_n - 1] == '\r':
            raw_lines = content.split('\r\n')
        else:
            raw_lines = content.split('\n')

    for line in raw_lines:
        lines.append(line.rstrip("\r\n"))
    return lines

...though, to be honest, that's still broken and was always broken, because there's another bug in the Python version of the parser that I noticed while writing that. My unconditional .decode('utf-8') replicates Python 3's behaviour in text mode and, if you try to open a non-UTF8 file in Python's text mode, you'll get this error:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 3: invalid continuation byte

The proper solution to match Vim is tricky because, again, it depends on stuff you can reconfigure external to the script but it boils down to "Parse once using the system's default encoding as specified by $LANG, falling back to latin1, and, if a scriptencoding statement is found, then re-parse as whatever encoding it declares. (eg. scriptencoding utf-8 in my case)

My best guess at an algorithm that doesn't require a big table of regional encoding would be something like this:

  1. If os.environ.get('LANG', '').endswith('.utf8'), then attempt to parse as UTF-8.
  2. If that fails, parse as latin1.
  3. If the parse turned up a scriptencoding statement, re-parse as the encoding it declares.

That works for two reasons:

  1. There are no invalid bit patterns under latin1, so it can't return an Exception... just gibberish interpretations of stuff outside 7-bit ASCII.
  2. Nothing needed to recognize and interpret scriptencoding <name> will involve characters outside 7-bit ASCII.

I'd have done that too, but the whole "search through the AST for scriptencoding ... and then redo the parse is far less trivial.

from vim-vimlparser.

ssokolow avatar ssokolow commented on July 23, 2024

The closest trivial hack I can think of would be to replace content = f.read().decode('utf-8') in my changed version with:

    content = f.read()
    try:
        content = content.decode('utf-8')
    except UnicodeDecodeError:
        content = content.decode('latin1')

Since all structural elements of VimL are 7-bit ASCII, that'll get you something that parses UTF-8 correctly and, if that fails, you instead get a correct AST but any tokens containing non-ASCII characters will be assumed to be latin1, even if that makes them gibberish.

(Of course, not irretrievable gibberish. Someone could always do gibberish.encode('latin1').decode('correct_encoding') at a later stage.)

from vim-vimlparser.

whonore avatar whonore commented on July 23, 2024

Does the newline argument of open solve the problem?

with open('nl', 'bw') as f:
    f.write(b'newline\ncarriage\rnewline carriage\r\n')

with open('nl', 'r') as f:
    print(f.readlines())
# -> ['newline\n', 'carriage\n', 'newline carriage\n']

with open('nl', 'r', newline='') as f:
    print(f.readlines())
# -> ['newline\n', 'carriage\r', 'newline carriage\r\n']

from vim-vimlparser.

ssokolow avatar ssokolow commented on July 23, 2024

Sorry for the delayed response. I'll try to get you an answer soon.

from vim-vimlparser.

ssokolow avatar ssokolow commented on July 23, 2024

No. Substituting open(path) for open(path, 'r', newline='') has no apparent effect.

To avoid this back-and-forthing, here's a test file, containing the relevant two lines bit of my vimrc, plus a few extra lines of context to help make sure things are working properly:

vimrc.txt

Vim parses it properly, as does vimlparser when run with Python 2.x, or when my proposed "Replicate Vim's algorithm for identifying DOS vs. UNIX line endings in sourced scripts and .vimrc files" modifications are run with Python 3.x.

Otherwise, vimlparser with Python 3.x will produce the somewhat amusingly confusing failure mode of appearing to output nothing at all because the error message doesn't use repr() when showing the problem line and the portion after the \r contains a literal ^[ (i.e. ESC) which the terminal dutifully interprets.

If you use less to force the non-printables to not be interpreted by the terminal, you see this:

vimlparser: E492: Not an editor command: JA TODO:: line 4 col 1

For that reason, I'd also suggest changing the %s in this line to a %r:

raise VimLParserException(Err(viml_printf("E492: Not an editor command: %s", self.reader.peekline()), self.ea.cmdpos))

...that removes the need for less to see the message and makes it look like this:

vimlparser: E492: Not an editor command: '\x1bkJA TODO:': line 4 col 1

However, the successful output still looks truncated unless you run it through something like less because it doesn't use repr() on the ^[ either.

from vim-vimlparser.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.