I've finished new lexer. Conceptually it doesn't lock user into some particular method

Idea to parse only unsigned values originated here: <a class="issue-

Done in <a class="commit-link" data-hovercard-type="commit" data-hovercard-url="https:

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

It'd be pretty nice to be able to get <div class="highlight highlight-source-haske

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Discussion about new lexer about megaparsec HOT 12 CLOSED

mrkkrp commented on June 9, 2024

Discussion about new lexer

from megaparsec.

Comments (12)

abooij commented on June 9, 2024

Is indentGuard flexible enough to allow different styles of indentation? e.g., maybe you want space to normally be a whitespace, but not allow it in indentation. Or maybe you want the whitespace to be whatever sequence of (white) symbols (tabs/spaces), but fixed for any block of code.

I don't like the naming of integer: in particular, there is a very strong consensus among mathematicians that an integer can be negative (i.e. is signed). Mathematicians call a positive integer a natural, and there is also Data.Natural. There's also Data.Word for an Int-sized positive number.

Similarly, I think a float should be able to be negative, although this is less of an issue, since when you're going to deal with floating point numbers, you should really look at implementation details anyway. Then again, it outputs a Double, which suggests it accepts negative input values.

In the same spirit, if you claim that a number is either an Integer or a Double, then we should be talking about the signed variants.

Why can the new lexer read integer-style numbers with unlimited precision (Integer), but floating point-style numbers only with finite precision (Double)? Either:

use Int instead of Integer
use the appropriate typeclass instead of Integer
use some kind of arbitrary precision real number instead of Double
emphasize that one is arbitrary size while the other is not

Regarding octal: C people would say starting with 0 suffices. I think your documentation here is sufficiently clear about the difference, but it's worth mentioning somewhere, so here you go.

I like the new design and think it is much more generally useful. Indeed, essentially the only usage of the old TokenParser stuff is in example code, while this can be used more widely.

I read most of the code, and algebraically it looks correct. Thanks again for your work!

from megaparsec.

mrkkrp commented on June 9, 2024

(I will address your points in separate comments.)

I think indentGuard is flexible enough for vast majority of cases. After all you can supply custom parser to consume white space. This parser can only parse tabs or something like that.

It's currently not possible (using only built-in indentGuard) to make sure that every indentation in indentation block consists of identical sequence of white space characters. To make this happen user will need to write about 4 lines of original Haskell I guess.

I would be much more concerned with the fact that tab-width is hard-coded in updatePosChar function. Not sure how to make it configurable.

from megaparsec.

mrkkrp commented on June 9, 2024

Idea to parse only unsigned values originated here:

haskell/parsec#35 (comment)

it turns out that Haskell report doesn't say anything about sign in numeric literals, so “basic” versions (those that supposed to parse things according to Haskell report) don't parse sign. After all you may want to parse only positive numbers.

These functions should preferably anyway return values of signed types so they can be easily turned into parsers for signed numbers with help of signed combinator.

About size of integers: this is mostly taken from Parsec without changes. Choice of data types is not that unusual: Integer can be used to parse arbitrary sized integers, that is better than bounded Ints. It can be downgraded after all, but Int cannot be “upgraded”.

Use some kind of arbitrary precision real number instead of Double.

Haskell doesn't come with this sort of thing by default AFAIK. Otherwise yes, it would make sense given how float parser is defined (it accepts unlimited row of digits in both whole and fractional parts).

from megaparsec.

mrkkrp commented on June 9, 2024

About octal: indeed different languages vary in these subtle details. I think I will make these parsers (octal and hexadecimal) parse “raw” values without prefixes, so programmer will be able to prefix it with any sort of parser. This will be more flexible.

from megaparsec.

mrkkrp commented on June 9, 2024

Done in 3de3f69.

from megaparsec.

doppioandante commented on June 9, 2024

I just wanted to say that I'm trying this out, in max 2 weeks I should have some feedback.
I have to parse a pseudo-asm, so the parsing part is practically non-existent, it's all about lexing. Parsec eating newlines was troublesome because newline has to be used as separator between asm statements.

from megaparsec.

mrkkrp commented on June 9, 2024

@doppioandante, great. Please post here descriptions of any difficulties that you experience, so we can improve design of the lexer if necessary.

from megaparsec.

abooij commented on June 9, 2024

Re: naturals vs integers.

base >= 4.8 has Numeric.Natural, so you can use that.

from megaparsec.

neongreen commented on June 9, 2024

It'd be pretty nice to be able to get

parens    = between (symbol "(") (symbol ")")
braces    = between (symbol "{") (symbol "}")
angles    = between (symbol "<") (symbol ">")
brackets  = between (symbol "[") (symbol "]")

and so on without defining them all manually every time, but there's an obstacle: all those definitions need to have access to the whitespace parser, which would result in a lot of passing-stuff-around a la Parsec:

parens   = L.parens spaceConsumer
braces   = L.braces spaceConsumer
angles   = L.angles spaceConsumer
brackets = L.brackets spaceConsumer

This could probably be fixed by letting them get “whitespace configuration” (and possibly other things?) from MonadReader, but then we make it impossible for others to use Reader in the transformer stack. An alternative is defining our own MonadParserConfig as a Reader newtype, but that's too much and I'm not actually proposing for this to end up in the main library.

Just throwing this in because maybe somebody would have a better idea – and if so, I'd like to know about it.

from megaparsec.

neongreen commented on June 9, 2024

@mrkkrp, the hassle is in the fact that once symbol is defined, you also have to define

parens = between (symbol "(") (symbol ")")

where symbol refers to your definition. So, everyone keeps writing parens = between (symbol "(") (symbol ")") over and over, and yet this definition can't be reused.

Actually, I think it's something Backpack might fix once it's released, so maybe the question is moot.

from megaparsec.

mrkkrp commented on June 9, 2024

@neongreen, I think the idea with “white space configuration” in any form is an unnecessary complication. We could provide definitions like parens by default where they would take space consuming parser as argument. Yes, passing of the space consuming parser may be kind of boilerplate in most cases, but don't forget that it allows you to tune white space consumption policy on per-lexeme basis.

from megaparsec.

neongreen commented on June 9, 2024

but don't forget that it allows you to tune white space consumption policy on per-lexeme basis

Yep, but so does the Reader solution (with its local function). I also think it could be interesting to be able to do the following:

-- This is possible already.
expr = ...                        -- parses “1+2+3”, “( 1 + 2 + 3 )  ”, etc

-- This is possible as well.
angles expr                       -- parses “<1+2+3>”, “< 1+2 +3 >”, etc

-- This is trivial with the Reader solution.
noSpaces expr                     -- parses “(1+2+3)” 
                                  -- but not “(1+(2+ 3))”

-- I have no idea how to achieve this, and it's probably useless.
(glued angles) expr               -- parses “<1+2+3>”, “<1 + 2 + 3>”, etc
                                  -- but not “<1+2+3 >” or “< 1+2+3>”

I think the idea with “white space configuration” in any form is an unnecessary complication.

Maybe. I'd like to once again stress that it's not something I'm proposing or even have any use for – I merely wonder how much flexibility can we get out of the lexer without it becoming too complicated/weird.

from megaparsec.

Discussion about new lexer about megaparsec HOT 12 CLOSED

Comments (12)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs