mrkkrp / megaparsec Goto Github PK
View Code? Open in Web Editor NEWIndustrial-strength monadic parser combinator library
License: Other
Industrial-strength monadic parser combinator library
License: Other
I often write parsers for the small languages that come up in my PL research, and when it's time to write a paper, I like to use lhs2tex --newcode
to extract the source code of examples from my *.tex
sources for automatic checking. I want the error messages to point to my *.tex
file, not the generated file, so I have to add support for {-# LINE ... #-}
pragmas to my parser. For example, see this commit.
While the implementation is not overly hard, it requires me to drop to a relatively low level of abstraction. I would appreciate a more high-level API for parsing and processing line pragmas.
Some things that should be done:
CHANGELOG.md
, group changes by categories (general changes, error messages, characters parsing, etc.), so it's easier to read.README.md
file and information in .cabal
file.I would like to ask if anyone can help with the last task, because I myself don't have accounts in most social sites like that, but I feel the project would benefit from mentioning on such a site. I'm OK with creating an account for this purpose, but maybe someone with existing account would like to do it?
Looking at this PR: haskell/parsec#25, I see that a parser that could parse at least n
tokens and up to m
tokens is quite handy. countUpTo
doesn't seem necessary, at any rate it should be defined in terms of more powerful countFromTo
. Original count
can also be defined in terms of countFromTo
, which makes me think we should have countFromTo
in Megaparsec. Since it's the most powerful parser in this family, I propose call it count
and define count' n = count n n
.
This should be rather easy. It's more or less the question whether you actually want backwards-compatibility, and whether you want to use {-# LANGUAGE CPP #-}
for the MIN_VER
macro or "redundant import" warnings.
I just tried a quick 'n dirty patch and imported either Control.Applicative
(17 files*) or Data.Foldable (Foldable, foldMap)
(1 file), which passed all tests. Note that the import of Control.Applicative
isn't necessary in all of those files, e.g. all parseFromFile
could be written with infix fmap
instead of <$>
.
If you're interested in a patch, I can probably prepare a proper one later today.
* Bug9.hs is the only file that needs hiding (Const)
due to data Expr = Const …
.
This can be reproduced on master
branch, via cabal repl
:
λ> parseTest (string "rere" <* eof) "reri"
ghc: panic! (the 'impossible' happened)
(GHC version 7.10.1 for x86_64-unknown-linux):
Loading temp shared object failed: /tmp/ghc9380_0/libghc9380_93.so: undefined symbol: _hpc_tickboxes_megapzuEw3SHAmfXgNLpm5a31oXO6_TextziMegaparsecziError_hpc
Please report this as a GHC bug: http://www.haskell.org/ghc/reportabug
No idea why this happens and how to construct minimal example to reproduce. I can do the same via cabal repl tests
without any problems. Original Parsec works, of course. Not sure this is my fault or introduced by me, maybe this is a bug indeed.
OK, this is an important announcement. I'm considering the possibility of
removal of the following parsers from Text.Megaparsec.Combinator
:
chainl
chainl1
chainr
chainr1
sepEndBy
sepEndBy1
Chaining functions are useless and misleading. What they do is actually
better performed by Text.Megaparsec.Expr
, it's obvious from examples in
their descriptions. I guess they were written before the module existed, so
the author of Parsec tried to provide more functionality this way.
User who encounters these combinators will think that this is how
expressions should be parsed, so he may start writing his code using these
tools instead of Text.Megaparsec.Expr
.
sepEndBy
and its variant sepEndBy1
are not that useful too. These things
can be easily implemented in terms of existing combinators (however the
author of Parsec preferred to re-implement the whole thing, strange…) if
needed (although I think it's not that often needed):
-- | @sepEndBy p sep@ parses /zero/ or more occurrences of @p@,
-- separated and optionally ended by @sep@. Returns a list of values
-- returned by @p@.
sepEndBy p sep = sepBy p sep <* optional sep
-- | @sepEndBy1 p sep@ parses /one/ or more occurrences of @p@,
-- separated and optionally ended by @sep@. Returns a list of values
-- returned by @p@.
sepEndBy1 p sep = sepBy1 p sep <* optional sep
Note that it's not only easy to write the combinators using the existing
ones, but sepBy p sep <* optional sep
is also clearer than sepEndBy
. In
fact I had to read description of that function to understand what it really
does.
Please argue and provide your arguments if you want to save the
combinators.
It's obviously a monoid and it's a shame I haven't noticed this before. The definition is quite simple:
instance Monoid ParseError
mempty = newErrorUnknown (initialPos "")
mappend = mergeError
The library could use <>
instead of mergeError
. A couple of tests to check that this satisfies monoid laws should be written.
Hi, I'm developing an interpreter and was using Parsec for some preprocessing of an Abstract Syntax Tree, while developing I asked on StackOverflow how to get the input wether the parser succeeds or not, and have changed to using megaparsec since your announcement in the mailing list.
I don't need this feature anymore, but it may be a good idea to have this option anyway.
Parsec itself is a monad transformer (ParsecT
). This means that state monad can be used with it easily if user of the library needs to. However, Parsec, unlike other similar libraries, keep special user state field in its State
record. As far as I can tell, in vast majority of cases user state is nothing but unit ()
.
Here is a comment found in original Parsec's source code, above definition of instance of MonadState
:
I'm presuming the user might want a separate, non-backtracking state aside from the Parsec user state.
I'm seeing stateUser
field as unnecessary complication. I'm proposing removal of this custom user state from Megaparsec. What do you think, @albertnetymk, @neongreen?
(Title is a sort of pan.) See haskell/parsec#32 for more information. This may be good thing to implement.
See haskell/parsec#39 for more information. This should be easy to fix given our propositions in the thread. Also, new tests should be added.
Currently columns are counted from 1 with 1 being the minimum. This seems to be incorrect. I'm writing (rather stupid, really) tests for Text.Megaparsec.Pos
and this module's function updateCharPos
itself proves that current numbering is wrong. Here is description of the function:
Update a source position given a character. If the character is a newline ('\n') or carriage return ('\r') the line number is incremented by 1. If the character is a tab ('\t') the column number is incremented to the nearest 8'th column, i.e. column + 8 - ((column-1)
mod
8). In all other cases, the column is incremented by 1.
So, this means that screen is divided by tabs in non-regular fashion, because second "section" according to that logic starts from 8th column and ends on 16th column and it's 8 characters long. The problem is that first "section" starts from 1st column and ends on 8th column and so it's 7 characters long!
Also, source code doesn't check \r character at all I wonder if it may have implications on some systems, because \n\r will result in column 2 according to source code...
This is strange, why would continuations that represent successful outcome of parsing provide ParseError
records? To find out, I simply re-wrote definition of ParsecT
this way:
newtype ParsecT s u m a = ParsecT
{ unParser :: forall b . State s u
-> (a -> State s u -> m b) -- consumed-OK
-> (ParseError -> m b) -- consumed-error
-> (a -> State s u -> m b) -- empty-OK
-> (ParseError -> m b) -- empty-error
-> m b }
Then I removed all the useless stuff like cerr x s (unknownError s)
which Parsec needed to fill out ParseError
argument everywhere. This actually almost fixed that failing test from old tests and made code in Text.Megaparsec.Prim
a lot clearer.
But that argument had its purpose, of course. The only its purpose is to provide information about expected token when something fails latter. I hope an example will clarify. On lower level many p
will succeed even if p
fails at some point. All the collected stuff is returned as well as this failure, but as argument of “consumed-OK” continuation, so normally it will be ignored. However, if parser after many p
fails (let's call it n
), this will contribute to error message, so user will see:
unexpected something
expecting <what n parses> or <what p parses>
rather than
unexpected something
expecting <what n parses>
First error message is better, of course. But current implementation is rather convoluted. It conflicts with clear principles that I want to keep about merging of error messages. And that ParseError
in “consumed-OK” continuation is not really a parse error on its own. It's more like a hint, useful in only very specific cases (when something fails right after that “successful” parser) and rather buggy in other cases.
I'm wondering what would be the best solution to this problem. If we eliminate ParseError
argument from successful continuations (as common sense tells us to do), we will need some way to store this sort of hint. This hint should automatically expire when another parser succeeds (consuming input) after the tricky one.
If anyone want to discuss, please do.
According to this pull request for Parsec: haskell/parsec#46 some tokens (more “complex” then characters) may benefit if the second argument of token
combinator will be of type (t -> Either e a)
, so additional information can be returned to be included into the error message.
We could include such a change in Megaparsec.
Although I think this proposed solution is a bit bulky. If we want to give caller of token
ability to influence type of error message, it would be better written as:
token :: Stream s m t =>
(SourcePos -> t -> s -> SourcePos)
-> (t -> Either Message a)
-> ParsecT s u m a
…then ParseError
can easily constructed from Message
.
If we decide to do this change it will be very easy. Only satisfy
from Text.Megaparsec.Char
uses token
and it would be very natural to return Left . Unexpected . showToken $ x
when supplied predicate isn't satisfied.
Hi, I'm having trouble with a SQL parser using megaparsec. When I parse "SELECT DISTINCT db.stuff.id FROM stuff WHERE id = 1" it works but when I parse "SELECT DISTINCT db.stuff.id FROM stuff WHERE id = 1 ORDER BY id, name ASC LIMIT 20" it errors:
> parse query "" "SELECT DISTINCT db.stuff.id FROM stuff WHERE id = 1 ORDER BY id, name ASC LIMIT 20"
Left line 1, column 53:
unexpected 'O'
expecting '%' or white space
%
is the first operator in my expression parser
expression :: Parsec String Expression
expression = label "expression" $ makeExprParser term operators
where
term = ...
operators =
[ [ InfixN $ (ExpressionOPERATION .: OperationMOD) <$ sqltoken "%" ]
, ...
]
It's kind of weird that the error refers to only the first operator in the operator table. How do I get this to work? Sorry, not very experienced with parsing libraries.
See haskell/parsec#24 for more information. I wonder if haskell/parsec#41 can be helpful?
As laid out in this stackoverflow question _How to return multiple parse failures within Parsec's monadic context_, the fail
combinator only allows for a single Message
value to be placed into the ParseError
message list when there exists cases in parsing context sensitive grammars where multiple failures can exists. There is no way to place multiple Message
values into the ParseError
message list through the Parsec libraries exported functions.
I would like a fails
or failures
combinator of type:
fails :: [String] -> ParsecT s u m a
This combinator will allow multiple Message
values to be placed in the ParseError
message list when there exists multiple failures in a context sensitive grammar.
I looked for a function that did this, and none seemed to do it. If we are parsing certain parts of a file, we may want to set the initial SourcePos
to something different than 1
, 1
.
This could be easily implemented with
runParserInitPos :: Stream s t => Parsec s a -> String -> Int -> Int -> s -> Either ParseError a
runParserInitPos act src lin col str = runParser (setPosition pos >> act) src str
where
pos :: SourcePos
pos = newPos src lin col
But this may be a good function to have exposed to the users.
Hi,
I would like to have such a feature, however using that in parsec requires to copy this whole text.parsec.token code. Maybe you have some ideas on how to make this token generator more flexible? This is also related to the whitespace issue we discussed before.
Currently it returns its argument, unlike char'
. It seems to me that returning the parsed string could be more useful, especially if it can be made faster than naive string' = mapM char'
.
We need to achieve 100% coverage of most modules. Parsec tends to export more than necessary, so we should really clean up its API and everything we export should be thoroughly documented.
Currently in Megaparsec (just like in Parsec) tab width is not configurable, it's hard-coded in updatePosChar
function and is equal to 8
. This can be a problem. Many people use tabs to indent code and 4
is popular value for tab-width. If we cannot provide a way to specify tab-width, parsers created with Megaparsec may output confusing column positions in error messages in some cases.
I don't think it's possible to release Megaparsec 4.0.0 with such a flaw.
Proposed solution include addition of tab width field in State
record, a couple of functions to set and get it in Text.Megaparsec.Prim
, updatePosChar
should take tab-width as argument and other functions should get it and pass to updatePosChar
.
If you have other propositions, I would like to hear them.
So, why I was working on #45, I've noticed that there wasn't much missing in 4.6.0.0, except for Data.Bool (bool)
and Data.Either (isRight, isLeft)
.
I've prepared a commit based on the 4.7.0.0 patch, that mainly provides a proof of concept to show that 4.6.0.x compatibility is feasible. However, it's a mess and not really easy to maintain.
Instead, if there's interest to add base-4.6.0.x compatibility, I would introduce a new module (Text.Megaparsec.Comp
or similar), which re-exports the functions in base-4.7.0.0 or higher, and defines the three functions for older versions. That should give a single point of maintenance.
See haskell/parsec#35 for more information. This should be fixed and corresponding tests added. Also consider improving parsing of other types of numbers.
I've started writing of QuickCheck tests for Text.Megaparsec.Char
and the first function I've tested, oneOf
showed me that current printing of characters in error messages is inconsistent and unacceptable in general.
Currently printing function (from Parsec-era) amounts to:
showCh x = show [x]
Thanks to my first test for this module, I know that it displays characters as strings sometimes. For example, this may result in unexpected 'a'
printed as unexpected "a"
, when actually character is meant, for example in this case:
λ> parseTest (oneOf "a" <* eof) "ab"
parse error at line 1, column 2:
unexpected 'b'
expecting end of input
-- ↑ correct, we talk about a character
λ> parseTest (oneOf "a" <* eof) "blah"
parse error at line 1, column 1:
unexpected "b"
-- ↑ what is this? a string?
Ironically enough, authors of Parsec have better solution to print characters, not in their code though, but in an example for tokenPrim
:
-- > char c = tokenPrim showChar nextPos testChar
-- > where showChar x = "'" ++ x ++ "'"
-- > testChar x = if x == c then Just x else Nothing
-- > nextPos pos x xs = updatePosChar pos x
Of course they meant "'" ++ [x] ++ "'"
, but even in this form this won't work decently because it will print newlines literally, breaking (in two lines) text of error message for example.
If we choose another way and write showChar = show
we will get better Haskellish syntax, where newlines are \n
s, but also this sort of thing '\456'
for various otherwise perfectly legal characters like greek letters.
Conclusion: we should develop better printing function, where newlines are printed as word newline
, so it would say unexpected newline
, while printing all other characters like π
in their normal form.
Comments are welcome. I ask myself if I should open similar issue for vanilla Parsec, although something tells me that I will be ignored.
Pros of MegaParsec:
Cons of MegaParsec:
Here is an example:
λ> parseTest (string "re" <* eof) "ri"
parse error at line 1, column 1:
unexpected "i"
expecting "re"
This reports first mismatching character, but specified position is at the beginning of the whole string. Also, since it's logical that we should tell the user whole expected word, we should also tell him/her whole mismatching part, so:
λ> parseTest (string "re" <* eof) "ri"
parse error at line 1, column 1:
unexpected "ri"
expecting "re"
λ> parseTest (string "foo" <* eof) "bar"
parse error at line 1, column 1:
unexpected 'b'
expecting "foo"
↑ this is how proper error message should look, imho. Let's see if this can be fixed.
This is how to reproduce the problem:
λ> parseTest ((((try $ string ">>>") <|> (empty <|> return "foo")) <?> "bar") <* eof) ">>"
parse error at line 1, column 1:
unexpected '>'
expecting ">>>", bar, or end of input
λ> parseTest (((((try $ string ">>>") <|> empty) <|> return "foo") <?> "bar") <* eof) ">>"
parse error at line 1, column 1:
unexpected '>'
expecting bar or end of input
Sorry for so many parentheses, this is just to be sure how things are
grouped. Note that former example models foldr
while latter example
corresponds to foldl
. If we group combinators a bit differently we get
different error messages. This is a bug from my point of view.
This is a subtle one, it's not easy to reproduce. You can play with this
too, maybe you can find out cause of this problem faster than me.
Megaparsec might be able to replace Parsec much faster if it has a tutorial (not just a list of distinctions between parsec
and megaparsec
) – this way instead of everyone saying “okay there's Megaparsec but maybe use Parsec instead because all tutorials cover it and they aren't that different” we might end up with “okay there's Parsec but Megaparsec comes with an extensive tutorial so look at it instead”. So, I offer to write such a tutorial, and then we could put it in Text.Megaparsec.Tutorial
.
(The reason I'm making this an issue instead of just silently writing it and then making a pull request is that I don't want a repetition of the situation with Tekmo and lens tutorial. So, if e.g. @mrkkrp has strong opinions on how it should be written, it would be better to hear them early on instead of ending up with something written in my usual idiosyncratic style and then having the pull request rejected.)
It'd be nice if it was possible to use label
like this:
yearP :: Parser Integer
yearP = label "year" $ do
s <- some digit
return $ if length s < 4 then 2000 + read s else read s
(The current signature is label :: ParsecT s u m a -> String -> ParsecT s u m a
.)
I'm following up on a discussion following the announcement of megaparsec-4.0.0.0 on Haskell-Cafe.
I'll cut to the chase and say that quite a few people are excited about megaparsec and see it as a worthwhile candidate for reinvigorating parsec (as major version 4.0), the development of which has been pretty stale lately. Probably, the smoothest way to achieve that would be for @mrkkrp to become a (primary) maintainer of parsec on Hackage. Would you like that, Mark?
If so, there's a formal process for that https://wiki.haskell.org/Taking_over_a_package . But I don't think there would be any issues with that as there is awareness and support of this among some of the parsec maintainers and Hackage Trustees already. I can lend my hand in facilitating the process, if needed.
I think this is the smoothest way to "replace" parsec, as you would be leveraging the well-known name, while fixing a lot of long standing issues. I wouldn't worry about backwards compatibility, especially, since megaparsec is already compatible with the two previous versions of GHC.
See haskell/parsec#8 for more information. This should be definitely fixed in megaparsec
.
I've finished new lexer. Conceptually it doesn't lock user into some particular methodology of parsing but just gives collection of useful helpers. It's responsibility of the programmer to glue everything together.
See branch new-lexer
: Text.Megaparsec.Lexer
.
Note that old record-based system is eliminated, thus Text.Megaparsec.Language
is eliminated too (after all I don't think it was ever useful for someone expect for providing emptyDef
).
This is the place where you can criticize my decisions, say “What have you done!”, and change something until it's too late.
One of old tests is currently failing, this will be investigated and tests for entire Text.Megaparsec.Lexer
will be written. After that the branch new-lexer
will be merged. I estimate all the work won't take longer than a couple of days.
Hi Mark,
is this still planned?
Daniel
A Travis build has shown that the case-insensitive test doesn't work as intended. A minimal example that shows the defect has also been provided in that build:
-- prop_string a s
prop_string "a" "A"
This returns the match "A"
correctly, but the test expects a matched "a"
due to the logic in checkString
. The problem here is that the parser string' a
doesn't necessarily return a
, but s
if the case-insenstive match is successful.
You can reproduce this result with the following property:
prop_string'_custom :: String -> Property
prop_string'_custom a = prop_string' (map toLower a) (map toUpper a)
As you probably know Megaparsec is going to be thoroughly tested. This is absolutely necessary to make sure that our library won't “surprise” users with some ugly quirks. It's also necessary we want to continue active development after Megaparsec 4.0.0 is released. Look at Parsec, it cannot
move forward anymore, because every change can backfire in unpredictable ways (thanks to how it's coded, too).
So, I'm trying to cover everything and I'm currently working on Text.Megaparsec.Prim
, but I foresee that I'll soon get quite busy and my free time that I can devote to free software development will be minimal.
Thus I ask you to help me write some tests. It's not necessary to submit tests for entire module, one or two tests will help me a lot.
Here is some information you may find helpful if you wish to help me.
Util
module and how existing tests for Text.Megaparsec.Char
and Text.Megaparsec.Prim
and written. Every function in Util
has documentation so it should be clear how to use them to test parsers.simpleParse
function from Util
.Current status of Megaparsec is:
Text.Megaparsec.Expr
and Text.Megaparsec.Token
are finished and work (or should work) as intended;Text.Megaparsec.Language
), and some need testing.Finished modules that need testing:
Text.Megaparsec.Combinator
Text.Megaparsec.Perm
Tests for unfinished modules are not currently welcomed, because their API/provided features will change in future, so it's too soon to write tests for them.
Thank you for your interest in Megaparsec project.
It might be a good idea to benchmark at least basic things. Not sure I can do all the stuff myself, but let it hang here, maybe someone could help in future and write some benchmarks. I think criterion
library can be used.
It's really annoying that I have to hide <|>
, many
, optional
from Control.Applicative
whenever I want to use Parsec, so I propose making them reexports from Control.Applicative
in this package.
I think now we can concentrate on tutorials and other materials. In particular, we can use GitHub project pages to host some useful information related to the project.
Here is what I think such a site should contain:
If you have propositions, tell me.
There is also an idea to edit README.md
file. It should define Megaparsec not in terms of Parsec, but as something self-sufficient that non-Parsec user can understand. Comparison against Parsec should be moved to dedicated section. It's a good idea to add hyperlinked table of contents at the beginning.
I've written a simple parser and while testing it I get the above ErrorCall
even though I am not using the many
combinator directly. Someone on IRC suggested that I'm getting this error because the some
combinator uses many
under the hood.
I find this behaviour rather unexpected.
There are still uncovered pieces of code and non-tested cases. In particular:
Text.Megaparsec.Prim
.While writing tests for Text.Megaparsec.Combinator
I ran into the following bug:
λ> parseTest (manyTill letterChar (char 'c')) "ab"
parse error at line 1, column 1:
unexpected 'a'
expecting 'c'
this error message is wrong. It should be:
parse error at line 1, column 3:
unexpected end of input
expecting 'c' or letter
It's worth noticing that Parsec doesn't suffer from this:
λ> parseTest (manyTill letter (char 'c')) "ab"
parse error at (line 1, column 3):
unexpected end of input
expecting "c" or letter
Current implementation however is functionally the same as Parsec's:
-- Megaparsec
manyTill :: Stream s m t =>
ParsecT s u m a -> ParsecT s u m end -> ParsecT s u m [a]
manyTill p end = (end *> return []) <|> someTill p end
someTill :: Stream s m t =>
ParsecT s u m a -> ParsecT s u m end -> ParsecT s u m [a]
someTill p end = (:) <$> p <*> manyTill p end
-- Parsec
manyTill :: (Stream s m t) => ParsecT s u m a -> ParsecT s u m end -> ParsecT s u m [a]
manyTill p end = scan
where
scan = do{ end; return [] }
<|>
do{ x <- p; xs <- scan; return (x:xs) }
I've checked it, changing Megaparsec's manyTill
to Parsec's manyTill
doesn't solve the problem. This means there a bug on lower level.
While it is sad that Text.Megaparsec.Prim
still has bugs, I'm glad they at least don't get through our tests.
I'll report here cause of the issue once I've that fixed.
For easy of adoption you should export all of Parsec's useful combinators.
many1
and spaces
are missing amongst others.
Forcing users to define these manually will likely lead to people not adopting the new library as they will not have a clean "drop-in" substitution.
Here's a Parsec parser:
import Text.Parsec
import Text.Parsec.Language
import Text.Parsec.Lexer
decimal' = decimal (makeTokenParser haskellDef)
pair = do
char '{' <?> "opening brace"
a <- decimal' <?> "1st number"
char ',' <?> "comma"
b <- decimal' <?> "2nd number"
char '}' <?> "closing brace"
eof
return (a, b)
It gives the following errors on various inputs:
> pair "{1a,456}"
*** Exception: (line 1, column 3):
unexpected "a"
expecting digit or comma
> pair "{a,456}"
*** Exception: (line 1, column 2):
unexpected "a"
expecting 1st number
Here's an equivalent Megaparsec parser:
import Text.Megaparsec
import Text.Megaparsec.Lexer
pair = do
char '{' <?> "opening brace"
a <- decimal <?> "1st number"
char ',' <?> "comma"
b <- decimal <?> "2nd number"
char '}' <?> "closing brace"
eof
return (a, b)
And its errors:
> pair "{1a,456}"
*** Exception: line 1, column 3:
unexpected 'a'
expecting 1st number or comma
> pair "{a,456}"
*** Exception: line 1, column 2:
unexpected 'a'
expecting 1st number
Parsec's error when parsing {1a,456}
seems more helpful to me (it would be even more apparent if instead of decimal
there was a more complicated parser). Am I mistaken?
(See the previous discussion here.)
ParsecT
lets us combine parsing with things such as keeping state, accessing configuration, etc. However, any monad you embed into ParsecT
won't be affected by Parsec's backtracking properties:
{-# LANGUAGE FlexibleContexts #-}
import Control.Monad.State
import Text.Parsec hiding (State)
trying s = modify (++ ["trying " ++ s])
test :: ParsecT String () (State [String]) Char
test = (trying "'a'" >> char 'a')
<|> (trying "'b'" >> char 'b')
main = print $ flip execState [] $ runParserT test () "" "b"
-- prints:
-- trying 'a'
-- trying 'b'
In this example having non-backtracking state was desirable. Sometimes having backtracking state is desirable – for instance, my usecase is parsing a list of rules, where each rule is merely a named list of patterns, and patterns can reference earlier-defined patterns (which is why all parsed patterns are added to state); if a single pattern in a rule can't be parsed I want to throw the rule away completely and forget that I ever parsed it, which means forgetting about all successfully-parsed patterns in the rule. (Sorry if the example is unclear or seems contrived.)
One way to have backtracking state is to use Parsec's built-in “user state”. Unfortunately, this solution isn't extensible, as Parsec provides no built-in writer, reader, or any other useful monad you could think of.
Another way is to use
StateT MyStateType Parser a
which, if you aren't familiar with StateT
, is another way of saying
MyStateType -> Parser (a, MyStateType)
which hopefully makes it clearer why this variant would preserve backtracking; if it doesn't, look at the definition of <|>
for StateT
:
m <|> n = StateT $ \ s -> runStateT m s <|> runStateT n s
Here both actions are applied to the same state, so it's impossible that both parsers' modifications would end up in the final state.
At the moment this solution isn't viable either, because all Parsec's parsers work in ParsecT
and can't be used in StateT
. So, we have to use lift
:
{-# LANGUAGE FlexibleContexts #-}
import Control.Applicative
import Control.Monad.State
import Text.Parsec hiding (State, (<|>))
import Text.Parsec.String
trying s = modify (++ ["trying " ++ s])
test :: StateT [String] Parser Char
test = (trying "'a'" >> lift (char 'a'))
<|> (trying "'b'" >> lift (char 'b'))
main = do
let Right res = parse (flip execStateT [] test) "" "b"
mapM_ putStrLn res
-- prints:
-- trying 'b'
However, we can't just define char = lift . primitiveCharParser
because that would make char
unusable in stacks of 2 and more transformers. So, instead we might have to create a new mtl-style class (perhaps MonadParsec
?) instead. A similar class already exists in parsers (see Parsing
, CharParsing
); however, we likely won't be able to reuse parsers because parsers depends on parsec/attoparsec утв because parsers has different naming conventions etc:
— mrkkrp
It's probably not possible to write such a thing quickly, at least not with 100% coverage. However, we could cover 30% or more for a start improving coverage latter. What's really important is that megaparsec
should be perceived as bug-free and robust software. To achieve this, good testing is necessary.
Service like Coveralls can be used to measure coverage.
Here is the plan:
Text.Megaparsec.Pos
Text.Megaparsec.Error
Text.Megaparsec.Prim
Text.Megaparsec.Combinator
Text.Megaparsec.Char
Text.Megaparsec.Expr
Text.Megaparsec.Perm
Text.Megaparsec.Token
Module Data.Char contains many interesting tests, like isControl
, isPrint
, isMark
and even allows to test Unicode categories of characters. We could add more parsers to Text.Megaparsec.Char
module leveraging this functionality. New tests will be easy to write since we already have good utilities for this sort of testing.
Somehow one of tests failed. It failed on very simple input. Currently every test is iterated 1000 times. And all other combinations of Cabal/GHC pass the tests. Why this particular test failed is beyond my understanding.
If you have any idea, it's interesting to hear.
The format of the location information in error messages generated by parsec is unusual, which makes it harder to provide basic IDE support for languages with parsec-based implementations. For example, I often write code like in this commit to help Emacs jump to locations in parsec errors. Megaparsec could improve on this by making it easier to customize the formatting and/or by choosing a more default default.
Let's check what Error
module tells us about message constructors:
SysUnExpect
message is automatically generated by the Text.Parsec.Combinator.satisfy
combinator. The argument is the unexpected input.UnExpect
message is generated by the Text.Parsec.Prim.unexpected
combinator. The argument describes the unexpected item.Expect
message is generated by the Text.Parsec.Prim.<?>
combinator. The argument describes the expected item.Message
message is generated by the fail
combinator. The argument is some general parser message.Note to self: refresh this section in docs.
After much guessing of how Error
module works, or rather how it's supposed to work and how it actually works (because of missing doc-strings), I conclude that we could simplify this too. Why should we have two types of unexpected thing? Does it really matter why a thing is unexpected (because library says so or user says so)?
I propose remove SysUnExpect
constructor and have only three constructors of error messages:
Unexpected
Expected
Message
The last of them is never really used in Parsec itself, but it certainly can be of some use in hands of a good programmer, so let it be.
Parsec is not capable of generating good error messages on program level. Its error merging, etc. logic is (was, if we talk about Megaparsec) absolutely broken.
For example, instance of Eq Message
says that Expect "foo"
and Expect "bar"
are equal! This means no serious error checking is possible without rendering of error message (more on this below), and indeed! look at Utils.hs
from old tests:
-- | Returns the error messages associated with a failed parse.
parseErrors :: Parser a -> String -> [String]
parseErrors p input =
case parse p "" input of
Left err -> drop 1 $ lines $ show err
Right _ -> []
Are you serious? This sort of check should be done by examination of list of error messages that can be extracted with errorMessages
function.
But to make it possible you need to make it work. Not only original Parsec couldn't compare error messages, it actually allowed duplicates of messages in list of messages. This means that you could have something like this [Unexpect "a", Expect "b", Expect "b"]
, so even if you have normal equality check, you cannot compare two lists of messages.
I needed quite a bit of luck to identify what some functions should exactly do in Parsec:
-- | @addErrorMessage m err@ returns @ParseError@ @err@ with message @m@
-- added. This function makes sure that list of messages is always ordered
-- and doesn't contain duplicates.
addErrorMessage :: Message -> ParseError -> ParseError
addErrorMessage m (ParseError pos ms) = ParseError pos (pre ++ [m] ++ post)
where pre = filter (< m) ms
post = filter (> m) ms
-- | @setErrorMessage m err@ returns @err@ with message @m@ added. This
-- function also deletes all existing error messages that were created with
-- the same constructor as @m@.
setErrorMessage :: Message -> ParseError -> ParseError
setErrorMessage m (ParseError pos ms) = addErrorMessage m (ParseError pos xs)
where xs = filter ((/= fromEnum m) . fromEnum) ms
↑ Here, I use interface of Error
module to keep list of messages not only sorted, but without duplicates too. Since user of Error
module cannot create ParseError
with its constructor (it's not exported, this is right decision), there is no way to have malformed list of messages now (and
this is thoroughly checked now).
This alone has enabled me to actually check messages without the need to render them. This was essential requirement to build decent test suite, of course.
I would like to tell you about mergeError
function. This is really weird one (original Parsec code):
mergeError :: ParseError -> ParseError -> ParseError
mergeError e1@(ParseError pos1 msgs1) e2@(ParseError pos2 msgs2)
-- prefer meaningful errors
| null msgs2 && not (null msgs1) = e1
| null msgs1 && not (null msgs2) = e2
| otherwise
= case pos1 `compare` pos2 of
-- select the longest match
EQ -> ParseError pos1 (msgs1 ++ msgs2)
GT -> e1
LT -> e2
This is totally nonsensical and our growing test suite recently spotted last bug remaining from this implementation (msgs1 ++ msgs2
allowed duplicates).
Both parts of the function are strange.
Here is correct implementation:
-- | Merge two error data structures into one joining their collections of
-- messages and preferring shortest match.
mergeError :: ParseError -> ParseError -> ParseError
mergeError e1@(ParseError pos1 ms1) e2@(ParseError pos2 ms2) =
case pos1 `compare` pos2 of
LT -> e1
EQ -> foldr addErrorMessage (ParseError pos1 ms1) ms2
GT -> e2
And finally, such a collection of nonsense couldn't remain unnoticed in such a well-known library if it couldn't present things like they are OK. This is done with showMessages
function:
showErrorMessages ::
String -> String -> String -> String -> String -> [Message] -> String
showErrorMessages msgOr msgUnknown msgExpecting msgUnExpected msgEndOfInput msgs
| null msgs = msgUnknown
| otherwise = concat $ map ("\n"++) $ clean $
[showSysUnExpect,showUnExpect,showExpect,showMessages]
where
(sysUnExpect,msgs1) = span ((SysUnExpect "") ==) msgs
(unExpect,msgs2) = span ((UnExpect "") ==) msgs1
(expect,messages) = span ((Expect "") ==) msgs2
showExpect = showMany msgExpecting expect
showUnExpect = showMany msgUnExpected unExpect
showSysUnExpect | not (null unExpect) ||
null sysUnExpect = ""
| null firstMsg = msgUnExpected ++ " " ++ msgEndOfInput
| otherwise = msgUnExpected ++ " " ++ firstMsg
where
firstMsg = messageString (head sysUnExpect)
showMessages = showMany "" messages
-- helpers
showMany pre msgs = case clean (map messageString msgs) of
[] -> ""
ms | null pre -> commasOr ms
| otherwise -> pre ++ " " ++ commasOr ms
commasOr [] = ""
commasOr [m] = m
commasOr ms = commaSep (init ms) ++ " " ++ msgOr ++ " " ++ last ms
commaSep = separate ", " . clean
separate _ [] = ""
separate _ [m] = m
separate sep (m:ms) = m ++ sep ++ separate sep ms
clean = nub . filter (not . null)
What a… no, it doesn't look complex because it does complex things. It's just coded this way. First of all, see comments in original Parsec, it's supposed to support many languages, this is why it accepts that many arguments. I wonder if it ever occurred to author of the function that translation to different language is not done by one-to-one word substitution. Let's implement it just in English, OK? This alone removes a lot of junk.
But this function has many things to do. It should compensate all the flaws in ParseError
model to present things normally. This includes removing of duplicate messages and a couple of special cases. Also, there is Data.List.intercalate
for what they call separate
.
Of course, fixing of this module alone is not enough for really good error messages in our library, but there are many other changes I've made. I hope this is really going to be high-quality library now.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.