mrkkrp / megaparsec Goto Github PK

View Code? Open in Web Editor NEW

890.0 890.0 82.0 2.48 MB

Industrial-strength monadic parser combinator library

License: Other

Haskell 98.79% Nix 1.21%

haskell megaparsec parser-library

megaparsec's People

Contributors

Stargazers

Watchers

Forkers

albertnetymk antarestrader abooij omefire neongreen minad harlanhaskins hvr fosskers mccain zemyla mikeplus64 devonhollowood erikd hth313 flip111 alexanderkjeldaas int-index spacekitteh unkindpartition recursion-ninja hardvain patrickherrmann profpatsch jeffreybenjaminbrown bumbleblym factisresearch tdammers ntc2 haskellstudio typedrat keithpitsui rowhit tvh hardentoo ghallak joneshf rehno-lindeque galenhuntington glguy mislavn luc-tielen tikhonjelvis fuath frasertweedale tmcgilchrist infinisil joranvar andreaspk srid hesiod jamesdbrock neuroradiology banacorn facundominguez onixus74 aslakg chrisdone unisonweb ra2003 1computer1 mlugg jonascarpay ulysses4ever cdepillabout andraskovacs raehik anton-latukha sjakobi cognivore alistairb tomjaguarpaw eggbaconandspam yaitskov mengwong spcfox olafklinke lev135 amesgen andrewthad andreasabel byorgey

megaparsec's Issues

Support parsing and processing of line pragmas

I often write parsers for the small languages that come up in my PL research, and when it's time to write a paper, I like to use lhs2tex --newcode to extract the source code of examples from my *.tex sources for automatic checking. I want the error messages to point to my *.tex file, not the generated file, so I have to add support for {-# LINE ... #-} pragmas to my parser. For example, see this commit.

While the implementation is not overly hard, it requires me to drop to a relatively low level of abstraction. I would appreciate a more high-level API for parsing and processing line pragmas.

Preparations for release of Megaparsec 4.0.0

Some things that should be done:

Review CHANGELOG.md, group changes by categories (general changes, error messages, characters parsing, etc.), so it's easier to read.
Review README.md file and information in .cabal file.
Tag release.
Upload on Hackage.
Add to Stackage.
Mention Megaparsec's release on Reddit or similar site where interested people could find it.

I would like to ask if anyone can help with the last task, because I myself don't have accounts in most social sites like that, but I feel the project would benefit from mentioning on such a site. I'm OK with creating an account for this purpose, but maybe someone with existing account would like to do it?

Enhance 'count' parser

Looking at this PR: haskell/parsec#25, I see that a parser that could parse at least n tokens and up to m tokens is quite handy. countUpTo doesn't seem necessary, at any rate it should be defined in terms of more powerful countFromTo. Original count can also be defined in terms of countFromTo, which makes me think we should have countFromTo in Megaparsec. Since it's the most powerful parser in this family, I propose call it count and define count' n = count n n.

Add base-4.7.0.0 compatibility

This should be rather easy. It's more or less the question whether you actually want backwards-compatibility, and whether you want to use {-# LANGUAGE CPP #-} for the MIN_VER macro or "redundant import" warnings.

I just tried a quick 'n dirty patch and imported either Control.Applicative (17 files*) or Data.Foldable (Foldable, foldMap) (1 file), which passed all tests. Note that the import of Control.Applicative isn't necessary in all of those files, e.g. all parseFromFile could be written with infix fmap instead of <$>.

If you're interested in a patch, I can probably prepare a proper one later today.

* Bug9.hs is the only file that needs hiding (Const) due to data Expr = Const ….

GHC Panic!

This can be reproduced on master branch, via cabal repl:

λ> parseTest (string "rere" <* eof) "reri"
ghc: panic! (the 'impossible' happened)
  (GHC version 7.10.1 for x86_64-unknown-linux):
    Loading temp shared object failed: /tmp/ghc9380_0/libghc9380_93.so: undefined symbol: _hpc_tickboxes_megapzuEw3SHAmfXgNLpm5a31oXO6_TextziMegaparsecziError_hpc

Please report this as a GHC bug:  http://www.haskell.org/ghc/reportabug

No idea why this happens and how to construct minimal example to reproduce. I can do the same via cabal repl tests without any problems. Original Parsec works, of course. Not sure this is my fault or introduced by me, maybe this is a bug indeed.

Removal of several parsers from 'Text.Megaparsec.Combinator'

OK, this is an important announcement. I'm considering the possibility of
removal of the following parsers from Text.Megaparsec.Combinator:

chainl
chainl1
chainr
chainr1
sepEndBy
sepEndBy1

Chaining functions are useless and misleading. What they do is actually
better performed by Text.Megaparsec.Expr, it's obvious from examples in
their descriptions. I guess they were written before the module existed, so
the author of Parsec tried to provide more functionality this way.

User who encounters these combinators will think that this is how
expressions should be parsed, so he may start writing his code using these
tools instead of Text.Megaparsec.Expr.

sepEndBy and its variant sepEndBy1 are not that useful too. These things
can be easily implemented in terms of existing combinators (however the
author of Parsec preferred to re-implement the whole thing, strange…) if
needed (although I think it's not that often needed):

-- | @sepEndBy p sep@ parses /zero/ or more occurrences of @p@,
-- separated and optionally ended by @sep@. Returns a list of values
-- returned by @p@.

sepEndBy p sep = sepBy p sep <* optional sep

-- | @sepEndBy1 p sep@ parses /one/ or more occurrences of @p@,
-- separated and optionally ended by @sep@. Returns a list of values
-- returned by @p@.

sepEndBy1 p sep = sepBy1 p sep <* optional sep

Note that it's not only easy to write the combinators using the existing
ones, but sepBy p sep <* optional sep is also clearer than sepEndBy. In
fact I had to read description of that function to understand what it really
does.

Please argue and provide your arguments if you want to save the
combinators.

Make ParseError a monoid

It's obviously a monoid and it's a shame I haven't noticed this before. The definition is quite simple:

instance Monoid ParseError
  mempty  = newErrorUnknown (initialPos "")
  mappend = mergeError

The library could use <> instead of mergeError. A couple of tests to check that this satisfies monoid laws should be written.

Get the left input in failure or success

Hi, I'm developing an interpreter and was using Parsec for some preprocessing of an Abstract Syntax Tree, while developing I asked on StackOverflow how to get the input wether the parser succeeds or not, and have changed to using megaparsec since your announcement in the mailing list.

I don't need this feature anymore, but it may be a good idea to have this option anyway.

Remove user state from Megaparsec state

Parsec itself is a monad transformer (ParsecT). This means that state monad can be used with it easily if user of the library needs to. However, Parsec, unlike other similar libraries, keep special user state field in its State record. As far as I can tell, in vast majority of cases user state is nothing but unit ().

Here is a comment found in original Parsec's source code, above definition of instance of MonadState:

I'm presuming the user might want a separate, non-backtracking state aside from the Parsec user state.

I'm seeing stateUser field as unnecessary complication. I'm proposing removal of this custom user state from Megaparsec. What do you think, @albertnetymk, @neongreen?

More expressive functions in expression parsers

(Title is a sort of pan.) See haskell/parsec#32 for more information. This may be good thing to implement.

Integer doesn't fail for leading white spaces

See haskell/parsec#39 for more information. This should be easy to fix given our propositions in the thread. Also, new tests should be added.

Columns should be counted from zero

Currently columns are counted from 1 with 1 being the minimum. This seems to be incorrect. I'm writing (rather stupid, really) tests for Text.Megaparsec.Pos and this module's function updateCharPos itself proves that current numbering is wrong. Here is description of the function:

Update a source position given a character. If the character is a newline ('\n') or carriage return ('\r') the line number is incremented by 1. If the character is a tab ('\t') the column number is incremented to the nearest 8'th column, i.e. column + 8 - ((column-1) mod 8). In all other cases, the column is incremented by 1.

So, this means that screen is divided by tabs in non-regular fashion, because second "section" according to that logic starts from 8th column and ends on 16th column and it's 8 characters long. The problem is that first "section" starts from 1st column and ends on 8th column and so it's 7 characters long!

Also, source code doesn't check \r character at all I wonder if it may have implications on some systems, because \n\r will result in column 2 according to source code...

“Consumed-OK” and “empty-OK” continuations take 'ParseError' as argument

This is strange, why would continuations that represent successful outcome of parsing provide ParseError records? To find out, I simply re-wrote definition of ParsecT this way:

newtype ParsecT s u m a = ParsecT
  { unParser :: forall b . State s u
             -> (a -> State s u -> m b) -- consumed-OK
             -> (ParseError -> m b)     -- consumed-error
             -> (a -> State s u -> m b) -- empty-OK
             -> (ParseError -> m b)     -- empty-error
             -> m b }

Then I removed all the useless stuff like cerr x s (unknownError s) which Parsec needed to fill out ParseError argument everywhere. This actually almost fixed that failing test from old tests and made code in Text.Megaparsec.Prim a lot clearer.

But that argument had its purpose, of course. The only its purpose is to provide information about expected token when something fails latter. I hope an example will clarify. On lower level many p will succeed even if p fails at some point. All the collected stuff is returned as well as this failure, but as argument of “consumed-OK” continuation, so normally it will be ignored. However, if parser after many p fails (let's call it n), this will contribute to error message, so user will see:

unexpected something
expecting <what n parses> or <what p parses>

rather than

unexpected something
expecting <what n parses>

First error message is better, of course. But current implementation is rather convoluted. It conflicts with clear principles that I want to keep about merging of error messages. And that ParseError in “consumed-OK” continuation is not really a parse error on its own. It's more like a hint, useful in only very specific cases (when something fails right after that “successful” parser) and rather buggy in other cases.

I'm wondering what would be the best solution to this problem. If we eliminate ParseError argument from successful continuations (as common sense tells us to do), we will need some way to store this sort of hint. This hint should automatically expire when another parser succeeds (consuming input) after the tricky one.

If anyone want to discuss, please do.

Possible change in 'Text.Megaparsec.Prim.token'

According to this pull request for Parsec: haskell/parsec#46 some tokens (more “complex” then characters) may benefit if the second argument of token combinator will be of type (t -> Either e a), so additional information can be returned to be included into the error message.

We could include such a change in Megaparsec.

Although I think this proposed solution is a bit bulky. If we want to give caller of token ability to influence type of error message, it would be better written as:

token :: Stream s m t =>
         (SourcePos -> t -> s -> SourcePos)
      -> (t -> Either Message a)
      -> ParsecT s u m a

…then ParseError can easily constructed from Message.

If we decide to do this change it will be very easy. Only satisfy from Text.Megaparsec.Char uses token and it would be very natural to return Left . Unexpected . showToken $ x when supplied predicate isn't satisfied.

Having trouble with `makeExprParser`

Hi, I'm having trouble with a SQL parser using megaparsec. When I parse "SELECT DISTINCT db.stuff.id FROM stuff WHERE id = 1" it works but when I parse "SELECT DISTINCT db.stuff.id FROM stuff WHERE id = 1 ORDER BY id, name ASC LIMIT 20" it errors:

> parse query "" "SELECT DISTINCT db.stuff.id FROM stuff WHERE id = 1 ORDER BY id, name ASC LIMIT 20"
Left line 1, column 53:
unexpected 'O'
expecting '%' or white space

% is the first operator in my expression parser

expression :: Parsec String Expression
expression = label "expression" $ makeExprParser term operators
  where
    term = ...
    operators =
      [ [ InfixN $ (ExpressionOPERATION .: OperationMOD) <$ sqltoken "%" ]
      , ...
      ]

It's kind of weird that the error refers to only the first operator in the operator table. How do I get this to work? Sorry, not very experienced with parsing libraries.

Add option to avoid automatic consumption of new-lines by lexemes

See haskell/parsec#24 for more information. I wonder if haskell/parsec#41 can be helpful?

Create combinator for multiple failures

As laid out in this stackoverflow question _How to return multiple parse failures within Parsec's monadic context_, the fail combinator only allows for a single Message value to be placed into the ParseError message list when there exists cases in parsing context sensitive grammars where multiple failures can exists. There is no way to place multiple Message values into the ParseError message list through the Parsec libraries exported functions.

I would like a fails or failures combinator of type:

fails :: [String] -> ParsecT s u m a

This combinator will allow multiple Message values to be placed in the ParseError message list when there exists multiple failures in a context sensitive grammar.

runParser function that receives initial SourcePos

I looked for a function that did this, and none seemed to do it. If we are parsing certain parts of a file, we may want to set the initial SourcePos to something different than 1, 1.

This could be easily implemented with

runParserInitPos :: Stream s t => Parsec s a -> String -> Int -> Int -> s -> Either ParseError a
runParserInitPos act src lin col str = runParser (setPosition pos >> act) src str
    where
        pos :: SourcePos
        pos = newPos src lin col

But this may be a good function to have exposed to the users.

multiline string literals

Hi,

I would like to have such a feature, however using that in parsec requires to copy this whole text.parsec.token code. Maybe you have some ideas on how to make this token generator more flexible? This is also related to the whitespace issue we discussed before.

Should string' return the actual parsed string?

Currently it returns its argument, unlike char'. It seems to me that returning the parsed string could be more useful, especially if it can be made faster than naive string' = mapM char'.

Improve documentation

We need to achieve 100% coverage of most modules. Parsec tends to export more than necessary, so we should really clean up its API and everything we export should be thoroughly documented.

Tab width must be configurable

Currently in Megaparsec (just like in Parsec) tab width is not configurable, it's hard-coded in updatePosChar function and is equal to 8. This can be a problem. Many people use tabs to indent code and 4 is popular value for tab-width. If we cannot provide a way to specify tab-width, parsers created with Megaparsec may output confusing column positions in error messages in some cases.

I don't think it's possible to release Megaparsec 4.0.0 with such a flaw.

Proposed solution include addition of tab width field in State record, a couple of functions to set and get it in Text.Megaparsec.Prim, updatePosChar should take tab-width as argument and other functions should get it and pass to updatePosChar.

If you have other propositions, I would like to hear them.

Add base-4.6.0.x compatibility

So, why I was working on #45, I've noticed that there wasn't much missing in 4.6.0.0, except for Data.Bool (bool) and Data.Either (isRight, isLeft).

I've prepared a commit based on the 4.7.0.0 patch, that mainly provides a proof of concept to show that 4.6.0.x compatibility is feasible. However, it's a mess and not really easy to maintain.

Instead, if there's interest to add base-4.6.0.x compatibility, I would introduce a new module (Text.Megaparsec.Comp or similar), which re-exports the functions in base-4.7.0.0 or higher, and defines the three functions for older versions. That should give a single point of maintenance.

A possible bug in Text.Parsec.Token.float

See haskell/parsec#35 for more information. This should be fixed and corresponding tests added. Also consider improving parsing of other types of numbers.

Better representation of characters in error messages

I've started writing of QuickCheck tests for Text.Megaparsec.Char and the first function I've tested, oneOf showed me that current printing of characters in error messages is inconsistent and unacceptable in general.

Currently printing function (from Parsec-era) amounts to:

showCh x = show [x]

Thanks to my first test for this module, I know that it displays characters as strings sometimes. For example, this may result in unexpected 'a' printed as unexpected "a", when actually character is meant, for example in this case:

λ> parseTest (oneOf "a" <* eof) "ab"
parse error at line 1, column 2:
unexpected 'b'
expecting end of input
-- ↑ correct, we talk about a character
λ> parseTest (oneOf "a" <* eof) "blah"
parse error at line 1, column 1:
unexpected "b"
-- ↑ what is this? a string?

Ironically enough, authors of Parsec have better solution to print characters, not in their code though, but in an example for tokenPrim:

-- > char c  = tokenPrim showChar nextPos testChar
-- >   where showChar x       = "'" ++ x ++ "'"
-- >         testChar x       = if x == c then Just x else Nothing
-- >         nextPos pos x xs = updatePosChar pos x

Of course they meant "'" ++ [x] ++ "'", but even in this form this won't work decently because it will print newlines literally, breaking (in two lines) text of error message for example.

If we choose another way and write showChar = show we will get better Haskellish syntax, where newlines are \ns, but also this sort of thing '\456' for various otherwise perfectly legal characters like greek letters.

Conclusion: we should develop better printing function, where newlines are printed as word newline, so it would say unexpected newline, while printing all other characters like π in their normal form.

Comments are welcome. I ask myself if I should open similar issue for vanilla Parsec, although something tells me that I will be ignored.

MegaParsec or Megaparsec?

Pros of MegaParsec:

It's more obvious that it's related to Parsec. (Is it actually a good thing?)

Cons of MegaParsec:

“parsec” is a unit of measurement, so it shouldn't be capitalised in “megaparsec”.
“Megaparsec” is consistent with “Attoparsec” and “Picoparsec”, 2 other libraries with “parsec” in them.
I suspect (but can't prove) that everyone who uses ThisCapitalisation in nir product's name regrets it when the product becomes popular.

Error messages by 'string' parser suck

Here is an example:

λ> parseTest (string "re" <* eof) "ri"
parse error at line 1, column 1:
unexpected "i"
expecting "re"

This reports first mismatching character, but specified position is at the beginning of the whole string. Also, since it's logical that we should tell the user whole expected word, we should also tell him/her whole mismatching part, so:

λ> parseTest (string "re" <* eof) "ri"
parse error at line 1, column 1:
unexpected "ri"
expecting "re"
λ> parseTest (string "foo" <* eof) "bar"
parse error at line 1, column 1:
unexpected 'b'
expecting "foo"

↑ this is how proper error message should look, imho. Let's see if this can be fixed.

Grouping of (<|>) operators can change result error message

This is how to reproduce the problem:

λ> parseTest ((((try $ string ">>>") <|> (empty <|> return "foo")) <?> "bar") <* eof) ">>"
parse error at line 1, column 1:
unexpected '>'
expecting ">>>", bar, or end of input
λ> parseTest (((((try $ string ">>>") <|> empty) <|> return "foo") <?> "bar") <* eof) ">>"
parse error at line 1, column 1:
unexpected '>'
expecting bar or end of input

Sorry for so many parentheses, this is just to be sure how things are
grouped. Note that former example models foldr while latter example
corresponds to foldl. If we group combinators a bit differently we get
different error messages. This is a bug from my point of view.

This is a subtle one, it's not easy to reproduce. You can play with this
too, maybe you can find out cause of this problem faster than me.

An embedded tutorial

Megaparsec might be able to replace Parsec much faster if it has a tutorial (not just a list of distinctions between parsec and megaparsec) – this way instead of everyone saying “okay there's Megaparsec but maybe use Parsec instead because all tutorials cover it and they aren't that different” we might end up with “okay there's Parsec but Megaparsec comes with an extensive tutorial so look at it instead”. So, I offer to write such a tutorial, and then we could put it in Text.Megaparsec.Tutorial.

(The reason I'm making this an issue instead of just silently writing it and then making a pull request is that I don't want a repetition of the situation with Tekmo and lens tutorial. So, if e.g. @mrkkrp has strong opinions on how it should be written, it would be better to hear them early on instead of ending up with something written in my usual idiosyncratic style and then having the pull request rejected.)

Flip argument order for “label”

It'd be nice if it was possible to use label like this:

yearP :: Parser Integer
yearP = label "year" $ do
  s <- some digit
  return $ if length s < 4 then 2000 + read s else read s

(The current signature is label :: ParsecT s u m a -> String -> ParsecT s u m a.)

Become parsec-4.0.0.0

I'm following up on a discussion following the announcement of megaparsec-4.0.0.0 on Haskell-Cafe.

I'll cut to the chase and say that quite a few people are excited about megaparsec and see it as a worthwhile candidate for reinvigorating parsec (as major version 4.0), the development of which has been pretty stale lately. Probably, the smoothest way to achieve that would be for @mrkkrp to become a (primary) maintainer of parsec on Hackage. Would you like that, Mark?

If so, there's a formal process for that https://wiki.haskell.org/Taking_over_a_package . But I don't think there would be any issues with that as there is awareness and support of this among some of the parsec maintainers and Hackage Trustees already. I can lend my hand in facilitating the process, if needed.

I think this is the smoothest way to "replace" parsec, as you would be leveraging the well-known name, while fixing a lot of long standing issues. I wouldn't worry about backwards compatibility, especially, since megaparsec is already compatible with the two previous versions of GHC.

`notFollowedBy` always succeeds with parsers that don't consume input

See haskell/parsec#8 for more information. This should be definitely fixed in megaparsec.

Discussion about new lexer

I've finished new lexer. Conceptually it doesn't lock user into some particular methodology of parsing but just gives collection of useful helpers. It's responsibility of the programmer to glue everything together.

See branch new-lexer: Text.Megaparsec.Lexer.

Note that old record-based system is eliminated, thus Text.Megaparsec.Language is eliminated too (after all I don't think it was ever useful for someone expect for providing emptyDef).

This is the place where you can criticize my decisions, say “What have you done!”, and change something until it's too late.

One of old tests is currently failing, this will be investigated and tests for entire Text.Megaparsec.Lexer will be written. After that the branch new-lexer will be merged. I estimate all the work won't take longer than a couple of days.

What about an indent parser?

Hi Mark,

is this still planned?

Daniel

string' test broken

A Travis build has shown that the case-insensitive test doesn't work as intended. A minimal example that shows the defect has also been provided in that build:

-- prop_string a s
prop_string "a" "A"

This returns the match "A" correctly, but the test expects a matched "a" due to the logic in checkString. The problem here is that the parser string' a doesn't necessarily return a, but s if the case-insenstive match is successful.

You can reproduce this result with the following property:

prop_string'_custom :: String -> Property
prop_string'_custom a = prop_string' (map toLower a) (map toUpper a)

I'm looking for help to finish tests sooner

Intro

As you probably know Megaparsec is going to be thoroughly tested. This is absolutely necessary to make sure that our library won't “surprise” users with some ugly quirks. It's also necessary we want to continue active development after Megaparsec 4.0.0 is released. Look at Parsec, it cannot
move forward anymore, because every change can backfire in unpredictable ways (thanks to how it's coded, too).

So, I'm trying to cover everything and I'm currently working on Text.Megaparsec.Prim, but I foresee that I'll soon get quite busy and my free time that I can devote to free software development will be minimal.

Thus I ask you to help me write some tests. It's not necessary to submit tests for entire module, one or two tests will help me a lot.

How to write tests

Here is some information you may find helpful if you wish to help me.

Take a look at Util module and how existing tests for Text.Megaparsec.Char and Text.Megaparsec.Prim and written. Every function in Util has documentation so it should be clear how to use them to test parsers.
Please construct situations that lead to various outcomes of parsing. Check all possible results of parsing, including error messages.
For error messages we need to check exact collection of messages, they need to be adequate and precise, always. If you think your test is good, but Megaparsec fails to pass it, please report it (this is still possible, although probability of such problems is minimal now).
It's OK to get result of parsing using equivalent (already tested) parser with help of simpleParse function from Util.

What to test

Current status of Megaparsec is:

all modules except for Text.Megaparsec.Expr and Text.Megaparsec.Token are finished and work (or should work) as intended;
from finished modules, some are already covered with tests, some don't require testing (Text.Megaparsec.Language), and some need testing.

Finished modules that need testing:

Text.Megaparsec.Combinator
Text.Megaparsec.Perm

Tests for unfinished modules are not currently welcomed, because their API/provided features will change in future, so it's too soon to write tests for them.

Thank you for your interest in Megaparsec project.

Why don't we benchmark the stuff?

It might be a good idea to benchmark at least basic things. Not sure I can do all the stuff myself, but let it hang here, maybe someone could help in future and write some benchmarks. I think criterion library can be used.

<|>, etc should be reexports

It's really annoying that I have to hide <|>, many, optional from Control.Applicative whenever I want to use Parsec, so I propose making them reexports from Control.Applicative in this package.

Propositions for Megaparsec's site

I think now we can concentrate on tutorials and other materials. In particular, we can use GitHub project pages to host some useful information related to the project.

Here is what I think such a site should contain:

Link to GitHub repository (“fork me on GitHub” or something like this).
Link to Hackage page of the project.
An article explaining how to switch from Parsec to Megaparsec;
A couple of tutorials.

If you have propositions, tell me.

There is also an idea to edit README.md file. It should define Megaparsec not in terms of Parsec, but as something self-sufficient that non-Parsec user can understand. Comparison against Parsec should be moved to dedicated section. It's a good idea to add hyperlinked table of contents at the beginning.

Combinator 'many' is applied to a parser that accepts an empty string.

I've written a simple parser and while testing it I get the above ErrorCall even though I am not using the many combinator directly. Someone on IRC suggested that I'm getting this error because the some combinator uses many under the hood.

I find this behaviour rather unexpected.

Write more tests

There are still uncovered pieces of code and non-tested cases. In particular:

Functions and instances in Text.Megaparsec.Prim.
Quality of error messages of expression parser.

Bug in 'manyTill'

While writing tests for Text.Megaparsec.Combinator I ran into the following bug:

λ> parseTest (manyTill letterChar (char 'c')) "ab"
parse error at line 1, column 1:
unexpected 'a'
expecting 'c'

this error message is wrong. It should be:

parse error at line 1, column 3:
unexpected end of input
expecting 'c' or letter

It's worth noticing that Parsec doesn't suffer from this:

λ> parseTest (manyTill letter (char 'c')) "ab"
parse error at (line 1, column 3):
unexpected end of input
expecting "c" or letter

Current implementation however is functionally the same as Parsec's:

-- Megaparsec

manyTill :: Stream s m t =>
            ParsecT s u m a -> ParsecT s u m end -> ParsecT s u m [a]
manyTill p end = (end *> return []) <|> someTill p end

someTill :: Stream s m t =>
            ParsecT s u m a -> ParsecT s u m end -> ParsecT s u m [a]
someTill p end = (:) <$> p <*> manyTill p end

-- Parsec

manyTill :: (Stream s m t) => ParsecT s u m a -> ParsecT s u m end -> ParsecT s u m [a]
manyTill p end      = scan
                    where
                      scan  = do{ end; return [] }
                            <|>
                              do{ x <- p; xs <- scan; return (x:xs) }

I've checked it, changing Megaparsec's manyTill to Parsec's manyTill doesn't solve the problem. This means there a bug on lower level.

While it is sad that Text.Megaparsec.Prim still has bugs, I'm glad they at least don't get through our tests.

I'll report here cause of the issue once I've that fixed.

Export all of original parsec's combinators

For easy of adoption you should export all of Parsec's useful combinators.

many1 and spaces are missing amongst others.

Forcing users to define these manually will likely lead to people not adopting the new library as they will not have a clean "drop-in" substitution.

A possible regression regarding error reporting

Here's a Parsec parser:

import Text.Parsec
import Text.Parsec.Language
import Text.Parsec.Lexer

decimal' = decimal (makeTokenParser haskellDef)

pair = do
  char '{'      <?> "opening brace"
  a <- decimal' <?> "1st number"
  char ','      <?> "comma"
  b <- decimal' <?> "2nd number"
  char '}'      <?> "closing brace"
  eof
  return (a, b)

It gives the following errors on various inputs:

> pair "{1a,456}"
*** Exception: (line 1, column 3):
unexpected "a"
expecting digit or comma

> pair "{a,456}"
*** Exception: (line 1, column 2):
unexpected "a"
expecting 1st number

Here's an equivalent Megaparsec parser:

import Text.Megaparsec
import Text.Megaparsec.Lexer

pair = do
  char '{'     <?> "opening brace"
  a <- decimal <?> "1st number"
  char ','     <?> "comma"
  b <- decimal <?> "2nd number"
  char '}'     <?> "closing brace"
  eof
  return (a, b)

And its errors:

> pair "{1a,456}"
*** Exception: line 1, column 3:
unexpected 'a'
expecting 1st number or comma

> pair "{a,456}"
*** Exception: line 1, column 2:
unexpected 'a'
expecting 1st number

Parsec's error when parsing {1a,456} seems more helpful to me (it would be even more apparent if instead of decimal there was a more complicated parser). Am I mistaken?

Using parsers with monad transformers

(See the previous discussion here.)

ParsecT lets us combine parsing with things such as keeping state, accessing configuration, etc. However, any monad you embed into ParsecT won't be affected by Parsec's backtracking properties:

{-# LANGUAGE FlexibleContexts #-}

import Control.Monad.State
import Text.Parsec hiding (State)

trying s = modify (++ ["trying " ++ s])

test :: ParsecT String () (State [String]) Char
test = (trying "'a'" >> char 'a')
   <|> (trying "'b'" >> char 'b')

main = print $ flip execState [] $ runParserT test () "" "b"
-- prints:
--   trying 'a'
--   trying 'b'

In this example having non-backtracking state was desirable. Sometimes having backtracking state is desirable – for instance, my usecase is parsing a list of rules, where each rule is merely a named list of patterns, and patterns can reference earlier-defined patterns (which is why all parsed patterns are added to state); if a single pattern in a rule can't be parsed I want to throw the rule away completely and forget that I ever parsed it, which means forgetting about all successfully-parsed patterns in the rule. (Sorry if the example is unclear or seems contrived.)

One way to have backtracking state is to use Parsec's built-in “user state”. Unfortunately, this solution isn't extensible, as Parsec provides no built-in writer, reader, or any other useful monad you could think of.

Another way is to use

StateT MyStateType Parser a

which, if you aren't familiar with StateT, is another way of saying

MyStateType -> Parser (a, MyStateType)

which hopefully makes it clearer why this variant would preserve backtracking; if it doesn't, look at the definition of <|> for StateT:

m <|> n = StateT $ \ s -> runStateT m s <|> runStateT n s

Here both actions are applied to the same state, so it's impossible that both parsers' modifications would end up in the final state.

At the moment this solution isn't viable either, because all Parsec's parsers work in ParsecT and can't be used in StateT. So, we have to use lift:

{-# LANGUAGE FlexibleContexts #-}

import Control.Applicative
import Control.Monad.State
import Text.Parsec hiding (State, (<|>))
import Text.Parsec.String

trying s = modify (++ ["trying " ++ s])

test :: StateT [String] Parser Char
test = (trying "'a'" >> lift (char 'a'))
   <|> (trying "'b'" >> lift (char 'b'))

main = do
  let Right res = parse (flip execStateT [] test) "" "b"
  mapM_ putStrLn res
-- prints:
--   trying 'b'

However, we can't just define char = lift . primitiveCharParser because that would make char unusable in stacks of 2 and more transformers. So, instead we might have to create a new mtl-style class (perhaps MonadParsec?) instead. A similar class already exists in parsers (see Parsing, CharParsing); however, we likely won't be able to reuse parsers because parsers depends on parsec/attoparsec утв because parsers has different naming conventions etc:

I'm not against reinventing the wheel if it's easier then buying one and gives better results. For example Megaparsec have different naming conventions to parse various categories of characters, etc. Not sure we could reuse this “as is”.

— mrkkrp

Complete QuickCheck test suite

It's probably not possible to write such a thing quickly, at least not with 100% coverage. However, we could cover 30% or more for a start improving coverage latter. What's really important is that megaparsec should be perceived as bug-free and robust software. To achieve this, good testing is necessary.

Service like Coveralls can be used to measure coverage.

Here is the plan:

Add more parsers to 'Text.Megaparsec.Char' module

Module Data.Char contains many interesting tests, like isControl, isPrint, isMark and even allows to test Unicode categories of characters. We could add more parsers to Text.Megaparsec.Char module leveraging this functionality. New tests will be easy to write since we already have good utilities for this sort of testing.

A failing test (GHC 7.8.4)

Somehow one of tests failed. It failed on very simple input. Currently every test is iterated 1000 times. And all other combinations of Cabal/GHC pass the tests. Why this particular test failed is beyond my understanding.

If you have any idea, it's interesting to hear.

Show errors in more standard format

The format of the location information in error messages generated by parsec is unusual, which makes it harder to provide basic IDE support for languages with parsec-based implementations. For example, I often write code like in this commit to help Emacs jump to locations in parsec errors. Megaparsec could improve on this by making it easier to customize the formatting and/or by choosing a more default default.

Remove `SysUnExpect` error message constructor

Let's check what Error module tells us about message constructors:

A SysUnExpect message is automatically generated by the Text.Parsec.Combinator.satisfy combinator. The argument is the unexpected input.
A UnExpect message is generated by the Text.Parsec.Prim.unexpected combinator. The argument describes the unexpected item.
A Expect message is generated by the Text.Parsec.Prim.<?> combinator. The argument describes the expected item.
A Message message is generated by the fail combinator. The argument is some general parser message.

Note to self: refresh this section in docs.

After much guessing of how Error module works, or rather how it's supposed to work and how it actually works (because of missing doc-strings), I conclude that we could simplify this too. Why should we have two types of unexpected thing? Does it really matter why a thing is unexpected (because library says so or user says so)?

I propose remove SysUnExpect constructor and have only three constructors of error messages:

Unexpected
Expected
Message

The last of them is never really used in Parsec itself, but it certainly can be of some use in hands of a good programmer, so let it be.

For Casual Reading

Parsec is not capable of generating good error messages on program level. Its error merging, etc. logic is (was, if we talk about Megaparsec) absolutely broken.

For example, instance of Eq Message says that Expect "foo" and Expect "bar" are equal! This means no serious error checking is possible without rendering of error message (more on this below), and indeed! look at Utils.hs from old tests:

-- | Returns the error messages associated with a failed parse.

parseErrors :: Parser a -> String -> [String]
parseErrors p input =
  case parse p "" input of
    Left err -> drop 1 $ lines $ show err
    Right _  -> []

Are you serious? This sort of check should be done by examination of list of error messages that can be extracted with errorMessages function.

But to make it possible you need to make it work. Not only original Parsec couldn't compare error messages, it actually allowed duplicates of messages in list of messages. This means that you could have something like this [Unexpect "a", Expect "b", Expect "b"], so even if you have normal equality check, you cannot compare two lists of messages.

I needed quite a bit of luck to identify what some functions should exactly do in Parsec:

-- | @addErrorMessage m err@ returns @ParseError@ @err@ with message @m@
-- added. This function makes sure that list of messages is always ordered
-- and doesn't contain duplicates.

addErrorMessage :: Message -> ParseError -> ParseError
addErrorMessage m (ParseError pos ms) = ParseError pos (pre ++ [m] ++ post)
    where pre  = filter (< m) ms
          post = filter (> m) ms

-- | @setErrorMessage m err@ returns @err@ with message @m@ added. This
-- function also deletes all existing error messages that were created with
-- the same constructor as @m@.

setErrorMessage :: Message -> ParseError -> ParseError
setErrorMessage m (ParseError pos ms) = addErrorMessage m (ParseError pos xs)
    where xs = filter ((/= fromEnum m) . fromEnum) ms

↑ Here, I use interface of Error module to keep list of messages not only sorted, but without duplicates too. Since user of Error module cannot create ParseError with its constructor (it's not exported, this is right decision), there is no way to have malformed list of messages now (and
this is thoroughly checked now).

This alone has enabled me to actually check messages without the need to render them. This was essential requirement to build decent test suite, of course.

I would like to tell you about mergeError function. This is really weird one (original Parsec code):

mergeError :: ParseError -> ParseError -> ParseError
mergeError e1@(ParseError pos1 msgs1) e2@(ParseError pos2 msgs2)
    -- prefer meaningful errors
    | null msgs2 && not (null msgs1) = e1
    | null msgs1 && not (null msgs2) = e2
    | otherwise
    = case pos1 `compare` pos2 of
        -- select the longest match
        EQ -> ParseError pos1 (msgs1 ++ msgs2)
        GT -> e1
        LT -> e2

This is totally nonsensical and our growing test suite recently spotted last bug remaining from this implementation (msgs1 ++ msgs2 allowed duplicates).

Both parts of the function are strange.

Prefer meaningful errors. What? So, what do you do when your meaningful error occurred after unmeaningful one? Ignore the earlier one.
Select the longest match. This is really something I cannot understand. When you parse something, you want to report first error that you spot, that is, prefer earlier errors — this is the only rule. When two errors have identical positions, join their messages, but intelligently.

Here is correct implementation:

-- | Merge two error data structures into one joining their collections of
-- messages and preferring shortest match.

mergeError :: ParseError -> ParseError -> ParseError
mergeError e1@(ParseError pos1 ms1) e2@(ParseError pos2 ms2) =
    case pos1 `compare` pos2 of
      LT -> e1
      EQ -> foldr addErrorMessage (ParseError pos1 ms1) ms2
      GT -> e2

And finally, such a collection of nonsense couldn't remain unnoticed in such a well-known library if it couldn't present things like they are OK. This is done with showMessages function:

showErrorMessages ::
    String -> String -> String -> String -> String -> [Message] -> String
showErrorMessages msgOr msgUnknown msgExpecting msgUnExpected msgEndOfInput msgs
    | null msgs = msgUnknown
    | otherwise = concat $ map ("\n"++) $ clean $
                 [showSysUnExpect,showUnExpect,showExpect,showMessages]
    where
      (sysUnExpect,msgs1) = span ((SysUnExpect "") ==) msgs
      (unExpect,msgs2)    = span ((UnExpect    "") ==) msgs1
      (expect,messages)   = span ((Expect      "") ==) msgs2

      showExpect      = showMany msgExpecting expect
      showUnExpect    = showMany msgUnExpected unExpect
      showSysUnExpect | not (null unExpect) ||
                        null sysUnExpect = ""
                      | null firstMsg    = msgUnExpected ++ " " ++ msgEndOfInput
                      | otherwise        = msgUnExpected ++ " " ++ firstMsg
          where
              firstMsg  = messageString (head sysUnExpect)

      showMessages      = showMany "" messages

      -- helpers
      showMany pre msgs = case clean (map messageString msgs) of
                            [] -> ""
                            ms | null pre  -> commasOr ms
                               | otherwise -> pre ++ " " ++ commasOr ms

      commasOr []       = ""
      commasOr [m]      = m
      commasOr ms       = commaSep (init ms) ++ " " ++ msgOr ++ " " ++ last ms

      commaSep          = separate ", " . clean

      separate   _ []     = ""
      separate   _ [m]    = m
      separate sep (m:ms) = m ++ sep ++ separate sep ms

      clean             = nub . filter (not . null)

What a… no, it doesn't look complex because it does complex things. It's just coded this way. First of all, see comments in original Parsec, it's supposed to support many languages, this is why it accepts that many arguments. I wonder if it ever occurred to author of the function that translation to different language is not done by one-to-one word substitution. Let's implement it just in English, OK? This alone removes a lot of junk.

But this function has many things to do. It should compensate all the flaws in ParseError model to present things normally. This includes removing of duplicate messages and a couple of special cases. Also, there is Data.List.intercalate for what they call separate.

Of course, fixing of this module alone is not enough for really good error messages in our library, but there are many other changes I've made. I hope this is really going to be high-quality library now.