zesterer / chumsky Goto Github PK

View Code? Open in Web Editor NEW

3.3K 25.0 142.0 3.69 MB

Write expressive, high-performance parsers with ease.

Home Page: https://crates.io/crates/chumsky

License: MIT License

Rust 100.00%

parser parser-combinators context-free-grammar errors recursive-descent-parser peg lexing parsing

chumsky's Introduction

pancake

A personal C++ utility library for a variety of uses

What does it do?

pancake includes many pieces of boilerplate code that are common throughout my software including:

Why the name 'pancake'?

I don't know. I searched "random words" online and 'pancake' was the first thing I saw. I'm not very inventive when it comes to names.

chumsky's People

Contributors

Stargazers

Watchers

Forkers

angelonfira jdonszelmann m-schm age-rs speedy37 isgasho siithy doytsujin ooboomberoo natemartinsf granitrocky laplacekorea ryo33 icodein jeffparsons janim craftspider memoryruins cjschneider2 ma124 bew jezza wackbyte domenicquirl luctius stratusfearme21 half-pie notafile knarkzel hatoo t-o-r-u-s danaugrs jfecher erikdesjardins kiliankilmister riey yutiansut dhruvdh maraaz taka231 jim-ec nomicfoundation tatsuya6502 kenta11 misawa remimimimimi alvra spachava753 damien-white tornaxo7 rebv cyberflamego tamaroning ptruser kianmeng nitin-mane jacobjohansen nanderoo arswysocki wnz27 electrifypro possible-fqz zij-it knutwalker lokathor vmarkushin nathan-at-least herkhinah mia-entropy hellow554 zyansheep emallson antikyth theonlymrcat soywod aljazerzen timmmm gg-big-org audunhalland muja ebastien alexpropp snawaz mlemesle ahmedrazi epage chaosprint thesecondbestname creatorsiso paullanum mechslayer svalaskevicius gruxor erussmsft tgmm sanjeevi567 firefragment chiakicage willothy tzvipm

chumsky's Issues

Missing `SeparatedBy::at_most(n)` and add a `exactly(n)` for `SeparatedBy` & `Repeated` ?

I see at_least but not at_most.

Usecase: For AoC day04 (https://adventofcode.com/2021/day/4)
The input has game boards, each of 5 lines of numbers separated by a newline.

I've made a parser for a line of numbers and want to do board_line.separated_by(newline).at_most(5)

Other option that would work for me (be better w.r.t the spec of the input) is a SeparatedBy::exactly(n) that would set both at_least and at_most to the same value.
--> In this case, also add it to Repeated ?

What do you think?

`separated_by` does not work as expected

Hi,

Thanks for this wonderful library! I've been using chumsky 0.5.0 on a project and it works great! However, I am baffled by a weird behavior of separated_by, which I think may be a potential bug.

Here's a simplified piece of code that reproduces the problem I met:

use chumsky::prelude::*;

fn main() {
  let digit_list =
    one_of::<char, _, Simple<_>>("123".chars()).separated_by(just(','));
  let letter_list = one_of("abc".chars()).separated_by(just(','));
  let parser = digit_list.clone().or(letter_list);

  // works as expected: this line works as expected
  assert_eq!(digit_list.parse("1,2,3"), Ok(vec!['1', '2', '3']));

  // works as expected: trailing tokens are ignored
  assert_eq!(digit_list.parse("1,2,3X"), Ok(vec!['1', '2', '3']));

  // does not work as expected: the trailing "," is not ignored, where I expect it to be ignored as other trailing tokens
  //
  // expected: Ok(['1', '2', '3'])
  // actual: Err([Simple { span: 6..7, reason: Unexpected, expected: {'2', '1', '3'}, found: None, label: None }])
  assert_eq!(digit_list.parse("1,2,3,"), Ok(vec!['1', '2', '3']));

  // does not work as expected. This result is even weirder. In any case I don't expect an empty list to be returned.
  //
  // expected: Ok(['1', '2', '3'])
  // actual: Ok([])
  assert_eq!(parser.parse("1,2,3,"), Ok(vec!['1', '2', '3']))
}

Basically, the trailing separator is not correctly treated as normal trailing tokens and would trigger an error when encountered.

P.S. I knew the existence of the allow_trailing option. This example is simplified and doesn't fully represent my use-case, where I cannot allow the trailing separators because I need to leave it out for other part of the program to handle it.

Implement memoisation

Memoisation is a technique for making backtracking parser performance linear over the input length at the cost of higher memory consumption and more overhead for less exponential grammars. It would be nice to be able to support it.

Example of Python-like parsing

Having an example on how to parse Python-like languages that are aware of indentation would be interesting.

Add `From<&[T; N]> for Stream<...>` impl

Currently, there are From implementations on Stream for a variety of slice types. The one missing is &[T; N]. This is useful due to lack of coercion on generics making a call to .parser(b"") fail, as it's a reference to an array not a slice.

I'd be willing to PR it, if there's interest in it existing.

Incorrect JSON parser example

I try to use the JSON parser under examples dir , mostly It works fine, but when I try to parse such snippet

{
 "test": "\u2222"
}

parser in example report a error, but actually this json snippet is valid according to the specification https://www.json.org/json-en.html

Documentation for `Stream`

What should eoi be in Stream::from_iter?

Syntax serialization (railroad diagrams)

Hey! I'm rolling an SQL parser, and I want to generate railroad diagrams of the syntax (SQLite style) – I think they are beautifully helpful – probably with https://github.com/lukaslueg/railroad. What's the simplest way of going about this with Chumsky automatically (if there is one)?

Missing documentation with example of the recovery system

AFAIK there's not a lot of documentation (except few lines), nor examples on how to use the recovery module and the parse_recovery_* functions
https://docs.rs/chumsky/latest/chumsky/recovery/index.html

Especially on the functions nested_delimiters skip_then_retry_until skip_until

Simple parser without map

I'm trying to define a simple parser that will ignore whitespace (which is already tokenized by my lexer.)

let whitespace = just(WHITESPACE).or_not().ignored();

This is the error I get:

63 | let whitespace = just(WHITESPACE).or_not().ignored();
| ---------- ^^^^ cannot infer type for type parameter E declared on the function just
| |
| consider giving whitespace the explicit type To<OrNot<Just<Token<'_>, E>>, Option<Token<'_>>, ()>, where the type parameter E is specified

That's a lot of boilerplate to add for a simple parser. Am I doing something incorrectly here?

Exponential blowup in recursive parsers

Consider this simple parser:

use chumsky::prelude::*;

fn parser() -> impl Parser<char, String, Error = Simple<char>> {
  recursive(|expr| {
    let atom = text::ident()
      .or(expr.clone().delimited_by('(', ')'));

    let expression = atom
      .clone()
      .then_ignore(just('+'))
      .then(atom.clone())
      .map(|(a, b)| format!("{}{}", a, b))
      .or(atom);

    expression
  }).then_ignore(end())
}

fn main() {
  println!("{:?}", parser().parse("((((((((((((((((((((((a+b))))))))))))))))))))))"));
}

Parsing the string

((((((((((((((((((((((a+b))))))))))))))))))))))

takes 8 seconds when compiling in release mode (around a minute in debug mode), and that time rapidly grows as more pairs of parentheses are added.

When the parser inside the recursive closure is more complex, the problem becomes even worse. With the parser I'm working on, I'm seeing multi-second parse times already for 3-4 levels of nested brackets.

Boxing parsers does not solve this issue. What can be done here?

Feature request: Allow users to implement custom recovery strategies

As it stands now, users only have access to skip_until, skip_then_retry_until, and nested_delimiters. I'm unsure if the best way to go about this would be exposing the Strategy trait (which currently uses private types) - preferably in a cleaner form, or adding lower-level recovery strategies/parsers that can be used to build up more complex strategies. As it stands, I believe it isn't possible to implement a strategy like expect from https://eyalkalderon.com/blog/nom-error-recovery/. skip_until is close but provides no way to not skip any tokens.

`repeated` followed by `at_least`

In the nano_rust example, when lexing operations repeated is followed by a call to at_least.

chumsky/examples/nano_rust.rs

Lines 62 to 66 in 64b10e2

 let op = one_of("+-*/!=") 

 .repeated() 

 .at_least(1) 

 .collect::<String>() 

 .map(Token::Op);

At first glance, this looks like a bug and should be fixed by using repeated exclusive or at_least depending on the semantics you want. But I'm new to chumsky, so I'm not sure.

Is this in fact a bug? If not, what are the semantics of the parser that this will generate?

Add a State parameter to parsers

In some cases, it's useful to be able to pass something around to parsers, such as a tree builder, interner, arena, or similar gizmo. Something that can't just be created ex nihilo in a leaf and passed outward (an ego-tree document, in my case: the parent of a node has to exist before the children do).

zesterer: An extra type parameter on Parser with a default could work
zesterer: trait Parser<I, O, State = ()>
zesterer: Then an extra combinator like .map_with_state(|output, state: &mut State|)
zesterer: And an extra Parser::parse_with_state function

Which is basically what Logos does modulo exactly how the state is retrieved.

Memory leak in `recursive`

Hi again, and thanks for the great crate! I've had a lot of fun using it so far.

Issue

The recursive implementation in this crate leaks memory if it is used in the intended way, due to an Rc reference cycle. Here is the relevant excerpt from recursive.rs.

/// A parser that can be defined in terms of itself by separating its [declaration](Recursive::declare) from its
/// [definition](Recursive::define).
///
/// Prefer to use [`recursive()`], which exists as a convenient wrapper around both operations, if possible.
pub struct Recursive<'a, I, O, E: Error<I>>(Rc<OnceCell<Box<dyn Parser<I, O, Error = E> + 'a>>>);

impl<'a, I: Clone, O, E: Error<I>> Recursive<'a, I, O, E> {
    /// Declare the existence of a recursive parser, allowing it to be used to construct parser combinators before
    /// being fulled defined.
    ///
    /// <abridged>
    pub fn declare() -> Self {
        Recursive(Rc::new(OnceCell::new()))
    }

    /// Defines the parser after declaring it, allowing it to be used for parsing.
    pub fn define<P: Parser<I, O, Error = E> + 'a>(&mut self, parser: P) {
        self.0
            .set(Box::new(parser))
            .unwrap_or_else(|_| panic!("Parser defined more than once"));
    }
}

Even after a recursive parser is dropped, if it is self-referential, the underlying memory will not be freed. For example, the json example has this issue. Running the following file with cargo +nightly miri run will panic with a memory leak detected:

use std::collections::HashMap;

use chumsky::prelude::*;

#[derive(Clone, Debug)]
enum Json {
    Invalid,
    Null,
    Bool(bool),
    Str(String),
    Num(f64),
    Array(Vec<Json>),
    Object(HashMap<String, Json>),
}

fn parser() -> impl Parser<char, Json, Error = Simple<char>> {
    recursive(|value| {
        let frac = just('.').chain(text::digits(10));

        let exp = just('e')
            .or(just('E'))
            .ignore_then(just('+').or(just('-')).or_not())
            .chain(text::digits(10));

        let number = just('-')
            .or_not()
            .chain(text::int(10))
            .chain(frac.or_not().flatten())
            .chain::<char, _, _>(exp.or_not().flatten())
            .collect::<String>()
            .map(|s| s.parse().unwrap())
            .labelled("number");

        let escape = just('\\').ignore_then(
            just('\\')
                .or(just('/'))
                .or(just('"'))
                .or(just('b').to('\x08'))
                .or(just('f').to('\x0C'))
                .or(just('n').to('\n'))
                .or(just('r').to('\r'))
                .or(just('t').to('\t')),
        );

        let string = just('"')
            .ignore_then(filter(|c| *c != '\\' && *c != '"').or(escape).repeated())
            .then_ignore(just('"'))
            .collect::<String>()
            .labelled("string");

        let array = value
            .clone()
            .chain(just(',').ignore_then(value.clone()).repeated())
            .or_not()
            .flatten()
            .delimited_by('[', ']')
            .map(Json::Array)
            .labelled("array");

        let member = string.clone().then_ignore(just(':').padded()).then(value);
        let object = member
            .clone()
            .chain(just(',').padded().ignore_then(member).repeated())
            .or_not()
            .flatten()
            .padded()
            .delimited_by('{', '}')
            .collect::<HashMap<String, Json>>()
            .map(Json::Object)
            .labelled("object");

        seq("null".chars())
            .to(Json::Null)
            .labelled("null")
            .or(seq("true".chars()).to(Json::Bool(true)).labelled("true"))
            .or(seq("false".chars()).to(Json::Bool(false)).labelled("false"))
            .or(number.map(Json::Num))
            .or(string.map(Json::Str))
            .or(array)
            .or(object)
            .recover_with(nested_delimiters('{', '}', [('[', ']')], |_| Json::Invalid))
            .recover_with(nested_delimiters('[', ']', [('{', '}')], |_| Json::Invalid))
            .recover_with(skip_then_retry_until(['}', ']']))
            .padded()
    })
    .then_ignore(end().recover_with(skip_then_retry_until([])))
}

fn main() {
    let _parser = parser();
}

Here is a more minimal example of the memory leak:

use chumsky::prelude::*;

fn main() {
    let parser = recursive(|f| just::<char, Simple<_>>('x').or(f.delimited_by('(', ')')));
    parser.parse("((x))").unwrap();
}

And the output from Miri:

The following memory was leaked: alloc1125 (Rust heap, size: 40, align: 8) {
    0x00 │ 01 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00 │ ................
    0x10 │ 00 00 00 00 00 00 00 00 ╾0x26658[a1232]<2580>─╼ │ ........╾──────╼
    0x20 │ ╾0x268b0[a1265]<2571>─╼                         │ ╾──────╼
}
alloc1232 (Rust heap, size: 24, align: 8) {
    0x00 │ ╾0x25d08[a1125]<untagged> (8 ptr bytes)╼ 28 00 00 00 29 00 00 00 │ ╾──────╼(...)...
    0x10 │ 78 00 00 00 __ __ __ __                         │ x...░░░░
}
alloc1265 (global (static or const), size: 40, align: 8) {
    0x00 │ ╾0x268e0[a1262]<2572>─╼ 18 00 00 00 00 00 00 00 │ ╾──────╼........
    0x10 │ 08 00 00 00 00 00 00 00 ╾0x268ee[a1263]<2573>─╼ │ ........╾──────╼
    0x20 │ ╾0x268fe[a1264]<2574>─╼                         │ ╾──────╼
}
alloc1262 (fn: std::ptr::drop_in_place::<chumsky::combinator::Or<chumsky::primitive::Just<char, chumsky::error::Simple<char>>, chumsky::combinator::DelimitedBy<chumsky::recursive::Recursive<char, char, chumsky::error::Simple<char>>, char>>> - shim(Some(chumsky::combinator::Or<chumsky::primitive::Just<char, chumsky::error::Simple<char>>, chumsky::combinator::DelimitedBy<chumsky::recursive::Recursive<char, char, chumsky::error::Simple<char>>, char>>)))
alloc1263 (fn: <chumsky::combinator::Or<chumsky::primitive::Just<char, chumsky::error::Simple<char>>, chumsky::combinator::DelimitedBy<chumsky::recursive::Recursive<char, char, chumsky::error::Simple<char>>, char>> as chumsky::Parser<char, char>>::parse_inner_verbose)
alloc1264 (fn: <chumsky::combinator::Or<chumsky::primitive::Just<char, chumsky::error::Simple<char>>, chumsky::combinator::DelimitedBy<chumsky::recursive::Recursive<char, char, chumsky::error::Simple<char>>, char>> as chumsky::Parser<char, char>>::parse_inner_silent)

error: the evaluated program leaked memory

note: pass `-Zmiri-ignore-leaks` to disable this check

error: aborting due to previous error

I'm not sure what the best way to fix this is, is it possible to demote some of the Rcs to Weak pointers?

What would be the right way to parse punctuation like ">>"?

I'm trying to parse a language with a bunch of type of punctuation, many of them similar to rust. It doesn't seem difficult to parse single-char punctuation, but I'm not sure what parser combinators to plug together to be able to parse text into the tokens with multiple characters:

pub enum Punctuation {
    Plus, // +
    Minus, // -
    Star, // *
    Slash, // /
    Percent, // %
    Caret, // ^
    Not, // !
    And, // &
    Or, // |
    AndAnd, // &&
    OrOr, // ||
    Shl, // <<
    Shr, // >>
    Eq, // =
    EqEq, // ==
    Ne, // !=
    Gt, // >
    Lt, // <
    Ge, // >=
    Le, // <=
    Underscore, // _
    Dot, // .
    Comma, // ,
    Semi, // ;
    Colon, // :
    PathSep, // ::
    RArrow, // ->
    FatArrow, // =>
}

`.delimited_by()` should take parsers as arguments, not input tokens

This is another inconsistency I noticed when working with the API. .separated_by() takes a parser as a separator, so it makes sense to expect that .delimited_by() take parsers as delimiters.

This would for example enable .delimited_by() to express surrounding XML tags in char parsers, which consist of multiple characters, and can contain parsable attributes themselves.

Release a new version?

Is there anything that blocks a release for a new version?

I plan to release a library using chumsky so would be nice to have current master on the crates.io. In particular, I'm using parsers in delimited_by (a small problem), and then_with (which can't be implemented without private API I think).

NanoRust example does not work

On master (9a2787a):

Hello, world!
The meaning of life is...
...something we cannot know
However, I can tell you that the factorial of 10 is...
Error: 'factorial' called with wrong number of arguments (expected 1, found 2)
    ╭─[<unknown>:22:18]
    │
 22 │        print(factorial(10, 11));
    ·                       ────┬───  
    ·                           ╰───── 'factorial' called with wrong number of arguments (expected 1, found 2)
────╯

It would be good for these to be checked in CI.

In this case the only issue preventing it from succeeding is precisely what the diagnostic is telling us, which is super cool. 😀 Though I'd suggest this is a good candidate for rustc-style "UI tests", maybe using something like "trybuild" if dtolnay/trybuild#64 were implemented (which might be a worthwhile "yak") because checking that it prints the expected output on success is probably important, too.

Unclosed delimiter error should allow multiple char items

After fixing #52 there are couple of issues left:

Error::unclosed_delimiters takes only single token (char if char stream is parsed)
nested_delimiters strategy allows only single token

My use case is parsing nested comments /* .. */ and r#"..."#. So this isn't the case where tokenizer is very helpful (I mean comments are usually skipped by tokenizer and strings are usually single-token).

I can probably live without (2), because it might be too hard to implement, but fixing (1) should yield much better error messages with some simple custom code I think.

"Recursive" isn't compatible with "map_with_span"

If you create a recursive parser that uses map_with_span to emit a tuple (T, span), you get the following error:

81 | let expr = recursive(|expr| {
| ^^^^^^^^^ expected enum Expr, found tuple
|

I found a workaround by emitting just a single value with .map(|x| x.0)

I'm not sure if I'm doing something wrong here, or if this is just a limitation.

How to parse a repeated parser with recovery

Hello, I am trying to write a parser which can parse a series of items from a known set and I want to be able to skip over an item if it is wrong but still produce an error for this wrong item.

To be more specific I am trying to use chumsky to write a simplified css parser and I am stuck on the parsing of pseudoclass selectors. Let's say I have something like this:

":hover:activ:focus"

Here 'active' is spelt wrong so I want this to produce an error but for the parser to skip over this one and continue if possible. So far I have attempted to write a psudoclass parser using choice and text::keyword together, and then I have used repeated and recover_with but the recovery part does not seem to be working with repeated.

I wish I could be more specific but I'm still tyring to learn this crate so I'm not even sure of the correct terminology just yet. Any help or advice would be greatly appreciated. Thanks.

Combinator for Eliminating Left Recursion

When writing parsers by hand and dealing with left recursion I generally use the following pattern:

fn parse_expr(s: Tokens) -> Option<Expr> {
    let left = parse_left_expr()?;

    parse_right_expr(s, left)
}

fn parse_left_expr(s: Tokens) -> Option<Expr> {
    match s.next()? {
        /* ... */
    }
}

fn parse_right_expr(s: Tokens, left: Expr) -> Option<Expr> {
    match s.next() {
        Some("+") => parse_right_expr(Expr::Add(left, parse_expr(s)?)),
        Some("(") => parse_right_expr(/* parse a call */),
        /* ... */
        _ => left,
    }
}

As I understand this is fairly common way of eliminating left recursion (I don't have an academic background so I don't know its name) but I was unable to find a good way of using this method with the provided what would be that best way of accomplishing this?

Support parsing nested inputs

Currently, there's no clean way to do something like

parse_keyword()
    .then_with(|kw| parser(kw))

Creating a parser that requires the ok result of the previous parser. It's possible with custom, but definitely kind of ugly.

Can't specify a single value with `just`

I must be missing something obvious. As a first step, I am writing a parser for just one single u64:

#[derive(Clone, Debug, PartialEq)]
enum Expr {
    Value2(u64),
}

fn expr_parser() -> impl Parser<char, Vec<Expr>, Error = Simple<Token>> {
    let number = just('-').or_not()
        .chain(text::int(10))
        .collect::<String>()
        .map(|s| s.parse::<u64>().unwrap())
        .labelled("number");

    just(number.map(Expr::Value2))  // <--- error here
        .padded()
}

The error:

error[E0277]: can't compare `chumsky::combinator::Map<Label<chumsky::combinator::Map<chumsky::combinator::Map<chumsky::combinator::Map<Then<OrNot<Just<char, _>>, impl chumsky::Parser<char, <char as Character>::Collection>+Copy+Clone>, fn((Option<char>, String)) -> Vec<_>, (Option<char>, String)>, fn(Vec<_>) -> String, Vec<_>>, [closure@src\main.rs:78:14: 78:43], String>, &str>, fn(u64) -> Expr {Expr::Value2}, u64>` with `chumsky::combinator::Map<Label<chumsky::combinator::Map<chumsky::combinator::Map<chumsky::combinator::Map<Then<OrNot<Just<char, _>>, impl chumsky::Parser<char, <char as Character>::Collection>+Copy+Clone>, fn((Option<char>, String)) -> Vec<_>, (Option<char>, String)>, fn(Vec<_>) -> String, Vec<_>>, [closure@src\main.rs:78:14: 78:43], String>, &str>, fn(u64) -> Expr {Expr::Value2}, u64>`
  --> src\main.rs:81:10
   |
81 |     just(number.map(Expr::Value2))
   |          ^^^^^^^^^^^^^^^^^^^^^^^^ no implementation for `chumsky::combinator::Map<Label<chumsky::combinator::Map<chumsky::combinator::Map<chumsky::combinator::Map<Then<OrNot<Just<char, _>>, impl chumsky::Parser<char, <char as Character>::Collection>+Copy+Clone>, fn((Option<char>, String)) -> Vec<_>, (Option<char>, String)>, fn(Vec<_>) -> String, Vec<_>>, [closure@src\main.rs:78:14: 78:43], String>, &str>, fn(u64) -> Expr {Expr::Value2}, u64> == chumsky::combinator::Map<Label<chumsky::combinator::Map<chumsky::combinator::Map<chumsky::combinator::Map<Then<OrNot<Just<char, _>>, impl chumsky::Parser<char, <char as Character>::Collection>+Copy+Clone>, fn((Option<char>, String)) -> Vec<_>, (Option<char>, String)>, fn(Vec<_>) -> String, Vec<_>>, [closure@src\main.rs:78:14: 78:43], String>, &str>, fn(u64) -> Expr {Expr::Value2}, u64>`
   | 
  ::: C:\Users\cedri\.cargo\registry\src\github.com-1ecc6299db9ec823\chumsky-0.5.0\src\primitive.rs:99:24
   |
99 | pub fn just<I: Clone + PartialEq, E>(x: I) -> Just<I, E> {
   |                        --------- required by this bound in `chumsky::primitive::just`
   |
   = help: the trait `PartialEq` is not implemented for `chumsky::combinator::Map<Label<chumsky::combinator::Map<chumsky::combinator::Map<chumsky::combinator::Map<Then<OrNot<Just<char, _>>, impl chumsky::Parser<char, <char as Character>::Collection>+Copy+Clone>, fn((Option<char>, String)) -> Vec<_>, (Option<char>, String)>, fn(Vec<_>) -> String, Vec<_>>, [closure@src\main.rs:78:14: 78:43], String>, &str>, fn(u64) -> Expr {Expr::Value2}, u64>`

The map function that gets called is inside chumsky and it returns a Map<...>, so it makes sense that this value cannot be compared.

What am I missing to parse a single number?

Debugging: How, what, why?

Chumsky currently supports a primitive debugging system, allowing parsers to print to stdout when entered during a call to Parser::parse_recovery_verbose. Expanding this further will require some thought.

What problems should debugging attempt to solve?

Parsers that consume zero input and repeat
Paths erroneously taken
Priority errors (i.e: a.or(b) vs b.or(a))

What information needs to be shown to the user?

Entered parsers
Number of iterations
Source location of parser
Recursion points

How is best to show this information?

Annotated tree?

What API features should be supported?

Recursion limit to prevent stack overflows

The `Seq` parser should yield the input it consumes

It's strange that Seq yields the unit type, while Just yields its input. This makes Seq useless if you combine multiple Seq parsers with or and then want to know which one matched.

In general, I find matching strings in a char parser to be too cumbersome. I have code that looks like this

just('=').chain(just('='))
  .or(just('!').chain(just('=')))
  .or(just('<').chain(just('=')))
  .or(just('<').to(vec!['<']))
  .or(just('>').chain(just('=')))
  .or(just('>').to(vec!['>']))

when all I really want is something like

one_of(&["==", "!=", "<=", "<", ">=", ">"])

This doesn't currently seem to be provided by Chumsky's built-in functions. I realize that this pattern may not generalize to arbitrary token types, but character-based parsers are extremely common, so it might make sense to have some special functions for this purpose.

Examples for using Parser::debug()

Can you share any examples for using the debug() parser? It requires a single argument, x, but the documentation doesn't specify what this should be.

I've tried wrapping another parser in debug(), and tried adding it as the final combinator, but neither would compile.

Crate doesn't compile with ahash feature disabled

When the ahash feature is disabled, the crate fails due to:
"error[E0412]: cannot find type RandomState in module std::collections::hash_set"

`skip_until` and `skip_then_retry_until` always skip at least one token

This can be an issue when parsing input like let = expr given a rust-like grammar:

let ident = ident().recover_with(skip_until([Equals, Semicolon], |_| Error));

just(Let)
    .then(ident)
    .then(just(Equals))
    .then(expression())

Ideally, this should still be able to parse input like the above, though since skip_until always skips at least one token, we miss the required = for the next rule. As far as I can tell, the only way around this is to remove the skip_until and make the whole ident rule optional which will throw out the error from it being missing entirely.

Library Compilation Error (0.7.0): conflicting implementations of trait `error::Error<_>` for type `error::Simple<_, _>`

I am not sure why, but Rust cannot compile the library version 0.7.0. The error can be seen below. Could someone else please also confirm this?

error[E0119]: conflicting implementations of trait `error::Error<_>` for type `error::Simple<_, _>`
   --> /home/colin/.cargo/registry/src/github.com-1ecc6299db9ec823/chumsky-0.7.0/src/error.rs:352:1
    |
223 | impl<I: Hash + Eq, S: Span + Clone + fmt::Debug> Error<I> for Simple<I, S> {
    | -------------------------------------------------------------------------- first implementation here
...
352 | impl<I, S: Span + Clone + fmt::Debug> Error<I> for Simple<I, S> {
    | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ conflicting implementation for `error::Simple<_, _>`

For more information about this error, try `rustc --explain E0119`.
error: could not compile `chumsky` due to previous error

In the current versions, the error corresponds to here and here in the current source code.

Notice that in the latter Simple was changed to Cheap at some point, though I couldn't find out when.

Feature-Request: Apply parsers in any order

An example use case would be LDAP schema as retrieved via LDAP

( 1.3.6.1.4.1.1466.115.121.1.8 DESC 'Certificate' X-BINARY-TRANSFER-REQUIRED 'TRUE' X-NOT-HUMAN-READABLE 'TRUE' )

which I want to parse into

pub struct LDAPSyntax {
    pub oid: ObjectIdentifier,
    pub desc: String,
    pub x_binary_transfer_required: bool,
    pub x_not_human_readable: bool,
}

The tags (DESC, X-BINARY-TRANSFER-REQUIRED,...) here can appear in any order and some of them can be optional (with or without a default value) but others are required (DESC in this case) for a given type of schema entry (the above is one for an LDAP-Syntax). Some have values after the tag and some do not (but a given tag always has a value or never has one). So what we really want is to express the form of each tag as a parser.

I can of course build a parser for a specific entry by chaining or() and then making it repeated() and then manually sifting through the results to create custom errors if required tags are missing but it feels like the general problem of having things to parse that are made from components that are required to exist but in an unspecified order could benefit from a general solution in chumsky directly, maybe something that could take a parameter

vec![ Foobar::Required(parser1),
        Foobar::Required(parser2),
        Foobar::Optional(parser3),
      ]

Maybe the API could even optionally allow the use of a Builder (like the ones created by derive-builder) pattern for the result type.

Alternatively one could return a vector sorted the same way as the input vector with a value for required parsers and Option for optional ones. Then one would probably want to require all the parsers to return the same type.

`Simple` display is slightly inconsistent

When displaying a Simple error, the found token and expected tokens are displayed slightly inconsistently. The found token is wrapped in apostrophes while the expected ones aren't, EG found 'x' but one of a, y, z was expected.

Help understanding when to map the span tokens exactly.

Hey!

I am still a little bit confused on when exactly I should to use map_with_span. I've taken a look at the examples in this repo and they basically do it all at the end on the last parser that gets returned. But after taking a look at the code in tao, you often do it in between as well.

Besides, could you enable GitHub Discussions on your repositories as well? That way the issue tracker doesn't get cluttered with questions like these :)

Have a great week!

Reduce type inflation by decoupling combinator types

Hi there! I'm writing a computer algebra system and I thought I'll give Chumsky a try. Your API design is excellent and in general I found Chumsky quite straightforward to use. Thank you for this crate!

That being said, I encountered a major problem: My simple expression parser takes almost five minutes to compile and results in a 1.3 GB debug executable. Rust-Analyzer is brought to its knees and barely capable of inspecting the code at all, but reveals that the final disjunction parser has a type that is dozens of pages when written out!

The main difference between that parser and your "nano_rust" example is that there are a few more levels of precedence, but that's about it. The language being parsed is by no means complicated or syntactically ambiguous. It appears that some mechanism in Chumsky results in exponential growth of parser complexity when nesting parsers hierarchically.

Any idea what's going on here? Is this a bug, or am I just using Chumsky incorrectly?

`validate_map` for handling a `Result` with a parser error

There is the function validate which can emit an error and continue parsing, but the data type that is returned by the closure passed to the function must be the return type of the parser.

It would be great if the validation function could also transform the result in order to emit an error that was returned e.g. by a previous map which produces a Result.

For now, my solution is to validate_map, transforming the Err into some Ok default, then use unwrapped and map. It kinda does what I want it to do, but not completely and with steps involved that I don't think should be necessary.

My specific use case is to first just detect a string literal in the lexer and only parse the escape sequences (and possibly placeholders etc.) in the parsing step depending on the context. This parse function returns a Result<Cow<'src, str>, StringParseError> and my error type has a specific variant for this error such that I can emit it easily.

Support for zero-copy parsing

This is now being developed in #82

Matching over struct/tuple enum items (using Logos)

Hello!

I wanted to combine the speed and simplicity of the Lexer generator Logos with the expressiveness of Chumsky and have run into an issue that I don't know how to resolve. Logos implements a function which returns a Lexer. This struct is an iterator over Tokens and has some additional functions like span, which I need for error messages.

Is there any way for how I can access the token iterator in a way similar to just(Token::Abc).map(|token| (lexer.span(), token)) and is it possible to match some Token::Int(u32)?

I can also convert the Lexer into an iterator over spanned Tokens using spanned, but the problem of matching this tuple remains.
I am still learning this library and am sorry if this is something trivial.

Add a Pratt parsing combinator

Switch to byte offset spans for strings by default

Why it counts characters and not byte offsets? It's hard to spot this thing deep in the documentation. And also most error reporting libraries expect byte offsets, codespan and miette do that. Is ariadne different?
End of input span is x..x+1 by default. I believe it should be zero-length. miette just skips the label on out of range label (i.e. doesn't display). And codespan even crashed previously (it maybe fixed now, not tested recently).

It's easy to fix both in my code using Stream::from_iter, but defaults are confusing.

parsing '{{ }}'

I'd like to parse '{{ some text }}' as " some text ".

The version using just '{' and '}' as delimiters works fine like this:

let inner = none_of("}".chars()).repeated().collect::<String>();
let parser = just('{').padding_for(inner).padded_by(just('}'));

but I'm stuck defining inner for the '{{' '}}' case.
The Stream can only peek one token, so I see no way to take all characters until '}}' is encountered.

Is there anything I'm missing?

No `Delimiters` strategy advertized by error message

Just got this:

Start and end delimiters cannot be the same when using `NestedDelimiters`, consider using `Delimiters` instead

But I don't see any Delimiters structure or delimiters function. So it's probably in to do list?

`no_std` support

Hey!

I've taken a look at the code and their doesn't seem to be any reason besides error.rs to make the crate no_std compatible (only taking the code into consideration). The problems with the above mentioned file are:

The error implements std::error::Error. This can be put behind a feature flag.
It uses HashSet, which is not available on no_std. The next best solution would be BHashSet, but that requires it's data to be Ord. And this doesn't always make sense.

Is this something you have considered/thought about?

I am asking because I have a parser and wanted to have a version that runs on the web as well. no_std would be the best solution! :)

Add benchmarks against nom

See https://github.com/Geal/nom/blob/main/examples/json.rs

Add more constraints to the Parser trait to improve error message quality

Investigate type inflation and strategies to reduce it

`binary` mod to match `text`

Proposal

A binary module to match text, with similar helpers but for non-textual input.

TL;DR Explanation

The binary module would have functions for parsers like int and string, which would interpret the relevant types from u8 streams.

Full Explanation

The proposed module would contain helpers for parsing binary files, in the same vein as there already exist helpers for parsing text.

Some helpers that would be nice to have (exact signatures up for bikeshedding, main point is what they'd allow):

int<I>(endian: Endian) for reading size_of::<I>() bytes as an integer of a particular type and endian
float<F>(endian: Endian) same for floats. Maybe combine these
string(ty: StringTy) for reading strings. The type would be like, 'null terminated' or 'length prefixed'
Possibly other types as well, such as Vec, Arrays, or similar.

I think chumsky has good potential as a tool for parsing non-textual files as well as text-based ones, but these kind of primitive operations are currently missing. With just a handful of these basic tools, parsing binary files could be just as painless as any other type.

Reduce Allocations in .repeated()

Calling .repeated() on a parser will allocate even if there is only a few occurrences of the pattern. I would like some way to use the SmallVec crate which allows small vector optimization. Is this something that you'd want to implement or accept a PR for?

If yes, I had two ideas about how it can be done.

Just use it by default, this is the simplest, but might not be desirable as it changes the default behaviour.
Declare a Push trait and implement it for Vec, it'd just be a one method trait with a method to push items onto the end, then the user can implement it for any newtyped container that they want to use, then give .repeated() a generic param bounded by the Push trait and use that for the outputs.

Unclear how to use `chumsky::primitive::custom()` perhaps not possible

The documentation for custom() states that you shouldn't need to use it, but doesn't provide any indication about how you would if you needed it.

There are a few issues I see:

First, the custom() function takes an unrestricted F generic type but the Parser trait is only implemented for a Custom<F: Fn(&mut StreamOf<I, E>) -> PResult<I, O, E>, ...>

This means that as a user, I don't know what type of closure to put in my custom parser, and also that it pushes the error to the call site rather than the definition of my custom parser. For example in the following the error appears in the .or(cust) location rather than at the definition of cust which would make debugging easier.

    let cust = chumsky::primitive::custom(|| 3); // Totally valid

    let parser = chumsky::primitive::just('3').or(cust).parse("3").expect(); // :(

The second, and perhaps larger issue is that if you do have the correct Fn(&mut StreamOf<I, E>) -> PResult<I, O, E> closure type, I am not actually sure you can do anything.

All of the useful parsing methods on Stream are marked pub(crate) which means you can't actually manipulate the stream within your custom parser.

Support integration with cstree

With the eventual merge of #82, it might be possible to have chumsky integrate with cstree, a library for lossless parsing using untyped syntax tree. This might be achieved by allowing implementers of the Input trait to specify functions that should be run when sequences of the input are consumed by the parser. For a dedicated Input implementation (that wraps another internally) we could specify these functions, allowing the emission of parse events.

	let op = one_of("+-*/!=")
	.repeated()
	.at_least(1)
	.collect::<String>()
	.map(Token::Op);