I have a use case where throughout the lifetime of the program I want to invoke the sa

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

How to reuse parsers without recreating them?,about zesterer/chumsky

Comments (25)

zesterer commented on June 2, 2024 2

So I was going to talk about how caching parsers in this way was not feasible in general... until I did some thinking and realised that, actually, it should be possible to design an API that allows it to be done safely and without sacrificing much in the way of versatility.

I've implemented an initial swing at such an API in the latest commit (01b96cd) and you can see an example of its usage in the tests here. It's currently behind the unstable feature flag since this is just an early attempt and is likely to change later (or perhaps it'll turn out to be unsound, I'm not sure).

The crux of it is: you implement Cached for a type (any old type will do, the type is just there to allow the trait to be implemented). When implementing the trait, you specify the input, output, and extra (i.e: error, context, etc.) types, along with providing a method that does the actual creating of the parser. Now, you get to use Cache to carry your parser around any way you like, with the ability to use the parser by calling .get().

Does this do the job for you?

from chumsky.

zesterer commented on June 2, 2024 2

I don't feel capable yet of judging soundness about this yet, so I can't offer feedback there.

I've discussed this with CraftSpider and we're both pretty happy that this is sound, so all good on that front (MIRI also likes it).

There's a bit of commented code in the Cached trait type (Parser<'a>:)

Specifying the parser type would allow one to skip boxing the parser, at the slight inconvenience that the parser type must be specified.

The comment of Cache::get has a sentence that trails off

Thanks! I'll fix that.

The implementation of a cached type is a bit verbose, given that all the information that goes into is already in the signature of the parser function. I take it that is necessary in order to let get() have the appropriate lifetimes. I imagine devising a macro that comes up with this automatically given a parser function would be tricky to implement, as it would effectively need to extract the type information from the returned parser, correct?

Yes, I don't see an easy way to avoid this (without losing the generality of the solution, anyway). Sadly it's not possible to use HRTBs in this case since you'd need both higher-ranked and higher-kinded types to generalise over the input and output type lifetimes (unless you were to use a trait for each, but that's just moving the problem).

I'd have a use-case where I think the same solution would be very useful.

This definitely sounds like a good use-case!

from chumsky.

stefnotch commented on June 2, 2024 1

I'd have a use-case where I think the same solution would be very useful.

I have a table-driven Pratt parser, where the used symbols can be chosen at runtime. For example, one might want a math parser that only understands simple high school math. Meanwhile a different use case calls for a math parser that also understands units.(physics)

Which means that every once in a while, my code will construct a new parser at runtime.

And then, that parser is used to parse anywhere between 1 to a few hundred separate formulas.

from chumsky.

zesterer commented on June 2, 2024 1

The thing to search for is 'Generic Associated Types'.

from chumsky.

Zij-IT commented on June 2, 2024 1

It would be great if you could get a MRE of the issue with the parser going, that way we are taking more than just some stabs in the dark.

from chumsky.

faassen commented on June 2, 2024 1

All right, in the process of stripping down this massive real world parser in the hope to find a minimal reproducing example, I realized I was doing a pretty hacky thing with recursive:

let  mut another_parser = None

let parser = recursive(|parser| { 
   let internal_parser = ...
   another_parser = Some(internal_parser);
  // stuff
});

I did this because I needed both the internal_parser as well as parser, but obviously it's quite hacky. I rewrote the code to use Recursive::declare and Recursive::define and the crash went away.

from chumsky.

Zij-IT commented on June 2, 2024 1

@faassen For whatever reason I thought that #494 was already merged into main, but that's not the case yet. As #486 shows, there is a leak when using declare and define.

from chumsky.

faassen commented on June 2, 2024

I've tried the simple approach above: store the parser required in a struct in advance so I create it only once, and then use it multiple times. Unfortunately I'm running into difficulties there too.

To store the parser in the struct I am using chumsky::Boxed. This takes two lifetimes, 'a and 'b. 'a is the lifetime of the input, 'b appears to be the lifetime of the DynParser that underlies the boxed parser, so I think the lifetime of storage.

The input passed into the parse function needs to live at least as long as 'a, but this 'a is now baked into the stored parser, and any input will live a lot more briefly.

I don't know how to express the lifetimes so that the lifetime of the input isn't baked into things.

from chumsky.

zesterer commented on June 2, 2024

Unfortunately this is difficult to support. The current implementation has the lifetime of the input be part of the Parser trait (i.e: each parser is effectively tied to a specific input data lifetime and can't easily be reused for different data lifetimes). The ideal would be an API that looks something like this:

trait Parser {
    type Input<'a>: Input<'a>;
    type Output<'a>;

    fn parse<'src>(&self, input: Self::Input<'src>) -> Self::Output<'src> { ... }
}

But unfortunately the assumed invariance of GAT parameters means this doesn't generalise right now.

It might be that you simply must recreate the parser each time (alternatively you can unsafely transmute the parser to whatever lifetime you require, provided you make sure that the input and output lifetimes match).

That said, recreating the parser each time isn't going to be that expensive. It's not the sort of thing you should be doing in a very hot loop, but recreating it for each file you parse or each input in a REPL isn't going to have a meaningful performance impact.

from chumsky.

faassen commented on June 2, 2024

Thanks for your answer! That's unfortunate.

I am currently using the Chumsky parser to parse strings at runtime in an interpreter implementation: parsing datetime formats and the like. It could totally happen in a hot loop, unfortunately. It sounds then like Chumsky isn't a good fit for parsing input like that. I didn't want to use regexes as the rules are somewhat involved and I think the chumsky code is more readable, but it sounds like I should reconsider that.

In another case I am reusing the language level parser to also parse the same thing during runtime and I didn't want to implement the same rules twice.

Would it be possible to support a cache in Chumsky itself? That is, Chumsky support memoization of parser functions? Since it's not possible to get this right outside of Chumsky, perhaps Chumsky should offer explicit support for this use case, if possible.

If not, I think it would be useful to discuss this limitation in the docs: do not use Chumsky if you want to parse at runtime during a hot loop.

from chumsky.

zesterer commented on June 2, 2024

If you're not using recursive or boxed, no allocation needs to occur during parser creation and the process of creating a parser becomes effectively free (or, at least, the compiler is able to statically dispatch everything and optimise it as it pleases). If your syntax can be expressed via regex, none of the aforementioned combinators should be required anyway!

from chumsky.

zesterer commented on June 2, 2024

Also, what do you mean by the 'cache' point here? Chumsky already supports memoisation (see .memoized()), but it sounds like this might not be what you're looking for?

from chumsky.

faassen commented on June 2, 2024

I am using boxed as I am using a parser multiple times and therefore need to be able to clone it. I am unclear about when parsers are copy but they aren't always, especially if I compose parsers using separate functions. Maybe I can avoid it. In any case, this requires a lot of knowledge of details in order to know whether it's maybe safe to use a parser in a hot loop.

Concerning memoized parsers, my understanding is that memoized memoizes parser output given input. I am talking about memoizing the parser function itself. So:

#[cache]
fn my_parser() -> impl Parser<...> {
   ...
}

If you call my_parser twice, the second time it returns the parser from cache. This way I can do whatever I want during parser construction and have it be fast.

Without this, storing parsers is impossible and there are pitfalls if you reuse one (avoid hot loops, or avoid boxed, etc). With this, it's easy.

from chumsky.

faassen commented on June 2, 2024

Thank you so much, this looks like it would do the job!

A few comments:

I don't feel capable yet of judging soundness about this yet, so I can't offer feedback there.
There's a bit of commented code in the Cached trait type (Parser<'a>:)
The comment of Cache::get has a sentence that trails off
The implementation of a cached type is a bit verbose, given that all the information that goes into is already in the signature of the parser function. I take it that is necessary in order to let get() have the appropriate lifetimes. I imagine devising a macro that comes up with this automatically given a parser function would be tricky to implement, as it would effectively need to extract the type information from the returned parser, correct?

from chumsky.

stefnotch commented on June 2, 2024

Unfortunately this is difficult to support. The current implementation has the lifetime of the input be part of the Parser trait (i.e: each parser is effectively tied to a specific input data lifetime and can't easily be reused for different data lifetimes). The ideal would be an API that looks something like this:
trait Parser {
    type Input<'a>: Input<'a>;
    type Output<'a>;

    fn parse<'src>(&self, input: Self::Input<'src>) -> Self::Output<'src> { ... }
}
But unfortunately the assumed invariance of GAT parameters means this doesn't generalise right now.

I'd love to understand this better. Is there a Rust issue that I could check out, or would you have any recommendations on what to read to understand advanced trait and lifetime topics better?

from chumsky.

faassen commented on June 2, 2024

I've finally found some time to try this out. The bad news first: I tried this on a big parser and it panics:

panicked at 'Recursive parser used before being defined', /home/faassen/.cargo/git/checkouts/chumsky-651fb76526ac39e3/b6c5df6/src/recursive.rs:155:18

The parser indeed uses recursive parsers, but I don't understand why caching it should have this effect.

Then a few implementation notes, perhaps useful to help improve the documentation:

My parser uses tokens defined by a logos lexer. In the parser, I used this generic parameter a lot:

where I: ValueInput<'a, Token = Token<'a>, Span = Span>

This caused a bit of a struggle. I want to define a Parsers struct that contains all the cached parsers, along these lines:

struct MyParserCache;

struct Parsers {
   my_parser_cache: chumsky::cache::Cache<MyParserCache>
}

But Parsers and thus MyParserCache cannot rely on a generic I parameter, otherwise we get the whole "source must outlive the duration of the program" namespace issue we started with. So what now? I landed on the following:

impl chumsky::cache::Cached for MyParserCache {
    type Parser<'src> = BoxedParser<
        'src,
        SpannedInput<
            Token<'src>,
            SimpleSpan,
            chumsky::input::BoxedStream<'src, (Token<'src>, SimpleSpan)>,
        >,
        ParsedAST,
    >;

    fn make_parser<'src>(self) -> Self::Parser<'src> {
        Parser::boxed(my_parser())
    }
}

For non-spanned scenario using BoxedStream with BoxedParser is sufficient, but I needed to wrap the whole thing in SpannedInput again. It took quite a bit of puzzling to get the types right.

But unfortunately after this I'm getting this crash. Looking at the source it appears RecursiveInner::Unowned cannot be upgraded for some reason.

from chumsky.

Zij-IT commented on June 2, 2024

I've finally found some time to try this out. The bad news first: I tried this on a big parser and it panics:
panicked at 'Recursive parser used before being defined', /home/faassen/.cargo/git/checkouts/chumsky-651fb76526ac39e3/b6c5df6/src/recursive.rs:155:18
The parser indeed uses recursive parsers, but I don't understand why caching it should have this effect.

Then a few implementation notes, perhaps useful to help improve the documentation:

My parser uses tokens defined by a logos lexer. In the parser, I used this generic parameter a lot:
where I: ValueInput<'a, Token = Token<'a>, Span = Span>
This caused a bit of a struggle. I want to define a Parsers struct that contains all the cached parsers, along these lines:
struct MyParserCache;

struct Parsers {
   my_parser_cache: chumsky::cache::Cache<MyParserCache>
}
But Parsers and thus MyParserCache cannot rely on a generic I parameter, otherwise we get the whole "source must outlive the duration of the program" namespace issue we started with. So what now? I landed on the following:
impl chumsky::cache::Cached for MyParserCache {
    type Parser<'src> = BoxedParser<
        'src,
        SpannedInput<
            Token<'src>,
            SimpleSpan,
            chumsky::input::BoxedStream<'src, (Token<'src>, SimpleSpan)>,
        >,
        ParsedAST,
    >;

    fn make_parser<'src>(self) -> Self::Parser<'src> {
        Parser::boxed(my_parser())
    }
}
For non-spanned scenario using BoxedStream with BoxedParser is sufficient, but I needed to wrap the whole thing in SpannedInput again. It took quite a bit of puzzling to get the types right.

But unfortunately after this I'm getting this crash. Looking at the source it appears RecursiveInner::Unowned cannot be upgraded for some reason.

Is your repo public by chance? Without being able to see your code, there’s no way for me to be able to test it and probe around properly

from chumsky.

faassen commented on June 2, 2024

Unfortunately not (yet); and it's a huge codebase anyhow. I can try to create a replication - what I will try is using parser recursion at all to trigger, but if it's more complicated than that it will take a while.

from chumsky.

faassen commented on June 2, 2024

Okay, I just tried a simple recursive parser with caching and it doesn't crash. So unfortunately it looks like I'm going to do more work to reproduce this behavior.

from chumsky.

faassen commented on June 2, 2024

Okay, I think I triggered a memory leak somewhere. When I use a cached parser (and thus do not recreate it), the memory usage of the process rises alarmingly (by gigabytes) and eventually it's killed off by Linux. This is when I'm using a large cached parser on the order of 10k times. Strangely enough if I recreate the cache for each run (and thus no caching should take place), the memory leak also happen; just the fact that the cache is in use seems to be enough.

When I don't use the parser caching in the same run the memory usage stays below 30 megabytes.

P.S. Again a reproducible example is probably quite a bit of effort, as my parser is significantly complex.

from chumsky.

Zij-IT commented on June 2, 2024

Okay, I think I triggered a memory leak somewhere. When I use a cached parser (and thus do not recreate it), the memory usage of the process rises alarmingly (by gigabytes) and eventually it's killed off by Linux. This is when I'm using a large cached parser on the order of 10k times. Strangely enough if I recreate the cache for each run (and thus no caching should take place), the memory leak also happen; just the fact that the cache is in use seems to be enough.

When I don't use the parser caching in the same run the memory usage stays below 30 megabytes.

P.S. Again a reproducible example is probably quite a bit of effort, as my parser is significantly complex.

Without knowing how your parser looks and purely based on your comment, I wonder if you’ve accidentally created mutually recursive parsers that, in an attempt to define the first, attempts to define the second until that loop results in Linux killing it.

Akin to: “to parse a Rust statement I must know how to parse an expression, but to parse a block expression I must know how to parse a Rust statement”

from chumsky.

faassen commented on June 2, 2024

I wonder if you’ve accidentally created mutually recursive parsers that, in an attempt to define the first, attempts to
define the second until that loop results in Linux killing it.

I don't think so. The parser definition is the same. The parser works - it parses. It works thousands of times, even. It's just that if I use the caching mechanism (and only then) the memory grows unbounded.

from chumsky.

faassen commented on June 2, 2024

@Zij-IT, you got me thinking and do some extra testing. If I just make the change to use Recursive.declare() then the parser memory usage explodes, without any use of caching whatsoever. So it's not due to the caching logic but the switch to use Recursive.declare(). But I switched to that as without it my hack makes the caching system crash. And to repeat: this parser works, it's just extremely memory intensive.

I will note that the docs claim that recursive is a convenience wrapper around define and declare but looking at the source it looks like that's not what is happening.

from chumsky.

faassen commented on June 2, 2024

That explains it then! So this means I am stuck: I can't cache this parser until I can create this type of recursive structure without a leak.

from chumsky.

How to reuse parsers without recreating them? about chumsky HOT 25 OPEN

Comments (25)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs