Comments (25)
So I was going to talk about how caching parsers in this way was not feasible in general... until I did some thinking and realised that, actually, it should be possible to design an API that allows it to be done safely and without sacrificing much in the way of versatility.
I've implemented an initial swing at such an API in the latest commit (01b96cd) and you can see an example of its usage in the tests here. It's currently behind the unstable
feature flag since this is just an early attempt and is likely to change later (or perhaps it'll turn out to be unsound, I'm not sure).
The crux of it is: you implement Cached
for a type (any old type will do, the type is just there to allow the trait to be implemented). When implementing the trait, you specify the input, output, and extra (i.e: error, context, etc.) types, along with providing a method that does the actual creating of the parser. Now, you get to use Cache
to carry your parser around any way you like, with the ability to use the parser by calling .get()
.
Does this do the job for you?
from chumsky.
- I don't feel capable yet of judging soundness about this yet, so I can't offer feedback there.
I've discussed this with CraftSpider and we're both pretty happy that this is sound, so all good on that front (MIRI also likes it).
- There's a bit of commented code in the
Cached
trait type (Parser<'a>:
)
Specifying the parser type would allow one to skip boxing the parser, at the slight inconvenience that the parser type must be specified.
- The comment of
Cache::get
has a sentence that trails off
Thanks! I'll fix that.
- The implementation of a cached type is a bit verbose, given that all the information that goes into is already in the signature of the parser function. I take it that is necessary in order to let
get()
have the appropriate lifetimes. I imagine devising a macro that comes up with this automatically given a parser function would be tricky to implement, as it would effectively need to extract the type information from the returned parser, correct?
Yes, I don't see an easy way to avoid this (without losing the generality of the solution, anyway). Sadly it's not possible to use HRTBs in this case since you'd need both higher-ranked and higher-kinded types to generalise over the input and output type lifetimes (unless you were to use a trait for each, but that's just moving the problem).
I'd have a use-case where I think the same solution would be very useful.
This definitely sounds like a good use-case!
from chumsky.
I'd have a use-case where I think the same solution would be very useful.
I have a table-driven Pratt parser, where the used symbols can be chosen at runtime. For example, one might want a math parser that only understands simple high school math. Meanwhile a different use case calls for a math parser that also understands units.(physics)
Which means that every once in a while, my code will construct a new parser at runtime.
And then, that parser is used to parse anywhere between 1 to a few hundred separate formulas.
from chumsky.
The thing to search for is 'Generic Associated Types'.
from chumsky.
It would be great if you could get a MRE of the issue with the parser going, that way we are taking more than just some stabs in the dark.
from chumsky.
All right, in the process of stripping down this massive real world parser in the hope to find a minimal reproducing example, I realized I was doing a pretty hacky thing with recursive
:
let mut another_parser = None
let parser = recursive(|parser| {
let internal_parser = ...
another_parser = Some(internal_parser);
// stuff
});
I did this because I needed both the internal_parser
as well as parser
, but obviously it's quite hacky. I rewrote the code to use Recursive::declare
and Recursive::define
and the crash went away.
from chumsky.
@faassen For whatever reason I thought that #494 was already merged into main, but that's not the case yet. As #486 shows, there is a leak when using declare
and define
.
from chumsky.
I've tried the simple approach above: store the parser required in a struct in advance so I create it only once, and then use it multiple times. Unfortunately I'm running into difficulties there too.
To store the parser in the struct I am using chumsky::Boxed
. This takes two lifetimes, 'a
and 'b
. 'a
is the lifetime of the input, 'b
appears to be the lifetime of the DynParser
that underlies the boxed parser, so I think the lifetime of storage.
The input passed into the parse
function needs to live at least as long as 'a
, but this 'a
is now baked into the stored parser, and any input will live a lot more briefly.
I don't know how to express the lifetimes so that the lifetime of the input isn't baked into things.
from chumsky.
Unfortunately this is difficult to support. The current implementation has the lifetime of the input be part of the Parser
trait (i.e: each parser is effectively tied to a specific input data lifetime and can't easily be reused for different data lifetimes). The ideal would be an API that looks something like this:
trait Parser {
type Input<'a>: Input<'a>;
type Output<'a>;
fn parse<'src>(&self, input: Self::Input<'src>) -> Self::Output<'src> { ... }
}
But unfortunately the assumed invariance of GAT parameters means this doesn't generalise right now.
It might be that you simply must recreate the parser each time (alternatively you can unsafely transmute the parser to whatever lifetime you require, provided you make sure that the input and output lifetimes match).
That said, recreating the parser each time isn't going to be that expensive. It's not the sort of thing you should be doing in a very hot loop, but recreating it for each file you parse or each input in a REPL isn't going to have a meaningful performance impact.
from chumsky.
Thanks for your answer! That's unfortunate.
I am currently using the Chumsky parser to parse strings at runtime in an interpreter implementation: parsing datetime formats and the like. It could totally happen in a hot loop, unfortunately. It sounds then like Chumsky isn't a good fit for parsing input like that. I didn't want to use regexes as the rules are somewhat involved and I think the chumsky code is more readable, but it sounds like I should reconsider that.
In another case I am reusing the language level parser to also parse the same thing during runtime and I didn't want to implement the same rules twice.
Would it be possible to support a cache in Chumsky itself? That is, Chumsky support memoization of parser functions? Since it's not possible to get this right outside of Chumsky, perhaps Chumsky should offer explicit support for this use case, if possible.
If not, I think it would be useful to discuss this limitation in the docs: do not use Chumsky if you want to parse at runtime during a hot loop.
from chumsky.
If you're not using recursive
or boxed
, no allocation needs to occur during parser creation and the process of creating a parser becomes effectively free (or, at least, the compiler is able to statically dispatch everything and optimise it as it pleases). If your syntax can be expressed via regex, none of the aforementioned combinators should be required anyway!
from chumsky.
Also, what do you mean by the 'cache' point here? Chumsky already supports memoisation (see .memoized()
), but it sounds like this might not be what you're looking for?
from chumsky.
I am using boxed as I am using a parser multiple times and therefore need to be able to clone it. I am unclear about when parsers are copy
but they aren't always, especially if I compose parsers using separate functions. Maybe I can avoid it. In any case, this requires a lot of knowledge of details in order to know whether it's maybe safe to use a parser in a hot loop.
Concerning memoized parsers, my understanding is that memoized
memoizes parser output given input. I am talking about memoizing the parser function itself. So:
#[cache]
fn my_parser() -> impl Parser<...> {
...
}
If you call my_parser
twice, the second time it returns the parser from cache. This way I can do whatever I want during parser construction and have it be fast.
Without this, storing parsers is impossible and there are pitfalls if you reuse one (avoid hot loops, or avoid boxed
, etc). With this, it's easy.
from chumsky.
Thank you so much, this looks like it would do the job!
A few comments:
-
I don't feel capable yet of judging soundness about this yet, so I can't offer feedback there.
-
There's a bit of commented code in the
Cached
trait type (Parser<'a>:
) -
The comment of
Cache::get
has a sentence that trails off -
The implementation of a cached type is a bit verbose, given that all the information that goes into is already in the signature of the parser function. I take it that is necessary in order to let
get()
have the appropriate lifetimes. I imagine devising a macro that comes up with this automatically given a parser function would be tricky to implement, as it would effectively need to extract the type information from the returned parser, correct?
from chumsky.
Unfortunately this is difficult to support. The current implementation has the lifetime of the input be part of the
Parser
trait (i.e: each parser is effectively tied to a specific input data lifetime and can't easily be reused for different data lifetimes). The ideal would be an API that looks something like this:trait Parser { type Input<'a>: Input<'a>; type Output<'a>; fn parse<'src>(&self, input: Self::Input<'src>) -> Self::Output<'src> { ... } }But unfortunately the assumed invariance of GAT parameters means this doesn't generalise right now.
I'd love to understand this better. Is there a Rust issue that I could check out, or would you have any recommendations on what to read to understand advanced trait and lifetime topics better?
from chumsky.
I've finally found some time to try this out. The bad news first: I tried this on a big parser and it panics:
panicked at 'Recursive parser used before being defined', /home/faassen/.cargo/git/checkouts/chumsky-651fb76526ac39e3/b6c5df6/src/recursive.rs:155:18
The parser indeed uses recursive parsers, but I don't understand why caching it should have this effect.
Then a few implementation notes, perhaps useful to help improve the documentation:
My parser uses tokens defined by a logos
lexer. In the parser, I used this generic parameter a lot:
where I: ValueInput<'a, Token = Token<'a>, Span = Span>
This caused a bit of a struggle. I want to define a Parsers
struct that contains all the cached parsers, along these lines:
struct MyParserCache;
struct Parsers {
my_parser_cache: chumsky::cache::Cache<MyParserCache>
}
But Parsers
and thus MyParserCache
cannot rely on a generic I
parameter, otherwise we get the whole "source must outlive the duration of the program" namespace issue we started with. So what now? I landed on the following:
impl chumsky::cache::Cached for MyParserCache {
type Parser<'src> = BoxedParser<
'src,
SpannedInput<
Token<'src>,
SimpleSpan,
chumsky::input::BoxedStream<'src, (Token<'src>, SimpleSpan)>,
>,
ParsedAST,
>;
fn make_parser<'src>(self) -> Self::Parser<'src> {
Parser::boxed(my_parser())
}
}
For non-spanned scenario using BoxedStream
with BoxedParser
is sufficient, but I needed to wrap the whole thing in SpannedInput
again. It took quite a bit of puzzling to get the types right.
But unfortunately after this I'm getting this crash. Looking at the source it appears RecursiveInner::Unowned
cannot be upgraded for some reason.
from chumsky.
I've finally found some time to try this out. The bad news first: I tried this on a big parser and it panics:
panicked at 'Recursive parser used before being defined', /home/faassen/.cargo/git/checkouts/chumsky-651fb76526ac39e3/b6c5df6/src/recursive.rs:155:18
The parser indeed uses recursive parsers, but I don't understand why caching it should have this effect.
Then a few implementation notes, perhaps useful to help improve the documentation:
My parser uses tokens defined by a
logos
lexer. In the parser, I used this generic parameter a lot:where I: ValueInput<'a, Token = Token<'a>, Span = Span>This caused a bit of a struggle. I want to define a
Parsers
struct that contains all the cached parsers, along these lines:struct MyParserCache; struct Parsers { my_parser_cache: chumsky::cache::Cache<MyParserCache> }But
Parsers
and thusMyParserCache
cannot rely on a genericI
parameter, otherwise we get the whole "source must outlive the duration of the program" namespace issue we started with. So what now? I landed on the following:impl chumsky::cache::Cached for MyParserCache { type Parser<'src> = BoxedParser< 'src, SpannedInput< Token<'src>, SimpleSpan, chumsky::input::BoxedStream<'src, (Token<'src>, SimpleSpan)>, >, ParsedAST, >; fn make_parser<'src>(self) -> Self::Parser<'src> { Parser::boxed(my_parser()) } }For non-spanned scenario using
BoxedStream
withBoxedParser
is sufficient, but I needed to wrap the whole thing inSpannedInput
again. It took quite a bit of puzzling to get the types right.But unfortunately after this I'm getting this crash. Looking at the source it appears
RecursiveInner::Unowned
cannot be upgraded for some reason.
Is your repo public by chance? Without being able to see your code, there’s no way for me to be able to test it and probe around properly
from chumsky.
Unfortunately not (yet); and it's a huge codebase anyhow. I can try to create a replication - what I will try is using parser recursion at all to trigger, but if it's more complicated than that it will take a while.
from chumsky.
Okay, I just tried a simple recursive parser with caching and it doesn't crash. So unfortunately it looks like I'm going to do more work to reproduce this behavior.
from chumsky.
Okay, I think I triggered a memory leak somewhere. When I use a cached parser (and thus do not recreate it), the memory usage of the process rises alarmingly (by gigabytes) and eventually it's killed off by Linux. This is when I'm using a large cached parser on the order of 10k times. Strangely enough if I recreate the cache for each run (and thus no caching should take place), the memory leak also happen; just the fact that the cache is in use seems to be enough.
When I don't use the parser caching in the same run the memory usage stays below 30 megabytes.
P.S. Again a reproducible example is probably quite a bit of effort, as my parser is significantly complex.
from chumsky.
Okay, I think I triggered a memory leak somewhere. When I use a cached parser (and thus do not recreate it), the memory usage of the process rises alarmingly (by gigabytes) and eventually it's killed off by Linux. This is when I'm using a large cached parser on the order of 10k times. Strangely enough if I recreate the cache for each run (and thus no caching should take place), the memory leak also happen; just the fact that the cache is in use seems to be enough.
When I don't use the parser caching in the same run the memory usage stays below 30 megabytes.
P.S. Again a reproducible example is probably quite a bit of effort, as my parser is significantly complex.
Without knowing how your parser looks and purely based on your comment, I wonder if you’ve accidentally created mutually recursive parsers that, in an attempt to define the first, attempts to define the second until that loop results in Linux killing it.
Akin to: “to parse a Rust statement I must know how to parse an expression, but to parse a block expression I must know how to parse a Rust statement”
from chumsky.
I wonder if you’ve accidentally created mutually recursive parsers that, in an attempt to define the first, attempts to
define the second until that loop results in Linux killing it.
I don't think so. The parser definition is the same. The parser works - it parses. It works thousands of times, even. It's just that if I use the caching mechanism (and only then) the memory grows unbounded.
from chumsky.
@Zij-IT, you got me thinking and do some extra testing. If I just make the change to use Recursive.declare()
then the parser memory usage explodes, without any use of caching whatsoever. So it's not due to the caching logic but the switch to use Recursive.declare()
. But I switched to that as without it my hack makes the caching system crash. And to repeat: this parser works, it's just extremely memory intensive.
I will note that the docs claim that recursive
is a convenience wrapper around define
and declare
but looking at the source it looks like that's not what is happening.
from chumsky.
That explains it then! So this means I am stuck: I can't cache this parser until I can create this type of recursive structure without a leak.
from chumsky.
Related Issues (20)
- ease support for collections of option HOT 2
- Example foo.rs fails to parse expr in parentheses HOT 6
- `unwrapped` vs. `collect` / `collected` HOT 4
- IntoIter combinator unnecessarily triggers infinite loop checks in other combinators such as foldr. HOT 3
- Heavyweight vergen HOT 3
- `select!` when returning an error gives an empty `expected` HOT 1
- add `into_span()`, a parser which simply returns the span HOT 2
- [Feature request] Pratt parsing with spans HOT 8
- Use case: `InputRef::skip_n` or equivalent HOT 3
- Support collecting with custom allocators HOT 7
- `ignored` vs `ignore` HOT 6
- Get span in bytes instead of characters? HOT 6
- Labelling parsers doesn't work sometimes HOT 5
- (post-v1?) Add chumsky to rust serialization benchmark page HOT 1
- Example nanorust parser does not work. HOT 13
- Custom error isn't respected in some cases HOT 6
- Pratt Parsing Error Recovery HOT 3
- Documentation request: State and Context HOT 10
- Extension parser API request HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from chumsky.