Say I want to make a parser for a markdown file but it's able to contain tags/annotati

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

preceding and following tokens <p dir="aut

Unable to lookahead/lookbehind accross tokens about ldt HOT 11 OPEN

kueblc commented on July 24, 2024

Unable to lookahead/lookbehind accross tokens

from ldt.

Comments (11)

SuperCuber commented on July 24, 2024

When I think about, I'm not quite sure why tokenize and identify are different steps - don't you know the type of the token while you produce it? Seems strange to me, I might be missing something.

from ldt.

kueblc commented on July 24, 2024

Hi @SuperCuber thanks for sharing your thoughts.

For performance reasons, the current parser does not support look behind or look ahead regular expressions.

You might be able to work around this using the CSS sibling selector, ie

.tagKey + .tagValue {
    color: red;
}

Let me know if this works for you and I can add this to the examples.

When I think about, I'm not quite sure why tokenize and identify are different steps - don't you know the type of the token while you produce it? Seems strange to me, I might be missing something.

We construct a single RegExp from the individual rules, which allows for very fast tokenization, but does not tell us which source rule produced the token. We then run identify only on the tokens that changed. This was the most performant model explored when originally developing this library in ~2010, though I am certainly open to exploring other options.

I'm also exploring performant grammar parsing, which may also address your needs and open the door to all sorts of advanced introspection capabilities.

The focus of this library is to be lightweight and fast enough to do live highlighting on low end machines, so any enhancements that have the potential to impact file size or performance must be optional. We can include multiple parsing models and let the end developer choose one which fits the performance and capability requirements of their project.

from ldt.

SuperCuber commented on July 24, 2024

We can include multiple parsing models and let the end developer choose one which fits the performance and capability requirements of their project.

I think this is the way to go. The most straightforward way to do this that I see is to add a method that will do both steps (that on Parser will just call tokenize then identify, giving basically no performance loss + backwards compatibility for custom parsers)

Then, if you choose to implement new parsers, you can do it in a new file, making the whole set of changes opt-in not only on the api level but also on the file size level.

from ldt.

kueblc commented on July 24, 2024

The most straightforward way to do this that I see is to add a method that will do both steps (that on Parser will just call tokenize then identify

I appreciate your input. I believe we will need to keep these separate so that we are not running identify in cases where tokens have not changed. Let me do some deeper thinking on this.

Did you try the workaround using CSS sibling selectors? Let me know if this is a viable strategy or if there are issues I am not seeing with this approach.

from ldt.

SuperCuber commented on July 24, 2024

Did you try the workaround using CSS sibling selectors?

It's going to be a bit fiddly. For example, if I wanted to color only the text inside the brackets I'm not sure I how I would do that, since I want it to continue until the closing bracket but a regex of [^\)]+ would match stuff from the start of the whole text...
And if I tried something like having a .word that wouldn't help much either since I can't say .tagKey + .leftParen + .word* in css.

Also there isn't a way in css to lookahead (according to this)

from ldt.

SuperCuber commented on July 24, 2024

Let me do some deeper thinking on this

So did you come up with some ideas? I'm interested in this feature.

from ldt.

kueblc commented on July 24, 2024

Yes I had a few ideas I'm experimenting with. Might take me bit due to work and life. I'm pretty booked up for the next week or so.

One idea is to pass identify some sort of context, such as the preceding and following tokens, the position in the original string, or index in the token stream. I'd want to test these ideas for functionality and performance in most of the supported browsers.

Feel free to share any of your own ideas. I'll hopefully have something to roll out by the end of the month.

from ldt.

SuperCuber commented on July 24, 2024

preceding and following tokens

I assume that at least the following tokens are not .identifyed yet, so if you wanted to do that you'd need to re-implement identifying those tokens inside your identify function which is weird...

the position in the original string

I don't think that helps much unless you expect the implementation of the function to then go to the original string and try breaking it into tokens on its own, then locate the tokens it wants to check based on the character's index? That seems really convoluted and inefficient to me

I'm also exploring performant grammar parsing, which may also address your needs and open the door to all sorts of advanced introspection capabilities.

This seems interesting, can you elaborate on how you imagine it would be used (say I want to write my own grammar)?
Maybe you can leverage an existing library for it or even allow users to somehow hook up their library/implementation of choice...

from ldt.

kueblc commented on July 24, 2024

preceding and following tokens

I assume that at least the following tokens are not .identifyed yet, so if you wanted to do that you'd need to re-implement identifying those tokens inside your identify function which is weird...

There may be a misunderstanding in how the process works. tokenize is responsible for breaking up the original string into an array of substrings. Then the (unidentified) substrings are individually passed to identify which determines the token type.

tokenize should already be able to handle look behind/ahead, but the identify step will fail because it lacks context. So the idea is to provide the substring to identify, along with the substrings that surround it. The result of identify does not depend on other results, so it doesn't matter whether we run identify on the surrounding substrings first or at all.

the position in the original string

I don't think that helps much unless you expect the implementation of the function to then go to the original string and try breaking it into tokens on its own, then locate the tokens it wants to check based on the character's index? That seems really convoluted and inefficient to me

We can leverage RegExp to test for a match at a given position. This is just one idea, it will certainly need to be evaluated for performance.

I'm also exploring performant grammar parsing, which may also address your needs and open the door to all sorts of advanced introspection capabilities.

This seems interesting, can you elaborate on how you imagine it would be used (say I want to write my own grammar)?
Maybe you can leverage an existing library for it or even allow users to somehow hook up their library/implementation of choice...

This is still very much in the exploration phase. A lot of work remains to be done to make this possible to do in real time. I'm looking into ways to produce partial parse trees and partial updates. Also investigating doing fast (best guess) updates in between full updates to make this possible.

from ldt.

SuperCuber commented on July 24, 2024

the idea is to provide the substring to identify, along with the substrings that surround it. The result of identify does not depend on other results

But it does - if it's not looking at the other substrings then what's the point.
What I mean is, say I want to color the type of the variable in this rust snippet:

let my_var: MyType = create_new_mytype();

I'd need to define .tokenize in a very basic way - for example, break on every word boundary.
Then I'd need to, in .identify("MyType"), check the previous token and see that it's :, then check the one before and see that it's a proper variable identifier, and check the one before and see that it's let, then check the one after and see that it's =...

So a lot of my logic for identifying a type token is actually logic for identifying the other tokens... Which IMO is a bit strange.

I guess this is just a bit too complex for a basic regex-based parser and you'd want proper grammar-based parsers for this kind of thing... But then an .identify with context doesn't really serve much purpose IMO - it's unneeded for a simple regex parser and not strong enough for a more complicated one.

A lot of work remains to be done to make this possible to do in real time.

I see, I didn't really take into account that these parsers need to work at actual typing speed, so a generic "give me a string and I'll break it into parts" parser just won't be fast enough.

I'm looking into ways to produce partial parse trees and partial updates. Also investigating doing fast (best guess) updates in between full updates to make this possible.

Yeah the parser would need to support half-typed unrecognized (yet) tokens, maybe even in the middle of the text... Seems like a pretty hard problem to do it efficiently.

I guess what I'm looking for in the meantime is a way to make it "just work" with my more complex logic because it simply doesn't fit into the 2-step process... If I could define my own function that updates the colorful elements on input for example, I could use that to at least make it work slowly.

from ldt.

Unable to lookahead/lookbehind accross tokens about ldt HOT 11 OPEN

Comments (11)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs