Comments (11)
When I think about, I'm not quite sure why tokenize and identify are different steps - don't you know the type of the token while you produce it? Seems strange to me, I might be missing something.
from ldt.
Hi @SuperCuber thanks for sharing your thoughts.
For performance reasons, the current parser does not support look behind or look ahead regular expressions.
You might be able to work around this using the CSS sibling selector, ie
.tagKey + .tagValue {
color: red;
}
Let me know if this works for you and I can add this to the examples.
When I think about, I'm not quite sure why tokenize and identify are different steps - don't you know the type of the token while you produce it? Seems strange to me, I might be missing something.
We construct a single RegExp
from the individual rules, which allows for very fast tokenization, but does not tell us which source rule produced the token. We then run identify
only on the tokens that changed. This was the most performant model explored when originally developing this library in ~2010, though I am certainly open to exploring other options.
I'm also exploring performant grammar parsing, which may also address your needs and open the door to all sorts of advanced introspection capabilities.
The focus of this library is to be lightweight and fast enough to do live highlighting on low end machines, so any enhancements that have the potential to impact file size or performance must be optional. We can include multiple parsing models and let the end developer choose one which fits the performance and capability requirements of their project.
from ldt.
We can include multiple parsing models and let the end developer choose one which fits the performance and capability requirements of their project.
I think this is the way to go. The most straightforward way to do this that I see is to add a method that will do both steps (that on Parser
will just call tokenize
then identify
, giving basically no performance loss + backwards compatibility for custom parsers)
Then, if you choose to implement new parsers, you can do it in a new file, making the whole set of changes opt-in not only on the api level but also on the file size level.
from ldt.
The most straightforward way to do this that I see is to add a method that will do both steps (that on
Parser
will just calltokenize
thenidentify
I appreciate your input. I believe we will need to keep these separate so that we are not running identify
in cases where tokens have not changed. Let me do some deeper thinking on this.
Did you try the workaround using CSS sibling selectors? Let me know if this is a viable strategy or if there are issues I am not seeing with this approach.
from ldt.
Did you try the workaround using CSS sibling selectors?
It's going to be a bit fiddly. For example, if I wanted to color only the text inside the brackets I'm not sure I how I would do that, since I want it to continue until the closing bracket but a regex of [^\)]+
would match stuff from the start of the whole text...
And if I tried something like having a .word
that wouldn't help much either since I can't say .tagKey + .leftParen + .word*
in css.
Also there isn't a way in css to lookahead (according to this)
from ldt.
Let me do some deeper thinking on this
So did you come up with some ideas? I'm interested in this feature.
from ldt.
Yes I had a few ideas I'm experimenting with. Might take me bit due to work and life. I'm pretty booked up for the next week or so.
One idea is to pass identify
some sort of context, such as the preceding and following tokens, the position in the original string, or index in the token stream. I'd want to test these ideas for functionality and performance in most of the supported browsers.
Feel free to share any of your own ideas. I'll hopefully have something to roll out by the end of the month.
from ldt.
preceding and following tokens
I assume that at least the following tokens are not .identify
ed yet, so if you wanted to do that you'd need to re-implement identifying those tokens inside your identify function which is weird...
the position in the original string
I don't think that helps much unless you expect the implementation of the function to then go to the original string and try breaking it into tokens on its own, then locate the tokens it wants to check based on the character's index? That seems really convoluted and inefficient to me
I'm also exploring performant grammar parsing, which may also address your needs and open the door to all sorts of advanced introspection capabilities.
This seems interesting, can you elaborate on how you imagine it would be used (say I want to write my own grammar)?
Maybe you can leverage an existing library for it or even allow users to somehow hook up their library/implementation of choice...
from ldt.
preceding and following tokens
I assume that at least the following tokens are not
.identify
ed yet, so if you wanted to do that you'd need to re-implement identifying those tokens inside your identify function which is weird...
There may be a misunderstanding in how the process works. tokenize
is responsible for breaking up the original string into an array of substrings. Then the (unidentified) substrings are individually passed to identify
which determines the token type.
tokenize
should already be able to handle look behind/ahead, but the identify
step will fail because it lacks context. So the idea is to provide the substring to identify
, along with the substrings that surround it. The result of identify
does not depend on other results, so it doesn't matter whether we run identify
on the surrounding substrings first or at all.
the position in the original string
I don't think that helps much unless you expect the implementation of the function to then go to the original string and try breaking it into tokens on its own, then locate the tokens it wants to check based on the character's index? That seems really convoluted and inefficient to me
We can leverage RegExp
to test for a match at a given position. This is just one idea, it will certainly need to be evaluated for performance.
I'm also exploring performant grammar parsing, which may also address your needs and open the door to all sorts of advanced introspection capabilities.
This seems interesting, can you elaborate on how you imagine it would be used (say I want to write my own grammar)?
Maybe you can leverage an existing library for it or even allow users to somehow hook up their library/implementation of choice...
This is still very much in the exploration phase. A lot of work remains to be done to make this possible to do in real time. I'm looking into ways to produce partial parse trees and partial updates. Also investigating doing fast (best guess) updates in between full updates to make this possible.
from ldt.
the idea is to provide the substring to identify, along with the substrings that surround it. The result of identify does not depend on other results
But it does - if it's not looking at the other substrings then what's the point.
What I mean is, say I want to color the type of the variable in this rust snippet:
let my_var: MyType = create_new_mytype();
I'd need to define .tokenize
in a very basic way - for example, break on every word boundary.
Then I'd need to, in .identify("MyType")
, check the previous token and see that it's :
, then check the one before and see that it's a proper variable identifier, and check the one before and see that it's let
, then check the one after and see that it's =
...
So a lot of my logic for identifying a type
token is actually logic for identifying the other tokens... Which IMO is a bit strange.
I guess this is just a bit too complex for a basic regex-based parser and you'd want proper grammar-based parsers for this kind of thing... But then an .identify
with context doesn't really serve much purpose IMO - it's unneeded for a simple regex parser and not strong enough for a more complicated one.
A lot of work remains to be done to make this possible to do in real time.
I see, I didn't really take into account that these parsers need to work at actual typing speed, so a generic "give me a string and I'll break it into parts" parser just won't be fast enough.
I'm looking into ways to produce partial parse trees and partial updates. Also investigating doing fast (best guess) updates in between full updates to make this possible.
Yeah the parser would need to support half-typed unrecognized (yet) tokens, maybe even in the middle of the text... Seems like a pretty hard problem to do it efficiently.
I guess what I'm looking for in the meantime is a way to make it "just work" with my more complex logic because it simply doesn't fit into the 2-step process... If I could define my own function that updates the colorful elements on input for example, I could use that to at least make it work slowly.
from ldt.
Related Issues (20)
- .ldt label should clone textarea’s styles HOT 1
- Keybinder: Grab input text HOT 2
- Match exactly the beginning of a line (Writing a Markdown highlighter) HOT 1
- How do I use a nicer font? HOT 1
- How can I change the highlighting core? HOT 1
- I'd just like to say that I love this library. HOT 1
- Publish to npm HOT 1
- Github Pages not found HOT 2
- Vertical scrollbar appears before necessary in Opera
- Url leads to page not found HOT 1
- Caret misaligned on iPhone HOT 2
- Horizontal scrollbar
- Please clarify the license - is it AND or OR HOT 2
- Setting bold/italic will cause misalignment on Windows
- Line numbering HOT 1
- Delete and backspace do not work in IE9
- Multiple newlines do not display properly in IE7
- Would you be interested in a version that works with line wraps HOT 2
- iOS text shadow HOT 5
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from ldt.