Comments (17)
Hey, do agree. Autolinker does look for protocols between 3 and 9 characters, so that's why it's getting cut off. Thinking maybe it should just take as many characters as are there
from autolinker.js.
https://github.com/jonschlinkert/remarkable/blob/dev/lib/common/url_schemas.js
Schema can be longer, and without //
from autolinker.js.
Wow, yeah. I'm also not currently including dashes or dots. Any other characters that are accepted that you happen to know of off-hand?
from autolinker.js.
Take a look at IANA spec and CommonMark discussion. IANA has a lot of garbage, i'm not sure that all needed. But it has examples for each protocol.
Also, if you plan to rewrite - consider moving text operations to separate module. Here is full list of "questions", we encounted while use with remarkable.
from autolinker.js.
Good to know. Seems like this is the spec on the scheme name: http://tools.ietf.org/html/rfc3986#section-3.1
When you say "text operations," are you looking for just the find/replace part? Judging from jonschlinkert/remarkable#108, I gather that you don't need the html parsing part for your purposes?
Also, if I understand correctly from your use of Autolinker in your source code, are you just using it for its parsing capabilities? Would having a separate Parser
class help? (Not sure how quickly I'd be able to implement something like this, but if that's your intention then I can try to move in that direction when working on the project :))
from autolinker.js.
Yes, we apply autolinker on AST. For remarkable it will be best to have text scanner only (even without replace) + url generator. That will reduce output on browserification. The most convenient would be to have it in separate package. But having a separate "requireable" file will be good too, if you keep stable interface/filename on build version change (..X).
Also, you have suspicious bug about 100% CPU load. That can ddos server if it really happens.
I don't have a bit desire to create one more package - already have a tons to develop & maintain. IMHO, your one is the best of all i seen, but needs some care.
from autolinker.js.
Thanks, but yeah, it def needs some work. I only get to work on it every so often, so there are def some things that have fallen by the wayside. The 100% cpu bug is def an interesting one. Freezes in the regex engine itself. You're totally right about the possibility of a dos attack though, I'll def look into that asap (have a feeling it has to do with the part I added to handle <!doctype>
tags)
Would love to help you out in general though, and save you from writing your own. I'll try to devote some time this week to fixing a few of these. The one thing I'm unfamiliar with though is non-english chars in urls. Do you have any experience with this?
from autolinker.js.
Hey, FYI, just published v0.14.0, which should match the full scheme names. Give that a try and let me know.
from autolinker.js.
The one thing I'm unfamiliar with though is non-english chars in urls. Do you have any experience with this?
Full support with surrogate chars is not easy, but possible.
- use xregexp will solve problem, but significantly increase size for browser.
- it's possible to extract expression for Letter class (~2kb), and do/generate other things manually.
- see more generic validation expression with international chars support https://github.com/flatiron/revalidator/blob/master/lib/revalidator.js#L152 (will not be ok here but give idea what happens)
- search "javascript regexp unicode" to find info about all caveats.
Personally, i don't like recursive expressions, because those are too easy to ddos. For example, with long patterns like ((((((((((((((((((((((((((((((((((((((((((((((((((((((()))))))))))))))))))))))))))))))))))))))))))))
. In pair with unicode support it can be more easy to write scanner manually.
https://github.com/ljosa/urlize.js - one more package, trying to solve the same problem in correct way, but with similar problems in unicode support. May be useful for url scanner redesign.
from autolinker.js.
Hey, FYI, just published v0.14.0, which should match the full scheme names. Give that a try and let me know.
will it match javasctipt/vbscript ? that would be a problem (not for remarkable
but for your users). Also keep in mind, that those can be masked by ulr-encoding https://github.com/jonschlinkert/remarkable/blob/dev/test/fixtures/remarkable/xss.txt#L73
from autolinker.js.
https://gist.github.com/puzrin/ce95a25581a4d069e173
That's only my attempt to separate data abstraction levels and summary of known caveats. Done for personal memos, but can be useful for you too.
Reason why separation can be useful:
- Unicode deals will be kept separate. No need to care at html parse phase
- I doubt, that it's possible to implement secure html parser easy.It would be more simple to use DOM in browser, and cheerio for node
- minimal total size for browser
- no additional problems with security (parse quality)
- possibility to provide additional rules & options (like @twitter names replace) without complicating matcher interface and defaults.
from autolinker.js.
will it match javasctipt/vbscript ?
Yes :( I guess I should explicitly ignore those.
Interesting info otherwise though. Some thoughts:
- I'm thinking that I wouldn't want to introduce the overhead of a DOM implementation just for html parsing. Really all that Autolinker is interested in from an HTML perspective is to not auto-link within html tags' attributes, and to not auto-link inside any descendant of an
<a>
tag. One could use an html purifier after the fact for security purposes, if he/she would like to. - XRegExp looks awesome, thanks for the link. Of course, it is extra size.. I did create an issue a couple of weeks ago to use Google Closure Compiler for minification though, which may reduce the size of Autolinker significantly enough to warrant its use. Otherwise, perhaps extracting just the unicode part might be the way to go.
What I'm thinking is that I'll work on fixing the current bugs, and toward extracting a separate module that performs the search/replace. That part should definitely be able to be abstracted. Might make it easy to introduce the unicode char support at that time too.
from autolinker.js.
May be, as temporary solution, option to skip html parse? I guess, regexp bug is not in matcher.
from autolinker.js.
Hey, so I fixed the initial issue of this ticket in v0.14.1. Going to close this ticket for now, and will open new ones for the other items.
For skipping the HTML parse, I'm thinking that I would prefer not to add another option, but give me a day or two, I might be able to figure out this regex bug. Otherwise, I definitely want to give you a straight interface to the search/replace functionality that will be independent of the html parsing.
from autolinker.js.
No prob.
Just for info, sindresorhus adviced this package https://github.com/webmodules/urlregexp . It's a bit big (5kb gzipped), but very useful to understand IDL support and logic.
This can be rewritten to parse without regexp, to optimize speed.
from autolinker.js.
Wow, yeah, quite a regex. Def could help though, thanks.
from autolinker.js.
Hi.
I see you guys here trying to filter danger protocols like "javascript" and "vbscript". Currently this is done by this code: return ( uriScheme !== 'javascript:' && uriScheme !== 'vbscript:' );
.
I don't think this is a mature solution, I believe the only way to deal with scheme filtering is a whitelist (maybe customized by user). E.g. look at this scary big list: https://www.owasp.org/index.php/XSS_Filter_Evasion_Cheat_Sheet (all this case-insensitive and inserting meaningless characters technics). I guess many of them don't work in modern browsers but it's better to be safe than sorry.
from autolinker.js.
Related Issues (20)
- 3 seconds to parse link in safari HOT 7
- Request: Add taiwanese style phone number format
- URL with multiple email addresses in query string is not linked correctly
- Autolinker vulnerable to RTLO URL spoofing attacks HOT 26
- Bug: getUrl method not working correctly for links in markdown formats HOT 1
- doc/release v3.16.0 changelog missing HOT 1
- urls.wwwMatches usecase HOT 4
- Links with `¬` in query string are broken
- encoded email address is split
- Wrong result when parsing CJK(中文/日文/韩文) char followed by colon HOT 1
- Failed to parse source map HOT 2
- Add support for YouTube hashtags HOT 1
- Add support for YouTube mentions HOT 1
- Url with protocol prefixed with emoji being incorrectly parsed HOT 1
- Help to create Markdown Link Matcher
- Enable `inlineSources`
- URL as value in key-value pair parsed only partially (and incorrectly)
- [Tiktok mentions] - parsing does not match tiktok mention behaviour HOT 2
- Detection If a Persian word is attached before the link
- Help fix
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from autolinker.js.