GithubHelp home page GithubHelp logo

Strange protocol cut about autolinker.js HOT 17 CLOSED

puzrin avatar puzrin commented on May 29, 2024
Strange protocol cut

from autolinker.js.

Comments (17)

gregjacobs avatar gregjacobs commented on May 29, 2024

Hey, do agree. Autolinker does look for protocols between 3 and 9 characters, so that's why it's getting cut off. Thinking maybe it should just take as many characters as are there

from autolinker.js.

puzrin avatar puzrin commented on May 29, 2024

https://github.com/jonschlinkert/remarkable/blob/dev/lib/common/url_schemas.js

Schema can be longer, and without //

from autolinker.js.

gregjacobs avatar gregjacobs commented on May 29, 2024

Wow, yeah. I'm also not currently including dashes or dots. Any other characters that are accepted that you happen to know of off-hand?

from autolinker.js.

puzrin avatar puzrin commented on May 29, 2024

Take a look at IANA spec and CommonMark discussion. IANA has a lot of garbage, i'm not sure that all needed. But it has examples for each protocol.

Also, if you plan to rewrite - consider moving text operations to separate module. Here is full list of "questions", we encounted while use with remarkable.

from autolinker.js.

gregjacobs avatar gregjacobs commented on May 29, 2024

Good to know. Seems like this is the spec on the scheme name: http://tools.ietf.org/html/rfc3986#section-3.1

When you say "text operations," are you looking for just the find/replace part? Judging from jonschlinkert/remarkable#108, I gather that you don't need the html parsing part for your purposes?

Also, if I understand correctly from your use of Autolinker in your source code, are you just using it for its parsing capabilities? Would having a separate Parser class help? (Not sure how quickly I'd be able to implement something like this, but if that's your intention then I can try to move in that direction when working on the project :))

from autolinker.js.

puzrin avatar puzrin commented on May 29, 2024

Yes, we apply autolinker on AST. For remarkable it will be best to have text scanner only (even without replace) + url generator. That will reduce output on browserification. The most convenient would be to have it in separate package. But having a separate "requireable" file will be good too, if you keep stable interface/filename on build version change (..X).

Also, you have suspicious bug about 100% CPU load. That can ddos server if it really happens.

I don't have a bit desire to create one more package - already have a tons to develop & maintain. IMHO, your one is the best of all i seen, but needs some care.

from autolinker.js.

gregjacobs avatar gregjacobs commented on May 29, 2024

Thanks, but yeah, it def needs some work. I only get to work on it every so often, so there are def some things that have fallen by the wayside. The 100% cpu bug is def an interesting one. Freezes in the regex engine itself. You're totally right about the possibility of a dos attack though, I'll def look into that asap (have a feeling it has to do with the part I added to handle <!doctype> tags)

Would love to help you out in general though, and save you from writing your own. I'll try to devote some time this week to fixing a few of these. The one thing I'm unfamiliar with though is non-english chars in urls. Do you have any experience with this?

from autolinker.js.

gregjacobs avatar gregjacobs commented on May 29, 2024

Hey, FYI, just published v0.14.0, which should match the full scheme names. Give that a try and let me know.

from autolinker.js.

puzrin avatar puzrin commented on May 29, 2024

The one thing I'm unfamiliar with though is non-english chars in urls. Do you have any experience with this?

Full support with surrogate chars is not easy, but possible.

  • use xregexp will solve problem, but significantly increase size for browser.
  • it's possible to extract expression for Letter class (~2kb), and do/generate other things manually.
  • see more generic validation expression with international chars support https://github.com/flatiron/revalidator/blob/master/lib/revalidator.js#L152 (will not be ok here but give idea what happens)
  • search "javascript regexp unicode" to find info about all caveats.

Personally, i don't like recursive expressions, because those are too easy to ddos. For example, with long patterns like ((((((((((((((((((((((((((((((((((((((((((((((((((((((())))))))))))))))))))))))))))))))))))))))))))). In pair with unicode support it can be more easy to write scanner manually.

https://github.com/ljosa/urlize.js - one more package, trying to solve the same problem in correct way, but with similar problems in unicode support. May be useful for url scanner redesign.

from autolinker.js.

puzrin avatar puzrin commented on May 29, 2024

Hey, FYI, just published v0.14.0, which should match the full scheme names. Give that a try and let me know.

will it match javasctipt/vbscript ? that would be a problem (not for remarkable but for your users). Also keep in mind, that those can be masked by ulr-encoding https://github.com/jonschlinkert/remarkable/blob/dev/test/fixtures/remarkable/xss.txt#L73

from autolinker.js.

puzrin avatar puzrin commented on May 29, 2024

https://gist.github.com/puzrin/ce95a25581a4d069e173

That's only my attempt to separate data abstraction levels and summary of known caveats. Done for personal memos, but can be useful for you too.

Reason why separation can be useful:

  • Unicode deals will be kept separate. No need to care at html parse phase
  • I doubt, that it's possible to implement secure html parser easy.It would be more simple to use DOM in browser, and cheerio for node
    • minimal total size for browser
    • no additional problems with security (parse quality)
  • possibility to provide additional rules & options (like @twitter names replace) without complicating matcher interface and defaults.

from autolinker.js.

gregjacobs avatar gregjacobs commented on May 29, 2024

will it match javasctipt/vbscript ?

Yes :( I guess I should explicitly ignore those.

Interesting info otherwise though. Some thoughts:

  • I'm thinking that I wouldn't want to introduce the overhead of a DOM implementation just for html parsing. Really all that Autolinker is interested in from an HTML perspective is to not auto-link within html tags' attributes, and to not auto-link inside any descendant of an <a> tag. One could use an html purifier after the fact for security purposes, if he/she would like to.
  • XRegExp looks awesome, thanks for the link. Of course, it is extra size.. I did create an issue a couple of weeks ago to use Google Closure Compiler for minification though, which may reduce the size of Autolinker significantly enough to warrant its use. Otherwise, perhaps extracting just the unicode part might be the way to go.

What I'm thinking is that I'll work on fixing the current bugs, and toward extracting a separate module that performs the search/replace. That part should definitely be able to be abstracted. Might make it easy to introduce the unicode char support at that time too.

from autolinker.js.

puzrin avatar puzrin commented on May 29, 2024

May be, as temporary solution, option to skip html parse? I guess, regexp bug is not in matcher.

from autolinker.js.

gregjacobs avatar gregjacobs commented on May 29, 2024

Hey, so I fixed the initial issue of this ticket in v0.14.1. Going to close this ticket for now, and will open new ones for the other items.

For skipping the HTML parse, I'm thinking that I would prefer not to add another option, but give me a day or two, I might be able to figure out this regex bug. Otherwise, I definitely want to give you a straight interface to the search/replace functionality that will be independent of the html parsing.

from autolinker.js.

puzrin avatar puzrin commented on May 29, 2024

No prob.

Just for info, sindresorhus adviced this package https://github.com/webmodules/urlregexp . It's a bit big (5kb gzipped), but very useful to understand IDL support and logic.

This can be rewritten to parse without regexp, to optimize speed.

from autolinker.js.

gregjacobs avatar gregjacobs commented on May 29, 2024

Wow, yeah, quite a regex. Def could help though, thanks.

from autolinker.js.

Kagami avatar Kagami commented on May 29, 2024

Hi.
I see you guys here trying to filter danger protocols like "javascript" and "vbscript". Currently this is done by this code: return ( uriScheme !== 'javascript:' && uriScheme !== 'vbscript:' );.
I don't think this is a mature solution, I believe the only way to deal with scheme filtering is a whitelist (maybe customized by user). E.g. look at this scary big list: https://www.owasp.org/index.php/XSS_Filter_Evasion_Cheat_Sheet (all this case-insensitive and inserting meaningless characters technics). I guess many of them don't work in modern browsers but it's better to be safe than sorry.

from autolinker.js.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.