GithubHelp home page GithubHelp logo

Comments (13)

smallstepstoday avatar smallstepstoday commented on August 30, 2024 1

I probably should have said this already, but I believe the whole unified collection is amazing. Super productive.

Now that I am all caught up...

My project involves working with actual customer websites, parsing them and making recommendations. You are absolutely right as to the spec: only link, style and meta statements. In this case their developers are being a bit creative...

Your screen snapshots show the browsers when JavaScript is disabled, and I have confirmed that rehype-parse matches the browser in that instance. However, when JavaScript is enabled the output looks the same as what I showed earlier. See this output from Safari 13.1 (15609.1.20.111.8):
image

There are two different modes applied to the parsing based on whether JavaScript is active. The difference lies at the very least with the noscript tag. When JavaScript is allowed, the content of the noscript tag, valid or invalid, is ignored, resulting in a different structure. How will rehype-parse reconcile these differences? Should their be a configuration flag to select which mode to use?

from rehype.

wooorm avatar wooorm commented on August 30, 2024

👋

Have you tried opening your input file in a browser? I believe that gives the same behavior as rehype, no?

from rehype.

smallstepstoday avatar smallstepstoday commented on August 30, 2024

Yep. Nope. 🙂
image

from rehype.

wooorm avatar wooorm commented on August 30, 2024

Are you using an out of date Chrome. Or maybe Chrome on your platform does not parse HTML correctly?

Here are SF, Ch, and FF:
Screen Shot 2020-04-26 at 5 30 41 pm
Screen Shot 2020-04-26 at 5 30 54 pm
Screen Shot 2020-04-26 at 5 30 19 pm


And here is the spec, which for noscripts content model in head, says: “When scripting is disabled, in a head element: in any order, zero or more link elements, zero or more style elements, and zero or more meta elements.” Text nodes other than inter-element whitespace are not allowed.

from rehype.

wooorm avatar wooorm commented on August 30, 2024

Thank you!! ✨

We do indeed parse with scripting turned off. The reason for that is: well, we don’t support scripting.
Whether scripting is enabled or not, the HTML is still incorrect, but browsers ignore it when scripting is turned on, and don’t when it’s turned off.

Adding a flag to act as if scripting was turned on while not supporting scripting, will result in other errors like this, so while it may help you in this case, it will cause other problems.

My project involves working with actual customer websites, parsing them and making recommendations.

What kind of recommendations would you like to give here, that you can’t now?

from rehype.

smallstepstoday avatar smallstepstoday commented on August 30, 2024

It not essential for the tool to support scripting, only that it support scripting as a valid mode for parsing. The resulting DOM, and therefore the resulting AST, is different depending on whether the assumed context supports scripting or not.

The WHATWG spec has a scripting flag to indicate whether scripting is enabled at the time of parsing. It follows then that any tool that parses according to the spec should support the same...

The scripting flag can be enabled even when the parser was created as part of the HTML fragment parsing algorithm, even though script elements don't execute in that case.

Elsewhere it states:

All these contortions are required because, for historical reasons, the noscript element is handled differently by the HTML parser based on whether scripting was enabled or not when the parser was invoked.

Relative to parsing HTML fragments:

If the scripting flag is enabled, switch the tokenizer to the RAWTEXT state. Otherwise, leave the tokenizer in the data state.

This demonstrates the importance of handling the parsing based on whether scripting is assumed to be supported. Given that these instructions cover the tokenization, parsing and tree construction, it is not out of scope regardless of whether the tool supports scripting.

There is nothing wrong with leaving the default state as scripting disabled in order to prevent breaking changes. Adding a flag to enable script-enabled parsing would be beneficial to users because the output can be adjusted to check both conditions, and would make the parser even more standards-compliant.

To answer your question: I develop heuristics based on a number of factors. An error I see creeping in occasionally relates to this tag. This is either due to practices like the one that sparked this conversation, or because there is a lost opportunity to brand a site for visitors who choose to do without JavaScript. When the parsing is done in an environment where scripting is enabled, the resulting analysis is impacted. Using the AST to perform the analysis enables me to spot multiple issues using the same tool chain. That only works if the AST represents expected results.

from rehype.

wooorm avatar wooorm commented on August 30, 2024

Adding a flag to enable script-enabled parsing would be beneficial to users because the output can be adjusted to check both conditions.

While adding the option is technically trivial, the implications aren’t trivial: because now the tree can be either in scripting, or in non-scripting. So plugins won’t be able to inspect or inject in noscripts.

As your goal is to inspect the tree, I’m not sure that acting as if we support scripting, will allow you to inspect more: instead, the parser would see elements as a string, whereas current behavior is to expose them as nodes. That sounds like you’d be able to inspect less.

or because there is a lost opportunity to brand a site for visitors who choose to do without JavaScript

<noscript> is allowed in <body> too, it doesn’t have to be in <head>. That opportunity is still there if HTML is valid?

from rehype.

smallstepstoday avatar smallstepstoday commented on August 30, 2024

As I said before, this is an amazing collection of tools. And it is definitely more productive than many others. You are ultimately deciding what issues this library addresses and which issues it does not. This may be one that isn't. I am okay if that is your decision.

So plugins won’t be able to inspect or inject in noscripts.

I am not sure I agree as long as the behavior is documented and consistent. Most likely plugins are not going to inject invalid HTML anyway, so this may moot.

That sounds like you’d be able to inspect less.

I think that depends on your purpose in creating the AST in the first place. Providing a flag, especially if doing so is trivial, opens this tool to additional use cases beyond transformation, etc.

The branding opportunity is lost when the tag is in the body. There are different concerns when the tag is in the head. There the issue has more to do with validity of the allowed tags and whether the combined head content is valid within a scripting-enabled environment.

We can continue to debate the virtues and pitfalls of adding a flag, but that isn't going to benefit anyone. You do great work. You decide how this library handles this, and what you gain or lose by not doing so.

from rehype.

wooorm avatar wooorm commented on August 30, 2024

Thank you!

I am not sure I agree as long as the behavior is documented and consistent. Most likely plugins are not going to inject invalid HTML anyway, so this may moot.

For example a linter that checks noscripts, or for links in body, which with this flag would no longer be able to (fully) do so.

Providing a flag [...] opens this tool to additional use cases beyond transformation, etc.

Such as?

The branding opportunity is lost when the tag is in the body.

How? You can still put a heading in a body, as the first thing, for users without scripting turned on? I don’t get this, could you provide an example of what you are doing in noscript in head, that can’t work with valid HTML?

We can continue to debate the virtues and pitfalls of adding a flag, but that isn't going to benefit anyone

Communication is exactly the way to benefit everyone. We (and many other open source projects) follow consensus seeking; to summarize: “In consensus-seeking, discussion of potential proposals is held first, followed by the framing of a solution, and then modifying it until the group reaches a consensus if there is no longer any disagreement.

from rehype.

smallstepstoday avatar smallstepstoday commented on August 30, 2024

There is no penalty for implementing a scripting flag since the existing behavior accurately mirrors the non-scripting mode. Adding a scripting mode changes the output only for those who use the flag. Since parse5 already supports the flag, the implementation is, as you stated before, trivial.

I get why the decision was made to parse in non-scripting mode: it causes the links, styles and meta tags to be added to the AST. It is not difficult to create routines to map the AST to reflect a scripting environment's view of the document, but it should not be necessary to do so if the tool can do it for me.

When I started using rehype I assumed that it would spit out the tags as it found them, making tight error corrections to preserve as much of the underlying structure. That meant something like leaving a text node in the spot where the invalid content was, and posting an error or warning about it.

In working through this issue I learned that it faithfully interprets the document from a non-scripting perspective. That means replicating damage to the head tag as shown in the examples above. If the tool is not merely providing a hypothetical parse, but reflects the actual handling of the document within the browser, then the current solution is partial in that it does not support the scripting flag.

Either way the consequence is trivial. My vote is for more flexibility from the tool, especially when that flexibility comes at a low cost and does not dramatically increase the maintenance cost of the package.

The branding opportunity is lost when the tag is in the body.

What I meant was: when developers don't dress up the noscript in the body and deliver a branded experience, they lose an opportunity to stand out among their competitors.

from rehype.

wooorm avatar wooorm commented on August 30, 2024

Hi again, sorry for the delay.
Thanks for the write up. I believe I understand where you are coming from.

The consequence is trivial.

It is not. If scripting: true was supported, <noscript> would include textual content that, in order to round trip properly, would have to be injected into the resulting HTML document without escaping it. That is a serious security problem. It is also a usability problem, because it is unclear to plugins whether the contents of noscript would need to parsed or not, which could lead to parsing it double, and it’s also unclear whether a node should be injected into a <noscript> as a node, or a string.
Every plugin now needs to know about noscript elements and change its handling of them.

What I meant was: when developers don't dress up the noscript in the body and deliver a branded experience, they lose an opportunity to stand out among their competitors.

Can you provide an example (as in, code) of what people do and why it is apparently necessary to author incorrect HTML?

from rehype.

smallstepstoday avatar smallstepstoday commented on August 30, 2024

My turn to apologize for the delay.

Isn't it possible for someone to create an AST with all the attributes you describe and feed it to the downline processes without involving rehype? If it is, then adding the flag to rehype is not going to add any additional risk. After all parse5, which is used under the covers, already supports that flag.

I understand your concern about the plugins having to deal with the difference. I am sure you have a better idea than I do how much impact there would be. Having said that, I don't believe you will be seeing different AST nodes created that aren't already present, making it unlikely to be significant. Plugins that ignore the noscript tag will continue to do so. Those that emit the tag, will output whatever is in it, just like today. My comment regarding the triviality had more to do with users of the library versus the implementers. 😀

The responsibility of the user is to ensure that dangerous or invalid content is handled appropriately. Documentation can provide a warning, as can a warning message output to the console. If the behavior is clear, then the user is able to ensure safe handling. It is the same thing in my mind as React's dangerouslySetInnerHTML.

My hope is that you bring the library closer to spec and let users like me worry about the fuzziness around a historically fuzzy tag. Having fidelity across the entire scope will make this library even better.

The need is not to produce the incorrect HTML, but rather to handle it appropriately and according to spec. If you are parroting back in AST form what the author produced, then I am able to highlight the issues and advise. I have encountered at least one company that uses this technique to deter screen scrapers from accessing their site. The error causes a number of parsers to break, defeating the scraper. In fact, the code I added to start is more or less exactly what they have in their site. They use JavaScript to replace the incorrect code when the page is loaded correctly. While screen scraping is not my use case, the result is the same.

from rehype.

wooorm avatar wooorm commented on August 30, 2024

I feel really strongly about the security issues I mentioned before due to introducing an unstable AST, which will lead to XSS vulnerabilities. rehype-parse handles everything according to spec, which means that text in a noscript will close that noscript.

The error causes a number of parsers to break, defeating the scraper

If folks are directly working against browsers and tools like us with incorrect HTML, then, well, what’d you expect 😅

If you are parroting back in AST form what the author produced, then I am able to highlight the issues and advise.

Representing what authors wrote is something different than representing HTML. There are other projects that do this, I believe rehype-webparser might work in your case?


I’m sorry, I understand what you’re looking for, but I don’t think rehype-parse should do it. But, rehype-parse is only a couple lines, and everything in unified is loose modules that you can combine together. So, feel free to add this in userland!

from rehype.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.