GithubHelp home page GithubHelp logo

Comments (19)

jch avatar jch commented on June 16, 2024

I wonder why doc is a DocumentFragment instead of a Nokogiri::HTML::Document. The line that parses it is https://github.com/jch/html-pipeline/blob/master/lib/html/pipeline.rb#L53. When I search your sample with a Document, it works as expected:

irb(main):018:0> Nokogiri::HTML('hi').search("text()")
=> [#<Nokogiri::XML::Text:0x3fd03a089798 "hi">]

You implementation works, but I'd be worried about the performance of doc.text == html. Both to create a string object from the doc, and to compare it against the existing value. Another implementation would be to iterate through all the child nodes and only work upon text nodes:

doc.children.each do |node|
  next unless node.text?
  # snip...
end

Thanks for digging in on this bug. Could you open a PR with a test and we can continue the discussion from there?

from html-pipeline.

wideopenspaces avatar wideopenspaces commented on June 16, 2024

@jch I will dig deeper. When this wasn't working, I created a test pipeline with only EmojiFilter in it, so I know it wasn't any of the custom filters I built, but it's quite possible I did do something wrong.

If it's not an ID10T error on my part, I'll certainly work up a PR!

from html-pipeline.

jch avatar jch commented on June 16, 2024

@wideopenspaces any luck?

from html-pipeline.

wideopenspaces avatar wideopenspaces commented on June 16, 2024

Work got in the way the last two weeks. I'll see if I can set aside a few hours this week to tackle this. Thanks for reminding me!

from html-pipeline.

Razer6 avatar Razer6 commented on June 16, 2024

@wickedshimmy Hitting the same issue. Had you any success?

from html-pipeline.

Razer6 avatar Razer6 commented on June 16, 2024

@jch Your approach works. I could provide PR if would acceptable.

from html-pipeline.

jch avatar jch commented on June 16, 2024

@Razer6 👍 a PR would be awesome. I'd be happy to review and test it for compatibility.

from html-pipeline.

jch avatar jch commented on June 16, 2024

Fixed by #146

from html-pipeline.

aroben avatar aroben commented on June 16, 2024

I ran into this problem too. I think some versions of libxml2 don't return top-level text nodes inside a DocumentFragment when using .search("text()"). I was finding that things worked fine on my Mac laptop, which as a new-ish version of libxml2, but not on a Linux server with an older version.

HTML::Pipeline normally avoids this by wrapping everything inside a <div> in PlainTextInputFilter. If you use PlainTextInputFilter this problem never occurs because there are no top-level text nodes.

@wideopenspaces Were you using PlainTextInputFilter? If you weren't, then you're probably opening yourself to XSS attacks or at least bad parsing/rendering (e.g., if your input string happens to contain HTML).

from html-pipeline.

aroben avatar aroben commented on June 16, 2024

So I guess what I'm saying is that #146 seems unnecessary. It seems like the correct fix is "Use PlainTextInputFilter". (Unless you were using it, in which case you and I weren't seeing the same bug.)

from html-pipeline.

jch avatar jch commented on June 16, 2024

Eeeenteresting. I had forgotten about PlainTextInputFilter. I suppose this problem doesn't just apply to the EmojiFilter, but to all filters that work on text without a root node.

@aroben is there a downside to inlining PlainTextInputFilter behavior into all pipelines to avoid this gotcha? It would add a some overhead to all pipelines, but it feels like an implicit dependency as it is.

from html-pipeline.

aroben avatar aroben commented on June 16, 2024

Well if you are in fact starting with HTML, not plaintext, you should not use PlainTextInputFilter.

from html-pipeline.

jch avatar jch commented on June 16, 2024

Ya, then I guess you'd add unnecessary overhead. I'll document this better and revert #146 then.

cc @Razer6

from html-pipeline.

aroben avatar aroben commented on June 16, 2024

Not just overhead. All your HTML would get escaped and thus rendered as plain text. I.e. you'd see HTML tags in your output.

from html-pipeline.

jch avatar jch commented on June 16, 2024

Ooo ya. Good point.

from html-pipeline.

wideopenspaces avatar wideopenspaces commented on June 16, 2024

@aroben No, we are sanitizing separately. I will look into whether or not PlainTextFilter will work for us.

from html-pipeline.

aroben avatar aroben commented on June 16, 2024

@wideopenspaces If you already have HTML-escaped text on your hands then you could manually wrap it in a <div> like PlainTextInputFilter does.

from html-pipeline.

wideopenspaces avatar wideopenspaces commented on June 16, 2024

Yep, that may be the best solution. Thanks for chiming in! And @Razer6 thanks for picking up my slack!

from html-pipeline.

Razer6 avatar Razer6 commented on June 16, 2024

HTML::Pipeline normally avoids this by wrapping everything inside a <div>

I proposed this solution also in #144. Didn't see that there is a dedicated filter doing that. Thanks for pointing out 👍

Eeeenteresting. I had forgotten about PlainTextInputFilter. I suppose this problem doesn't just apply to > the EmojiFilter, but to all filters that work on text without a root node.

This is right.

@jch @aroben Thanks for finding the correct solution 👍

from html-pipeline.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.