Comments (19)
I wonder why doc
is a DocumentFragment
instead of a Nokogiri::HTML::Document
. The line that parses it is https://github.com/jch/html-pipeline/blob/master/lib/html/pipeline.rb#L53. When I search your sample with a Document
, it works as expected:
irb(main):018:0> Nokogiri::HTML('hi').search("text()")
=> [#<Nokogiri::XML::Text:0x3fd03a089798 "hi">]
You implementation works, but I'd be worried about the performance of doc.text == html
. Both to create a string object from the doc, and to compare it against the existing value. Another implementation would be to iterate through all the child nodes and only work upon text nodes:
doc.children.each do |node|
next unless node.text?
# snip...
end
Thanks for digging in on this bug. Could you open a PR with a test and we can continue the discussion from there?
from html-pipeline.
@jch I will dig deeper. When this wasn't working, I created a test pipeline with only EmojiFilter in it, so I know it wasn't any of the custom filters I built, but it's quite possible I did do something wrong.
If it's not an ID10T error on my part, I'll certainly work up a PR!
from html-pipeline.
@wideopenspaces any luck?
from html-pipeline.
Work got in the way the last two weeks. I'll see if I can set aside a few hours this week to tackle this. Thanks for reminding me!
from html-pipeline.
@wickedshimmy Hitting the same issue. Had you any success?
from html-pipeline.
@jch Your approach works. I could provide PR if would acceptable.
from html-pipeline.
@Razer6 👍 a PR would be awesome. I'd be happy to review and test it for compatibility.
from html-pipeline.
Fixed by #146
from html-pipeline.
I ran into this problem too. I think some versions of libxml2 don't return top-level text nodes inside a DocumentFragment when using .search("text()")
. I was finding that things worked fine on my Mac laptop, which as a new-ish version of libxml2, but not on a Linux server with an older version.
HTML::Pipeline normally avoids this by wrapping everything inside a <div>
in PlainTextInputFilter. If you use PlainTextInputFilter this problem never occurs because there are no top-level text nodes.
@wideopenspaces Were you using PlainTextInputFilter? If you weren't, then you're probably opening yourself to XSS attacks or at least bad parsing/rendering (e.g., if your input string happens to contain HTML).
from html-pipeline.
So I guess what I'm saying is that #146 seems unnecessary. It seems like the correct fix is "Use PlainTextInputFilter". (Unless you were using it, in which case you and I weren't seeing the same bug.)
from html-pipeline.
Eeeenteresting. I had forgotten about PlainTextInputFilter
. I suppose this problem doesn't just apply to the EmojiFilter
, but to all filters that work on text without a root node.
@aroben is there a downside to inlining PlainTextInputFilter
behavior into all pipelines to avoid this gotcha? It would add a some overhead to all pipelines, but it feels like an implicit dependency as it is.
from html-pipeline.
Well if you are in fact starting with HTML, not plaintext, you should not use PlainTextInputFilter.
from html-pipeline.
Ya, then I guess you'd add unnecessary overhead. I'll document this better and revert #146 then.
cc @Razer6
from html-pipeline.
Not just overhead. All your HTML would get escaped and thus rendered as plain text. I.e. you'd see HTML tags in your output.
from html-pipeline.
Ooo ya. Good point.
from html-pipeline.
@aroben No, we are sanitizing separately. I will look into whether or not PlainTextFilter will work for us.
from html-pipeline.
@wideopenspaces If you already have HTML-escaped text on your hands then you could manually wrap it in a <div>
like PlainTextInputFilter does.
from html-pipeline.
Yep, that may be the best solution. Thanks for chiming in! And @Razer6 thanks for picking up my slack!
from html-pipeline.
HTML::Pipeline normally avoids this by wrapping everything inside a
<div>
I proposed this solution also in #144. Didn't see that there is a dedicated filter doing that. Thanks for pointing out 👍
Eeeenteresting. I had forgotten about PlainTextInputFilter. I suppose this problem doesn't just apply to > the EmojiFilter, but to all filters that work on text without a root node.
This is right.
@jch @aroben Thanks for finding the correct solution 👍
from html-pipeline.
Related Issues (20)
- Where is this inserting the p tag?
- Changing the list of commonmarker extensions with custom renderer is broken HOT 2
- console rendering of slash continued multi-line commands HOT 2
- Jch html-pipeline
- Add MathML elements to whitelist HOT 4
- Change branch name off of `master` HOT 1
- Invalid and missing HTML elements in the sanatizer
- V3 ideas HOT 1
- Allow picture tag in sanitation HOT 1
- Canalizacion de HTML
- Html-pipeline
- I would love to use the vscode codicon.ttf in my readme.md HOT 1
- 2.14.0 is disconnected HOT 4
- Allow `loading` attribute on images HOT 5
- Since bump 2.14.2 builds are failing HOT 3
- Allow vertical-align HOT 1
- Indicate a version for activesupport that has support/receives security patches (>= 6?) HOT 2
- v3: Question regarding requiring a ConvertFilter if there are NodeFilters HOT 1
- v3 gemoji, gemojione seem required - is there a way to not require at puma startup? HOT 2
- Suggestion: add more tags to the sanitization filter HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from html-pipeline.