gjtorikian / html-pipeline Goto Github PK

HTML processing filters and utilities

License: MIT License

Ruby 100.00%

html-pipeline's Introduction

HTML-Pipeline

HTML processing filters and utilities. This module is a small framework for defining CSS-based content filters and applying them to user provided content.

Although this project was started at GitHub, they no longer use it. This gem must be considered standalone and independent from GitHub.

HTML-Pipeline

Installation

Add this line to your application's Gemfile:

gem 'html-pipeline'

And then execute:

$ bundle

Or install it by yourself as:

$ gem install html-pipeline

Usage

This library provides a handful of chainable HTML filters to transform user content into HTML markup. Each filter does some work, and then hands off the results tothe next filter. A pipeline has several kinds of filters available to use:

Multiple TextFilters, which operate a UTF-8 string
A ConvertFilter filter, which turns text into HTML (eg., Commonmark/Asciidoc -> HTML)
A SanitizationFilter, which remove dangerous/unwanted HTML elements and attributes
Multiple NodeFilters, which operate on a UTF-8 HTML document

You can assemble each sequence into a single pipeline, or choose to call each filter individually.

As an example, suppose we want to transform Commonmark source text into Markdown HTML:

Hey there, @gjtorikian

With the content, we also want to:

change every instance of Hey to Hello
strip undesired HTML
linkify @mention

We can construct a pipeline to do all that like this:

require 'html_pipeline'

class HelloJohnnyFilter < HTMLPipelineFilter
  def call
    text.gsub("Hey", "Hello")
  end
end

pipeline = HTMLPipeline.new(
  text_filters: [HelloJohnnyFilter.new]
  convert_filter: HTMLPipeline::ConvertFilter::MarkdownFilter.new,
    # note: next line is not needed as sanitization occurs by default;
    # see below for more info
  sanitization_config: HTMLPipeline::SanitizationFilter::DEFAULT_CONFIG,
  node_filters: [HTMLPipeline::NodeFilter::MentionFilter.new]
)
pipeline.call(user_supplied_text) # recommended: can call pipeline over and over

Filters can be custom ones you create (like HelloJohnnyFilter), and HTMLPipeline additionally provides several helpful ones (detailed below). If you only need a single filter, you can call one individually, too:

filter = HTMLPipeline::ConvertFilter::MarkdownFilter.new
filter.call(text)

Filters combine into a sequential pipeline, and each filter hands its output to the next filter's input. Text filters are processed first, then the convert filter, sanitization filter, and finally, the node filters.

Some filters take optional context and/or result hash(es). These are used to pass around arguments and metadata between filters in a pipeline. For example, if you want to disable footnotes in the MarkdownFilter, you can pass an option in the context hash:

context = { markdown: { extensions: { footnotes: false } } }
filter = HTMLPipeline::ConvertFilter::MarkdownFilter.new(context: context)
filter.call("Hi **world**!")

Alternatively, you can construct a pipeline, and pass in a context during the call:

pipeline = HTMLPipeline.new(
  convert_filter: HTMLPipeline::ConvertFilter::MarkdownFilter.new,
  node_filters: [HTMLPipeline::NodeFilter::MentionFilter.new]
)
pipeline.call(user_supplied_text, context: { markdown: { extensions: { footnotes: false } } })

Please refer to the documentation for each filter to understand what configuration options are available.

More Examples

Different pipelines can be defined for different parts of an app. Here are a few paraphrased snippets to get you started:

# The context hash is how you pass options between different filters.
# See individual filter source for explanation of options.
context = {
  asset_root: "http://your-domain.com/where/your/images/live/icons",
  base_url: "http://your-domain.com"
}

# Pipeline used for user provided content on the web
MarkdownPipeline = HTMLPipeline.new (
  text_filters: [HTMLPipeline::TextFilter::ImageFilter.new],
  convert_filter: HTMLPipeline::ConvertFilter::MarkdownFilter.new,
  node_filters: [
    HTMLPipeline::NodeFilter::HttpsFilter.new,HTMLPipeline::NodeFilter::MentionFilter.new,
  ], context: context)

# Pipelines aren't limited to the web. You can use them for email
# processing also.
HtmlEmailPipeline = HTMLPipeline.new(
  text_filters: [
    PlainTextInputFilter.new,
    ImageFilter.new
  ], {})

Filters

TextFilters

TextFilters must define a method named call which is called on the text. @text, @config, and @result are available to use, and any changes made to these ivars are passed on to the next filter.

ImageFilter - converts image url into <img> tag
PlainTextInputFilter - html escape text and wrap the result in a <div>

ConvertFilter

The ConvertFilter takes text and turns it into HTML. @text, @config, and @result are available to use. ConvertFilter must defined a method named call, taking one argument, text. call must return a string representing the new HTML document.

MarkdownFilter - creates HTML from text using Commonmarker

Sanitization

Because the web can be a scary place, HTML is automatically sanitized after the ConvertFilter runs and before the NodeFilters are processed. This is to prevent malicious or unexpected input from entering the pipeline.

The sanitization process takes a hash configuration of settings. See the Selma documentation for more information on how to configure these settings.

A default sanitization config is provided by this library (HTMLPipeline::SanitizationFilter::DEFAULT_CONFIG). A sample custom sanitization allowlist might look like this:

ALLOWLIST = {
  elements: ["p", "pre", "code"]
}

pipeline = HTMLPipeline.new \
  text_filters: [
    HTMLPipeline::TextFilter::ImageFilter.new,
  ],
  convert_filter: HTMLPipeline::ConvertFilter::MarkdownFilter.new,
  sanitization_config: ALLOWLIST

result = pipeline.call <<-CODE
This is *great*:

    some_code(:first)

CODE
result[:output].to_s

This would print:

<p>This is great:</p>
<pre><code>some_code(:first)
</code></pre>

Sanitization can be disabled if and only if nil is explicitly passed as the config:

pipeline = HTMLPipeline.new \
  text_filters: [
    HTMLPipeline::TextFilter::ImageFilter.new,
  ],
  convert_filter: HTMLPipeline::ConvertFilter::MarkdownFilter.new,
  sanitization_config: nil

For more examples of customizing the sanitization process to include the tags you want, check out the tests and the FAQ.

NodeFilters

NodeFilterss can operate either on HTML elements or text nodes using CSS selectors. Each NodeFilter must define a method named selector which provides an instance of Selma::Selector. If elements are being manipulated, handle_element must be defined, taking one argument, element; if text nodes are being manipulated, handle_text_chunk must be defined, taking one argument, text_chunk. @config, and @result are available to use, and any changes made to these ivars are passed on to the next filter.

NodeFilter also has an optional method, after_initialize, which is run after the filter initializes. This can be useful in setting up a custom state for result to take advantage of.

Here's an example NodeFilter that adds a base url to images that are root relative:

require 'uri'

class RootRelativeFilter < HTMLPipeline::NodeFilter

  SELECTOR = Selma::Selector.new(match_element: "img")

  def selector
    SELECTOR
  end

  def handle_element(img)
    next if img['src'].nil?
    src = img['src'].strip
    if src.start_with? '/'
      img["src"] = URI.join(context[:base_url], src).to_s
    end
  end
end

For more information on how to write effective NodeFilters, refer to the provided filters, and see the underlying lib, Selma for more information.

AbsoluteSourceFilter: replace relative image urls with fully qualified versions
AssetProxyFilter: replace image links with an encoded link to an asset server
EmojiFilter: converts :<emoji>: to emoji
- (Note: the included MarkdownFilter will already convert emoji)
HttpsFilter: Replacing http urls with https versions
ImageMaxWidthFilter: link to full size image for large images
MentionFilter: replace @user mentions with links
SanitizationFilter: allow sanitize user markup
SyntaxHighlightFilter: applies syntax highlighting to pre blocks
- (Note: the included MarkdownFilter will already apply highlighting)
TableOfContentsFilter: anchor headings with name attributes and generate Table of Contents html unordered list linking headings
TeamMentionFilter: replace @org/team mentions with links

Dependencies

Since filters can be customized to your heart's content, gem dependencies are not bundled; this project doesn't know which of the default filters you might use, and as such, you must bundle each filter's gem dependencies yourself.

For example, SyntaxHighlightFilter uses rouge to detect and highlight languages; to use the SyntaxHighlightFilter, you must add the following to your Gemfile:

gem "rouge"

Note See the Gemfile :test group for any version requirements.

When developing a custom filter, call HTMLPipeline.require_dependency at the start to ensure that the local machine has the necessary dependency. You can also use HTMLPipeline.require_dependencies to provide a list of dependencies to check.

On a similar note, you must manually require whichever filters you desire:

require "html_pipeline" # must be included
require "html_pipeline/convert_filter/markdown_filter" # included because you want to use this filter
require "html_pipeline/node_filter/mention_filter" # included because you want to use this filter

Documentation

Full reference documentation can be found here.

Instrumenting

Filters and Pipelines can be set up to be instrumented when called. The pipeline must be setup with an ActiveSupport::Notifications compatible service object and a name. New pipeline objects will default to the HTMLPipeline.default_instrumentation_service object.

# the AS::Notifications-compatible service object
service = ActiveSupport::Notifications

# instrument a specific pipeline
pipeline = HTMLPipeline.new [MarkdownFilter], context
pipeline.setup_instrumentation "MarkdownPipeline", service

# or set default instrumentation service for all new pipelines
HTMLPipeline.default_instrumentation_service = service
pipeline = HTMLPipeline.new [MarkdownFilter], context
pipeline.setup_instrumentation "MarkdownPipeline"

Filters are instrumented when they are run through the pipeline. A call_filter.html_pipeline event is published once any filter finishes; call_text_filters and call_node_filters is published when all of the text and node filters are finished, respectively. The payload should include the filter name. Each filter will trigger its own instrumentation call.

service.subscribe "call_filter.html_pipeline" do |event, start, ending, transaction_id, payload|
  payload[:pipeline] #=> "MarkdownPipeline", set with `setup_instrumentation`
  payload[:filter] #=> "MarkdownFilter"
  payload[:context] #=> context Hash
  payload[:result] #=> instance of result class
  payload[:result][:output] #=> output HTML String
end

The full pipeline is also instrumented:

service.subscribe "call_text_filters.html_pipeline" do |event, start, ending, transaction_id, payload|
  payload[:pipeline] #=> "MarkdownPipeline", set with `setup_instrumentation`
  payload[:filters] #=> ["MarkdownFilter"]
  payload[:doc] #=> HTML String
  payload[:context] #=> context Hash
  payload[:result] #=> instance of result class
  payload[:result][:output] #=> output HTML String
end

Third Party Extensions

If you have an idea for a filter, propose it as an issue first. This allows us to discuss whether the filter is a common enough use case to belong in this gem, or should be built as an external gem.

Here are some extensions people have built:

html-pipeline-asciidoc_filter
jekyll-html-pipeline
nanoc-html-pipeline
html-pipeline-bitly
html-pipeline-cite
tilt-html-pipeline
html-pipeline-wiki-link' - WikiMedia-style wiki links
task_list - GitHub flavor Markdown Task List
html-pipeline-nico_link - An HTMLPipeline filter for niconico description links
html-pipeline-gitlab - This gem implements various filters for html-pipeline used by GitLab
html-pipeline-youtube - An HTMLPipeline filter for YouTube links
html-pipeline-flickr - An HTMLPipeline filter for Flickr links
html-pipeline-vimeo - An HTMLPipeline filter for Vimeo links
html-pipeline-hashtag - An HTMLPipeline filter for hashtags
html-pipeline-linkify_github - An HTMLPipeline filter to autolink GitHub urls
html-pipeline-redcarpet_filter - Render Markdown source text into Markdown HTML using Redcarpet
html-pipeline-typogruby_filter - Add Typogruby text filters to your HTMLPipeline
korgi - HTMLPipeline filters for links to Rails resources

FAQ

1. Why doesn't my pipeline work when there's no root element in the document?

To make a pipeline work on a plain text document, put the PlainTextInputFilter at the end of your text_filters config . This will wrap the content in a div so the filters have a root element to work with. If you're passing in an HTML fragment, but it doesn't have a root element, you can wrap the content in a div yourself.

2. How do I customize an allowlist for `SanitizationFilter`s?

HTMLPipeline::SanitizationFilter::ALLOWLIST is the default allowlist used if no sanitization_config argument is given. The default is a good starting template for you to add additional elements. You can either modify the constant's value, or re-define your own config and pass that in, such as:

config = HTMLPipeline::SanitizationFilter::DEFAULT_CONFIG.deep_dup
config[:elements] << "iframe" # sure, whatever you want

Contributors

Thanks to all of these contributors.

This project is a member of the OSS Manifesto.

html-pipeline's People

Contributors

Stargazers

Watchers

Forkers

mtodd jasonong pborreli rongmic benubois nbibler jbarnette jfuchs angeladt mislav vmg dmarcotte mutle jakedouglas stereobooster hooopo reyesyang niloops thfb2003 amymarie40 web5design fahchen indirect ehrldo lawrencepit adelcambre geothird openflex hakubjozak cameronmcefee kirpen suburke mojavelinux musicantma netconstructor linuxfrorg pengwynn tricknotes srjordan speakez hungyuhei simeonwillbanks jakubsvehla bradgessler excid3 agencia-watermelons johnnycatsf haileys aroben ivantsepp chrishunt bkeepers gma awesome foca interfaces-xx bypashkevich prodigeni rrrene joemorro farkaslee fork-from-github lynettelu zoumaho vijayp146 ucarion rymohr mandysss verdy-p ashishup kelsin rwz acidburn0zzz saady cw2018 nangal technogeeky dascritch danetheory cmitchusa williamren pnsk razer6 tansaku syamil6619 fuzioncloud jessi1212 mjy1124 moskvax mtco guy1394 brittballard modeset azizshamim calebhearth sheltowt n-shinta nanosemanticsinc michellerayhardy bernatp3rs

html-pipeline's Issues

AutolinkFilter link_attr doesn't seem to work

Hi,
In my code I have:

context = {
      asset_root: 'https://a248.e.akamai.net/assets.github.com/images/icons/',
      link_attr: 'target="_blank"',
      gfm: true
    }

    pipeline = HTML::Pipeline.new [
      HTML::Pipeline::MarkdownFilter,
      HTML::Pipeline::SanitizationFilter,
      HTML::Pipeline::EmojiFilter,
      HTML::Pipeline::AutolinkFilter
    ], context

    pipeline.call(text)[:output].to_s

and
%p= raw format(answer.body) to invoke it.

The link however doesn't add the attribute target="_blank"
Any idea?

Thanks,
Roy

Rename repository to html-pipeline

I'll be renaming this repository to html-pipeline in 3 days. Giving a heads up to let people change any references.

Executable for previewing

Any interest in an executable that people can use to preview the output of an html-pipeline run easily? I just wrote one, but I could submit it as a pull: https://gist.github.com/indirect/5096633

$ echo "foo" | html-pipeline
<p>foo</p>

History

It'd be cool to retain the original history when extracting libraries like this. Would you guys mind if I push a branch with the full history from the github/github repo? We'd need to rebase everything that's happened here on top and force push unfortunately. Sorry, I would have chimed in here earlier but had no idea this was going on.

OSX HTML::Pipeline::MarkdownFilter Fails on Right Double Quotation Mark around email address

When using the HTML::Pipeline::MarkdownFilter on a string containing a "Right Double Quotation Mark" (U+201D) around an email address the output html will include an invalid byte sequence when trying to autolink it as a mailto:

I'm only having this issue on OSX. I'm running 10.10.2.

To reproduce:

renderer = HTML::Pipeline.new([HTML::Pipeline::MarkdownFilter]).freeze
renderer.to_html("This is  an “[email protected]” example").split

This is really a bug within github-markdown, but I'm submitting it here as github-markdown doesn't seem to have a Github repository. I've also tried using Redcloth and it fails as well.

ruby 2.1.5p273 (2014-11-13 revision 48405) [x86_64-darwin14.0]

# Nokogiri (1.6.5)
    ---
    warnings: []
    nokogiri: 1.6.5
    ruby:
      version: 2.1.5
      platform: x86_64-darwin14.0
      description: ruby 2.1.5p273 (2014-11-13 revision 48405) [x86_64-darwin14.0]
      engine: ruby
    libxml:
      binding: extension
      source: packaged
      libxml2_path: "/Users/ericgoodwin/.rbenv/versions/2.1.5/lib/ruby/gems/2.1.0/gems/nokogiri-1.6.5/ports/x86_64-apple-darwin14.1.0/libxml2/2.9.2"
      libxslt_path: "/Users/ericgoodwin/.rbenv/versions/2.1.5/lib/ruby/gems/2.1.0/gems/nokogiri-1.6.5/ports/x86_64-apple-darwin14.1.0/libxslt/1.1.28"
      libxml2_patches:
      - 0001-Revert-Missing-initialization-for-the-catalog-module.patch
      - 0002-Fix-missing-entities-after-CVE-2014-3660-fix.patch
      libxslt_patches:
      - 0001-Adding-doc-update-related-to-1.1.28.patch
      - 0002-Fix-a-couple-of-places-where-f-printf-parameters-wer.patch
      - 0003-Initialize-pseudo-random-number-generator-with-curre.patch
      - 0004-EXSLT-function-str-replace-is-broken-as-is.patch
      - 0006-Fix-str-padding-to-work-with-UTF-8-strings.patch
      - 0007-Separate-function-for-predicate-matching-in-patterns.patch
      - 0008-Fix-direct-pattern-matching.patch
      - 0009-Fix-certain-patterns-with-predicates.patch
      - 0010-Fix-handling-of-UTF-8-strings-in-EXSLT-crypto-module.patch
      - 0013-Memory-leak-in-xsltCompileIdKeyPattern-error-path.patch
      - 0014-Fix-for-bug-436589.patch
      - 0015-Fix-mkdir-for-mingw.patch
      compiled: 2.9.2
      loaded: 2.9.2

Emoji syntax gravatars

I'm not sure if this is a good idea or if this is actually the place to suggest it, but it'd be cool if you could put something like :cameronmcefee: in any gfm field and have the person's avatar appear, probably linked to their profile and maybe tool-tipped with their name.

Contributing Guidelines

CONTRIBUTING.md is a cool feature; we should add it to html-pipeline! 😄

When a user submits a New Issue or sends a Pull Request, they are linked to the project's CONTRIBUTING.md.

New Issue:

Pull Request:

Since CONTRIBUTING.md is linked from both places, we could split it into two pieces of documentation. At the top of the document, we could have navigation to both pieces. Here is a rough draft for review. Thoughts?

Submitting New Issue

Please include:

Example code
Result output
nokogiri -v

Sending Pull Request

How to run the tests:

bundle exec rake

Potential class loading conflict with add-on filters

Due to the fact that HTML::Pipeline is a class, not a module, there is risk that an add-on filter will prematurely define this class before it's extended in the core library, which causes the notorious "superclass mismatch" exception.

Here's an example of where this happens. While create a new gem for the BarFilter, we define a version file:

lib/html/pipeline/bar_filter/version.rb

module HTML
  class Pipeline
    class BarFilter
      VERSION = '1.0.0'
    end
  end
end

If we load this at the top of a gemspec file, for instance, then if we attempt to load 'html/pipeline', it goes 💥.

Normally the way these things are defined (as far as I understand it), the top-level type in a gem is a module, not a class. One way to accomplish this without breaking the current API (much), is to define the class method new on the module that instantiates the concrete class. Something like:

module HTML
  module Pipeline
    def self.new filters, default_context = {}, result_class = nil
      Engine.new filters, default_context, result_class
    end

    class Engine
      # relocate Pipeline class definition here
    end
  end
end

The other solution, which I used in html-pipeline-asciidoc_filter, is to put the filter class in a different module for the purpose of holding the VERSION constant.

module HTML_Pipeline
class BarFilter
  VERSION = '1.0.0'
end
end

Either way, I think this is an important issue to address to minimize the challenges of creating an add-on filter.

Feature Request: Add "details" tag to whitelist

I'm a little surprised that the html5 "summary" tag is whitelisted, but the "details" tag (that it is used with) is not whitelisted:

http://html5doctor.com/the-details-and-summary-elements/

Might it be possible to include the "details" tag in the white list? I think this could be a really useful feature

Loosen Markdown Dependency.

Considering that Github Markdown tends to lack documentation on how to configure it (that or Google is failing me,) and it does a lot of things that aren't necessarily nice for user content that you want to restrict (such as autolinking) it would be nice if the dependence on github-markdown was loosened so that people who wish to use redcarpet can.

No stylesheets for SyntaxHighlightFilter

Using the example is the README:

    pipeline = HTML::Pipeline.new [
      HTML::Pipeline::MarkdownFilter,
      HTML::Pipeline::SyntaxHighlightFilter
    ]
    result = pipeline.call input
    result[:output].to_s

produces the requisite <span>s with classes, but there are no styles / stylesheets to colorize the output.

Is there something I need to add to application.css?

Can I use MentionFilter without MarkdownFilter

It semms that MentionFilter has to work with MarkdownFilter. But i dont want to give markdown support.

Place Dependency Management On Filters

#48 kickstarted discussion, and here is a plan for placing dependency management on Filters.

Add dependency management tests
Add dependency management to Filter with descriptive exception
message
Refactor Filters to use new dependency management logic
For CI, move gem dependencies from gemspec to Gemfile :test block
Add gem post install message alerting users to new dependency
management
Update README to detail each Filters dependencies e.g. FaradayMiddleware README

@mention_filter should not replace mentions in style blocks.

HTML 5 allows style tags inline in a page. style should be added to the list of parents that are excluded, since otherwise @media queries get turned into mentions. :(

Optionally require github-linguist

charlock_holmes is a hassle to deploy on Heroku (brianmario/charlock_holmes#4). Could github/linguist (which depends on charlock_holmes) be an optional dependency? I'm guessing that quite a few sites that use html-pipeline won't need syntax highlighting.

Question - Can this work with Rouge?

Title says it all, can this work with Rouge instead of pygments? https://github.com/jneen/rouge

I prefer to stick with all Ruby solution :)

EmailReplyParser is undefined

I might be missing some dependency, but the EmailReplyFilter references an EmailReplyParser constant which is not defined in the gem, at all :)

Can't remember if this is something that was there in github/github or maybe github/html-pipeline? But it should proooobably be here. Or maybe it's EmailReplyFilter that shouldn't be :P

Do not mention or emojify in a codeblock

If the @mention or :emoji: is in a code block, do not transform it. This applies to the HTML, rather than to a Markdown, codeblock.

Related issues:

Thanks!

Open source, transferring repo ownership

I think this repo is ready for 🚢ing. #6 extracted this project from .com, and removed GitHub specific references in the gem. Here's a list of remaining things I'd like to do before I share the ❤️ with the world:

update the readme
write a blog post with some examples
add travis
transfer ownership to jch (per @rtomayko, having a maintainer rather than putting it under the org)

Is there anything I'm missing?

Implement an AsciiDoc filter based on Asciidoctor

Implement an AsciiDoc filter based on Asciidoctor.

Adding this filter will allow AsciiDoc output to be syntax highlighted. The filter should invoke Asciidoctor using attributes that make the HTML produced reasonably consistent with the HTML generated from Markdown (notitle! idprefix idseparator=-)

Ensure we get the latest from github/github

There have been some necessary changes to the pipeline over in github/github...just want to create as a TODO to make sure they get merged onto here at some point.

Support ActiveSupport 4.1

ActiveSupport v4.1.0 depends upon 'minitest', '~> 5.1'.

This dependency breaks the html-pipeline build. Here is a more detailed explanation.
#123 found a temporary solution by disallowing ActiveSupport 4.1.0. or greater. A more permanent solution must be found.

Straight quote → unicode curly quotes filter

What do you think about including this filter?

https://gist.github.com/r38y/7663375

If this filter is applied and they want straight quotes, they can escape them with ". It will also turn -- into – and --- into —.

Passed content must be valid XML to be filtered

Right now HTML::Pipeline::MentionFilter.new "test @benbalter test" will return the input string, while filter = HTML::Pipeline::MentionFilter.new "<p>test @benbalter test</p>" will return the expected @mentioned string.

I believe this is due to the doc.search('text()') pattern. Would be awesome if html-pipeline could support arbitrary strings, as right now I believe the input must be HTML, or the first filter must be the markdown filter for the expected behavior to occur.

At the very least, documentation could help clear things up for new users.

Enable syntax highlighting for inline code

Copying from this issue from github/markup:

Currently, you can syntax-highlight code blocks. For example,

main :: IO ()
main = putStrLn "Hello, World!"

renders as

main :: IO ()
main = putStrLn "Hello, World!"

However, you cannot do the same with inline code such as

main :: IO ()

both of which get rendered as main :: IO () (without syntax highlighting) when used inline. It would be nice to have something like

haskell main :: IO ()

that gives you inline syntax-highlighting (right now, that would render as haskell main :: IO ()).

As gjtorikian suggested on the other issue, this could conceivably be fixed by changing this line to match on code tags, as well as pre.

Warn if "pipelines" are out of order.

I would love it if rather then sending a generic error that means nothing to the user (in some cases) and could be confusing, html-pipeline should detect order issues if there is a clear process order or emoji should convert the DocumentFragment. What I mean is:

[
  HTML::Pipeline::MarkdownFilter,
  HTML::Pipeline::EmojiFilter
]

Works, but

[
  HTML::Pipeline::EmojiFilter,
  HTML::Pipeline::MarkdownFilter
]

Fails. However your lib sends people a broad message that doesn't even hint closely to what the problem might be, it only sends: https://github.com/jch/html-pipeline/blob/master/lib/html/pipeline/text_filter.rb#L7 which can confuse some users who are simply doing the most simple things like:

class HTMLPipeline < Filter
  FILTERS =
    [
      HTML::Pipeline::EmojiFilter
      HTML::Pipeline::MarkdownFilter,
    ]

  def run(content, opts = {})
    opts = { gfm: true, asset_root: "/assets/img" }.merge(opts)
    HTML::Pipeline.new(FILTERS, opts).to_html(content)
  end
end

This might be a problem with Emoji on Ruby 2.0.0-p0 though.

De-github https filter

Pretty sure this is the only filter that still includes github refs:
https://github.com/jch/html-pipeline/blob/master/lib/html/pipeline/https_filter.rb

How about piggybacking on the :base_url option instead?

Medico

It seems too complicated to make a repository. What help can you give when the code to paste within the page's body doesn't click?

Better error notification on missing linguist dependency?

Chalk this up to RTFM, but with a simple filter like this

HTML::Pipeline.new [
          HTML::Pipeline::MarkdownFilter,
          HTML::Pipeline::SyntaxHighlightFilter
        ]

I kept getting the help rails app to crash:

SystemExit in Help/articles#show

Showing /Users/garentorikian/github/help/app/views/help/articles/_article.html.erb where line #22 raised:

exit
Extracted source (around line #22):

Finally, after looking at the logs, I found: You need to install linguist before using the SyntaxHighlightFilter. See README.md for details.

Not sure if this error can be raised in the browser itself, but it'd be nice. Also not sure if this'll be fixed by #28 anyway.

Whitelist table sections (thead, tbody, tfoot)

Add the table section elements to the whitelist.

Table sections (thead, tbody, tfoot) are important table elements that control how a table gets rendered. If handled with the same restrictions as the table element (they can only contain tr, th and td elements), allowing them does not impose any security risk.

Decrease number of dependencies

Remove as many gem dependencies as possible because not everyone uses every single filter. The responsibility of checking for dependencies will be on the filter. This is similar to what faraday does for it's adapters. I don't want the current filters to be split up into a bunch of mini-gems (html-pipeline-emoji, html-pipeline-markdown) cause that's just dicing things too thin.

Camo Filter doesn't return doc when disabled

During some testing this morning I started using the disable_asset_proxy option. It seems when you pass that in the CamoFilter just returns nil, instead of the doc causing the rest of the filter chain to break.

cut a 1.6.0 release

We should bump a release. I want to get the Digest deprecation taken care of in some projects upstream.

/cc @jch

Question about github markdown filter (low priority!)

Hi there,

I have been trying to work out how to stop newline's being inserted into a (github flavour) markdown blockquote.

If I have a markdown file like this:

> this is a start of a quote
> this is a continuation of a quote

according to the docs, github markdown does not put a <br> tag in there.

I have been using your excellent pipeline in a small gem I created for using markdown with the excellent vimwiki plugin, and I keep getting <br> tags inside my generated html. I'm happy to create a test case if it'll help, but I'm wondering if you can tell me what (if any) other filters I should be using. Currently it just uses your sample ones:

pipeline = HTML::Pipeline.new [
  HTML::Pipeline::MarkdownFilter,
  HTML::Pipeline::SyntaxHighlightFilter
]

Any help most appreciated!

Allow SSH protocol links

It'd be handy if you could also use SSH protocol links like [test server](ssh://[email protected]). Is there any chance of adding that to the protocol whitelist in SanitizationFilter? I don't think there should be any security implications, but I may be missing something.

header tags are html-rendered with name="" instead of id=""

This bug was previously reported in markup repo
A section header in markdown is rendered as h3 > a[name]

To anchor an element in URL, the id attribute must be used

The name attribute is reserved for usage in form elements. Its availability as a id is a inheritance from Netscape days and should not have been used here.

Alas, as I read in your README.md, « Note that the id attribute is not whitelisted. »
So how can I patch this ?

Fix travis-ci build

The builds are failing because ActiveSupport 4.x requires Ruby 1.9:

Installing activesupport (4.0.0) 
Gem::InstallError: activesupport requires Ruby version >= 1.9.3.
An error occurred while installing activesupport (4.0.0), and Bundler cannot
continue.
Make sure that `gem install activesupport -v '4.0.0'` succeeds before bundling.

Need to add separate gemfiles for CI to fix this.

EmojiFilter doesn't work on strings that don't contain HTML

When I pass this string...

"I can do this.\r\n:scream: Juice 3: Whoa, that's a LOT of cayenne!"

...to a pipeline containing EmojiFilter, it does not replace the emoji-cheat-sheet code with the Emoji as expected.

I tracked the problem down to here:

irb(main):204:0> doc.search('text()')
=> []

What does happen is that the DocumentFragment in doc contains one child Nokogiri::XML::Text node, and doc.text contains the same text that html contains. So....

Armed with that knowledge, I made the following changes:

def call
- doc.search('text()').each do |node|
+ nodes(doc).each do |node|
    content = node.to_html
    next if !content.include?(':')
    next if has_ancestor?(node, %w(pre code))
    html = emoji_image_filter(content)
    next if html == content
    node.replace(html)
  end
  doc
end

# Look for text nodes in the DocumentFragment
# 
# If doc's text is the same as original string,
# just nab its children to get the proper nodes.
# Otherwise do a search for text nodes.
+ def nodes(doc)
+   doc.text == html ? doc.children : doc.search('text()')
+ end

... and that fixed it for me.

Anyone see any problems with that fix? If not, I'll work up a PR as soon as I can.

Tweaks to the email reply filter

Am I correct in thinking this is used to parse the replies on GitHub? If so, what do you think about adding a way to strip the garbage from this:

I'm happy to do it but I wanted to make sure this filter was the correct place to do it.

I think the non-code solution is for that dude to delete the garbage from his email but that is sort of "you're holding it wrong".

@mention at end of parenthetical sentence doesn't get linked

@mentions.) gets left alone instead of turning into @mentions.

How do I keep finding these?

Support for ActiveSupport 4

We were upgrading from 0.0.14 to 0.2.0, but got blocked by the gemspec requirement on activesupport 3 or earlier.

Bundler could not find compatible versions for gem "activesupport":
  In Gemfile:
    html-pipeline (~> 0.1.0) ruby depends on
      activesupport (< 4, >= 2) ruby

    rails (~> 4.0) ruby depends on
      activesupport (4.0.0)

MentionFilter base_url config question

Hi. I am using MentionFilter, and my user lives in www.lvh.me:3000/~jch.

HTML::Pipeline.new [
  HTML::Pipeline::MarkdownFilter,
  HTML::Pipeline::SanitizationFilter,
  HTML::Pipeline::MentionFilter
], context.merge(gfm: true, base_url: '/~')

If I specified base_url: '~' or /~, it gives me

www.lvh.me:3000/~/jch

instead of

www.lvh.me:3000/~jch.

How to achieve behaviour as mentioned with MentionFilter?

Currently I replace it by myself:

text.gsub!(/@([a-z0-9][a-z0-9-]*)/i) do |match|
  %Q(<a href="/~#{$1}">#{match}</a>)
end

Thanks!

Spaces inserted into code

Using

    pipeline = HTML::Pipeline.new [
      HTML::Pipeline::MarkdownFilter,
      HTML::Pipeline::SyntaxHighlightFilter
    ]

produces code that has 10 spaces prepended to every line after the first, including an extra line with 10 spaces at the end.

This

```css
@media (max-width: 992px) {
    #contact_email{ display: none; }
}

produces

@media (max-width: 992px) {
              #contact_email{ display: none; }
          }
          // 10 spaces at end

Getting Started Guide

The README has tons of information (usage, dependencies, examples, etc). However, new users would benefit from a Getting Started Guide; factory_girl's guide is a good example. The Getting Started Guide could detail common implementations such as integrating with Rails or Sinatra. Thoughts?

Give rsanheim rights to pushing gems

hey @jch, can you grant me rights to push the gem to rubygems? Would be useful for getting releases out the door.

My account is rsanheim - email is [email protected]

Tagged releases for 0.3.0 and 0.3.1

Let's add 'em.

cc @jch

Where to report custom filters?

Maybe wiki page like for Jekyll Plugins

TocFilter: non-English characters in headers don't get proper anchor names

Using something like <h1>日本語</h1> results in an anchor with a blank name.

Separate gems for versioning external dependencies

We don't specify versions for external dependencies and raise runtime errors when a dependency is missing (#80). For example, HTML::Pipeline::AutolinkFilter depends on rinku:

begin
  require "rinku"
rescue LoadError => _
  abort "Missing dependency 'rinku' for AutolinkFilter. See README.md for details."
end

This approach is simple, but couples html-pipeline's versioning to the versions of it's external dependencies. For example, to update from gemoji ~> 1 to ~> 2, we would need to increase the major version for html-pipeline #159.

Here are a few ideas I came up with:

Keep things the same

This requires the least changes. We would raise html-pipeline's major version whenever one of it's dependencies made breaking changes. There are 8 external dependencies for 8 filters. They are all pretty stable gems and unlikely to change frequently.

Separate gems, same repository

I experimented with this in the separate-gems branch. This is similar to how rails/rails is composed of separate gems (actionpack, actionmailer, activesupport), but all live in the same repository for an easy development workflow. The problem I ran into with this is bundler does not like having multiple projects within the same folder. If you poke around rails/rails, you can see they've added a good number of helper methods to Rakefile and their own set of conventions to bumping versions to make it work well. This feels a bit overkill to me, but maybe I'm missing something obvious.

Separate gems, separate repositories

We recommend 3rd party filters to be written this way. We could do the same thing with the existing filters and package them as their own separate gems in separate repositories. The trade off here is we'd have to jump between 9 projects (html-pipeline, and 8 filter gems). We could add a html-pipeline organization to help with this, but it is more overhead and would make the project harder to discover, and harder to contribute to. This is also how the bkeepers/qu gem handles swapping different backend stores.

@simeonwillbanks @JuanitoFatas @rsanheim @bkeepers What do you think? Are there other factors I haven't covered? Another possible way?

Detect asset pipeline availability

In the github app, the emoji icons are frozen to public/images, and urls to images are coded relative to the value of :asset_root. It'd be preferable to detect the availability of the asset-pipeline and use asset_path when it's available.

gjtorikian / html-pipeline Goto Github PK

html-pipeline's Introduction

HTML-Pipeline

Installation

Usage

More Examples

Filters

TextFilters

ConvertFilter

Sanitization

NodeFilters

Dependencies

Documentation

Instrumenting

Third Party Extensions

FAQ

1. Why doesn't my pipeline work when there's no root element in the document?

2. How do I customize an allowlist for SanitizationFilters?

Contributors

html-pipeline's People

Contributors

Stargazers

Watchers

Forkers

html-pipeline's Issues

Submitting New Issue

Sending Pull Request

Keep things the same

Separate gems, same repository

Separate gems, separate repositories

Recommend Projects

Recommend Topics

Recommend Org

Jobs

2. How do I customize an allowlist for `SanitizationFilter`s?