GithubHelp home page GithubHelp logo

threedaymonk / htmlentities Goto Github PK

View Code? Open in Web Editor NEW
334.0 8.0 28.0 207 KB

HTMLEntities is a simple library to facilitate encoding and decoding of named (ý and so on) or numerical ({ or Ī) entities in HTML and XHTML documents.

License: Other

Ruby 100.00%

htmlentities's People

Contributors

champierre avatar janne avatar merrells avatar threedaymonk avatar tricknotes avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

htmlentities's Issues

doesn't decode &Amp; - purposeful?

I'm scraping so I can't really control the HTML entity itself. I don't know whether &Amp; is a valid html entity (as opposed to &), tbh I don't really care, I just need to decode it to &.

The regex to match an entity is case insensitive, but the map (even the expanded flavor) doesn't include the capitalized version. I figure this may have done on purpose to match html specs.

I'm happy to submit a PR to handle this if anyone is interested, but short of that I'm curious what the advised path is.

My first thought (as seen in other issues) would be to define a custom mapping that includes this value. Being wary of what other entities might be out there that are mis-capitalized, another thought is to downcase the match before checking the map.

Encoding failure on well formed utf-8 (mdash)

Hello,

I'm using htmlentities in one of my projects. Recently users have been complaining about a failure in the text encoding process. I finally got a minimal code that is exposing the issue (using ruby 1.8.7):

require "htmlentities" 
dec=HTMLEntities.new;
dec.encode(dec.decode("—"),:decimal)

throws :

/htmlentities-4.2.0/lib/htmlentities/encoder.rb:85:in `unpack': malformed UTF-8 character (expected 3 bytes, given 1 bytes) (ArgumentError)

Playing with git bisect seems to say that commit : e7f336b introduces the bug. After looking at the regexp I would tend to think that you miss a + sign after the new pattern.

PS : Version 4.0.0 is working just fine !

Improperly decoding apostrophe

When using an apostrophe encoded to ' from Rails, this is being improperly decoded to an empty string.

To reproduce this problem:
HTMLEntities gem version: 4.3.4
Ruby 2.2.2
Rails 4.2.7

$> HTMLEntities.new.decode(HTMLEntities.new.encode("'", :decimal))
 => "'" 
$> HTMLEntities.new.encode("'", :decimal)
 => "'" 
$> ERB::Util.h("'")
 => "'" 
$> ERB::Util.h("'") == HTMLEntities.new.encode("'", :decimal)
 => true 
$> ERB::Util.h("'") === HTMLEntities.new.encode("'", :decimal)
 => true 
$> HTMLEntities.new.decode(ERB::Util.h("'"))
 => "" 

I was able to get around this behavior by monkey-patching prepare

class HTMLEntities
  class Decoder
    private

    def prepare(string)
-      string.to_s.encode(Encoding::UTF_8)
+      string.to_s.encode(Encoding::UTF_8).unicode_normalize
    end
  end
end

NameError: uninitialized constant HTMLEntities::Encoder::Encoding

Hi,
i'm getting this error:

>> require 'htmlentities'
=> []
>> coder = HTMLEntities.new
=> #<HTMLEntities:0x10902c6f0 @flavor="xhtml1">
>> string = "<élan>"
=> "<élan>"
>> coder.encode(string) # => "&lt;élan&gt;"
NameError: uninitialized constant HTMLEntities::Encoder::Encoding
    from /Users/xx/.rbenv/versions/1.8.7-p371/lib/ruby/gems/1.8/gems/activesupport-2.3.17/lib/active_support/dependencies.rb:131:in `const_missing'
    from /Users/xx/.rbenv/versions/1.8.7-p371/lib/ruby/gems/1.8/gems/htmlentities-4.3.2/lib/htmlentities/encoder.rb:25:in `prepare'
    from /Users/xx/.rbenv/versions/1.8.7-p371/lib/ruby/gems/1.8/gems/htmlentities-4.3.2/lib/htmlentities/encoder.rb:19:in `encode'
    from /Users/xx/.rbenv/versions/1.8.7-p371/lib/ruby/gems/1.8/gems/htmlentities-4.3.2/lib/htmlentities.rb:73:in `encode'
    from (irb):4

Using Ruby 1.8.7 + Rails 2.3.17
Any help please?

Missing helper file helpers/htmlentities.rb

**Using rails 3.2.11

Gem install successfully, can't use it. I'm not using the functionality in either of the following two files either which confuses me.

Trace:

app/controllers/application_controller.rb:1:in `<top (required)>'
app/controllers/users_controller.rb:1:in `<top (required)>'

This error occurred while loading the following files:
   htmlentities

Cannot Decode &#44; HTML to Comma

htmlentities seems to work great for everything except &#44; which should decode to a comma

Example Code - Long_Description contains text with entities such as &#44; which should decode to a comma but does not.

require 'htmlentities'
coder = HTMLEntities.new
self.Long_Description = coder.decode(self.Long_Description)

Any ideas?

Side Note: Even here on Github if you don't enclose &#44; into code tags it decodes it to the comma character ( , )

htmlentities gem appearing as corrupt to Bundler

I'm attempting to install the htmlentities gem onto a new system via Bundler.

When running bundle install I keep hitting this error:

bundle install --deployment --local
/opt/ruby-enterprise-1.8.7-2010.02/lib/ruby/site_ruby/1.8/rubygems/package/tar_input.rb:111:in initialize': No metadata found! (Gem::Package::FormatError) from /opt/ruby-enterprise-1.8.7-2010.02/lib/ruby/site_ruby/1.8/rubygems/package/tar_input.rb:17:innew'
from /opt/ruby-enterprise-1.8.7-2010.02/lib/ruby/site_ruby/1.8/rubygems/package/tar_input.rb:17:in open' from /opt/ruby-enterprise-1.8.7-2010.02/lib/ruby/site_ruby/1.8/rubygems/package.rb:58:inopen'
from /opt/ruby-enterprise-1.8.7-2010.02/lib/ruby/site_ruby/1.8/rubygems/format.rb:63:in from_io' from /opt/ruby-enterprise-1.8.7-2010.02/lib/ruby/site_ruby/1.8/rubygems/format.rb:51:infrom_file_by_path'
from /opt/ruby-enterprise-1.8.7-2010.02/lib/ruby/1.8/open-uri.rb:32:in open_uri_original_open' from /opt/ruby-enterprise-1.8.7-2010.02/lib/ruby/1.8/open-uri.rb:32:inopen'
from /opt/ruby-enterprise-1.8.7-2010.02/lib/ruby/site_ruby/1.8/rubygems/format.rb:50:in from_file_by_path' from /opt/ruby-enterprise-1.8.7-2010.02/lib/ruby/gems/1.8/gems/bundler-1.0.2/lib/bundler/source.rb:197:incached_specs'
from /opt/ruby-enterprise-1.8.7-2010.02/lib/ruby/gems/1.8/gems/bundler-1.0.2/lib/bundler/source.rb:195:in each' from /opt/ruby-enterprise-1.8.7-2010.02/lib/ruby/gems/1.8/gems/bundler-1.0.2/lib/bundler/source.rb:195:incached_specs'
from /opt/ruby-enterprise-1.8.7-2010.02/lib/ruby/gems/1.8/gems/bundler-1.0.2/lib/bundler/source.rb:194:in each' from /opt/ruby-enterprise-1.8.7-2010.02/lib/ruby/gems/1.8/gems/bundler-1.0.2/lib/bundler/source.rb:194:incached_specs'
from /opt/ruby-enterprise-1.8.7-2010.02/lib/ruby/gems/1.8/gems/bundler-1.0.2/lib/bundler/source.rb:157:in fetch_specs' from /opt/ruby-enterprise-1.8.7-2010.02/lib/ruby/gems/1.8/gems/bundler-1.0.2/lib/bundler/index.rb:7:inbuild'
from /opt/ruby-enterprise-1.8.7-2010.02/lib/ruby/gems/1.8/gems/bundler-1.0.2/lib/bundler/source.rb:155:in fetch_specs' from /opt/ruby-enterprise-1.8.7-2010.02/lib/ruby/gems/1.8/gems/bundler-1.0.2/lib/bundler/source.rb:70:inspecs'
from /opt/ruby-enterprise-1.8.7-2010.02/lib/ruby/gems/1.8/gems/bundler-1.0.2/lib/bundler/lazy_specification.rb:48:in __materialize__' from /opt/ruby-enterprise-1.8.7-2010.02/lib/ruby/gems/1.8/gems/bundler-1.0.2/lib/bundler/spec_set.rb:83:inmaterialize'
from /opt/ruby-enterprise-1.8.7-2010.02/lib/ruby/gems/1.8/gems/bundler-1.0.2/lib/bundler/spec_set.rb:81:in map!' from /opt/ruby-enterprise-1.8.7-2010.02/lib/ruby/gems/1.8/gems/bundler-1.0.2/lib/bundler/spec_set.rb:81:inmaterialize'
from /opt/ruby-enterprise-1.8.7-2010.02/lib/ruby/gems/1.8/gems/bundler-1.0.2/lib/bundler/definition.rb:93:in specs' from /opt/ruby-enterprise-1.8.7-2010.02/lib/ruby/gems/1.8/gems/bundler-1.0.2/lib/bundler/definition.rb:81:inresolve_with_cache!'
from /opt/ruby-enterprise-1.8.7-2010.02/lib/ruby/gems/1.8/gems/bundler-1.0.2/lib/bundler/installer.rb:34:in run' from /opt/ruby-enterprise-1.8.7-2010.02/lib/ruby/gems/1.8/gems/bundler-1.0.2/lib/bundler/installer.rb:8:ininstall'
from /opt/ruby-enterprise-1.8.7-2010.02/lib/ruby/gems/1.8/gems/bundler-1.0.2/lib/bundler/cli.rb:217:in install' from /opt/ruby-enterprise-1.8.7-2010.02/lib/ruby/gems/1.8/gems/bundler-1.0.2/lib/bundler/vendor/thor/task.rb:22:insend'
from /opt/ruby-enterprise-1.8.7-2010.02/lib/ruby/gems/1.8/gems/bundler-1.0.2/lib/bundler/vendor/thor/task.rb:22:in run' from /opt/ruby-enterprise-1.8.7-2010.02/lib/ruby/gems/1.8/gems/bundler-1.0.2/lib/bundler/vendor/thor/invocation.rb:118:ininvoke_task'
from /opt/ruby-enterprise-1.8.7-2010.02/lib/ruby/gems/1.8/gems/bundler-1.0.2/lib/bundler/vendor/thor.rb:246:in dispatch' from /opt/ruby-enterprise-1.8.7-2010.02/lib/ruby/gems/1.8/gems/bundler-1.0.2/lib/bundler/vendor/thor/base.rb:389:instart'
from /opt/ruby-enterprise-1.8.7-2010.02/lib/ruby/gems/1.8/gems/bundler-1.0.2/bin/bundle:13
from /opt/REE/bin/bundle:19:in `load'
from /opt/REE/bin/bundle:19

Based on the suggestion at this page - http://codebeef.com/bundler-no-metadata-found-problem - we have patched tar_input.rb to provide further details.

This yields the following:

/opt/www/appname/vendor/cache/htmlentities-4.2.1.gem may be corrupt! Delete it and retry the operation

Based on this error message, I had Bundler recache the gem, but hit the same issue again. I don't believe htmlentities to actually be corrupt, especially as I've run into this issue with a fresh cache of the gem file. Why would it appear as corrupt to Bundler?

Why do ldquo and rdquo appear differently?

Can anyone explain why the left quote displays differently than the right quote? Is this normal?

coder = HTMLEntities.new
string = "&ldquo;These pretzels are making me thirsty&hellip;&rdquo;"
coder.decode(string) => "“These pretzels are making me thirsty…\342\200\235"

I'm using ruby enterprise 1.8.7.

Using this with Controller

I am new to Ruby. I tried implementing the gem in my controller but it is not working.

class linksController < ApplicationController
  require 'erb'
  include ERB::Util
  require 'open-uri'
  require 'htmlentities'

  def index
    coder = HTMLEntities.new
    @test = coder.encode('hdhdhd-shgssg- shsah', :basic, :decimal)
  end
end

When I look at the source code, it is not being converted.

Verify HTML entity names

While looking at that duplicate inodot key that was fixed in #17, I noticed the capital letter below is called Iodot. That seemed inconsistent with other capital letters, so I looked it up.

What I found was &Idot; without the o. That's a bit strange too.

I'm curious where this list of entities came from and if there is any way to verify that they are all correct?

Decoding removes &pound; entity when given a SafeBuffer

Use Case

The Rails number_to_currency helper returns currency symbols as HTML entities. When exporting an HTML report to CSV, we would like to use the UTF-8 symbol instead.

Versions

  • ruby 2.0.0p247 (2013-06-27 revision 41674) [x86_64-darwin12.5.0]
  • Rails 4.0.0 (same issue with 3.x)
  • htmlentities (4.3.1)

Steps to Reproduce

New rails app:

rails new testapp
cd testapp
echo "gem 'htmlentities'" >> Gemfile
bundle
rails c

In console:

HTMLEntities.new.decode("&pound;12.34") # => "£12.34"
ActiveSupport::SafeBuffer.new("&pound;12.34") # => "&pound;12.34"
HTMLEntities.new.decode(ActiveSupport::SafeBuffer.new("&pound;12.34")) # => "12.34"

It decodes the way I would want when not given a SafeBuffer, but removes the entity entirely when given a SafeBuffer.

Work around

Instead of using a nice generic solution like HTMLEntities, I end up using a few gsubs to get the job done:

str.to_s.gsub("&pound;", "\u00a3").gsub("&#8364;", "\u20ac").gsub("&#269;", "\u010d")

(which works, but then this list needs to be maintained).

Does not decode, when using regex.

I thought in the third block of code the entites would get decoded. Am I missing something?

>> require 'htmlentities'
>> @coder = HTMLEntities.new

>> poi_csv = ["space&nbsp;", "dots&hellip;", "arrow&raquo;"]

>> poi_csv.collect { |column| column.gsub( /(&.*?;)/, @coder.decode('\1')) }
=> ["space&nbsp;", "dots&hellip;", "arrow&raquo;"]

>> @coder.decode("&hellip;")
=> "…"

I am using ruby 1.8.7 and rails 2.3.17

encode(string, :named) is re-encoding valid entries and breaking the HTML

Hi,
First of all I would like to thank you for this awesome gem. But I found a bug while trying to sanitize a string that has both valid and invalid chars. below i explain this problem better:

coder = HTMLEntities.new
string = "> Car &amp; Bike <"
new = coder.encode(string)  # BUG =>  "&gt; Car &amp;amp; Bike &lt;" 
worst_then_new = coder.encode(new) # BUG => "&amp;gt; Car &amp;amp;amp; Bike &amp;lt;" 

A workaround this problem would be to "decode" before "encode" but this hack is to slow...

Expanded encoder doesn't encode colon character

Not sure if this is by design, but the expanded encoder (which includes a mapping for the colon character) doesn't convert colons to their HTML entity form. The use case is encoding title text for use in a YAML front matter block (and subsequently embedding in an HTML page).

My code:

require 'htmlentities'
title = "Foo: Bar"
coder = HTMLEntities.new(:expanded)
coder.encode(title, :hexadecimal)

Expected: "Foo: Bar"
Got: "Foo: Bar"

Also tried coder.encode(title, :decimal), coder.encode(title, :named, :decimal), and coder.encode(title). Tested with IRB to make sure the problem isn't coming from somewhere else.

Encode Registered Trademark (®)

See the following in my Rails console:

[11] pry(#<DesignsController>)> HTMLEntities.new.encode("®")
=> "®"
[12] pry(#<DesignsController>)> HTMLEntities.new.encode("&")
=> "&amp;"

How can I encode the Registered Trademark symbol?

Feature request: only encode HTML special characters

I'm not sure how to do this with your library, but I would like the ability to only encode special characters. For example, I have a block of HTML which has UTF-8 characters, but I don't what to encode the HTML tags. Is there a way that we can pass a option, or configure the encoder to skip the < and > characters which make up an html tag?

Add License information to gemfile

This will make it show up on rubygems.org. I'm doing due diligence on our gems and need to find out the licenses for all the gems. Having it show up on rubygems.org cuts out the step of having to go to the github repo.

Add support for incorrect numerical entity format

Add an option that allows users to decode (invalid) HTML entities that forget the # sign, such as &1234; instead of &#1234.

(I'll open a PR for this soon, I just need the link to this issue for now).

Serious decoding bug with masked entities

Hi there,

consider the following input string:

&amp;#3346;

When calling decode() on this string, it will get decoded to the unicode character referenced by #3346. I think this happens because you first decode the &amp; and then decode the generated &#3346; in the string. A valid solution would be to decode &amp; last.

Greetings,
CK

Add support for case-insentitive decoding

Add an option that allows users to decode (invalid) HTML entities with incorrect casing, such as &Amp; instead of &amp;.

(I'll open a PR for this soon, I just need the link to this issue for now).

decoding failure for &Ccedil;

I seem to be having a problem with Ç. For example: 'FRANÇOIS' is being decoded as 'FRANÇOIS'. However, 'François' is correctly handled as 'François'. I thought that it might be a case of the input string being latin1, but I'm pretty sure that's not the case and your documentation seems to imply that it won't decode things that it doesn't understand.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.