GithubHelp home page GithubHelp logo

skryukov / uri-idna Goto Github PK

View Code? Open in Web Editor NEW
9.0 4.0 0.0 386 KB

A IDNA2008, UTS46 and Punycode implementation in pure Ruby

License: MIT License

Ruby 99.21% Shell 0.03% HTML 0.76%
hacktoberfest ruby idna idna2008 uts46

uri-idna's Introduction

URI::IDNA

Gem Version Ruby

A IDNA2008, UTS46, IDNA from WHATWG URL Standard and Punycode implementation in pure Ruby.

This gem provides a number of functions for converting internationalized domain names (IDNs) between the Unicode and ASCII Compatible Encoding (ACE) forms.

Sponsored by Evil Martians

Installation

Add to your Gemfile:

gem "uri-idna"

And then run bundle install.

Usage

There are plenty of ways to convert IDNs between Unicode and ACE forms.

IDNA2008

The RFC 5891 defines two protocols for IDN conversion: Registration and Domain Name Lookup.

Registration protocol

URI::IDNA.register(alabel:, ulabel:, **options)

Options
  • check_hyphens: true – whether to check hyphens according to Section 5.4.
  • leading_combining: true – whether to check leading combining marks according to Section 5.4.
  • check_joiners: true – whether to check CONTEXTJ code points according to Section 5.4.
  • check_others: true – whether to check CONTEXTO code points according to Section 5.4.
  • check_bidi: true – whether to check bidirectional characters according to Section 5.4.
require "uri/idna"

URI::IDNA.register(alabel: "xn--gdkl8fhk5egc.jp", ulabel: "ハロー・ワールド.jp")
#=> "xn--gdkl8fhk5egc.jp"

URI::IDNA.register(ulabel: "ハロー・ワールド.jp")
#=> "xn--gdkl8fhk5egc.jp"

URI::IDNA.register(alabel: "xn--gdkl8fhk5egc.jp")
#=> "xn--gdkl8fhk5egc.jp"

URI::IDNA.register(ulabel: "☕.us")
#<URI::IDNA::InvalidCodepointError: Codepoint U+2615 at position 1 of "☕" not allowed>

Domain Name Lookup Protocol

URI::IDNA.lookup(domain_name, **options)

Options
  • check_hyphens: true – whether to check hyphens according to Section 4.2.3.1.
  • leading_combining: true – whether to check leading combining marks according to Section 4.2.3.2.
  • check_joiners: true – whether to check CONTEXTJ code points according to Section 4.2.3.3.
  • check_others: true – whether to check CONTEXTO code points according to Section 4.2.3.3.
  • check_bidi: true – whether to check bidirectional characters according to Section 4.2.3.4.
  • verify_dns_length: true – whether to check DNS length according to Section 4.4.
require "uri/idna"

URI::IDNA.lookup("ハロー・ワールド.jp")
#=> "xn--pck0a1b0a6a2e.jp"

URI::IDNA.lookup("xn--pck0a1b0a6a2e.jp")
#=> "xn--pck0a1b0a6a2e.jp"

URI::IDNA.lookup("Ῠ.me")
#<URI::IDNA::InvalidCodepointError: Codepoint U+1FE8 at position 1 of "Ῠ" not allowed>

Unicode UTS46 (TR46)

Current revision: 31

The UTS46 defines two IDN conversion functions: ToASCII and ToUnicode.

ToASCII

URI::IDNA.to_ascii(domain_name, **options)

Options
require "uri/idna"

URI::IDNA.to_ascii("Bloß.de")
#=> "xn--blo-7ka.de"

# UTS46 transitional processing is disabled by default,
# but can be enabled via option:
URI::IDNA.to_ascii("Bloß.de", transitional_processing: true)
#=> "bloss.de"

# Note that UTS46 processing is not fully IDNA2008 compliant:
URI::IDNA.to_ascii("☕.us")
#=> "xn--53h.us"

ToUnicode

URI::IDNA.to_unicode(domain_name, **options)

Options
require "uri/idna"

URI::IDNA.to_unicode("xn--blo-7ka.de")
#=> "bloß.de"

IDNA2008 compatibility

It's possible to use UTS46 mapping first and then apply IDNA2008, so the processing fully conforms IDNA2008:

require "uri/idna"

# For example we can use UTS46 mapping to downcase some characters
char = "⼤"
char.ord # "\u2F24"
#=> 12068

# just downcase doesn't work in this case
char.downcase.ord
#=> 12068

# but UTS46 mapping does it's thing:
URI::IDNA::UTS46::Mapping.call(char).ord 
#=> 22823

# so here is a full example:
domain = "⼤.cn" # "\u2F24.cn"
URI::IDNA.lookup(domain)
# <URI::IDNA::InvalidCodepointError: Codepoint U+2F24 at position 1 of "⼤" not allowed>

mapped_domain = URI::IDNA::UTS46::Mapping.call(domain)
URI::IDNA.lookup(mapped_domain)
#=> "xn--pss.cn"

WHATWG

WHATWG's URL Standard uses UTS46 algorithm to define ToASCII and ToUnicode functions, it abstracts all available flags and provides only one—the be_btrict flag instead.

Note that the check_hyphens UTS46 option is set to false in this algorithm.

ToASCII

URI::IDNA.whatwg_to_ascii(domain_name, **options)

Options
  • be_strict: true – defines values of use_std3_ascii_rules and verify_dns_length UTS46 options.
require "uri/idna"

URI::IDNA.whatwg_to_ascii("Bloß.de")
#=> "xn--blo-7ka.de"

# The be_strict flag sets use_std3_ascii_rules and verify_dns_length UTS46 flags to its value
URI::IDNA.whatwg_to_ascii("2003_rules.com", be_strict: false)
#=> "2003_rules.com"

# By default be_strict is set to true
URI::IDNA.whatwg_to_ascii("2003_rules.com")
#<URI::IDNA::InvalidCodepointError: Codepoint U+005F at position 5 of "2003_rules" not allowed>

ToUnicode

URI::IDNA.whatwg_to_unicode(domain_name, **options)

Options
  • be_strict: true - be_strict: true – defines value of use_std3_ascii_rules UTS46 option.
require "uri/idna"

URI::IDNA.whatwg_to_unicode("xn--blo-7ka.de")
#=> "bloß.de"

Punycode

Punycode module performs conversion between Unicode and Punycode. Note that Punycode is not IDNA2008 compliant, it is only used for conversion, no validations performed.

require "uri/idna/punycode"

URI::IDNA::Punycode.encode("ハロー・ワールド")
#=> "gdkl8fhk5egc"

URI::IDNA::Punycode.decode("gdkl8fhk5egc")
#=> "ハロー・ワールド"

Full technical reference:

IDNA2008

Punycode

  • RFC 3492 – Punycode: A Bootstring encoding of Unicode

UTS46 (also referenced as TS46)

Development

After checking out the repo, run bin/setup to install dependencies. Then, run rake spec to run the tests. You can also run bin/console for an interactive prompt that will allow you to experiment.

To install this gem onto your local machine, run bundle exec rake install. To release a new version, update the version number in version.rb, and then run bundle exec rake release, which will create a git tag for the version, push git commits and the created tag, and push the .gem file to rubygems.org.

Generating Unicode data

This gem uses Unicode data files to perform IDN conversion. To generate new Unicode data files, run bundle exec rake idna:generate.

To specify Unicode version, use VERSION environment variable, e.g. VERSION=15.1.0 bundle exec rake idna:generate.

By default, used Unicode version is the one used by the Ruby version (RbConfig::CONFIG["UNICODE_VERSION"]).

To set directory for generated files, use DEST_DIR environment variable, e.g. DEST_DIR=lib/uri/idna/data bundle exec rake idna:generate.

Unicode data cached in the tmp directory by default, to change it, use CACHE_DIR environment variable, e.g. CACHE_DIR=~/.cache/unicode_data bundle exec rake idna:generate.

Note: rake idna:generate might generate different results on different versions of Ruby due to usage of built-in Unicode normalization methods.

Inspect Unicode data

To inspect Unicode data, run bundle exec rake 'idna:inspect[<HEX_CODE>]'.

To specify Unicode version, or cache directory, use VERSION or CACHE_DIR environment variables, e.g. VERSION=15.1.0 bundle exec rake 'idna:inspect[1f495]'.

Update UTS46 test suite data

To update UTS46 test suite data, run bundle exec rake idna:update_uts46_test_suite.

Contributing

Bug reports and pull requests are welcome on GitHub at https://github.com/skryukov/uri-idna.

License

The gem is available as open source under the terms of the MIT License.

uri-idna's People

Contributors

skryukov avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.