GithubHelp home page GithubHelp logo

camertron / camertron-eprun Goto Github PK

View Code? Open in Web Editor NEW

This project forked from duerst/eprun

2.0 3.0 0.0 977 KB

Efficient Pure Ruby Unicode Normalization

License: Other

Ruby 99.03% Perl 0.53% HTML 0.44%

camertron-eprun's Introduction

Efficient Pure Ruby Unicode Normalization (eprun)

(pronounced e-prune)

The Talk

Please see the Internationalization & Unicode Conference 37 talk on Implementing Normalization in Pure Ruby - the Fast and Easy Way.

Directories and Files

  • lib/normalize.rb: The core normalization code.
  • lib/string_normalize.rm: String#normalize.
  • lib/generate.rb: Generation script, generates lib/normalize_tables.rb from data/UnicodeData.txt and data/CompositionExclusions.txt. This needs to be run only once when updating to a new Unicode version.
  • lib/normalize_tables.rb: Data used for normalization, automatically generated by lib/generate.rb.
  • data/: All three files in this directory are downloaded from the Unicode Character Database. They are currently at Unicode version 6.3. They need to be updated for a newer Unicode version (happens about once a year).
  • test/test_normalize.rb: Tests for lib/string_normalize.rb, using data/NormalizationTest.txt.
  • benchmark/benchmark.rb: Runs the benchmark with example text files. Automatically checks for existing gems/libraries; if e.g. the unicode_util gem is not available, that part of the benchmark is skipped. This also applies to eprun, which will not be run on Ruby 1.8.
  • benchmark/Deutsch_.txt, Japanese_.txt, Korean_.txt, Vietnamese_.txt: example texts extracted from random Wikipedia pages (see http://en.wikipedia.org/wiki/Wikipedia:Random). The languages are choosen based on number of characters affected by normalization (Deutsch < Japanese < Vietnamese < Korean). These files have somewhat differing lengths, so the results cannot directly be compared across languages. Adding other files with ending "_.txt" will include them in the benchmark.
  • benchmark/benchmark_results.rb: Results of benchmark for eprun, unicode_utils, ActiveSupport::Multibyte (version 3.0.0), twitter_cldr, and the unicode gem. Eprun, unicode_utils, and unicode normalizations are run 100 times each, ActiveSupport::Multibyte is run 10 times each, and twitter_cldr is run only 1 time (didn't want to wait any longer).
  • benchmark/benchmark_results_jruby.txt: Results of benchmark when using jruby (excludes unicode gem), version 1.7.4 (1.9.3p392, 2013-05-16 2390d3b on Java HotSpot(TM) Client VM 1.7.0_07-b10 [Windows 7-x86]).
  • benchmark/benchmark.pl: Runs the benchmark using Perl, both with xsub (i.e. C) version (run 100 times) and pure Perl version (run 10 times).
  • benchmark/benchmark_results_pl.txt: Results of Perl benchmarks.

TODOs and Ideas

  • Publish as a gem, or several gems.
  • Deal better with encodings other than UTF-8.
  • Add methods such as String#nfc, String#nfd,...
  • Add methods for normalization variants.
  • See talk for more.

camertron-eprun's People

Contributors

camertron avatar duerst avatar reiz avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.