GithubHelp home page GithubHelp logo

bbc / unicode-bidirectional Goto Github PK

View Code? Open in Web Editor NEW
48.0 40.0 11.0 1.74 MB

A Javascript implementation of the Unicode 9.0.0 Bidirectional Algorithm

License: MIT License

JavaScript 98.33% HTML 1.13% Shell 0.54%
unicode-bidirectional-algorithm unicode frontend library dpub innovation

unicode-bidirectional's Introduction

unicode-bidirectional

Code Climate Test Coverage Build Status
A Javascript implementation of the Unicode 9.0.0 Bidirectional Algorithm

This is an implementation of the Unicode Bidirectional Algorithm (UAX #9) that works in both Browser and Node.js environments. The implementation is conformant as per definition UAX#9-C1.

Installation

npm install unicode-bidirectional --save

Usage

unicode-bidirectional is declared as a Universal Module (UMD), meaning it can be used with all conventional Javascript module systems:

1. ES6

import { resolve, reorder } from 'unicode-bidirectional';

const codepoints = [0x28, 0x29, 0x2A, 0x05D0, 0x05D1, 0x05D2]
const levels = resolve(codepoints, 0);  // [0, 0, 0, 1, 1, 1]
const reordering = reorder(codepoints, levels); // [0x28, 0x29, 0x2A, 0x05D2, 0x05D1, 0x05D0]

2. CommonJS

var UnicodeBidirectional = require('unicode-bidirectional/dist/unicode.bidirectional');
var resolve = UnicodeBidirectional.resolve;
var reorder = UnicodeBidirectional.reorder;

var codepoints = [0x28, 0x29, 0x2A, 0x05D0, 0x05D1, 0x05D2]
var levels = resolve(codepoints, 0);  // [0, 0, 0, 1, 1, 1]
var reordering = reorder(codepoints, levels); // [0x28, 0x29, 0x2A, 0x05D2, 0x05D1, 0x05D0]

3. RequireJS

require(['UnicodeBidirectional'], function (UnicodeBidirectional) {
  var resolve = UnicodeBidirectional.resolve;
  var reorder = UnicodeBidirectional.reorder;

  var codepoints = [0x28, 0x29, 0x2A, 0x05D0, 0x05D1, 0x05D2]
  var levels = resolve(codepoints, 0);  // [0, 0, 0, 1, 1, 1]
  var reordering = reorder(codepoints, levels); // [0x28, 0x29, 0x2A, 0x05D2, 0x05D1, 0x05D0]
});

4. HTML5 <script> tag

<script src="unicode.bidirectional.js" /> <!-- exposes window.UnicodeBidirectional -->
var resolve = UnicodeBidirectional.resolve;
var reorder = UnicodeBidirectional.reorder;

var codepoints = [0x28, 0x29, 0x2A, 0x05D0, 0x05D1, 0x05D2]
var levels = resolve(codepoints, 0);  // [0, 0, 0, 1, 1, 1]
var reordering = reorder(codepoints, levels); // [0x28, 0x29, 0x2A, 0x05D2, 0x05D1, 0x05D0]

You can download unicode.bidirectional.js from Releases. Using this file with a <script> tag will expose UnicodeBidirectional as global variable on the window object.

API

resolve(codepoints, paragraphlevel[, automaticLevel = false])

Returns the resolved levels associated to each codepoint in codepoints[1]. This levels array determines: (i) the relative nesting of LTR and RTL characters, and hence (ii) how characters should be reversed when displayed on the screen.

The input codepoints are assumed to be all be in one paragraph that has a base direction of paragraphLevel – this is a Number that is either 0 or 1 and represents whether the paragraph is left-to-right (0) or right-to-left (1). automaticLevel is an optional Boolean flag that when present and set to true, causes this function to ignore the paragraphlevel argument and instead attempt to deduce the paragraph level from the codepoints. [2]
Neither of the two input arrays are mutated.

reorder(codepoints, levels)

Returns the codepoints in codepoints reordered (i.e. permuted) according the levels array. [3]
Neither of the two input arrays are mutated.

reorderPermutation(levels[, IGNORE_INVISIBLE = false])

Returns the reordering that levels represents as an permutation array. When this array has an element at index i with value j, it denotes that the codepoint previous positioned at index i is now positioned at index j. [4]
The input array is not mutated. The IGNORE_INVISIBLE parameter controls whether or not invisible characters (characters with a level of 'x' [5]) are to be included in the permutation array. By default, they are included in the permutation (they are not ignored, hence IGNORE_INVISIBLE is false).

mirror(codepoints, levels)

Replaces each codepoint in codepoints with its mirrored glyph according to rule L4 and the levels array.
Neither of the two input arrays are mutated.

constants

An object containing metadata used by the bidirectional algorithm. This object includes the following keys:

  • mirrorMap: a map mapping a codepoint to its mirrored counterpart, e.g. looking up "<" gives ">". If a codepoint does not have a mirrored counterpart, then there is no key-value pair in the map and so a lookup will give undefined. [6]
  • oppositeBracket: a map mapping a codepoint to its bracket pair counterpart, e.g. looking up "(" gives ")". If a codepoint does not have a bracket pair counterpart, then there is no key-value pair in the map and so a lookup will give undefined. [7]
  • openingBrackets: a set containing all brackets that are opening brackets. [7]
  • closingBrackets: a set containing all brackets that are closing brackets. [7]

Additional Notes:

For all the above functions, codepoints are represented by an Array of Numbers where each Number denotes the Unicode codepoint of the character, that is an integer between 0x0 and 0x10FFFF inclusive. levels are represented by an Array of Numbers where Number is an integer between 0 and 127 inclusive. One or more entries of levels may be the string 'x'. This denotes a character that does not have a level [5].

[1]: Codepoints are automatically converted to NFC normal form if they are not already in that form.
[2]: This function deduces the paragraph level according to: UAX#P1, UAX#P2 and UAX#P3.
[3]: This is an implementation of UAX#9-L2.
[4]: More formally known as the one-line notation for permutations. See Wikipedia.
[5]: Some characters have a level of x – the levels array has a string 'x' instead of a number. This is expected behaviour. The reason is because the Unicode Bidirectional algorithm (by rule X9.) will not assign a level to certain invisible characters / control characters. They are basically completely ignored by the algorithm. They are invisible and so have no impact on the visual RTL/LTR ordering of characters. Most of the invisible characters that fall into this category are in this list.
[6]: This is taken from BidiMirroring.txt.
[7]: This is taken from BidiBrackets.txt.

Polyfills

unicode-bidirectional uses the following ECMAScript 2015 (ES5) features that are not fully supported by Internet Explorer and older versions of other browsers:

If you are targeting these browsers, you'll need to add one or more Polyfill libraries to fill in these features (for example, es6-shim and unorm).

More Info

For other Javascript Unicode Implementations see:

License

MIT.
Copyright (c) 2017 British Broadcasting Corporation

unicode-bidirectional's People

Contributors

jameslawson avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

unicode-bidirectional's Issues

npm install error

Hello dear, I have the following error while installation
npm install unicode-bidirectional --save

npm ERR! Error while executing:
npm ERR! /usr/bin/git ls-remote -h -t ssh://[email protected]/jameslawson/unicode-bidiclass.git
npm ERR! 
npm ERR! Permission denied (publickey).
npm ERR! fatal: Could not read from remote repository.
npm ERR! 
npm ERR! Please make sure you have the correct access rights
npm ERR! and the repository exists.
npm ERR! 
npm ERR! exited with error code: 128

npm ERR! A complete log of this run can be found in:
npm ERR!     /Users/bluemix/.npm/_logs/2017-08-05T02_55_06_459Z-debug.log

A smaller and faster alternative

I used this library for many years, but now, I found a better alternative, and I would like to share it.

There is a C library called Fribidi, and a WASM port: https://harfbuzz.github.io/harfbuzzjs/fribidi/

This library minified is 112 kB, Zipped 32 kB.

Fribidi WASM is 33 kB, Zipped 7 kB. It is also much faster! Would be great, if the author could mention it in Readme.md :)

Performance

Running the conformance tests apparently can be done in under 30sec using go.
While Javascript being interpreted on Node/v8 is to be slower than go, they're taking minutes to run (roughly 20mins) which is a lot longer.

To improve time:

  • look at complexities / data structures and find any nasty time complexities
  • consider using multiple cores when running tests
  • improve input/output when running times (don't spam STDOUT, read all the file vs buffering, etc.)

Rule L4. Add Mirroring

To achieve full conformance, rule L4 needs to be implemented.
http://www.unicode.org/reports/tr9/#L4

Before mirroring, the code needs to check the Bidi_Mirrored property of the character
and see if it "yes" (Y), meaning "yes this character is required to be mirrored".

Can use mathiasbynens/unicode-9.0.0 to both check if mirroring is required, and to also find
the mirroring glyph.

require('unicode-9.0.0/Binary_Property/Bidi_Mirrored/regex');
const mirrored = require('unicode-9.0.0/Bidi_Mirroring_Glyph').get(0xAB);

Remove dependendency on Node.js

Whould it be possible to remove the Node.js - specific code? I think such library can be made with simple strings, arrays, numbers, there is no need to have I/O etc. inside it. I would like to use a similar tool in the browser environment.

You use many third-party libraries. I am sure that writing your own functions could improve the performance. These libraries are often stupid-proof (e.g. every time they check, if the input parameter is a string, or a number, or undefined ....) which ruins the performance.

Test Case No. 447029 (91%)

{"levels":[1,1,2,2],"reorder":["2","3","1","0"],"bidiTypes":["ET","CS","ET","EN"],"bitset":4}
INPUT: List [ "ET", "CS", "ET", "EN" ] LEVEL = 1, AUTO = false
ACTUAL OUTPUT: List [ 2, 2, 2, 2 ]
EXPECTED OUTPUT: List [ 1, 1, 2, 2 ]

Improve Naive Classification of Brackets

in src/util/constant.js, we have the following functions will need to use unicode database data

  • bracketType
  • oppositeBracket

Right now they naively only consider (), {} and [].

The approach to mirroring

I am using your library to find embedding levels and find the reordering permuation for Unicode strings. Then, I need to render the string using the available TTF / OTF file.

When a character run has an odd embedding level (goes from right to left), some characters in it must be mirrored (brackets, >, <, ...). I can do it on a geometry level, but I must know, which characters should be mirrored.

The information about mirroring is here. I expect, that a similar database is already built into this library. Can it be exposed to users somehow?

Another approach could be, if this library replaced actual characters in a string with a mirrored version of the character, but such "dual" character does not always exist.

What is the correct way to do it?

Break out N0 - N2 from "weak"

Separate weak rules W1-W7 from neutral rules N0 - N2 in file resolvedWeak.js.
It doesn't make much sense to call rules N0 - N2 "weak" as such.

Conformance tests aren't running all cases

The conformance tests aren't running all the test cases. The bit mask in BidiTest.txt data is a mask of up to 3 different runs that should be made whereas the test cases are only running one case.

In ./test/conform/bidiclass/runner.js, instead of

    const bitset = test.bitset;
    const paragraphLevel = ((bitset & 2) > 0) ? 0 : 1;
    const autoLTR = ((bitset & 1) > 0) ? true : false;

it should be something like this:

    for (let bit = 1; bit < 8; bit = bit << 1)
    {
      if ((test.bitset & bit) == 0)
        continue;

      const paragraphLevel = ((bit & 2) > 0) ? 0 : 1;
      const autoLTR = ((bit & 1) > 0) ? true : false;

Test Case No. 65879

INPUT: List [ "R", "RLI", "R" ] LEVEL = 0, AUTO = false
ACTUAL OUTPUT: List [ 1, 1, 1 ]
EXPECTED OUTPUT: List [ 1, 0, 1 ]

Current Code Behaviour:

  • Embedding Levels = [0, 0, 1]
  • Level Runs = [[R, RLI], [R]]
  • Isolating Run Sequences = [[R, RLI], [R]] with sos/eos's of [(L, R), (R,R)]

By N1. The RLI is between R and eos = R, so is changed to R.
After types have been resolved, we have [R,R,R]. Implicit levels give [1,1,1].

Add API

Expose an API.
Probably expose resolve/resolvedLevels.js and resolve/reorderedLevels.js

Multiple NSMs after a bracket pair

Fix rule N0.

          // TODO: fix this so that _a SEQUENCE of NSMs_ are all changed to the strong type
          //       eg. ( NSM NSM NSM NSM ==>  ( L L L L
          //       eg. ) NSM NSM NSM NSM ==>  ) L L L L
          //       right now _ONLY ONE_ NSM that follows is changed to the strong type

There may be more than one NSMs adjacent to a bracket that should be changed to the same strong type. The following implementation using a for loop (with a break) to convert the chain of NSMs

Normalize is ES5

src/main.js: String.prototype.normalize() is ES5. Very old browsers will not support this.
Find a way to gracefully degrade and/or add a note in README.

Zero-width space

Hi, I try to order zero-width space character (200b)

var cdp = [8203, 8203, 8203, 8203, 8203, 8203, 8203, 10];
var lvls = UnicodeBidirectional.resolve(cdp);  // lvls = ["x", "x", "x", "x", "x", "x", "x", 0]
var perm = UnicodeBidirectional.reorderPermutation(lvls) // perm = [7]

Is this an expected behavior? I was quite surprised by the fact, that I started with 8 characters, but received the permutation of 1 character (which is not even a permutation) at the end, so my program crashed.

Test Case No. 490841 (99%)

{"levels":[124,125,125,125,125,124],"reorder":["62","73","71","66","64","75"],"bidiTypes
":["LRE","LRE","LRE","LRE","LRE","LRE","LRE","LRE","LRE","LRE","LRE","LRE","LRE","LRE","
LRE","LRE","LRE","LRE","LRE","LRE","LRE","LRE","LRE","LRE","LRE","LRE","LRE","LRE","LRE"
,"LRE","LRE","LRE","LRE","LRE","LRE","LRE","LRE","LRE","LRE","LRE","LRE","LRE","LRE","LR
E","LRE","LRE","LRE","LRE","LRE","LRE","LRE","LRE","LRE","LRE","LRE","LRE","LRE","LRE","
LRE","LRE","LRE","LRE","ON","RLO","L","LRE","RLI","LRE","RLE","LRO","RLO","PDI","PDF","L
","PDF","ON"],"bitset":7}
INPUT: List [ "LRE", "LRE", "LRE", "LRE", "LRE", "LRE", "LRE", "LRE", "LRE", "LRE", "LRE
", "LRE", "LRE", "LRE", "LRE", "LRE", "LRE", "LRE", "LRE", "LRE", "LRE", "LRE", "LRE", "
LRE", "LRE", "LRE", "LRE", "LRE", "LRE", "LRE", "LRE", "LRE", "LRE", "LRE", "LRE", "LRE"
, "LRE", "LRE", "LRE", "LRE", "LRE", "LRE", "LRE", "LRE", "LRE", "LRE", "LRE", "LRE", "L
RE", "LRE", "LRE", "LRE", "LRE", "LRE", "LRE", "LRE", "LRE", "LRE", "LRE", "LRE", "LRE",
 "LRE", "ON", "RLO", "L", "LRE", "RLI", "LRE", "RLE", "LRO", "RLO", "PDI", "PDF", "L", "
PDF", "ON" ] LEVEL = 0, AUTO = true
ACTUAL OUTPUT: List [ 124, 124, 124, 124, 124, 124 ]
EXPECTED OUTPUT: List [ 124, 125, 125, 125, 125, 124 ]

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.