open-i18n / rust-unic Goto Github PK

View Code? Open in Web Editor NEW

234.0 17.0 24.0 14.54 MB

UNIC: Unicode and Internationalization Crates for Rust

Home Page: https://crates.io/crates/unic

License: Other

Rust 98.99% Shell 1.01%

unicode internationalization text-processing crates rust cldr locale-data unic unicode-characters unicode-algorithms

rust-unic's Introduction

UNIC: Unicode and Internationalization Crates for Rust

https://github.com/open-i18n/rust-unic

UNIC is a project to develop components for the Rust programming language to provide high-quality and easy-to-use crates for Unicode and Internationalization data and algorithms. In other words, it's like ICU for Rust, written completely in Rust, mostly in safe mode, but also benefiting from performance gains of unsafe mode when possible.

See UNIC Changelog for latest release details.

Project Goal

The goal for UNIC is to provide access to all levels of Unicode and Internationalization functionalities, starting from Unicode character properties, to Unicode algorithms for processing text, and more advanced (locale-based) processes based on Unicode Common Locale Data Repository (CLDR).

Other standards and best practices, like IETF RFCs, are also implemented, as needed by Unicode/CLDR components, or common demand.

Project Status

At the moment UNIC is under heavy development: the API is updated frequently on master branch, and there will be API breakage between each 0.x release. Please see open issues for changes planed.

We expect to have the 1.0 version released in 2018 and maintain a stable API afterwards, with possibly one or two API updates per year for the first couple of years.

Design Goals

Primary goal of UNIC is to provide reliable functionality by way of easy-to-use API. Therefore, new components are added may not be well-optimized for performance, but will have enough tests to show conformance to the standard, and examples to show users how they can be used to address common needs.
Next major goal for UNIC components is performance and low binary and memory footprints. Specially, optimizing runtime for ASCII and other common cases will encourage adaptation without fear of slowing down regular development processes.
Components are guaranteed, to the extend possible, to provide consistent data and algorithms. Cross-component tests are used to catch any inconsistency between implementations, without slowing down development processes.

Components and their Organization

UNIC Components have a hierarchical organization, starting from the unic root, containing the major components. Each major component, in turn, may host some minor components.

API of major components are designed for the end-users of the libraries, and are expected to be extensively documented and accompanies with code examples.

In contrast to major components, minor components act as providers of data and algorithms for the higher-level, and their API is expected to be more performing, and possibly providing multiple ways of accessing the data.

The UNIC Super-Crate

The unic super-crate is a collection of all UNIC (major) components, providing an easy way of access to all functionalities, when all or many are needed, instead of importing components one-by-one. This crate ensures all components imported are compatible in algorithms and consistent data-wise.

Main code examples and cross-component integration tests are implemented under this crate.

Major Components

unic-char: Unicode Character Tools.
unic-ucd: Unicode Character Database (UAX#44).
unic-bidi: Unicode Bidirectional Algorithm (UAX#9).
unic-normal: Unicode Normalization Forms (UAX#15).
unic-segment: Unicode Text Segmentation Algorithms (UAX#29).
unic-idna: Unicode IDNA Compatibility Processing (UTS#46).
unic-emoji: Unicode Emoji (UTS#51).

Applications

unic-cli: UNIC Command-Line Tools

Code Organization: Combined Repository

Some of the reasons to have a combined repository these components are:

Faster development. Implementing new Unicode/i18n components very often depends on other (lower level) components, which in turn may need adjustments—expose new API, fix bugs, etc—that can be developed, tested and reviewed in less cycles and shorter times.
Implementation Integrity. Multiple dependencies on other components mean that the components need to, to some level, agree with each other. Many Unicode algorithms, composed from smaller ones, assume that all parts of the algorithm is using the same version of Unicode data. Violation of this assumption can cause inconsistencies and hard-to-catch bugs. In a combined repository, it's possible to reach a better integrity during development, as well as with cross-component (integration) tests.
Pay for what you need. Small components (basic crates), which cross-depend only on what they need, allow users to only bring in what they consume in their project.
Shared bootstrapping. Considerable amount of extending Unicode/i18n functionalities depends on converting source Unicode/locale data into structured formats for the destination programming language. In a combined repository, it's easier to maintain these bootstrapping tools, expand coverage, and use better data structures for more efficiency.

Documentation

Unicode and Rust
UNIC Versioning
UNIC Unicode API
UNIC API Guideline
UNIC API Reference (autogenerated on docs.rs)

How to Use UNIC

In Cargo.toml:

[dependencies]
unic = "0.9.0"  # This has Unicode 10.0.0 data and algorithms

And in main.rs:

extern crate unic;

use unic::ucd::common::is_alphanumeric;
use unic::bidi::BidiInfo;
use unic::normal::StrNormalForm;
use unic::segment::{GraphemeIndices, Graphemes, WordBoundIndices, WordBounds, Words};
use unic::ucd::normal::compose;
use unic::ucd::{is_cased, Age, BidiClass, CharAge, CharBidiClass, StrBidiClass, UnicodeVersion};

fn main() {

    // Age

    assert_eq!(Age::of('A').unwrap().actual(), UnicodeVersion { major: 1, minor: 1, micro: 0 });
    assert_eq!(Age::of('\u{A0000}'), None);
    assert_eq!(
        Age::of('\u{10FFFF}').unwrap().actual(),
        UnicodeVersion { major: 2, minor: 0, micro: 0 }
    );

    if let Some(age) = '🦊'.age() {
        assert_eq!(age.actual().major, 9);
        assert_eq!(age.actual().minor, 0);
        assert_eq!(age.actual().micro, 0);
    }

    // Bidi

    let text = concat![
        "א",
        "ב",
        "ג",
        "a",
        "b",
        "c",
    ];

    assert!(!text.has_bidi_explicit());
    assert!(text.has_rtl());
    assert!(text.has_ltr());

    assert_eq!(text.chars().nth(0).unwrap().bidi_class(), BidiClass::RightToLeft);
    assert!(!text.chars().nth(0).unwrap().is_ltr());
    assert!(text.chars().nth(0).unwrap().is_rtl());

    assert_eq!(text.chars().nth(3).unwrap().bidi_class(), BidiClass::LeftToRight);
    assert!(text.chars().nth(3).unwrap().is_ltr());
    assert!(!text.chars().nth(3).unwrap().is_rtl());

    let bidi_info = BidiInfo::new(text, None);
    assert_eq!(bidi_info.paragraphs.len(), 1);

    let para = &bidi_info.paragraphs[0];
    assert_eq!(para.level.number(), 1);
    assert_eq!(para.level.is_rtl(), true);

    let line = para.range.clone();
    let display = bidi_info.reorder_line(para, line);
    assert_eq!(
        display,
        concat![
            "a",
            "b",
            "c",
            "ג",
            "ב",
            "א",
        ]
    );

    // Case

    assert_eq!(is_cased('A'), true);
    assert_eq!(is_cased('א'), false);

    // Normalization

    assert_eq!(compose('A', '\u{030A}'), Some('Å'));

    let s = "ÅΩ";
    let c = s.nfc().collect::<String>();
    assert_eq!(c, "ÅΩ");

    // Segmentation

    assert_eq!(
        Graphemes::new("a\u{310}e\u{301}o\u{308}\u{332}").collect::<Vec<&str>>(),
        &["a\u{310}", "e\u{301}", "o\u{308}\u{332}"]
    );

    assert_eq!(
        Graphemes::new("a\r\nb🇺🇳🇮🇨").collect::<Vec<&str>>(),
        &["a", "\r\n", "b", "🇺🇳", "🇮🇨"]
    );

    assert_eq!(
        GraphemeIndices::new("a̐éö̲\r\n").collect::<Vec<(usize, &str)>>(),
        &[(0, "a̐"), (3, "é"), (6, "ö̲"), (11, "\r\n")]
    );

    assert_eq!(
        Words::new(
            "The quick (\"brown\") fox can't jump 32.3 feet, right?",
            |s: &&str| s.chars().any(is_alphanumeric),
        ).collect::<Vec<&str>>(),
        &["The", "quick", "brown", "fox", "can't", "jump", "32.3", "feet", "right"]
    );

    assert_eq!(
        WordBounds::new("The quick (\"brown\")  fox").collect::<Vec<&str>>(),
        &["The", " ", "quick", " ", "(", "\"", "brown", "\"", ")", " ", " ", "fox"]
    );

    assert_eq!(
        WordBoundIndices::new("Brr, it's 29.3°F!").collect::<Vec<(usize, &str)>>(),
        &[
            (0, "Brr"),
            (3, ","),
            (4, " "),
            (5, "it's"),
            (9, " "),
            (10, "29.3"),
            (14, "°"),
            (16, "F"),
            (17, "!")
        ]
    );
}

You can find more examples under examples and tests directories. (And more to be added as UNIC expands...)

License

Licensed under either of

Apache License, Version 2.0 (LICENSE-APACHE or http://www.apache.org/licenses/LICENSE-2.0)
MIT license (LICENSE-MIT or http://opensource.org/licenses/MIT)

at your option.

Contribution

Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.

Code of Conduct

UNIC project follows The Rust Code of Conduct. You can find a copy of it in CODE_OF_CONDUCT.md or online at https://www.rust-lang.org/conduct.html.

rust-unic's People

Contributors

Stargazers

Watchers

rust-unic's Issues

API to override char property values for Private-Use chars

From http://www.unicode.org/faq/private_use.html:

Private-use characters are code points whose interpretation is not specified by a character encoding standard and whose use and interpretation may be determined by private agreement among cooperating users. Private-use characters are sometimes also referred to as user-defined characters (UDC) or vendor-defined characters (VDC).

One should not expect the rest of an operating system to override the character properties for private-use characters, since private use characters can have different meanings, depending on how they originated. In terms of line breaking, case conversions, and other textual processes, private-use characters will typically be treated by the operating system as otherwise undistinguished letters (or ideographs) with no uppercase/lowercase distinctions.

Basically, a system can assign its own internal meaning to PUA characters, and with meaning comes the character properties. UNIC should allow override of property values for PUA characters.

How we can do that in Rust while maintaining Cargo package boundaries could be tricky and needs some pondering.

What assumptions can we make?

I assume it's safe to assume that any override would affect any and all instances of UNIC libraries in existence, even when only used internally by some dependencies.
The above comes with the assumption that non of the dependent libraries is assigning a meaning to any PUA char.
And, since it's logical to have libraries that assign PUA chars, to be used by other libraries, we need to make sure parallel assignments do not conflict in anyway; meaning that either the codepoints don't overlap, or if they do, all the char property values overridden are exactly the same.

I think this is one of those areas that would require cutting edge features of rustc. We need to investigate more on implementation solutions.

In addition:

We may also want to provide a query method for PUAs, in the same level as the definition. In other words, if we use compiler plugins to assign PUAs, we should provide a compile-time query method for the current state of PUA assignments.
We need to make sure any sensitive area, like Security Mechanisms, blocks any PUA on its own boundary. I believe there are parts of the specs, but we need to double-check this.

Implement CharRange type

[assign: @CAD97]

Replace usages of (char, char) as an inclusive range with a custom Range struct adapted for use on characters.

CharProperty::display() ?

Do we want to add display() as an instance method to CharProperty API? We already have impl fmt::Display for it, but returning &str would be useful in cases that the display string is not expected to be matched with formatting.

I have seen both cases in third-party libraries.

What do you think?

Implement unic-ucd-ident

Char Properties:

ID_Start/ID_Continue
XID_Start/XID_Continue
Pattern_Syntax/Pattern_White_Space

References:

Cross-component tests:

http://www.unicode.org/reports/tr31/#Table_Lexical_Classes_for_Identifiers

Similar crates:

https://github.com/unicode-rs/unicode-xid

Implement Emoji Component

The Emoji component will provide access to character properties defined by the Emoji spec, and API for handling Emojis, like API to construct Emojis, parse strings and detect Emojis, and being able to do conversions between emoji-style and text-style.

References

Homepage: http://unicode.org/emoji/
Spec: http://unicode.org/reports/tr51/index.html
Data: http://www.unicode.org/Public/emoji/

[ucd/normal] Characters 가 through 힣 have the wrong Decomposition_Type

See #27 which adds a failing test.

11172 test cases failed! (1100892 passed) {

0: Fail { line_num: Some(234), char: '가', exp_dt: Some(Canonical), actual_dt: None }

...

11171: Fail { line_num: Some(234), char: '힣', exp_dt: Some(Canonical), actual_dt: None }

}

Relevant line from DecompositionType.txt:

AC00..D7A3    ; Canonical # Lo [11172] HANGUL SYLLABLE GA..HANGUL SYLLABLE HIH

IDNA is still using python generated tables

The generation code for simplified tables exists under unic-gen::generate::idna, but I still need to coerce unic-idna into using the new tables.

This would also be a good time to address #51.

Add Conformance Test to ucd-normal component

Use the extracted data file for DecompositionType property to test conformance of the implementation.

Test source file:
http://www.unicode.org/Public/UCD/latest/ucd/extracted/DerivedDecompositionType.txt

NOTE: The test should also verify that the characters not listed in the file get None value for the property.

`EnumeratedCharProperty`'s `FromStr` impl should be fully UAX44-LM3 compliant

Here's the current implementation:

https://github.com/behnam/rust-unic/blob/99164db5dd8f909cd5b491dcd516b13b3356c6ea/unic/char/property/src/macros.rs#L111-L129

It does a quick check for simple equivalence against the abbreviated or long aliases, then does a case insensitive compare. However, for proper compliance we should comply with UAX44-LM3:

5.9.3 Matching Symbolic Values

Property aliases and property value aliases are symbolic values. When comparing them, use loose matching rule UAX44-LM3.

UAX44-LM3. Ignore case, whitespace, underscore (_), hyphens, and any initial prefix string "is".

"linebreak" is equivalent to "Line_Break" or "Line-break"

"lb=BA" is equivalent to "lb=ba" or "LB=BA"

"Script=Greek" is equivalent to "Script=isGreek" or "Script=Is_Greek"

Loose matching is generally appropriate for the property values of [our EnumeratedCharacterProperty], which have symbolic aliases defined for their values. Loose matching should not be done for the property values of String properties, which do not have symbolic aliases defined for their values; exact matching for String property values is important, as case distinctions or other distinctions in those values may be significant.

For loose matching of symbolic values, an initial prefix string "is" is ignored. The reason for this is that APIs returning property values are often named using the convention of prefixing "is" (or "Is" or "Is_", and so forth) to a property value. Ignoring any initial "is" on a symbolic value during loose matching is likely to produce the best results in application areas such as regex. Removal of an initial "is" string for a loose matching comparison only needs to be done once for a symbolic value, and need not be tested recursively. There are no property aliases or property value aliases of the form "isisisisistooconvoluted" defined just to test implementation edge cases.

Existing and future property aliases and property value aliases are guaranteed to be unique within their relevant namespaces, even if an initial prefix string "is" is ignored. The existing cases of note for aliases that do start with "is" are: dt=Iso (Decomposition_Type=Isolated) and lb=IS. The Decomposition_Type value alias does not cause any problem, because there is no contrasting value alias dt=o (Decomposition_Type=olated). For lb=IS, note that the "IS" is the entire property value alias, and is not a prefix. There is no null value for the Line_Break property for it to contrast with, but implementations of loose matching should be careful of this edge case, so that "lb=IS" is not misinterpreted as matching a null value.

FromStr should not allocate heap memory so using String returning APIs isn't a (good) option. This comparison can, however, be done relatively simply using character iterators, stripping ignored characters, and AsciiExt::eq_ignore_ascii_case.

Property aliases are guaranteed to be ASCII-only so using ASCII algorithms rather than Unicode-aware ones is safe.

Expose versions of non-UCD data

At the moment, we only have UNICODE_VERSION, which exposes the version of UCD data in each component. We need to have API and data for other versions, like IDNA Version, Emoji Version, etc.

Enact The Not Rocket Science Rule Of Software Engineering

We should enforce The Not Rocket Science Rule of Software Engineering with the help of bors.

The benefit of using bors is well hashed out in the Rust community. Let's bring the benefits to UNIC!

This does not have to be done right away, but I would suggest integrating with bors before 1.0.0.

Cargo package metadata fields for all crates

We need to figure out a good way to have all Cargo package metadata fields set, since not all packages have README.md and LICENCE-* files next to them.

Unfortunately, it's not that clear yet how package.readme is being used in a package yet. This needs follow up with the Cargo team.

But, if we assume that package.readme is going to be used as a longer/formatted description of the crate package by some packaging system (cargo or dist-level systems), then we can add basics README.md files containing only the basics.

Implement Serde

At the moment, only some components have serde support. We should start implementing serde everywhere, as an optional dependency.

Optimize unic-ucd-case data tables

#157 implemented UCD Case character properties, which are:

Lowercase
Uppercase
Cased
Case_Ignorable
Changes_When_Lowercase
Changes_When_Uppercase
Changes_When_Titlecase
Changes_When_Casefolded
Changes_When_Casemapped

However, based on their definitions, we can reduce the number of tables we store and compose the result from other properties here and the General_Category property (unic-ucd-category). We want to do so here as a way to optimize data table footprint of the library.

And, when doing so, we should move the now-unused auto-generated tables to case/tests/tables/ and use them to match the behavior of the composition with direct lookup. This way we make sure the upstream data and our implementations are consistent with the spec.

Implement locale-based collation

This would depend on having the locale tags and CLDR data processing in place and should be a good CLDR-level target.

Expand components/ucd/tests/category_tests.rs

We have a cross-component test in components/ucd/tests/category_tests.rs that checks values of the Bidi_Class property against General_Category property, based on UAX#9's Table 4. Bidirectional Character Types.

Now that we have component/ucd/category, we can expand the test to also cover General_Category values.

Determine property instance name API for all Properties

We already have abbr_name, long_name, and human_name fn returning &'static str on EnumeratedCharProperty and BinaryCharProperty.

We should decide on if and how we want to expand these fn to NumericCharProperty and CustomCharProperty.

Because these have procedural name derivation from data, they will not be able to return &'static str, and will likely need to follow a fmt pattern or return String. Along with #144, we don't want to move further away from #![no_std] support in the UCD, so we probably want to favor the fmt API surface, though that is up for debate.

Implement unic-ucd-case and unic-case

Char Properties:

Lowercase
Uppercase
Cased
Case_Ignorable
Changes_When_*

Implements case-folding algorithms:
(TBD.)

Related crates:

[unic-segment] Implement Setence Boundry algorithm

References:

UAX #29: Unicode Text Segmentation http://www.unicode.org/reports/tr29/

#174 has implementation of the properties needed (e.g. Setence_Break) and the other two segmentation algorithms.

Add allocation free API

I'd like to see a generic "transformation" or "streaming" API that can be used to transform data across the subcrates, possibly without requiring a heap allocation if some mutable storage is already allocated (eg. that can be used for IDNA transformations as well as for normalization, giving the two packages a consistent API). I've recently been playing with adapting the Go text/transform APIs, but as you can imagine it doesn't exactly map cleanly to Rust, being a drastically different language (Initial trait and experimentation can be found here).

Instead, I suspect having some sort of io::Read/io::Write impl would work better. This would allow us to transform into pre-allocated space, as opposed to the current implementation (in IDNA at least) that returns a String which must always be heap allocated when IDNA is called. It might even let us wrap several transforms up into a single object without doing multiple allocations, which would be useful when implementing PRECIS (something I've also been experimenting with lately).

It would be nice to start discussing such an API in this issue if you think such a thing would be desirable. Thanks for your work on this!

[ucd-name] Complete implementation for Unicode Name Property

We have added a basic partial implementation for character names. We need a few more sources (data-based and algorithmic) to complete the implementation.

From http://www.unicode.org/reports/tr44/#About_Property_Table

Jamo.txt must also be used, and the Name property for CJK unified ideographs, Tangut ideographs, and Nushu ideographs is derived by rule.

Source data:

http://www.unicode.org/Public/UCD/latest/ucd/Jamo.txt

From the spec (http://www.unicode.org/versions/Unicode10.0.0/ch04.pdf) (Section 4.8 Name), NR3 and NR4 are already implemented, and we still need implement these rules:

NR1: Hangul Jamos
NR2: Ideographics

From Table 4-8. Name Derivation Rule Prefix Strings:

Range       Rule Prefix String
AC00..D7A3   NR1 “hangul syllable”
3400..4DB5   NR2 “cjk unified ideograph-”
4E00..9FEA   NR2 “cjk unified ideograph-”
20000..2A6D6 NR2 “cjk unified ideograph-”
2A700..2B734 NR2 “cjk unified ideograph-”
2B740..2B81D NR2 “cjk unified ideograph-”
2B820..2CEA1 NR2 “cjk unified ideograph-”
2CEB0..2EBE0 NR2 “cjk unified ideograph-”
17000..187EC NR2 “tangut ideograph-”
1B170..1B2FB NR2 “nushu character-”
F900..FA6D*  NR2 “cjk compatibility ideograph-”
FA70..FAD9   NR2 “cjk compatibility ideograph-”
2F800..2FA1D NR2 “cjk compatibility ideograph-”

NOTE: Code Point Labels, as defined later in that chapter shall be implemented with its own API. Let's not include those in this issue, and we get to them after we're done here.

[ucd/normal] [proposal] Allow ucd-normal to use mark information from ucd-category

unic-ucd-normal has its own table of symbols with a general general category of Mark. This duplication from unic-ucd-category is by design because ucd/normal does not need the entire classification that ucd/category exposes.

However, when including unic-ucd or just happening to want ucd/category as well, this is then duplicated information.

Hence, the proposal, which has three resolution paths:

Add a feature (say, ucd-category) to unic-ucd-normal that, if on, does not compile the libraries own category mark table, and instead implements is_combining_mark as GeneralCategory::of(character).is_mark()
Through benchmarks and other measurements, show that adding ucd/category as an unconditional dependency of ucd/normal is a acceptable level of bloat to avoid data duplication
Leave it the way that it is and just test it into oblivion to make sure the implementations are in agreement (do this anyway)

Of which I suggest the first.

Implement unic-ucd-script

Provides access to:

Script
Script_Extensions

References:

Consistency tests:

Test relations between Script and Script_Extensions impl.

Cross-component tests:
TBD.

Similar crates:

https://github.com/servo/unicode-script

Is unic-ucd-indent available somewhere?

Hi! I'd love to use unic-ucd-indent crate (https://github.com/behnam/rust-unic/tree/master/unic/ucd/ident) for `is_pattern_whitespace property, but I am not able to find it on crates.io. Is it available somewhere?

[ucd/bidi] BidiClass should use the long names

The Bidi_Class_Values enum should use the long names of the bidi classes, for clarity and to fit in better with the rest of the ucd api and the Rust ecosystem.

This can probably be bikeshedded ad nauseum, but defaulting to the descriptive names seems the better idea and §5.8.1 Property Aliases tells us that the long symbolic names are the preferred aliases. (Cases like Age where we can provide a more meaningful struct rather than a enum excepted, of course.)

We could offer a alias mod or such which provides pub use bindings to the abbreviated symbolic name. PropertyValueAliases.txt could (in theory) be used to generate and/or test the aliases.

Implement UCD EA-Width

Spec:

Implements:

East_Asian_Width

Depends on:

Emoji_Presentation
Regional_Indicator

Similar existing crates:

https://crates.io/crates/unicode-width

Improve unic-ucd-bidi and ucd-bidi

We need to add more char properties to unic-ucd-bidi and use them in ucd-bidi to support more features like bidi mirroring of chars.

Char Properties:

Bidi_Control
Bidi_Mirrored
Bidi_Mirroring_Glyph

Drop dependency on rustc_test

Background: rust-lang/rust#43683

Integration tests using rustc_test, although look nicer in the actual test file, aren't that organic (files cannot be just copy-pasted) and can break regularly, because of the changes to rustc internal.

For these kinds of tests, we have an alternate solution which collects failed test cases and throws one panic at the end, if there are any fails, with some useful information which can be catched with a #[should_panic(expected = "... test cases failed! (... passed)")] pattern.

Example: https://github.com/behnam/rust-unic/blob/master/unic/bidi/tests/conformance_tests.rs#L25-L44

This approach also has the benefit that it allows temporary failures for big tests, meaning that you don't have to get everything in perfect shape before you can commit your big test to master. We've been using this for the non-100%-conforming bidi module, enabling gradual improvements over time.

Release UNIC-0.5

Unicode_API: Add new UCD properties supported.
Update package versions and publish.

[ucd] Expand API for resolved Character Properties

Tracking the additions we like to have for UCD, specially for unic-ucd-common.

ucd-common:

is_mark()
is_graphic()
is_punct()
is_symbol()

Release 0.7.0

We've had a good deal of changes made since the last release.

Could we get version 0.7.0 released?

Bumped minimum Rust to 1.22
A lot of internal structure updates
Added
- unic/emoji/char
- unic/segment
- unic/ucd/common
- unic/ucd/ident
- unic/ucd/segment

Write up CONTRIBUTING.md

Should write up CONTRIBUTING.md with the basics or how code is organized and other development guides.

Implement unic-ucd-segment and unic-segment

References:

UAX #29: Unicode Text Segmentation http://www.unicode.org/reports/tr29/

Defines Char Properties:

Grapheme_Cluster_Break
Word_Break
Sentence_Break

Needs Char Properties:

General_Category
Alphabetic

Related Char Properties that are not needed in algorithm implementation, therefore can be made optional feature in unic-ucd-segment, or implemented in a separate component:

Grapheme_Base
Grapheme_Extend

Tests:

http://www.unicode.org/reports/tr44/tr44-4.html#Segmentation_Test_Files

Similar crates:

https://crates.io/crates/unicode-segmentation

Drop all direct deps on /data/

Now that we have very simple and easy ways to generate data tables from source data files, it's better to replace almost all direct reads for files under /data/ from /unic/**/tests/**/*.rs with conversion of test data into test data tables (under <component>/tests/tables/, in an RSV format), and using the test data tables for running the tests.

This will allow us to package more (integration) tests, which is a good practice because of the cargo packages being the source for distro packages.

For some test data sources, the complicated format of the test file may actually make it less useful to apply this pattern. So, we should only do this when it's not too much overhead.

CanonicalCombiningClass should probably be a newtype

(Tuple struct, whatever)

Blocking on associated consts (we like that don't we), here's the proposed design:

struct CanonicalCombiningClass(u8);

impl CanonicalCombiningClass {
    pub const NOT_REORDERED = CanonicalCombiningClass(0);
    pub const OVERLAY = CanonicalCombiningClass(1);
    pub const NUKTA = CanonicalCombiningClass(7);
    // and so on
}

impl CanonicalCombiningClass {
    pub fn of(ch: char) -> CanonicalCombiningClass;
}

impl CanonicalCombiningClass {
    fn is_not_reordered(&self) -> bool;
}

Since CanonicalCombiningClass should is clearly a distinct concept from a u8, it should be a distinct type. We should use Rust's mechanisms to express that with this zero cost abstraction.

[char/property] Document un-enforced expectations

Background: #113 (comment)

Actually, since we implement of() on the main type, that takes priority over traits and there won't be any conflicts. But, unfortunately, that's one of things we cannot enforce on other users via type tools. So, I guess we need to document this expectation in the docs.

Need to document how to define of(), since PartialCP and CompleteCP overlap.

CharProperty: API to get all possible values, if applicable

Some properties, like GeneralCategory and BidiClass, are a flat enum, and having a type function returning all possible values of them can help with writing tests.

For example, having an iterator over GC values can reduce this test to 20 execs of the loop block, not all Unicode code points: https://github.com/behnam/rust-unic/blob/master/unic/ucd/category/tests/coverage_tests.rs

[ucd] Make FromStr follow UAX44-LM3 for char props

Rust names for aliases defined in the Unicode Character Database will be consistent with the formal long aliases under UAX44-LM3. This is an invariant and helps for API discovery and navigability.

Rust names will follow Rust naming conventions. This is an invariant and helps for API discovery navigability.

UCD aliases for properties are given by PropertyAliases.txt. ACD aliases for property values are given by [PropertyValueAliases.txt].

The question then, is how to deirive the rust name from the long alias.

99.9% of long aliases in PropertyValueAliases.txt are of form Long_Name. For those it is clear that the algoritm for long alias to rust name is just:

let rust_name = long_name.chars().filter(|c| c != '_').collect::<String>()

But the 0.1% is Decomposition_Type=Nobreak (dt=Nb).

If we apply the above algorithm, we get DecompositionType::Nobreak. However, it might be more in line with Rust API guidelines to name it DecompositionType::NoBreak, which is still equivalent under UAX44-LM3 (or even the subset of just case insensitivity).

Do we allow this less-strict transformation between the formal long alias and the rust alias, or do we stick to the simple mapping?

Enhance CharProperty API for fetching values

Add API to CharProperty types to fetch property values for CharRange, efficiently.

Also, for the current API, of(ch: char), we can introduce locality caching of the last-seen range in the table. But, this would work in most real-world cases only if we also have a shortcut for ASCII range that does not change the locality.

Replace most String-returning methods with Display impls

In #168 I found just two different methods named display which simply call format! to return a string. The Display impl then simply prints this allocated string.

Instead of doing this, we should put the necessary formatting into a Display impl and let the user call to_string if the want a string version. There can even be Into<String> impls if necessary. This avoids unnecessary allocations, and also brings us closer to having a more allocation-free API.

I've noticed a lot of unnecessary allocations in this library and this would be an easy way to fix that.

Organizing benching code

In unic-bidi (and going back to the source unicode-bidi crate), I had added bench_it feature to not run benching on every single test run.

In travis.yml, benching is only run once and on the nightly rustc.

I think we did this because #[bench] was not stable yet. I think it's got stabilized now. Is that so?

Also, it is common to have simple data-less benches inside the modules, like unit tests, and not as an integration thing. Should we do this, or not?

Exclude tools, data and data-dependent tests/benches in packages

We should use the Cargo exclude option to not publish data files to crates.io, as they are usually very large and almost of no use for 99.99% of the callers.

See servo/unicode-bidi#43 for report issue with publishing tests without data when published crate is being used in repackaging.

See also rust-lang/cargo#4268 for current limits of the exclude syntax, and the changes coming.

Mapping the UCD into unic-ucd subcrates

UAX44 § 5.1 Property Index gives a list of UCD properties. For convenience, I have reproduced below those which are intended for exposure in library APIs. This issue will serve as a tracking list for exposing those. Each property is also given a type in UAX44 § 5.3 Property Definitions (one of Catalog, Enumeration, Binary, String, Numeric, or Miscellaneous). For definitions of those, see UAX44 § 5.2. The type is also included in the below table for ease of reference.

Property Index

General

Case

Numeric

Numeric_Value (Numeric)
Numeric_Type (Enumeration)
Hex_Digit (Binary)
ASCII_Hex_Digit (Binary)

Normalization

Canonical_Combining_Class (Numeric)
Decomposition_Type (Enumerated)
NFC_Quick_Check (Enumerated)
NFKC_Quick_Check (Enumerated)
NFD_Quick_Check (Enumerated)
NFKD_Quick_Check (Enumerated)
NFKC_Casefold (String)
Changes_When_NFKC_Casefolded (Binary)

Shaping and Rendering

Bidirectional

Bidi_Class (Enumerated)
Bidi_Control (Binary)
Bidi_Mirrored (Binary)
Bidi_Mirroring_Glyph (Miscellaneous)
Bidi_Paired_Bracket (Miscellaneous)
Bidi_Paried_Bracket_Type (Enumerated)

Identifiers

CJK

Miscellaneous

These need to be partitioned into subcrates. Some properties clearly fit into one of the crates already implemented or planned. Below is a the list of UCD crates planned, along with the properties which they are most likely to contain. How these properties are exposed is a separate question, which this issue does not intend to address. Properties marked "(??)" are included as this is a logical place to put the property, but needs further consideration.

Note that these crates may include more tables than that listed here. Namely, those which are contributory and thus excluded from this listing.

core

None. The version of Unicode.

age

name

Name
(??) Name_Alias (??)

category

General_Category

block

Block

script

Script
Script_Extensions

normal

Canonical_Combining_Class
Decomposition_Type

normal-quickcheck

NFC_Quick_Check
NFKC_Quick_Check
NFD_Quick_Check
NFKD_Quick_Check

case

Uppercase
Lowercase
Lowercase_Mapping
Titlecase_Mapping
Uppercase_Mapping
Cased
Case_Ignorable

case-quickcheck

Changes_When_Lowercased
Changes_When_Uppercased
Changes_When_Titlecased
Changes_When_Casefolded
Changes_When_Casemapped

grapheme

Grapheme_Base
Grapheme_Link

numeric

Numeric_Value
Numeric_Type
(??) Hex_Digit (??)
(??) ASCII_Hex_Digit (??)

bidi

Bidi_Class
(??) Bidi_Control (??)
(??) Bidi_Mirrored (??)
(??) Bidi_Mirroring_Glyph (??)
(??) Bidi_Paired_Bracket (??)
(??) Bidi_Paired_Bracket_Type (??)

joining

Join_Control
Joining_Group
Joining_Type

ea-width

East_Asian_Width

This leaves the following list of properties which should be exposed, but don't have a definite home yet. (Properties included in the above listings with a (??) indicating inconclusive placement are not re-included here.)

Homeless Properties

General

White_Space
Alphabetic
Hangul_Syllable_Type
Noncharacter_Code_Point
Default_Ignorable_Code_Point
Deprecated
Logical_Order_Exception

Case

Case_Folding
Simple_Lowercase_Mapping
Simple_Titlecase_Mapping
Simple_Uppercase_Mapping
Simple_Case_Folding
Soft_Dotted

Numeric

Normalization

NFKC_Casefold
Changes_When_NFKC_Casefolded

Shaping and Rendering

Vertical_Orientation
Line_Break
Sentence_Break
Word_Break
Prepended_Concatenation_Mark

Bidirectional

Identifiers

ID_Continue
ID_Start
XID_Continue
XID_Start
Pattern_Syntax
Pattern_White_Space

CJK

Ideographic
Unified_Ideograph
Radical
IDS_Binary_Operator
IDS_Trinary_Operator
Unicode_Radical_Stroke

Miscellaneous

Math
Quotation_Mark
Dash
Sentence_Terminal
Terminal_Punctuation
Diacritic
Extender
Regional_Indicator
Indic_Positional_Category
Indic_Syllabic_Category

These properties need to be given a home crate before they can be included.

#![no_std] where possible?

A common feature among unicode crates is that many of them are no_std or are opt-out std.

Do we want to support the no_std use case? If we do, we should do so soon to avoid including std things in our crates.

For the UCD at the very least, no_std does not seem difficult. char_property works as-is with #![no_std] use core as std. (caveat: my small test didn't cover the macro...) char_range works so long as the Bound-based construction API is gated on std support. utils has iter_all_chars still (why? that should probably be removed since we have CharRange) which returns a Box, but works if that is dropped.

A quick search of the ucd directory shows one use of std::collections, which is in a test. Other than that, I don't think any non-core apis are used in ucd. std::ascii fn can be easily shimmed for those that are being used (I know they exist some places). I mean, we're writing a text-processing library, I hope we could shim it 😆.

Working in a no_std would also force us to think Iterator-first, as we would no longer have allocating APIs available at all.

normal has one use of VecDeque. Other than that, String, and std::ascii, I don't think we're using any non-core APIs in the libraries. (I of course exclude the source generation tools.)

Pack all utils together in one crate: unic-utils

The reason for having separate crates for UNIC components is the large data size for some of them, which may not be needed in other parts. Specially, UCD components all having data of all sizes, and not all algorithms need all of that data.

That rule doesn't apply to utility code, which basically doesn't have any data tables. So, it doesn't bother to have all of them under one umbrella crate and not maintain multiple utils sub-crates.

What do you think?

Create shortcuts for ASCII inputs

A big part of the input to many Unicode algorithms is ASCII-only or ASCII-majority, specially in early stages of application development.

Most properties/algorithms have a clear and easy effect on ASCII input, that can be put together in a couple of lines of code, usually. And in many cases, this only adds one if condition to the log(n) bsearch checks that follows.

We want to benefit from this and make these algorithms much faster for the common case, to make adoption easier for developers.

UCD Age: Custom enum type or Option<UnicodeVersion>?

Right now, we have Age as this:

https://github.com/behnam/rust-unic/blob/72ca9a893373f1857bc6ab6440389dfdceea6f13/unic/ucd/age/src/age.rs#L36-L42

It's nice to have this CompleteCharProperty, as it gives meaningful API like is_assigned() and is_unassigned().
Another option is to convert this to a PartialCharProperty, which returns None if the char is Unassigned, and Some(UnicodeVersion) otherwise.
One more option would be to have it as PartialCharProperty, but the return type will be more following the UAX#44 spec: only having major and minor numbers. Basically, return value would be Some(Age), with Age being similar to UnicodeVersion, but without minor number. Then, we can have transformations between Age and UnicodeVersion.

Option (3) makes it much more like other types in UCD, option (2) is a bit farther, and option (1), which is the current one, is the farthest from general approach here.

BUT, there's also the fact that option (1) looks nicer and more organic comparing to the other ones.

I'm filing this mostly because I need to figure out a way to provide a property-range contract for Age, and would be great if it's not just an empty contract.

Any ideas?

Use Option<(u8, u8)> for ucd::Age

Using Option<(u8, u8)> for Age property allows more flexibility in the API and is nothing short of the current enum.

[normal] Implement isNFC()

Many APIs need to check input for being NFC, and it's something that can be done pretty fast for a majority of cases.

Ref: http://www.unicode.org/reports/tr15/#Detecting_Normalization_Forms

Decouple download and generate scripts

The requirements we have for download and generate scripts are fairly different.

For download, we want to keep files from every source in sync, as we read the version from their ReadMe.txt. So, for each source, we want to delete all the files before re-downloading. The files under /data/, for each source, should never go out of sync, version-wise. Because of this, this part of the code should be grouped together based on the data source.

For generate, in contrast, we want to have steps the generate consistent table files, which depend on one or more sources, and generates one or more table files. For this part of the code, it makes sense to group them based on major component, because that's closer to the development process.

The current model that has coupled download and generate code to be called from one binary makes it hard to satisfy this pattern.

Also, with the current model, we cannot add the download logic for a new source, like emoji, in one step and get to the generate code later; forcing us to put in a dummy generate code until we get to the implementation.

Also, the two steps don't share any model configuration, and only sharing some CLI implementation.

I think the current model doesn't have much benefits and it's going to be much easier if we have two separate creates for these tasks, each with their own functionality.

Also, now we can rename gen to tools, as we have killed all the Python code. So, we can have /tools/download/ and /tools/generate/, each a separate binary create. Or, we can have one tools create with multiple binaries inside.

open-i18n / rust-unic Goto Github PK

rust-unic's Introduction

UNIC: Unicode and Internationalization Crates for Rust

Project Goal

Project Status

Design Goals

Components and their Organization

The UNIC Super-Crate

Major Components

Applications

Code Organization: Combined Repository

Documentation

How to Use UNIC

License

Contribution

Code of Conduct

rust-unic's People

Contributors

Stargazers

Watchers

Forkers

rust-unic's Issues

References

General

Case

Numeric

Normalization

Shaping and Rendering

Bidirectional

Identifiers

CJK

Miscellaneous

General

Case

Numeric

Normalization

Shaping and Rendering

Bidirectional

Identifiers

CJK

Miscellaneous

Recommend Projects

Recommend Topics

Recommend Org

Jobs