GithubHelp home page GithubHelp logo

yeslogic / unicode-case-mapping Goto Github PK

View Code? Open in Web Editor NEW
7.0 14.0 5.0 135 KB

Fast mapping of char to lowercase, uppercase, or titlecase in Rust.

License: Apache License 2.0

Rust 99.89% Makefile 0.11%
rust unicode uppercase lowercase titlecase ucd

unicode-case-mapping's Introduction

unicode-case-mapping


Fast mapping of a char to lowercase, uppercase, titlecase, or its simple case folding in Rust using Unicode 15.0 data.

Usage

fn main() {
    assert_eq!(unicode_case_mapping::to_lowercase('İ'), ['i' as u32, 0x0307]);
    assert_eq!(unicode_case_mapping::to_lowercase('ß'), ['ß' as u32, 0]);
    assert_eq!(unicode_case_mapping::to_uppercase('ß'), ['S' as u32, 'S' as u32, 0]);
    assert_eq!(unicode_case_mapping::to_titlecase('ß'), ['S' as u32, 's' as u32, 0]);
    assert_eq!(unicode_case_mapping::to_titlecase('-'), [0; 3]);
    assert_eq!(unicode_case_mapping::case_folded('I'), NonZeroU32::new('i' as u32));
    assert_eq!(unicode_case_mapping::case_folded('ß'), None);
    assert_eq!(unicode_case_mapping::case_folded('ẞ'), NonZeroU32::new('ß' as u32));
}

Motivation / When to Use

The Rust standard library supplies to_uppercase and to_lowercase methods on char so you might be wondering why this crate was created or when to use it. You should almost certainly use the standard library, unless:

  • You need support for titlecase conversion or case folding according to the Unicode character database (UCD).
  • You need lower level access to the mapping table data, compared to the iterator interface supplied by the standard library.
  • You need faster performance than the standard library.

An additional motivation for creating this crate was to be able to version the UCD data used independent of the Rust version. This allows us to ensure all our Unicode related crates are all using the same UCD version.

Performance & Implementation Notes

ucd-generate is used to generate tables.rs. A build script (build.rs) compiles this into a three level look up table. The look up time is constant as it is just indexing into the arrays.

The multi-level approach maps a code point to a block, then to a position within a block, which is then the index of a record describing how to map that codepoint to lower, upper, and title case. This allows the data to be deduplicated, saving space, whilst also providing fast lookup. The code is parameterised over the block size, which must be a power of 2. The value in the build script is optimal for the data set.

This approach trades off some space for faster lookups. The tables take up about 101KiB. Benchmarks (run with cargo bench) show this approach to be ~5–10× faster than the binary search approach used in the Rust standard library.

It's possible there are further optimisations that could be made to eliminate some runs of repeated values in the first level array.

Regenerating tables.rs

  1. Regenerate with yeslogic-ucd-generate (see header of file).
  2. Add #[allow(dead_code)] to each table to prevent warnings.
  3. Delete entries that map to themselves. E.g. in Vim: :g/(\(\d\+\), &\[\1\])/d.

unicode-case-mapping's People

Contributors

adrianwong avatar djudd avatar wezm avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

unicode-case-mapping's Issues

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.