GithubHelp home page GithubHelp logo

Comments (4)

camertron avatar camertron commented on August 12, 2024

Hey @jrochkind,

This looks pretty interesting, but unfortunately we don't have plans to incorporate UTR#30 into TwitterCLDR at the moment. At first glance, it seems like a fairly straightforward algorithm, and I would happily accept a pull request. TwitterCLDR's current transformations are really normalizations, one of which UTR#30 specifically depends on (NFD), so at least that's already done. You can make use of NFD normalization using the corresponding class:

TwitterCldr::Normalization::NFD.normalize(text)

# alternatively:
text.localize.normalize(:using => :NFD)

Good luck!

from twitter-cldr-rb.

jrochkind avatar jrochkind commented on August 12, 2024

Thanks! I may try a pull request in the future.

from twitter-cldr-rb.

jrochkind avatar jrochkind commented on August 12, 2024

Do you have any advice as to how to use the mapping data files of the sort here with TwitterCLDR? That is, is there already a part of TwitterCLDR written to use this kind of mapping data, but applied to other mapping data? I ask because it seems like this may be some kind of standard unicode mapping data file, I'm not sure.

from twitter-cldr-rb.

camertron avatar camertron commented on August 12, 2024

It looks like those files contain a series of folding rules that map one character (or range of characters) to another. The algorithm in UTR#30 says to perform the following steps:

a. Apply optional folding operations (i.e. rules from the solr files)
b. Apply canonical decomposition (described above)
c. Repeat (a) and (b) until stable (I think "stable" means "until you can't decompose any more")
d. Apply composition if necessary (only if you want the string in composed form, based on your technical requirements)

Applying a folding operation might look something like this: given a rule like 058A>002D, every time you encounter a "058A" character, you'd replace it with "002D". Bear in mind that I only took a cursory glance over the UTR#30 spec, so that might be incorrect. Indeed, the spec is quite a bit more complicated than that.

If you do decide to work on this feature and submit a PR, I'd suggest looking around for a test file. Unicode publishes a set of test data (inputs and correct outputs) for algorithms like normalization and bidi, so it's possible they have one for folding as well.

Good luck!

from twitter-cldr-rb.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.