Hey <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url=

Any interest in UTR#30 Normalization? about twitter-cldr-rb HOT 4 CLOSED

twitter commented on August 12, 2024

Any interest in UTR#30 Normalization?

from twitter-cldr-rb.

Comments (4)

camertron commented on August 12, 2024

Hey @jrochkind,

This looks pretty interesting, but unfortunately we don't have plans to incorporate UTR#30 into TwitterCLDR at the moment. At first glance, it seems like a fairly straightforward algorithm, and I would happily accept a pull request. TwitterCLDR's current transformations are really normalizations, one of which UTR#30 specifically depends on (NFD), so at least that's already done. You can make use of NFD normalization using the corresponding class:

TwitterCldr::Normalization::NFD.normalize(text)

# alternatively:
text.localize.normalize(:using => :NFD)

Good luck!

from twitter-cldr-rb.

jrochkind commented on August 12, 2024

Thanks! I may try a pull request in the future.

from twitter-cldr-rb.

jrochkind commented on August 12, 2024

Do you have any advice as to how to use the mapping data files of the sort here with TwitterCLDR? That is, is there already a part of TwitterCLDR written to use this kind of mapping data, but applied to other mapping data? I ask because it seems like this may be some kind of standard unicode mapping data file, I'm not sure.

from twitter-cldr-rb.

camertron commented on August 12, 2024

It looks like those files contain a series of folding rules that map one character (or range of characters) to another. The algorithm in UTR#30 says to perform the following steps:

a. Apply optional folding operations (i.e. rules from the solr files)
b. Apply canonical decomposition (described above)
c. Repeat (a) and (b) until stable (I think "stable" means "until you can't decompose any more")
d. Apply composition if necessary (only if you want the string in composed form, based on your technical requirements)

Applying a folding operation might look something like this: given a rule like 058A>002D, every time you encounter a "058A" character, you'd replace it with "002D". Bear in mind that I only took a cursory glance over the UTR#30 spec, so that might be incorrect. Indeed, the spec is quite a bit more complicated than that.

If you do decide to work on this feature and submit a PR, I'd suggest looking around for a test file. Unicode publishes a set of test data (inputs and correct outputs) for algorithms like normalization and bidi, so it's possible they have one for folding as well.

Good luck!

from twitter-cldr-rb.

Recommend Projects

Any interest in UTR#30 Normalization? about twitter-cldr-rb HOT 4 CLOSED

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs