I benchmarked all the ruby unicode normalization alternatives i knew about. <p dir

Hey <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url=

consider using `unf` gem for unicode normalization,about twitter/twitter-cldr-rb

Comments (13)

camertron commented on August 14, 2024

Hey @jrochkind, wow thanks for the thorough writeup about this! Your research is quite timely, in fact. I just attended the International Unicode Conference here in Santa Clara and was fortunate to talk to a guy named Martin Dürst, a professor at Aoyama-Gakuin University in Tokyo. He and one of his graduate students had invented a very fast normalization algorithm called "eprun" (now available on github here). He had also run benchmarks against most of the known Ruby normalization implementations and found twitter-cldr-rb to be severely lacking - in fact, he originally wanted to title his talk at the conference "How to Beat Twitter at Normalization". For that reason, before and during the conference I was hell-bent on making twitter-cldr-rb faster. When all was said and done, I managed to speed it up by around 70%. You can see the results in the v3.0 branch. Professor Dürst's algorithm is, of course, still quite a bit faster. He tested all implementations on very large bodies of text that twitter-cldr-rb wasn't originally designed to handle, plus he's pretty much the inventor of the Unicode Normalization Algorithm.

As much as I want twitter-cldr-rb's normalization algorithm to be the fastest around, there are a number of reasons why I decided not to use any of the existing implementations:

Compatibility. One of my original goals for twitter-cldr-rb was for it to run everywhere, but especially in Ruby 1.8. Martin's code depends on at least Ruby 1.9 for two reasons: gsub with a block, and gsub with multi-byte characters.
Purity. I have really tried to limit the number of dependencies twitter-cldr-rb has. At the moment, it only depends on json, and even that has no version requirement. I had to introduce hamster for the faster normalization implementation in the v3.0 branch, but reluctantly. I also really, really want to stay away from C-extensions and all of the headaches they bring.
Maintainability. I don't know how well these other implementations are maintained or if the authors will keep them updated when new versions of the Unicode spec come out.
As you noted in your blog post, most of the time twitter-cldr-rb's implementation is fast enough. If you're chewing through gigabytes of text, an alternate, faster implementation is probably better for your use case.

That said, if you have any ideas regarding speeding up our normalization algorithm, please let me know! Martin's works by pre-computing a map of all possible substitutions, then running a gsub (from the standard library, implemented in C) over the text, replacing characters with sequential yields to gsub's optional block. Very clever. In contrast, to maintain compatibility with 1.8, twitter-cldr-rb converts the entire body of text into code points, normalizes them, then converts them back into text. Most of the computation time is spent converting to code points and back again, plus the memory needed to store that potentially huge array. Most of the speed improvements I was able to make were due to hamster's screaming fast list implementation (Ruby doesn't have to keep copying as the array increases in size).

Your thoughts are always appreciated!

from twitter-cldr-rb.

jrochkind commented on August 14, 2024

Thanks for the response.

I agree with wanting to avoid 'screaming fast' implementations that are not portable.

However, the unf gem works on ruby 1.8.7. And doesn't require hamster. :)

I understand wanting to avoid dependencies.... but it seems undesirable for everyone writing a gem that deals with unicode to re-implement the normalization algorithms, and run into their own bugs and performance issues. It really seems like unicode normalization ought to be in ruby stdlib.

Failing that, it seems preferable for everyone to use a common gem dependency for normalization, instead of everyone inventing their own, and then having to deal on their own with bugs or performance issues. No?

But I understand wanting to avoid any dependencies at all, especially ones written as C extensions. It's your call. But personally I'd be a lot more worried about a Hamster dependency, then a dependency on something small and focused that just does unicode normalization. The thing about a dependency on unf is, if for some reason it stops working later, it's pretty darn easy to just switch it out for something else again.

(However, the fact that unf monkey-patches String may be a bigger deal breaker! We're trying to convince the unf maintainer not to do that. knu/ruby-unf#4)

I do not have any ideas about speeding up your normalization algorithm -- other than replacing it with someone elses gem that already does it well. The fact that in the current release you are 2 orders of magnitude slower than everyone else is troubling though, hopefully the upcoming release improves that. There's no need for twitter_cldr's normalization to be 'the fastest around', but multiple orders of magnitude slower seems undesirable.

from twitter-cldr-rb.

camertron commented on August 14, 2024

However, the unf gem works on ruby 1.8.7. And doesn't require hamster. :)

Right, but it does depend on a native extension.

It really seems like unicode normalization ought to be in ruby stdlib.

That would be awesome. Do you know how we might advocate for this?

...it seems preferable for everyone to use a common gem dependency for normalization, instead of everyone inventing their own, and then having to deal on their own with bugs or performance issues. No?

Yes, but at least with our own implementation we can ensure it always works with the right Ruby implementations/versions and with the version of the Unicode spec the rest of the gem adheres to. However, I'm not saying adopting a faster implementation is bad... just don't know if one exists at the moment that covers all the criteria.

I'd be a lot more worried about a Hamster dependency.

Why is that? Are you thinking of potential gem version conflicts?

The thing about a dependency on unf is, if for some reason it stops working later, it's pretty darn easy to just switch it out for something else again.

That's a great point. Perhaps we could extract twitter-cldr-rb's normalization implementation into a separate library and switch it out/back in as necessary.

However, the fact that unf monkey-patches String may be a bigger deal breaker!

You know, twitter-cldr-rb does this too :) It defines the localize method on a bunch of core objects. I've thought about making such integration optional... looks like that might be a good idea.

The fact that in the current release you are 2 orders of magnitude slower than everyone else is troubling though, hopefully the upcoming release improves that. ... multiple orders of magnitude slower seems undesirable.

Yes... I've definitely thought quite a bit about this, lost sleep even. Would you mind benchmarking the implementation in the v3.0 branch? I'd be curious to see how it stacks up against the others.

from twitter-cldr-rb.

jrochkind commented on August 14, 2024

It does depend on a native extension -- but provides a JRuby alternative that just proxies to the Java stdlib, and does not try to use a native extension. I guess that might still leave out rbx though, not sure? (I've never used rbx).

But I totally understand your reluctance to introduce any dependencies, especially a C one.

By the benchmarking I did, I suspect that a native ruby implementation is never going to get close to a C extension -- or to the Java stdlib implementation. But I understand why you might want a native ruby implementation anyway.

I'd be worried about Hamster only because it's a fairly large and complex dependency, not because of anything actually particular to Hamster -- I've never used Hamster, although I've read the docs.

Your idea of pulling out twitter's implementation as a standalone gem others can re-use, to keep everyone from having to re-write normalization algorithms all the time, is an interesting one. Although there are already so many unicode normalization options already, I'm not sure about adding yet another one, you'd have to make the case for what makes yours better, why people should standardize on yours instead of one of the ones already there. That's one reason I benchmark'd em, to have at least some reason to say "okay, let's try standardizing on this one." Which led me to suggest 'unf' for standardization, since it performs best and does nothing else but unicode normalization.

Might be interesting to somehow provide a gem with a pure-ruby implementation with the exact same API as unf... but man, I don't really want to go through the 'MultiJson' style of thing just for unicode normalization again, it would be so nice not to have to think about this and just have unicode normaliation available. I have no idea how you try to get it into ruby stdlib, I've had no luck before at having any influence over that.

Benchmark the v3.0 branch? Okay, normally I'd say "Feel free to fork my repo and do it yourself, that's why it's open source," (every time you release benchmarks everyone wants starts telling you you did it wrong and wouldn't you please include what they want!) but because I feel like I owe you for twitter_cldr, and I'm curious myself, okay, stay tuned.

from twitter-cldr-rb.

jrochkind commented on August 14, 2024

Okay, I don't think there's any way to include both the 2.x and 3.x twitter_cldr in the same benchmark run, since I don't think I can load them both simultaneously.

So here's all my runs again, using twitter_cldr 3.x instead of 2.x -- this is current tip of v3.0 branch, by telling my Gemfile gem "twitter_cldr", :git => "https://github.com/twitter/twitter-cldr-rb.git", :branch => "v3.0"

mri:

$ ruby benchmark.rb
benchmark.rb:9: warning: already initialized constant Unicode
Rehearsal --------------------------------------------------
unicode_utils    1.160000   0.000000   1.160000 (  1.166212)
active_support   1.800000   0.010000   1.810000 (  1.808460)
twitter_cldr    16.320000   0.030000  16.350000 ( 16.366699)
unf              0.070000   0.000000   0.070000 (  0.071209)
unicode_gem      0.350000   0.000000   0.350000 (  0.350502)
---------------------------------------- total: 19.740000sec

                     user     system      total        real
unicode_utils    1.180000   0.000000   1.180000 (  1.178134)
active_support   1.700000   0.010000   1.710000 (  1.710028)
twitter_cldr    15.940000   0.020000  15.960000 ( 15.970978)
unf              0.050000   0.000000   0.050000 (  0.051623)
unicode_gem      0.340000   0.000000   0.340000 (  0.339548)

jruby

$ ruby benchmark.rb
Rehearsal --------------------------------------------------
unicode_utils    3.970000   0.050000   4.020000 (  1.872000)
active_support   3.150000   0.050000   3.200000 (  1.925000)
twitter_cldr    21.320000   0.290000  21.610000 ( 16.252000)
unf              0.750000   0.010000   0.760000 (  0.402000)
---------------------------------------- total: 29.590000sec

                     user     system      total        real
unicode_utils    0.960000   0.010000   0.970000 (  0.950000)
active_support   1.300000   0.010000   1.310000 (  1.298000)
twitter_cldr    13.180000   0.200000  13.380000 ( 13.042000)
unf              0.120000   0.000000   0.120000 (  0.120000)

In the original 2.x, the twitter_cldr took nearly 90 seconds, now it's down to 15. So you've improved it fourfold -- but it's still 8x slower than it's next competitor, and around two orders of magnitude slower than unf.

I think unicode_tools is pure ruby too, and is 8x faster than you. So if you really need pure ruby, maybe you want to look at it's implementation, copy it's implementation, or just use it as a dependency (although it includes things other than just unicode normalization). But it seems silly that all these projects are working on their own unicode normalization algorithms independently, doesn't it?

(I still haven't found the code ActiveSupport's implementation, I have no idea if it's pure ruby, or different in mri and jruby, or what).

from twitter-cldr-rb.

jrochkind commented on August 14, 2024

https://github.com/lang/unicode_utils/blob/master/lib/unicode_utils/nfc.rb

from twitter-cldr-rb.

camertron commented on August 14, 2024

Hey @jrochkind,

Finally responding to your last comment, thanks for the benchmarks :) I've taken your points to heart and think it's time to integrate a better, faster normalization algorithm from another gem. I mentioned Martin Dürst earlier in this thread as well as his normalization implementation called "eprun" (Efficient Pure Ruby Unicode Normalization). It's the fastest I've seen so far, even beating out the unicode_utils and unicode gems in the benchmarks he's run (no benchmarks yet for unf). He's trying to get his code integrated into the Ruby stdlib, but as you know, these things can take time. Until recently, his code was more of a script than a reusable library, so I've re-started a conversation with him to create an eprun gem, with a focus on supporting MRI 1.8. You can see my pull request and the resulting conversation here. Let me know what you think!

Eprun has taken an average of a 400ms performance hit after I added support for MRI 1.8. Here are a few benchmarks I ran this morning (using 1.8):

________________ Deutsch (89435 characters, 90745 bytes) ________________
             user     system      total        real
Fast normalization using eprun (100 times)
NFD:     1.870000   0.000000   1.870000 (  1.874002)
NFKD:    2.520000   0.010000   2.530000 (  2.524128)
NFC:     2.120000   0.000000   2.120000 (  2.119158)
NFKC:    2.760000   0.000000   2.760000 (  2.770890)
Hash size: NFD 19, NFC 0, K 2

Using unicode_utils gem (100 times)
NFD:     5.630000   0.010000   5.640000 (  5.635517)
NFKD:    7.910000   0.000000   7.910000 (  7.926094)
NFC:    13.920000   0.010000  13.930000 ( 13.931892)
NFKC:   16.160000   0.010000  16.170000 ( 16.171123)

Using ActiveSupport::Multibyte::Chars (100 times)
NFD:    10.000000   0.110000  10.110000 ( 10.102845)
NFKD:   10.000000   0.100000  10.100000 ( 10.094339)
NFC:    17.630000   0.120000  17.750000 ( 17.750051)
NFKC:   17.510000   0.120000  17.630000 ( 17.628064)

Using unicode gem (native code, 100 times)
NFD:     1.420000   0.050000   1.470000 (  1.465032)
NFKD:    1.480000   0.050000   1.530000 (  1.523354)
NFC:     5.330000   0.070000   5.400000 (  5.398401)
NFKC:    5.390000   0.060000   5.450000 (  5.453838)

________________ Japanese (76221 characters, 226953 bytes) ________________
             user     system      total        real
Fast normalization using eprun (100 times)
NFD:     3.260000   0.020000   3.280000 (  3.275335)
NFKD:    4.120000   0.030000   4.150000 (  4.142771)
NFC:     3.140000   0.000000   3.140000 (  3.141171)
NFKC:    4.020000   0.000000   4.020000 (  4.019722)
Hash size: NFD 68, NFC 0, K 21

Using unicode_utils gem (100 times)
NFD:     6.080000   0.010000   6.090000 (  6.094526)
NFKD:    7.780000   0.010000   7.790000 (  7.801275)
NFC:    13.040000   0.010000  13.050000 ( 13.053922)
NFKC:   15.100000   0.000000  15.100000 ( 15.103054)

Using ActiveSupport::Multibyte::Chars (100 times)
NFD:    10.670000   0.100000  10.770000 ( 10.765291)
NFKD:   10.900000   0.090000  10.990000 ( 10.990938)
NFC:    25.730000   0.130000  25.860000 ( 25.844477)
NFKC:   26.010000   0.110000  26.120000 ( 26.108892)

Using unicode gem (native code, 100 times)
NFD:     1.590000   0.070000   1.660000 (  1.663076)
NFKD:    1.630000   0.070000   1.700000 (  1.696432)
NFC:     5.500000   0.080000   5.580000 (  5.580656)
NFKC:    5.430000   0.060000   5.490000 (  5.489105)

from twitter-cldr-rb.

jrochkind commented on August 14, 2024

Awesome, thanks for working on this.

I'm curious why you require support for ruby 1.8. You guys use 1.8 internally? Personally I think it would make sense to drop support for 1.8. ruby 1.8 is no longer supported by ruby team -- including not even for security fixes.

It's a bit sad to take a performance hit in order to work on a ruby which is end-of-lifed and no longer even supported by ruby team. But ultra high performance isn't neccesarily required so I guess it's fine (ultra-high perf isn't neccesarily required, but 2-3 orders of magnitude worse like your previous code is still probably not acceptable).

If I were you, I'd still just depend on an existing gem that does a fine job of high-performance normalization, like unf. Which works on both MRI and jruby -- but I think not 1.8, is that the problem? Another option, you could depend on unf, but use runtime checks to determine ruby version, use unf on ruby 1.9.3, but use your existing algorithm on ruby 1.8.7 I suppose.

from twitter-cldr-rb.

camertron commented on August 14, 2024

Hey @jrochkind,

We require support for Ruby 1.8 for a few reasons. Firstly, we do use a patched version of REE (named Kiji) internally at Twitter. In fact, a number of projects still use it, the most important being parts of the Monorail (i.e. twitter.com). More than that however, Ruby 1.8 doesn't have great unicode support, so one of my initial goals was to bring at least partial unicode support to it to plug that particular hole for 1.8 users. Users of 1.9 and 2.0 have nice, built-in tools for this kind of thing of course. I know of at least two other companies who still depend on 1.8 for their production apps. It's dying, but far from dead, and for the most part isn't too difficult to support if you've got good tests.

In terms of taking a performance hit, turns out I was wrong about that. Martin and I have been working on conditionally loading appropriate versions of eprun depending on what version of Ruby you're running. For 1.8, we automatically require my less performant port, but for >= 1.9, we load Martin's original, faster implementation. The benchmarks look pretty good, although admittedly still not faster than unf :(

I've decided that depending on an existing gem isn't a bad thing, and I think (after considering your comments), that depending on unf or eprun is better than depending on hamster. Martin is trying to get eprun merged into Ruby core (plus he's a personal acquaintance of mine). For those reasons, plus the fact that we've managed to get it to work with MRI 1.8, is why eprun is my first choice. Eprun covers all the required bases: it works on 1.8, doesn't rely on native extensions, and is quite performant when compared to other implementations.

from twitter-cldr-rb.

jrochkind commented on August 14, 2024

sounds like you've got all the bases covered and something great is going to result. Thanks for this gem and open-sourcing it and explaining what's going on with the peanut gallery. And good to know about eprun, I'll check that one out. (So many choices for unicode normalization in ruby! I hope they just add it to the stdlib so nobody needs to spend time figuring out which unicode normalization option to use again!)

from twitter-cldr-rb.

jrochkind commented on August 14, 2024

Oh, and maybe consider taking unf's simple jruby implementation, of just delegating through to the Java stdlib for unicode normalization when on jruby. While there are probably Java alternatives for most of what twitter-cldr does, I'm probably not alone in sometimes writing code that isn't neccesarily targetted at jruby but might run on jruby sometimes and needs to run decently there.

from twitter-cldr-rb.

jrochkind commented on August 14, 2024

PPS: Also with regard to eprun (which I'm just taking a look at and plan to benchmark next to the others), I'd really suggest making the monkey-patching of String optional, and not using the monkey-patched API in twitter-cldr, but instead using the Normalize.normalize API.

Monkey-patching seems okay when you consider a first-level dependency, someone saying "I choose to use (eg) eprun, and it adds these methods to String, okay, good to know." But when you consider multi-leel dependency... jrochkind has a do_something gem that depends on twitter-cldr which depends on eprun, and as a result jrochkind's do_something has to warn all users that, oh yeah, it's going to add on these normalize methods to every String in your app if you use do_something... it starts to seem less acceptable, no?

from twitter-cldr-rb.

camertron commented on August 14, 2024

I couldn't agree more, @jrochkind, which is why the string monkey patching is not required by default in my ruby18 branch. You have to call Eprun.enable_core_extensions! which simply requires lib/eprun/core_ext/string.rb. Not really sure if there's a better way to do it, but without refinements and such I think that's the only way.

Also, I think your idea of calling straight through to Java stdlib in JRuby is a great idea. I'll explore that avenue when it comes time to integrate the eprun gem.

from twitter-cldr-rb.

consider using `unf` gem for unicode normalization about twitter-cldr-rb HOT 13 CLOSED

Comments (13)

mri:

jruby

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs