I've written some simple functions that create tables using iconv.jl, and then do the

Conversion performance about stringencodings.jl HOT 6 OPEN

juliastrings commented on July 23, 2024

Conversion performance

from stringencodings.jl.

Comments (6)

nalimilan commented on July 23, 2024

Could you write a summary of the results? Also, I think it would make sense to measure the performance of converting many strings using the same iconv/ICU handler, since this is a more reasonable scenario.

I think progressively adding pure-Julia converters, starting with the most common encodings, is a good idea. That would justify changing the name of the package. One difficult point is to get a relatively consistent behavior as regards invalid characters, given that iconv will be less flexible than your Julia code; need to think about it. Anyway, if you can find a consistent plan to implement this, that be be nice.

from stringencodings.jl.

ScottPJones commented on July 23, 2024

The test in the gist used different sizes, 1, 16, 32, 64, 256, 1024, 5120.
The strings were created with random bytes 0-255, and then changing any invalid bytes (for that character set) to '?'.
iconv.jl was quite a bit slower, > 70x slower even on larger strings converting to UTF-16,
and > 45x slower converting to UTF-8.
ICU.jl was about twice as slow in general the pure Julia conversion code.
With the pure Julia converter, converting to UTF-8 is about 57% slower than to UTF-16 (which is to be expected, dealing with UTF-8 is generally significantly slower than UTF-16).
This was on large strings, 5120 bytes, which should have reduced the effect of the setup of the StringEncoder for the iconv.jl tests.
ICU.jl was pretty fast, but only supports UTF-16 (there is support for UTF-8 now in ICU, but it isn't
available in ICU.jl).

I'm not sure exactly what you want, about measuring converting many strings.
I suppose I could call StringEncoder directly, instead of using decode().
Is that what you were thinking of?

About the pure-Julia converters, I think it will be pretty easy for me to add all 8-bit encodings, with all of the invalid character behaviors.
I was also thinking about using the Encoding types (if you looked at the code in the Gist), and having the tables loadable on demand, with a Dict to keep track of loaded encodings, and finally implementing next(), start(), done(), so that it would be possible to get a character at a time from either a IOBuffer, Vector{UInt8}, via a pointer, or maybe even MMap'ed memory, based on a loaded encoding.
That would also make transcoding (using Unicode as the intermediary, like iconv does), easy also in pure Julia.
We could still use iconv for multi-byte character sets, until we have time to move those to pure Julia
(I've done it it in past, pretty efficiently, in table driven fashion, so I'm confident I could add that part, it's just a matter of time to do it [not too long, but more than a Saturday afternoon], and my "paying work" priorities (we need 8-bit support now, but MB character sets, not yet))

Big take away, which I learned when doing the Unicode conversion code in Base, after being pushed by Tony et. al. to do it in pure Julia, is that pure Julia rocks! 😀

from stringencodings.jl.

nalimilan commented on July 23, 2024

The test in the gist used different sizes, 1, 16, 32, 64, 256, 1024, 5120.
The strings were created with random bytes 0-255, and then changing any invalid bytes (for that character set) to '?'.
iconv.jl was quite a bit slower, > 70x slower even on larger strings converting to UTF-16,
and > 45x slower converting to UTF-8.
ICU.jl was about twice as slow in general the pure Julia conversion code.
With the pure Julia converter, converting to UTF-8 is about 57% slower than to UTF-16 (which is to be expected, dealing with UTF-8 is generally significantly slower than UTF-16).
This was on large strings, 5120 bytes, which should have reduced the effect of the setup of the StringEncoder for the iconv.jl tests.

5120 still isn't that much to offset the cost of creating a handle. But indeed in many use cases that overhead is an issue, so that's fair.

ICU.jl was pretty fast, but only supports UTF-16 (there is support for UTF-8 now in ICU, but it isn't
available in ICU.jl).

Why do you say that? UnicodeExtras.jl supports UTF8String too, though it would likely be slow because I think ICU uses UTF-16 as an intermediate.
https://github.com/nolta/UnicodeExtras.jl#file-encoding

I'm not sure exactly what you want, about measuring converting many strings.
I suppose I could call StringEncoder directly, instead of using decode().
Is that what you were thinking of?

Yeah, though I don't expect this to change the results too much. I wonder why iconv (and ICU to some extent) are so slow. Note that I haven't done any optimization on iconv.jl, so there might be significant performance issues to fix before considering the results as significant.

About the pure-Julia converters, I think it will be pretty easy for me to add all 8-bit encodings, with all of the invalid character behaviors.
I was also thinking about using the Encoding types (if you looked at the code in the Gist), and having the tables loadable on demand, with a Dict to keep track of loaded encodings, and finally implementing next(), start(), done(), so that it would be possible to get a character at a time from either a IOBuffer, Vector{UInt8}, via a pointer, or maybe even MMap'ed memory, based on a loaded encoding.
That would also make transcoding (using Unicode as the intermediary, like iconv does), easy also in pure Julia.
We could still use iconv for multi-byte character sets, until we have time to move those to pure Julia
(I've done it it in past, pretty efficiently, in table driven fashion, so I'm confident I could add that part, it's just a matter of time to do it [not too long, but more than a Saturday afternoon], and my "paying work" priorities (we need 8-bit support now, but MB character sets, not yet))

Big take away, which I learned when doing the Unicode conversion code in Base, after being pushed by Tony et. al. to do it in pure Julia, is that pure Julia rocks! 😀

Makes sense.

from stringencodings.jl.

ScottPJones commented on July 23, 2024

Why do you say that? UnicodeExtras.jl supports UTF8String too, though it would likely be slow because I think ICU uses UTF-16 as an intermediate.

Ah, we are using ICU.jl at work, so that's what I benchmarked. I'll have to look at UnicodeExtras.jl

Yeah, though I don't expect this to change the results too much. I wonder why iconv (and ICU to some extent) are so slow. Note that I haven't done any optimization on iconv.jl, so there might be significant performance issues to fix before considering the results as significant.

Yes - I understand iconv.jl isn't optimized yet - we'd really need to benchmark separately

calling iconv_open, 2) the rest of overhead of creating a StringEncoder,
(might show that we need to cache encoders, just like I want to cache the tables for pure Julia)
the actual call to iconv!, 4) overhead of doing things through IOBuffer / write
Part of the problem might be that the transcoding iconv call might not be optimizing or simplifying the case where the source or destination is UTF-8 or UTF-16. (Internally all transcoding goes via Unicode,
I think UTF-16 or UTF-32)

I like to benchmark as many possibilities as possible - so I don't get caught out by some case I didn't benchmark that turns out to be common.

from stringencodings.jl.

sambitdash commented on July 23, 2024

julia-iconv performance will be the platform native performance plus a small marshalling overhead by the julia and native C communication layer. iconv is a very matured library and no significant code additions have not gone in after 2011 which makes it ultra-stable. Any newer library may have that downside. Unless proven otherwise it may not be a good idea to move away from iconv.

from stringencodings.jl.

nalimilan commented on July 23, 2024

The problem is not the cost of communication between Julia and C (that cost should be null), it's just that iconv is said to be relatively slow. Julia allows generating very efficient code for common conversions on the fly, which should be worth it at least for simple cases. Anyway we'd only do this if benchmarks show it's really faster.

from stringencodings.jl.

Conversion performance about stringencodings.jl HOT 6 OPEN

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs