GithubHelp home page GithubHelp logo

Comments (6)

nalimilan avatar nalimilan commented on July 23, 2024

Could you write a summary of the results? Also, I think it would make sense to measure the performance of converting many strings using the same iconv/ICU handler, since this is a more reasonable scenario.

I think progressively adding pure-Julia converters, starting with the most common encodings, is a good idea. That would justify changing the name of the package. One difficult point is to get a relatively consistent behavior as regards invalid characters, given that iconv will be less flexible than your Julia code; need to think about it. Anyway, if you can find a consistent plan to implement this, that be be nice.

from stringencodings.jl.

ScottPJones avatar ScottPJones commented on July 23, 2024

The test in the gist used different sizes, 1, 16, 32, 64, 256, 1024, 5120.
The strings were created with random bytes 0-255, and then changing any invalid bytes (for that character set) to '?'.
iconv.jl was quite a bit slower, > 70x slower even on larger strings converting to UTF-16,
and > 45x slower converting to UTF-8.
ICU.jl was about twice as slow in general the pure Julia conversion code.
With the pure Julia converter, converting to UTF-8 is about 57% slower than to UTF-16 (which is to be expected, dealing with UTF-8 is generally significantly slower than UTF-16).
This was on large strings, 5120 bytes, which should have reduced the effect of the setup of the StringEncoder for the iconv.jl tests.
ICU.jl was pretty fast, but only supports UTF-16 (there is support for UTF-8 now in ICU, but it isn't
available in ICU.jl).

I'm not sure exactly what you want, about measuring converting many strings.
I suppose I could call StringEncoder directly, instead of using decode().
Is that what you were thinking of?

About the pure-Julia converters, I think it will be pretty easy for me to add all 8-bit encodings, with all of the invalid character behaviors.
I was also thinking about using the Encoding types (if you looked at the code in the Gist), and having the tables loadable on demand, with a Dict to keep track of loaded encodings, and finally implementing next(), start(), done(), so that it would be possible to get a character at a time from either a IOBuffer, Vector{UInt8}, via a pointer, or maybe even MMap'ed memory, based on a loaded encoding.
That would also make transcoding (using Unicode as the intermediary, like iconv does), easy also in pure Julia.
We could still use iconv for multi-byte character sets, until we have time to move those to pure Julia
(I've done it it in past, pretty efficiently, in table driven fashion, so I'm confident I could add that part, it's just a matter of time to do it [not too long, but more than a Saturday afternoon], and my "paying work" priorities (we need 8-bit support now, but MB character sets, not yet))

Big take away, which I learned when doing the Unicode conversion code in Base, after being pushed by Tony et. al. to do it in pure Julia, is that pure Julia rocks! 😀

from stringencodings.jl.

nalimilan avatar nalimilan commented on July 23, 2024

The test in the gist used different sizes, 1, 16, 32, 64, 256, 1024, 5120.
The strings were created with random bytes 0-255, and then changing any invalid bytes (for that character set) to '?'.
iconv.jl was quite a bit slower, > 70x slower even on larger strings converting to UTF-16,
and > 45x slower converting to UTF-8.
ICU.jl was about twice as slow in general the pure Julia conversion code.
With the pure Julia converter, converting to UTF-8 is about 57% slower than to UTF-16 (which is to be expected, dealing with UTF-8 is generally significantly slower than UTF-16).
This was on large strings, 5120 bytes, which should have reduced the effect of the setup of the StringEncoder for the iconv.jl tests.

5120 still isn't that much to offset the cost of creating a handle. But indeed in many use cases that overhead is an issue, so that's fair.

ICU.jl was pretty fast, but only supports UTF-16 (there is support for UTF-8 now in ICU, but it isn't
available in ICU.jl).

Why do you say that? UnicodeExtras.jl supports UTF8String too, though it would likely be slow because I think ICU uses UTF-16 as an intermediate.
https://github.com/nolta/UnicodeExtras.jl#file-encoding

I'm not sure exactly what you want, about measuring converting many strings.
I suppose I could call StringEncoder directly, instead of using decode().
Is that what you were thinking of?

Yeah, though I don't expect this to change the results too much. I wonder why iconv (and ICU to some extent) are so slow. Note that I haven't done any optimization on iconv.jl, so there might be significant performance issues to fix before considering the results as significant.

About the pure-Julia converters, I think it will be pretty easy for me to add all 8-bit encodings, with all of the invalid character behaviors.
I was also thinking about using the Encoding types (if you looked at the code in the Gist), and having the tables loadable on demand, with a Dict to keep track of loaded encodings, and finally implementing next(), start(), done(), so that it would be possible to get a character at a time from either a IOBuffer, Vector{UInt8}, via a pointer, or maybe even MMap'ed memory, based on a loaded encoding.
That would also make transcoding (using Unicode as the intermediary, like iconv does), easy also in pure Julia.
We could still use iconv for multi-byte character sets, until we have time to move those to pure Julia
(I've done it it in past, pretty efficiently, in table driven fashion, so I'm confident I could add that part, it's just a matter of time to do it [not too long, but more than a Saturday afternoon], and my "paying work" priorities (we need 8-bit support now, but MB character sets, not yet))

Big take away, which I learned when doing the Unicode conversion code in Base, after being pushed by Tony et. al. to do it in pure Julia, is that pure Julia rocks! 😀

Makes sense.

from stringencodings.jl.

ScottPJones avatar ScottPJones commented on July 23, 2024

Why do you say that? UnicodeExtras.jl supports UTF8String too, though it would likely be slow because I think ICU uses UTF-16 as an intermediate.

Ah, we are using ICU.jl at work, so that's what I benchmarked. I'll have to look at UnicodeExtras.jl

Yeah, though I don't expect this to change the results too much. I wonder why iconv (and ICU to some extent) are so slow. Note that I haven't done any optimization on iconv.jl, so there might be significant performance issues to fix before considering the results as significant.

Yes - I understand iconv.jl isn't optimized yet - we'd really need to benchmark separately

  1. calling iconv_open, 2) the rest of overhead of creating a StringEncoder,
    (might show that we need to cache encoders, just like I want to cache the tables for pure Julia)
  2. the actual call to iconv!, 4) overhead of doing things through IOBuffer / write
    Part of the problem might be that the transcoding iconv call might not be optimizing or simplifying the case where the source or destination is UTF-8 or UTF-16. (Internally all transcoding goes via Unicode,
    I think UTF-16 or UTF-32)

I like to benchmark as many possibilities as possible - so I don't get caught out by some case I didn't benchmark that turns out to be common.

from stringencodings.jl.

sambitdash avatar sambitdash commented on July 23, 2024

julia-iconv performance will be the platform native performance plus a small marshalling overhead by the julia and native C communication layer. iconv is a very matured library and no significant code additions have not gone in after 2011 which makes it ultra-stable. Any newer library may have that downside. Unless proven otherwise it may not be a good idea to move away from iconv.

from stringencodings.jl.

nalimilan avatar nalimilan commented on July 23, 2024

The problem is not the cost of communication between Julia and C (that cost should be null), it's just that iconv is said to be relatively slow. Julia allows generating very efficient code for common conversions on the fly, which should be worth it at least for simple cases. Anyway we'd only do this if benchmarks show it's really faster.

from stringencodings.jl.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.