It would be great to add in-place conversion from/to utf-16 with endianess hints. To k

Thank you <a class="user-mention notranslate" data-hovercard-type="user" data-hovercar

Endianess hints,about nemtrif/utfcpp

Comments (17)

nemtrif commented on August 21, 2024

Not sure I understand your proposal. Would you provide an example of such function?

from utfcpp.

nemtrif commented on August 21, 2024

No update since December. Closing...

from utfcpp.

ceztko commented on August 21, 2024

Sorry, when I read your reply I couldn't answer and then I forgot. Basically my request is about adding full support to perform byte swapping of UTF-16 encoding when needed/requested. Basically I would add more raw overloads of the functions utf8to16 and utf8to16 with a boolean flag swapbytes to headers unchecked.h (and probably also to checked.h). In the first case it interprets the bytes of the input utf16 strings as swapped, in the second case it swaps the bytes of the output utf16 string. Then I would add overloads of the same functions with an enum parameter hint like this:

enum class utf16endianess
{
    platftorm,
    smallendian,
    bigendian,
}

Calling these overloads with utf16endianess::platftorm call the functions utf8to16 and utf8to16 with swapbytes to false. Calling them with a different utf16endianess value perform a runtime check to detect endianess of the system and swap the bytes if plaftorm and target endianess differs.
The current overloads of utf8to16 and utf8to16 with neither hint or swapbytes parameter just call the added overload with endianesshint equals to utf16endianess::platftorm (not adding any performance penalty in normal use).

Summarizing the signatures would be like this:

octet_iterator utf16to8(u16bit_iterator start, u16bit_iterator end, octet_iterator result);
octet_iterator utf16to8(bool swapbytes, u16bit_iterator start, u16bit_iterator end, octet_iterator result);
octet_iterator utf16to8(utf16endianess hint, u16bit_iterator start, u16bit_iterator end, octet_iterator result);
u16bit_iterator utf8to16(octet_iterator start, octet_iterator end, u16bit_iterator result);
u16bit_iterator utf8to16(bool swapbytes, octet_iterator start, octet_iterator end, u16bit_iterator result);
u16bit_iterator utf8to16(utf16endianess hint, octet_iterator start, octet_iterator end, u16bit_iterator result);

Also utf32 would need a similar treatment. Please note this issue is not about adding handling of BOM bytes.

from utfcpp.

ceztko commented on August 21, 2024

I request re-opening of the ticket, since your questions were answered and pull request #67 with the suggested implementation has been created.

from utfcpp.

patrolez commented on August 21, 2024

I also think this is important, but somehow the topic was not well explored by repository maintainer.

UTF-16 can be one of UTE-16BE or UTE-16LE.
UTF-32 can be one of UTE-32BE or UTE-32LE.

BE, LE denotes Endianess trait.

BOM or "byte order mark" is made in purpose of Endianess identification, but it is made only for text files storage, and not strings processing.
For string processing, this information must be provided in a side channel.

UTF-8 is not influenced, as its code units are encoded per 1 byte, BOM allows distinguishing text files in UTF-8 from UTF-16/UTF-32 manifests/prefixes.
UTF-16 code unit is being coded in 2-bytes, so one CPU 0x0A0B might store it as [A, B] bytes, another as [B, A] bytes - so both will interpret it on its own as 0x0A0B, but cross-communication will result in 0x0B0A misinterpretation.
UTF-32 code unit is being coded in 4-bytes, so one CPU 0x0A0B0C0D might store it as [A, B, C, D] bytes, another as [D, C, B, A] bytes - so both will interpret it on its own as 0x0A0B0C0D, but cross-communication will result in 0x0D0C0B0A misinterpretation.

Endianess problem appears when:

two instances of the same App are compiled for 2 different CPU architectures with different Endianess and tries to exchange data,
two differently specified communication protocols with explicitly denoted Endianess are intended to be in use.

from utfcpp.

ceztko commented on August 21, 2024

Thank you @patrolez for the in depth explanation. At the time I opened the pull request #67 which was briefly closed by the maintainer without explanation. I don't blame @nemtrif: maybe he was just very busy, as we all are. Also on my side I didn't do my homework very well: the first pull request was problematic since it didn't compile on non MSVC compilers and also had one oversight. My second attempt addresses those issues and I use it in my pdf library, which is battle tested in several platforms (Win, Linux, Android, iOS, MacOS...). I would like to to purse merging of endianess hints support in utfcpp again but this time I will wait the maintainer to get more aware of what is requested and if he agrees on adding such feature to utfcpp.

from utfcpp.

ceztko commented on August 21, 2024

I also add that the importance of having endianness aware support directly in utfcpp, instead of just leaving the user manually performing the swap separately, is because of const buffers: the memory where to read strings with different endianness may be read only, and copy it to separate buffer may be a waste. It's much more convenient if the code performing the conversions is endianness aware and performs the right swap accordingly. In my implementation I was careful not to have a conditional check deciding if performing the swap at each code unit iteration. Instead I made a templatized version of the code that uses a static method handler that performs (or not) the swap. In this way, the non swap code path should end being no-op in optimized builds. I didn't really checked the produced assembly but other people may recommend better code if it's not as efficient as intended.

from utfcpp.

dishather commented on August 21, 2024

I also add that the importance of having endianness aware support directly in utfcpp, instead of just leaving the user manually performing the swap separately, is because of const buffers: the memory where to read strings with different endianness may be read only, and copy it to separate buffer may be a waste.

The library uses uint16_t for UTF16 data. This makes the library endianness-agnostic. If we want to serialize/deserialize UTF16 to/from a stream of bytes, this must be the place where we handle the endianness issues, right? If some code simply gets a pointer to a string of uint16_t's, and treats it as a byte pointer, the code is incorrect.

"Swapping" an uint16_t will produce an incorrect UTF16 string in memory. Combined with incorrect serialization routine, it is supposed to work (bugs will magically annihilate each other). Looks flimsy... Besides, what if the code tries to work with the uint16_t string directly? The library will feed it with invalid utf16 data. What if the serialization code is finally fixed? We will get an invalid UTF16 representation because the data in UTF16 string is backasswards,

Calling these overloads with utf16endianess::platftorm call the functions utf8to16 and utf8to16 with swapbytes to false. Calling them with a different utf16endianess value perform a runtime check to detect endianess of the system and swap the bytes if plaftorm and target endianess differs.
The current overloads of utf8to16 and utf8to16 with neither hint or swapbytes parameter just call the added overload with endianesshint equals to utf16endianess::platftorm (not adding any performance penalty in normal use).

Some thoughts on this:

Checking endianness in runtime is not needed. The program cannot change its endianness once compiled, so this information is readily available during compile-time. Since C++20, we have std::endian; before that, we have defines like __ORDER_LITTLE_ENDIAN__, __ORDER_BIG_ENDIAN__, __BYTE_ORDER__,
I'd prefer extra template parameters to adding function args. Function arguments take up stack space (or machine registers); template parameters can be completely reduced to zero code bloat by the compiler.
As said above, endianness comes into play when we convert utf16 strings to/from bytes. So the swapping must be done in serialization code, not in the library.

from utfcpp.

ceztko commented on August 21, 2024

The library uses uint16_t for UTF16 data. This makes the library endianness-agnostic. If we want to serialize/deserialize UTF16 to/from a stream of bytes, this must be the place where we handle the endianness issues, right? If some code simply gets a pointer to a string of uint16_t's, and treats it as a byte pointer, the code is incorrect.

I would be more concerned if this would highly impact the std::u16string container that is more explicit about its content. A container of uint16_t is more abstract about its content so the end user must be even more careful about how to use it. In my implementation[1] I added endianness hints exclusively in the low level functions. In this way it is entirely responsibility of the user to supply a 2 bytes unit buffer of its preference, and the user is solely responsible for using correctly the buffer he passed. Other high level function, for example functions that handle directly std::u16string are not influenced.

Checking endianness in runtime is not needed. The program cannot change its endianness once compiled, so this information is readily available during compile-time. Since C++20, we have std::endian; before that, we have defines like __ORDER_LITTLE_ENDIAN__, __ORDER_BIG_ENDIAN__, __BYTE_ORDER__,

In my implementation I ended not trusting the compiler toolchains. I may change my mind though. Can you provide evidence that they are reliable pre C++20 for the more relevant platforms?

I'd prefer extra template parameters to adding function args. Function arguments take up stack space (or machine registers); template parameters can be completely reduced to zero code bloat by the compiler.

This is possibile only if one decides to trust the defines above.

As said above, endianness comes into play when we convert utf16 strings to/from bytes. So the swapping must be done in serialization code, not in the library.

So, if I understood correctly, you're basically against adding this feature in the library. As I already said sometimes the buffer when one reads content from can be read only, leaving the user in the uncomfortable situation of needing to allocate a separate buffer to perform the swap. Also ICU library appears to have endianness hints support[2]. Because of these two supporting facts, I disagree with your conclusion that utfcpp is not the right place where to perform the swap.

[1] https://github.com/ceztko/utfcpp/blob/8a551a3afed5478f98a1defc9db5bbb725324892/source/utf8/checked.h#L284
[2] https://unicode-org.github.io/icu-docs/apidoc/dev/icu4c/ucnv_8h.html#adb0b44c6bd828c9d4cc2defcbba0f902aac97a8806bad1e28965f045cdbd8e305

from utfcpp.

dishather commented on August 21, 2024

In my implementation I ended not trusting the compiler toolchains. I may change my mind though. Can you provide evidence that they are reliable pre C++20 for the more relevant platforms?

The definitions are supported by GCC, clang, and modern versions of Microsoft C++ compiler (Visual Studio 2019 and further on), so most relevant platforms are covered.
On rare or obscure platforms we could always resort to good old compile-time definitions like -DLITTLE_ENDIAN.

BTW, I checked the source code for boost::endian, and they only check some assorted defines (plus special treatment for older MS C++ compilers). See https://www.boost.org/doc/libs/develop/boost/endian/detail/order.hpp

As I already said sometimes the buffer when one reads content from can be read only, leaving the user in the uncomfortable situation of needing to allocate a separate buffer to perform the swap.

Not necessarily. There is a better, more elegant solution.
The user can supply his/her own u16bit_iterator implementation for reading the data from such buffers. This iterator will do the required conversions under the hood. Moreover, this is the only correct way of accessing such data - and it does not require any modifications to the library.

Also ICU library appears to have endianness hints support[2].

Which is not surprising because ICU works with byte buffers.

Because of these two supporting facts, I disagree with your conclusion that utfcpp is not the right place where to perform the swap.

What I am mostly against is changing the library code. This may be unavoidable for ICU, which is written in C; on the other hand, utf8cpp is an extensible library based on C++ templates. Thus, it can be extended simply by providing code as template parameters.

If you are resolute to add endianness support to the library, I propose extending it with two iterators:
u16bit_be_iterator and u16bit_le_iterator. The first one will treat the underlying bytes as uint16_t's in big-endian byte order, the second - in little-endian. Since they both do similar conversions, there is no need detecting the platform endianness at all (using u16bit_be_iterator on, say, a big-endian machine will not be any faster than using u16bit_le_iterator).

from utfcpp.

ceztko commented on August 21, 2024

Moreover, this is the only correct way of accessing such data

I don't understand exactly why you have such strong conviction but that sounds a bit opinionated to me. I am more oriented to discern the topic in this way:

One may prefer the core library be endianness aware because it's mandated by the UNICODE specification the existence of different encodings and it's helpful when the library API is talkative enough that you can find the right tool at glance to solve a common issue;
Others may prefer the core library to be endianness unaware because in this case (C++ templatized code) the same issue can be solved by supplying external iteration structures.

Both solutions can be coded efficiently and there's IMO no reason to believe one is right and the other is wrong. I would say it's just a matter of taste, with (1) being a little bit more user friendly because the solution of a common issue is coupled with the API itself and not a decoupled structure to plug-in, implying that the user must know the existence of such structure to use it. Even in case of (2) it would be a little bit lazy not having the library itself providing such extra iteration structures, because writing custom iterators in C++ is notoriously not a 1 minute task (I known for having created some), unless you do it every other day which is rare.

Of course I would love utfcpp to supply one of these two solutions, because I don't want to to copy boilerplate code here and there in every place I use utfcpp, but it's up to the decision of the maintainer if accepting patches here and which approach should be used.

from utfcpp.

dishather commented on August 21, 2024

I don't understand exactly why you have such strong conviction but that sounds a bit opinionated to me.

I admit I sound somewhat categorical, so let me explain:

(On preferences). The preferred way of extending (or modifying the behavior of) template-based libraries is by supplying template parameters, not by modifying the library sources. For example, if a user wants to "extend" std::sort to sort in descending order, the preferred way is to supply a comparator to the template. Modifying the library function to accept a flag (e.g., bool reverse_sort) is not a good idea. Similarly, if a user wants std::copy() to add elements to the destination container, he/she simply uses std::back_inserter - which in essence changes the type of a template parameter.
(On correctness). Current implementation of utf8cpp uses u16bit_iterators which can be pointers, iterators to std::u16string, etc. In fact, the only way to avoid making a copy of read-only data (a string of bytes in UTF16-BE encoding) is to use a pointer to uint16_t. But if the data is not properly aligned (remember: it's a string of BYTES after all), on some platforms dereferencing the pointer will generate a hardware exception. Say, if the code is executed on an ARM chip, it may die with SIGBUS error. The problem is that we are accessing bytes using a pointer to another, wider data type. That's why I think that using an intelligent iterator instead of a pointer is a way better solution (besides, the iterator will access the underlying data as bytes, not as uin16_t's).

Extensibility through template parameters is the key to the success of the C++ Standard Library. Supplying your own template type instead of a default one is easy, it generates no extra code, it is portable and does not require modifications to already existing code.
Adding extra flags to library functions, on the other hand, requires modifications to the existing code, generates extra machine code (passing and checking the flag, etc).

Both solutions can be coded efficiently and there's IMO no reason to believe one is right and the other is wrong.

Not quite. Adding an argument to a function and checking it inside means more machine instructions. The impact is tiny, but it is there anyway. Thus, a million calls to current implementation of utf16to8(start, end, result) will take less time than a million calls to your implementation (which in turn calls utf16to8(false, start, end, result)).

I would say it's just a matter of taste, with (1) being a little bit more user friendly because the solution of a common issue is coupled with the API itself and not a decoupled structure to plug-in, implying that the user must know the existence of such structure to use it.

These extra iterators can be incorporated into the library, included into the documentation, etc. Usage examples can be extended to employ the new iterators. Thus, library users will know of their existence and usage.

(2) it would be a little bit lazy not having the library itself providing such extra iteration structures, because writing custom iterators in C++ is notoriously not a 1 minute task (I known for having created some), unless you do it every other day which is rare.

I am sure nemtrif will add the iterators to the library once they are sufficiently tested and documented.

from utfcpp.

ceztko commented on August 21, 2024

@dishather I still have some comments on your points:

The flag choice I was suggesting was leaded by the fact I was not really trusting non standardized endianness macro configurations. If one wants to trust available toolchain macros, the same code can be rewritten with template parameterized endianness hinting support, exactly as you suggest. Still, I understand your performance concerns: leaving the library "as is" is certainly a guarantee that the performance will not decrease in the same endianness no-op scenario, even in non-optimized builds. Nevertheless I am convinced that in optimized builds an endianness aware solution could be as efficient as the endianness unaware one;
As a matter of fact low level functions in upstream utfcpp already support iteration of pointer types (pointer ranges are valid iterators in most places in the STL anyway) so I don't understand your point about correctness and alignment. Strings are contiguous memory and have no strides: just pass a valid pointer range (const uint16_t* or const char16_t*), or cast some opaque pointer types to the above ones and today's utfcpp low level functions will work just fine. The range can be invalid for example by (end - start) % 2 == 1 but that's a condition that should be handled anyway in the current code and endianness is not relevant here.

As another consideration: from a maintainer perspective no modifications in the core library is just better than having to deal with several intrusive changes. If this helps in having endianness hinting support directly in utfcpp no doubt that supplying external iterators is the way to go. From the perspective of the end-user that just want to decode some bytes any library supplied solution is better than no solution at all.

from utfcpp.

dishather commented on August 21, 2024

just pass a valid pointer range (const uint16_t* or const char16_t*), or cast some opaque pointer types to the above ones and today's utfcpp low level functions will work just fine.

I highlighted the wrong part. Here's the trap: they won't.
Some platforms require that, when you dereference a pointer, the data you are accessing is properly aligned. E.g., if you are dereferencing a pointer to uint16_t, the data must be aligned on an even boundary - which means the address the pointer points to must be even. And if you are dereferencing a uint32_t pointer, the address must be divisible by 4. The details may vary, but the gist is the same: you cannot just cast an arbitrary pointer to uint16_t* and dereference it. I stumbled upon this restriction many times on many different systems (SPARC and ARM jump to mind), so this restriction is not as esoteric as it might seem.

Thus, the only safe way to access a buffer of bytes is by dereferencing a uint8_t (or char, or unsigned char) pointer. This is guaranteed to produce the expected result on any system.

As another consideration: from a maintainer perspective no modifications in the core library is just better than having to deal with several intrusive changes.

Yes, exactly so!

from utfcpp.

ceztko commented on August 21, 2024

@dishather thanks for spending some extra time to explain me the alignment issue. Unaligned access is something I never stumbled upon, even working with ARM, but that may be because of luck or because I used aligned buffers for strings that I may (or may not) decode. This issue does not appear to be currently checked in the upstream utfcpp (refer to the first dereferencing in the checked API), so it's something that could happen also for the no-op same endianness scenario. Maybe this issue is not checked specifically because that would produce an hard fault in affected platforms anyway, and it would be redundant in all the other platforms where that is not fatal (as far as I know x86 should not be affected unless using vector instructions). Also supplying non 2 byte type as pointer ranges may possibly lead to non failing code that produces wrong result. This could be handled specifically (use case: reading a utf16 string from an opaque buffer, endianness is irrelevant) but it's currently not.

from utfcpp.

dishather commented on August 21, 2024

This issue does not appear to be currently checked in the upstream utfcpp

The library should not be checking it.
Utf8cpp accepts pointers to uint16_t and never converts char* pointers to uint16_t pointers - so its implementation is correct.
Besides, while it is recommended that data be aligned, some systems are okay with unaligned access. It would be strange if the library refused to accept unaligned pointers on such systems.

This could be handled specifically (use case: reading a utf16 string from an opaque buffer, endianness is irrelevant) but it's currently not.

The library does not do this; we are talking about the user's code here. I think we should not check this: normally, the compiler and OS will take care of properly aligning the data. If the user casts random pointers to uint16_t*, he/she must take care that the alignment is correct.

from utfcpp.

ceztko commented on August 21, 2024

I finally decided to move to use the upstream utfcpp instead of my endianness aware custom version and, following @dishather precious hints, I coded a couple of custom iterators[1] that can read little or big endian utf-16 encoded strings from unaligned raw octet buffers. Here[2] is an example of use. To the best of my knowledge the code should not hit any known Undefined Behavior (UB). Knowing a bit of how iterator works it was a trivial task but as predicted it was not a 5 minute one (at least it was not 5 minutes for me). I'm still very convinced that such helper classes should belong to utfcpp itself. I'm testing the code in a big code base but unfortunately I am not in the conditions now to code few unit tests and prepare a pull request for utfcpp. Also the author @nemtrif still have to comment about the matter. I would be very glad if someone leaves a review here, though.

[1] https://github.com/pdfmm/pdfmm/blob/20382a6058ff6a67543170c94998648f2df7945a/src/pdfmm/private/utfcpp_extensions.h#L14
[2] https://github.com/pdfmm/pdfmm/blob/20382a6058ff6a67543170c94998648f2df7945a/src/pdfmm/base/PdfDeclarations.cpp#L204

from utfcpp.

Endianess hints about utfcpp HOT 17 CLOSED

Comments (17)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs