nemtrif / utfcpp Goto Github PK
View Code? Open in Web Editor NEWUTF-8 with C++ in a Portable Way
License: Boost Software License 1.0
UTF-8 with C++ in a Portable Way
License: Boost Software License 1.0
I put some effort into integrating C++11 language features, unifying codepaths, realizing the unit tests using the boost unit test framework and refactoring some internals over here... what is your opinion about these changes?
In addition to that I started writing new API docs using mkdocs and doxygen, but got carried away by my studies. If you want I will share the current state with you.
Best wishes,
Henrik Gaßmann
It would be great to add in-place conversion from/to utf-16 with endianess hints. To keep it portable, a simple runtime check for endianess can be put on overloads of the higher level functions that take the hint. Internal functions takes just a boolean for swap or not swap.
The library Fails To Build From Source with following error
[ 18%] Building CXX object tests/CMakeFiles/noexceptionstests.dir/test_unchecked_api.cpp.o
/home/lukas/spe5/utfcpp/tests/test_unchecked_api.cpp:1:10: fatal error: ../extern/ftest/ftest.h: No such file or directory
1 | #include "../extern/ftest/ftest.h"
| ^~~~~~~~~~~~~~~~~~~~~~~~~
compilation terminated.
$ git clone https://github.com/nemtrif/utfcpp.git
$ cd utfcpp
$ mkdir build
$ cd build
$ cmake ..
$ make
Linux hst02 5.10.0-1051-oem #53-Ubuntu SMP Thu Oct 28 08:11:53 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
The compilation finishes successfully after another repository is checked out with
utfcpp/extern$ git clone https://github.com/nemtrif/ftest.git
and re-run the make again.
Still, this is unexpected behaviour. I suggest to add the necessary commands into documentation. Either directly to README or into separate INSTALL file.
When compiling client project using utfcpp with compiler Clang 10.0.0 then warnings concerning "implicit conversion changes signedness" are displayed, e.g.
libs/utfcpp/source/utf8/cpp17.h:73:73: error: operand of ? changes signedness: 'long' to 'unsigned long' [-Werror,-Wsign-conversion]
[build] return (invalid == s.end()) ? std::string_view::npos : (invalid - s.begin());
See attached log file for details
Hello!
MSVC:
Microsoft (R) C/C++ Optimizing Compiler Version 19.21.27702.2 for x64
Microsoft (R) Incremental Linker Version 14.21.27702.2
The problem happens with this code:
FORCEINLINE const char32_t& String::operator[] ( uint32_t Index )
{
char* first = m_buffer;
char* last = m_buffer + Size ();
utf8::advance ( first, Index, last );
return utf8::peek_next ( first, last );
}
m_buffer
is a char*
with a UTF8-encoded string and FORCEINLINE
is simply inline
.
Here is my test (I replaced Russian word to "hello!" for convenience but it doesn't change the result):
TEST ( RuntimeCore, UTF8Char1 )
{
const char* str_raw = _T("hello!");
std::cout << "Raw string: " << str_raw << "\n";
Wrench::String hello ( str_raw );
char32_t z = hello[2];
EXPECT_EQ ( z, U'l' );
}
_T
simply adds u8
string literal.
And this is the result when the code was compiled with /O1
(actually, it doesn't really matter because only /Od
solves the problem):
[ RUN ] RuntimeCore.UTF8Char1
Raw string: hello!
***\Tests\Tests_Core.h(25): error: Expected equality of these values:
z
Which is: 3900834211
U'l'
Which is: 108
As you can see, z
value is completely different when optimization is enabled (it is the same with /Od
). It is also different on each execution (without rebuilding) which makes me completely unaware of what's happening in the code :(
What is the intended include for this lib?
#include <utf8/utf8.h>
maybe to generic#include <utf8.h>
even less generic and can conflict#include <utf8cpp/utf8.h>
preferred although does not equal github repo name (see #53), already used for install.I'd suggest deciding on one and adapting the source and install tree plus includes accordingly
After upgrading utf8cpp to 3.2.1 in nixpkgs, the tests started to fail in the following way:
cmake flags: -DCMAKE_FIND_USE_SYSTEM_PACKAGE_REGISTRY=OFF -DCMAKE_FIND_USE_PACKAGE_REGISTRY=OFF -DCMAKE_EXPORT_NO_PACKAGE_REGISTRY=ON -DCMAKE_BUILD_TYPE=Release -DCMAKE_SKIP_BUILD_RPATH=ON -DCMAKE_INSTALL_LOCALEDIR=/nix/store/rfnafncznyd5z6rjjqw008mx547061m0-utf8cpp-3.2.1/share/locale -DCMAKE_INSTALL_LIBEXECDIR=/nix/store/rfnafncznyd5z6rjjqw008mx547061m0-utf8cpp-3.2.1/libexec -DCMAKE_INSTALL_LIBDIR=/nix/store/rfnafncznyd5z6rjjqw008mx547061m0-utf8cpp-3.2.1/lib -DCMAKE_INSTALL_DOCDIR=/nix/store/rfnafncznyd5z6rjjqw008mx547061m0-utf8cpp-3.2.1/share/doc/utf8cpp -DCMAKE_INSTALL_INFODIR=/nix/store/rfnafncznyd5z6rjjqw008mx547061m0-utf8cpp-3.2.1/share/info -DCMAKE_INSTALL_MANDIR=/nix/store/rfnafncznyd5z6rjjqw008mx547061m0-utf8cpp-3.2.1/share/man -DCMAKE_INSTALL_OLDINCLUDEDIR=/nix/store/rfnafncznyd5z6rjjqw008mx547061m0-utf8cpp-3.2.1/include -DCMAKE_INSTALL_INCLUDEDIR=/nix/store/rfnafncznyd5z6rjjqw008mx547061m0-utf8cpp-3.2.1/include -DCMAKE_INSTALL_SBINDIR=/nix/store/rfnafncznyd5z6rjjqw008mx547061m0-utf8cpp-3.2.1/sbin -DCMAKE_INSTALL_BINDIR=/nix/store/rfnafncznyd5z6rjjqw008mx547061m0-utf8cpp-3.2.1/bin -DCMAKE_INSTALL_NAME_DIR=/nix/store/rfnafncznyd5z6rjjqw008mx547061m0-utf8cpp-3.2.1/lib -DCMAKE_POLICY_DEFAULT_CMP0025=NEW -DCMAKE_OSX_SYSROOT= -DCMAKE_FIND_FRAMEWORK=LAST -DCMAKE_STRIP=/nix/store/5j78r51lhl6pwfjxd2cz25p40xcnnywn-cctools-binutils-darwin-949.0.1/bin/strip -DCMAKE_RANLIB=/nix/store/5j78r51lhl6pwfjxd2cz25p40xcnnywn-cctools-binutils-darwin-949.0.1/bin/ranlib -DCMAKE_AR=/nix/store/5j78r51lhl6pwfjxd2cz25p40xcnnywn-cctools-binutils-darwin-949.0.1/bin/ar -DCMAKE_C_COMPILER=clang -DCMAKE_CXX_COMPILER=clang++ -DCMAKE_INSTALL_PREFIX=/nix/store/rfnafncznyd5z6rjjqw008mx547061m0-utf8cpp-3.2.1 -DCMAKE_INSTALL_LIBDIR=lib
-- The CXX compiler identification is Clang 7.1.0
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /nix/store/mc3wc45mhv836fs8dnjw4rfm7pj7gljl-clang-wrapper-7.1.0/bin/clang++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Configuring done
-- Generating done
CMake Warning:
Manually-specified variables were not used by the project:
CMAKE_C_COMPILER
CMAKE_EXPORT_NO_PACKAGE_REGISTRY
CMAKE_FIND_USE_PACKAGE_REGISTRY
CMAKE_FIND_USE_SYSTEM_PACKAGE_REGISTRY
CMAKE_POLICY_DEFAULT_CMP0025
-- Build files have been written to: /tmp/nix-build-utf8cpp-3.2.1.drv-0/source/build
cmake: enabled parallel building
building
build flags: -j4 -l4 SHELL=/nix/store/vm564n0g8k6pkiwqb31x1dw1y02h2mak-bash-5.1-p8/bin/bash
[ 6%] Building CXX object tests/CMakeFiles/cpp17.dir/test_cpp17.cpp.o
[ 12%] Building CXX object CMakeFiles/docsample.dir/samples/docsample.cpp.o
[ 18%] Building CXX object tests/CMakeFiles/apitests.dir/test_checked_api.cpp.o
[ 25%] Building CXX object tests/CMakeFiles/negative.dir/negative.cpp.o
In file included from /tmp/nix-build-utf8cpp-3.2.1.drv-0/source/tests/test_checked_api.cpp:2:
/tmp/nix-build-utf8cpp-3.2.1.drv-0/source/tests/../extern/ftest/ftest.h:159:13: warning: delete called on 'ftest::Test' that is abstract but has non-virtual destructor [-Wdelete-non-virtual-dtor]
delete m_tests[i];
^
In file included from /tmp/nix-build-utf8cpp-3.2.1.drv-0/source/tests/test_cpp17.cpp:1:
/tmp/nix-build-utf8cpp-3.2.1.drv-0/source/tests/../extern/ftest/ftest.h:159:13: warning: delete called on 'ftest::Test' that is abstract but has non-virtual destructor [-Wdelete-non-virtual-dtor]
delete m_tests[i];
^
[ 31%] Linking CXX executable negative
[ 31%] Built target negative
[ 37%] Building CXX object tests/CMakeFiles/apitests.dir/test_unchecked_api.cpp.o
1 warning generated.
1 warning generated.
[ 43%] Building CXX object tests/CMakeFiles/noexceptionstests.dir/test_unchecked_api.cpp.o
[ 50%] Linking CXX executable cpp17
[ 56%] Linking CXX executable docsample
[ 56%] Built target cpp17
[ 62%] Building CXX object tests/CMakeFiles/noexceptionstests.dir/test_unchecked_iterator.cpp.o
[ 62%] Built target docsample
[ 68%] Building CXX object tests/CMakeFiles/cpp11.dir/test_cpp11.cpp.o
In file included from /tmp/nix-build-utf8cpp-3.2.1.drv-0/source/tests/test_unchecked_api.cpp:1:
/tmp/nix-build-utf8cpp-3.2.1.drv-0/source/tests/../extern/ftest/ftest.h:159:13: warning: delete called on 'ftest::Test' that is abstract but has non-virtual destructor [-Wdelete-non-virtual-dtor]
delete m_tests[i];
^
In file included from /tmp/nix-build-utf8cpp-3.2.1.drv-0/source/tests/test_unchecked_api.cpp:1:
/tmp/nix-build-utf8cpp-3.2.1.drv-0/source/tests/../extern/ftest/ftest.h:159:13: warning: delete called on 'ftest::Test' that is abstract but has non-virtual destructor [-Wdelete-non-virtual-dtor]
delete m_tests[i];
^
In file included from /tmp/nix-build-utf8cpp-3.2.1.drv-0/source/tests/test_unchecked_iterator.cpp:2:
/tmp/nix-build-utf8cpp-3.2.1.drv-0/source/tests/../extern/ftest/ftest.h:159:13: warning: delete called on 'ftest::Test' that is abstract but has non-virtual destructor [-Wdelete-non-virtual-dtor]
delete m_tests[i];
^
In file included from /tmp/nix-build-utf8cpp-3.2.1.drv-0/source/tests/test_cpp11.cpp:1:
/tmp/nix-build-utf8cpp-3.2.1.drv-0/source/tests/../extern/ftest/ftest.h:159:13: warning: delete called on 'ftest::Test' that is abstract but has non-virtual destructor [-Wdelete-non-virtual-dtor]
delete m_tests[i];
^
1 warning generated.
[ 75%] Building CXX object tests/CMakeFiles/apitests.dir/test_checked_iterator.cpp.o
1 warning generated.
[ 81%] Building CXX object tests/CMakeFiles/apitests.dir/test_unchecked_iterator.cpp.o
1 warning generated.
[ 87%] Linking CXX executable noexceptionstests
[ 87%] Built target noexceptionstests
In file included from /tmp/nix-build-utf8cpp-3.2.1.drv-0/source/tests/test_checked_iterator.cpp:2:
/tmp/nix-build-utf8cpp-3.2.1.drv-0/source/tests/../extern/ftest/ftest.h:159:13: warning: delete called on 'ftest::Test' that is abstract but has non-virtual destructor [-Wdelete-non-virtual-dtor]
delete m_tests[i];
^
In file included from /tmp/nix-build-utf8cpp-3.2.1.drv-0/source/tests/test_unchecked_iterator.cpp:2:
/tmp/nix-build-utf8cpp-3.2.1.drv-0/source/tests/../extern/ftest/ftest.h:159:13: warning: delete called on 'ftest::Test' that is abstract but has non-virtual destructor [-Wdelete-non-virtual-dtor]
delete m_tests[i];
^
1 warning generated.
[ 93%] Linking CXX executable cpp11
[ 93%] Built target cpp11
1 warning generated.
1 warning generated.
[100%] Linking CXX executable apitests
[100%] Built target apitests
running tests
check flags: SHELL=/nix/store/vm564n0g8k6pkiwqb31x1dw1y02h2mak-bash-5.1-p8/bin/bash VERBOSE=y test
Running tests...
/nix/store/bn00vnhjd5p1ry7j1ipfakfnnqs1bbhs-cmake-3.21.2/bin/ctest --force-new-ctest-process
Test project /tmp/nix-build-utf8cpp-3.2.1.drv-0/source/build
Start 1: negative_test
Start 2: cpp11_test
Start 3: cpp17_test
Start 4: api_test
1/5 Test #1: negative_test .................... Passed 0.01 sec
Start 5: noexceptions_test
2/5 Test #2: cpp11_test .......................***Exception: Illegal 0.01 sec
[==========] Running 9 tests from 1 test cases.
[----------] 9 tests from CPP11APITests
[ RUN ] CPP11APITests.test_append
[ OK ] CPP11APITests.test_append
[ RUN ] CPP11APITests.test_utf16to8
[ OK ] CPP11APITests.test_utf16to8
[ RUN ] CPP11APITests.test_utf8to16
[ OK ] CPP11APITests.test_utf8to16
[ RUN ] CPP11APITests.test_utf32to8
[ OK ] CPP11APITests.test_utf32to8
[ RUN ] CPP11APITests.test_utf8to32
[ OK ] CPP11APITests.test_utf8to32
[ RUN ] CPP11APITests.test_find_invalid
[ OK ] CPP11APITests.test_find_invalid
[ RUN ] CPP11APITests.test_is_valid
[ OK ] CPP11APITests.test_is_valid
[ RUN ] CPP11APITests.test_replace_invalid
[ OK ] CPP11APITests.test_replace_invalid
[ RUN ] CPP11APITests.test_starts_with_bom
[ OK ] CPP11APITests.test_starts_with_bom
[----------] 9 tests from CPP11APITests
[==========] 9 tests from 1 test cases ran.
[ PASSED ] 9 tests.
3/5 Test #3: cpp17_test .......................***Exception: Illegal 0.01 sec
[==========] Running 9 tests from 1 test cases.
[----------] 9 tests from CPP17APITests
[ RUN ] CPP17APITests.test_utf16to8
[ OK ] CPP17APITests.test_utf16to8
[ RUN ] CPP17APITests.test_utf8to16
[ OK ] CPP17APITests.test_utf8to16
[ RUN ] CPP17APITests.test_utf32to8
[ OK ] CPP17APITests.test_utf32to8
[ RUN ] CPP17APITests.test_utf8to32
[ OK ] CPP17APITests.test_utf8to32
[ RUN ] CPP17APITests.test_find_invalid
[ OK ] CPP17APITests.test_find_invalid
[ RUN ] CPP17APITests.test_is_valid
[ OK ] CPP17APITests.test_is_valid
[ RUN ] CPP17APITests.test_replace_invalid
[ OK ] CPP17APITests.test_replace_invalid
[ RUN ] CPP17APITests.test_starts_with_bom
[ OK ] CPP17APITests.test_starts_with_bom
[ RUN ] CPP17APITests.string_class_and_literals
[ OK ] CPP17APITests.string_class_and_literals
[----------] 9 tests from CPP17APITests
[==========] 9 tests from 1 test cases ran.
[ PASSED ] 9 tests.
4/5 Test #4: api_test .........................***Exception: Illegal 0.01 sec
[==========] Running 15 tests from 3 test cases.
[----------] 11 tests from UnCheckedAPITests
[ RUN ] UnCheckedAPITests.test_append
[ OK ] UnCheckedAPITests.test_append
[ RUN ] UnCheckedAPITests.test_next
[ OK ] UnCheckedAPITests.test_next
[ RUN ] UnCheckedAPITests.test_peek_next
[ OK ] UnCheckedAPITests.test_peek_next
[ RUN ] UnCheckedAPITests.test_prior
[ OK ] UnCheckedAPITests.test_prior
[ RUN ] UnCheckedAPITests.test_advance
[ OK ] UnCheckedAPITests.test_advance
[ RUN ] UnCheckedAPITests.test_distance
[ OK ] UnCheckedAPITests.test_distance
[ RUN ] UnCheckedAPITests.test_utf32to8
[ OK ] UnCheckedAPITests.test_utf32to8
[ RUN ] UnCheckedAPITests.test_utf8to32
[ OK ] UnCheckedAPITests.test_utf8to32
[ RUN ] UnCheckedAPITests.test_utf16to8
[ OK ] UnCheckedAPITests.test_utf16to8
[ RUN ] UnCheckedAPITests.test_utf8to16
[ OK ] UnCheckedAPITests.test_utf8to16
[ RUN ] UnCheckedAPITests.test_replace_invalid
[ OK ] UnCheckedAPITests.test_replace_invalid
[----------] 11 tests from UnCheckedAPITests
[----------] 2 tests from CheckedIteratrTests
[ RUN ] CheckedIteratrTests.test_increment
[ OK ] CheckedIteratrTests.test_increment
[ RUN ] CheckedIteratrTests.test_decrement
[ OK ] CheckedIteratrTests.test_decrement
[----------] 2 tests from CheckedIteratrTests
[----------] 2 tests from UnCheckedIteratrTests
[ RUN ] UnCheckedIteratrTests.test_increment
[ OK ] UnCheckedIteratrTests.test_increment
[ RUN ] UnCheckedIteratrTests.test_decrement
[ OK ] UnCheckedIteratrTests.test_decrement
[----------] 2 tests from UnCheckedIteratrTests
[==========] 15 tests from 3 test cases ran.
[ PASSED ] 15 tests.
5/5 Test #5: noexceptions_test ................***Exception: Illegal 0.01 sec
[==========] Running 13 tests from 2 test cases.
[----------] 11 tests from UnCheckedAPITests
[ RUN ] UnCheckedAPITests.test_append
[ OK ] UnCheckedAPITests.test_append
[ RUN ] UnCheckedAPITests.test_next
[ OK ] UnCheckedAPITests.test_next
[ RUN ] UnCheckedAPITests.test_peek_next
[ OK ] UnCheckedAPITests.test_peek_next
[ RUN ] UnCheckedAPITests.test_prior
[ OK ] UnCheckedAPITests.test_prior
[ RUN ] UnCheckedAPITests.test_advance
[ OK ] UnCheckedAPITests.test_advance
[ RUN ] UnCheckedAPITests.test_distance
[ OK ] UnCheckedAPITests.test_distance
[ RUN ] UnCheckedAPITests.test_utf32to8
[ OK ] UnCheckedAPITests.test_utf32to8
[ RUN ] UnCheckedAPITests.test_utf8to32
[ OK ] UnCheckedAPITests.test_utf8to32
[ RUN ] UnCheckedAPITests.test_utf16to8
[ OK ] UnCheckedAPITests.test_utf16to8
[ RUN ] UnCheckedAPITests.test_utf8to16
[ OK ] UnCheckedAPITests.test_utf8to16
[ RUN ] UnCheckedAPITests.test_replace_invalid
[ OK ] UnCheckedAPITests.test_replace_invalid
[----------] 11 tests from UnCheckedAPITests
[----------] 2 tests from UnCheckedIteratrTests
[ RUN ] UnCheckedIteratrTests.test_increment
[ OK ] UnCheckedIteratrTests.test_increment
[ RUN ] UnCheckedIteratrTests.test_decrement
[ OK ] UnCheckedIteratrTests.test_decrement
[----------] 2 tests from UnCheckedIteratrTests
[==========] 13 tests from 2 test cases ran.
[ PASSED ] 13 tests.
20% tests passed, 4 tests failed out of 5
Total Test time (real) = 0.02 sec
The following tests FAILED:
2 - cpp11_test (ILLEGAL)
3 - cpp17_test (ILLEGAL)
4 - api_test (ILLEGAL)
5 - noexceptions_test (ILLEGAL)
Errors while running CTest
make: *** [Makefile:136: test] Error 8
We have a bit of a different toolchain than distributed by Apple, for example using clang 7.1.0. From the error messages I'm struggling to pinpoint the exact nature of the issue, any pointers would be appreciated.
utfcpp
defines some macro names such as NOEXCEPT
. Unfortunately, those can clash with similar macros defined in other projects/libraries. It would be nice if all macros were qualified, e.g. UTF_CPP_NOEXCEPT
.
I use this library via add_subdirectory in cmake, but I do not want to install utfcpp alongside my library. Please add an option to not install utfcpp. Thanks
Our project disable exceptions in compiler. It would be very helpful if you make a patch to enable/disable it easily. Suggested changes:
Thanks!
Line 28 in 7db7281
#ifndef UTF8_FOR_CPP_a184c22c_d012_11e8_a8d5_f2801f1b9fd1
#define UTF8_FOR_CPP_a184c22_cd012_11e8_a8d5_f2801f1b9fd1
Hello @nemtrif ,
I have main.cc.zip with some code snippet, which uses library functionality.
When that code is compiled on Ubuntu 16.04 with g++ 5.4, it gives compilation error. See
for more details.
In order to resolve the compilation issue, we need to change class utf8::iterator declaration in 'checked.h' as given below.
We need to include "cstddef" for std::ptrdiff_t
template <typename octet_iterator> class iterator : public std::iterator < std::bidirectional_iterator_tag, uint32_t, std::ptrdiff_t, const uint32_t*, const uint32_t& >
But, even after resolving the compilation issue, some more changes may be needed in the library code. Since, output of the program is incorrect.
Expected Result :
Within String ℇ¥ǢǾ¥ǽƱ
Find String ¥
String found at position 2
Actual Result :
Within String ℇ¥ǢǾ¥ǽƱ
Find String ¥
Error occurred
Please note that, even after changes in 'checked.h', we get 'expected result' when compiled with g++ 4.8
To summarize, it looks like code behaves incorrectly with g++ version 5.4.
extern/ftest
is empty.
I get the following warnings (lines 220, 266) when compiling with Visual Studio 2019 16.5 Preview 2:
`
utf8\unchecked.h(220,40): w
arning C4996: 'std::iterator<std::bidirectional_iterator_tag,utf8::uint32_t,ptrdiff_t,utf8::uint32_t *,utf8::uint32_t &
': warning STL4015: The std::iterator class template (used as a base class to provide typedefs) is deprecated in C++17
. (The header is NOT deprecated.) The C++ Standard has never required user-defined iterators to derive from
std::iterator. To fix this warning, stop deriving from std::iterator and start providing publicly accessible typedefs n
amed iterator_category, value_type, difference_type, pointer, and reference. Note that value_type is required to be non
-const, even for constant iterators. You can define _SILENCE_CXX17_ITERATOR_BASE_CLASS_DEPRECATION_WARNING or SILENCE
ALL_CXX17_DEPRECATION_WARNINGS to acknowledge that you have received this warning.`
Just like there is an utf8::iterator
adapter, I think it would be nice to have an utf8::range
so as to write
std::string str = ...;
utf8::range<std::string> codepoints(str);
for(char32_t cp : codepoints) { ... }
Or even
std::string str = ...;
for(char32_t cp : utf8::make_range(str)) { ... }
I tried to convert an istream to a sequence of code points. Unfortunately, you can't use utf8::iterator because it doesn't compile. It's possible to bypass the compilation problem by removing the two blocking lines which are just checks.
The problem is the same with both version of iterator (checked and unchecked) : its converts every two characters. The origin of the cause is the operator*() : it assumes that the injected iterator doesn't change while performing the next method.
I suggest that the class iterator stores the code point.
Here is an other problem : with a stream_iterator, you can't go back ! So operator--() can't work.
I suggest also to provide an end point for the iterator, just like std::istream_iterator so that you can write a classical for loop :
for(utf8::iterator iter ; iter != end ; ++iter) {//do stuff}
I hope you can find a solution.
Regards
#include <iostream>
#include <sstream>
#include <iomanip>
#include <iterator>
#include <utf8.h>
void print(uint32_t cp) {
// std::cout << std::hex << std::setfill('0') << std::setw(2) << cp << ' ';
std::cout << (char)cp;
}
int main() try {
using iterator = std::istream_iterator<char>;
using utf8_iterator = utf8::unchecked::iterator<iterator>;
//using utf8_iterator = utf8::iterator<iterator>;
std::istringstream is("abc");
iterator it(is);
iterator eos{};
utf8_iterator end_iter{};
/*
for(utf8_iterator iter(it, it, eos) ; iter != end_iter ; ++iter) {
std::cout << std::hex << std::setfill('0') << std::setw(2) << *iter << ' ';
}
*/
for(utf8_iterator iter(it) ; iter != end_iter ; ++iter) {
print(*iter);
}
std::cout << std::endl;
return 0;
} catch(const std::exception & e) {
std::cerr << "exception: " << e.what() << '\n';
return 1;
}
What is the proper name to refer to this project as? utfcpp
or utf8cpp
? It's called utfcpp
on github, but there are many references to utf8cpp
or UTF8-CPP
inside. The default installation path is /usr/include/utf8cpp/
, even.
I have this string VINAIGRE ALCOOL BLC 6° 100CL than i pass to is_valid function but it returns false
what's the issue here ?
Steps used to reproduce:
Expected result:
Compiles without issue.
Actual result:
Fails on line 324 of utf8/checked.h because the compiler is told to include utf8/cpp11.h as a path relative to utf8/checked.h.
Expected resolution:
Modify the #include to read:
#include "cpp11.h"
Visual Studio has very conservative approach for macro __cplusplus
. Even when used with /std:c++17
switch it still shows this:
So by fact Visual Studio will not use cpp17.h
header without forcing value UTF_CPP_CPLUSPLUS
to 201703L
. For windows it is better to use this:
#if (defined(_MSVC_LANG) && _MSVC_LANG >= 201703L)
//whatever
#endif
Related issue in transmission: transmission/transmission#2256
Hi! I need the possibility to convert from CESU-8 to UTF-8.
CESU-8 is the format we get from MySQL (and Oracle) DBMS and while we were using code point in the range U+0000 to U+FFFF all were OK (for this range CESU-8 is equal to UTF-8), but the things get messed up as one user used a Unicode supplementary character "THINKING FACE" emoji that was returned from DB as eda0beedb494 (CESU-8) instead of f09fa494(UTF-8)...
Can you please provide this functionality?
As an enhancement to the app, may I suggest supporting utf-8 chars in command arguments and file names.
After downloading the zip release 3.1 and unpacking I did md build and cd build and called cmake .. and got the following error:
-- Building for: Visual Studio 15 2017
-- The CXX compiler identification is MSVC 19.16.27030.1
-- Check for working CXX compiler: C:/Program Files (x86)/Microsoft Visual Studio/2017/Professional/VC/Tools/MSVC/14.16.27023/bin/Hostx86/x86/cl.exe
-- Check for working CXX compiler: C:/Program Files (x86)/Microsoft Visual Studio/2017/Professional/VC/Tools/MSVC/14.16.27023/bin/Hostx86/x86/cl.exe -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
CMake Error at CMakeLists.txt:35 (add_subdirectory):
The source directory
T:/temp/utfcpp31/extern/gtest
does not contain a CMakeLists.txt file.
-- Configuring incomplete, errors occurred!
See also "T:/temp/utfcpp31/build/CMakeFiles/CMakeOutput.log".
It seems gtest is missing.
It seems that there is an error, I didn’t really understand how it should work, so I can’t make a PR
Line 91
Lines 86 to 92 in 2ad9957
test code is:
#include <iostream>
#include "utf8.h"
#include<vector>
using namespace std;
int main() {
string str = "这是一个测试句子";
//用于找到有效的utf8字符串结尾处的指针
auto end_it = utf8::find_invalid(str.begin(), str.end());
//如果指针不是字符串的结尾,那么说明这个字符串的编码不全是utf8格式
if (end_it != str.end()) {
cout << "invalid utf-8 encoding detected at line." << endl;
cout << "this part is fine" << string(str.begin(), end_it);
}
//utf8字符串的长度
int length = utf8::distance(str.begin(), end_it);
cout << "the length of str is " << length << endl;
//存放转换后的utf16字符串
vector<unsigned short> utf16line;
//将其转换为utf16
utf8::utf8to16(str.begin(), end_it, back_inserter(utf16line));
//将其转换回utf8
string utf8line;
utf8::utf16to8(utf16line.begin(), utf16line.end(), back_inserter(utf8line));
//如果转换之后的结果和之前的不同,说明转换过程出现了错误
if (utf8line != string(str.begin(), end_it)) {
cout << "error in utf-16 conversion" << endl;
}
//输出转换之后的字符串
cout << utf8line << endl;
}
and i just got the prompt "invalid utf-8 encoding detected at line."
the cpp file is encoded with (UTF-8 BOM 65001)
I didn't find it in readme and in issues. Looks like this functionality is not included and I must write it on my own. If so what is the best way to do it? Do I need to iterate every character and call std::towupper
or there is some other way? Thanks
It's a great lib. If we can get it easily in Visual Studio, it'd be better. Nuget is a platform(tool) for this.
A couple of important missing features is the ability to directly iterate utf16 strings codepoints and appending codepoints to existing utf16 encoded strings. For iterating codepoints, one implementation can be found in ICU documentation[1][2].
[1] https://unicode-org.github.io/icu-docs/apidoc/dev/icu4c/utf16_8h.html#a844bb48486904fdca40c8b883e9c80ee
[2] https://unicode-org.github.io/icu-docs/apidoc/dev/icu4c/utf16_8h.html#ae98a64ae0f42bc6ad4179293c3638be4
Some utility function to convert a whole utf-8 string to json encoding (and the reverse) where all non ascii characters are expected to be encoded as for instance \u00e4 would be super useful, the way json constructs are at the heart of remote API nowadays (see https://graph.microsoft.com for instance).
Currently, the root CMake file for utfcpp create the CMake config files when UTF8_INSTALL
is set. Based on my (ever incomplete) understanding of CMake, it is missing an export()
command as follows:
if(UTF8_INSTALL)
if(MSVC)
set(DEF_INSTALL_CMAKE_DIR CMake)
else()
include(GNUInstallDirs) # define CMAKE_INSTALL_*
set(DEF_INSTALL_CMAKE_DIR ${CMAKE_INSTALL_LIBDIR}/cmake/utf8cpp)
endif()
install(DIRECTORY source/ DESTINATION include/utf8cpp)
install(TARGETS utf8cpp EXPORT utf8cppConfig)
install(EXPORT utf8cppConfig DESTINATION ${DEF_INSTALL_CMAKE_DIR})
export(EXPORT utf8cppConfig) # <-------- this is missing
endif()
Without it, if I use utfcpp in a library of my own and want to create an install target for my library, with exports, CMake currently complains that
CMake Error in CMakeLists.txt:
export called with target "mylib" which requires target "utfcpp" that is not
in any export set.
This error is fixed by adding the export()
command. I will create a PR to suggest the modification.
Tested with a 126MB UTF-8 format text file against ww898's [https://github.com/ww898/utf-cpp]
Testing source code here: https://github.com/chipsethan/UTF-test.git
It seems yours has a little room to improve. I hope you can optimize your code some day.
Visual Studio 2019 - 16.10.0 (Win10 Pro 20H2, x64, Intel i7-8700)
Mikhail Pilin (2.3)
UTF8-->UTF32:584.876ms
UTF8-->UTF16:417.439ms
UTF32-->UTF8:311.549ms
UTF16-->UTF8:276.505ms
Nemanja Trifunovic (3.2.1)
UTF8-->UTF32:1081.22ms
UTF8-->UTF16:950.651ms
UTF32-->UTF8:309.367ms
UTF16-->UTF8:476.447ms
identical results, ok
g++ 10.3.0 - msys2 (Win10 Pro 20H2, x64, Intel i7-8700)
Mikhail Pilin (2.3)
UTF8-->UTF32:334.683ms
UTF8-->UTF16:256.315ms
UTF32-->UTF8:240.74ms
UTF16-->UTF8:323.077ms
Nemanja Trifunovic (3.2.1)
UTF8-->UTF32:371.187ms
UTF8-->UTF16:253.828ms
UTF32-->UTF8:429.623ms
UTF16-->UTF8:467.776ms
identical results, ok
I think that it would be good to flush currently committed changes and make new release :)
Using utf8cpp 3.0.2 on Archlinux
$ g++ --version
g++ (GCC) 8.2.1 20181127
Copyright (C) 2018 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
file1.cc:
#include <utf.h>
file2.cc:
#include <utf.h>
$ g++ test1.cc test2.cc -std=c++11
/usr/bin/ld: /tmp/ccmqjhz6.o: in function `utf8::append(char32_t, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&)':
test2.cc:(.text+0x0): multiple definition of `utf8::append(char32_t, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&)'; /tmp/ccKIEFGd.o:test1.cc:(.text+0x0): first defined here
/usr/bin/ld: /tmp/ccmqjhz6.o: in function `utf8::utf16to8(std::__cxx11::basic_string<char16_t, std::char_traits<char16_t>, std::allocator<char16_t> > const&)':
test2.cc:(.text+0x2e): multiple definition of `utf8::utf16to8(std::__cxx11::basic_string<char16_t, std::char_traits<char16_t>, std::allocator<char16_t> > const&)'; /tmp/ccKIEFGd.o:test1.cc:(.text+0x2e): first defined here
/usr/bin/ld: /tmp/ccmqjhz6.o: in function `utf8::utf8to16(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)':
test2.cc:(.text+0xd2): multiple definition of `utf8::utf8to16(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)'; /tmp/ccKIEFGd.o:test1.cc:(.text+0xd2): first defined here
/usr/bin/ld: /tmp/ccmqjhz6.o: in function `utf8::utf32to8(std::__cxx11::basic_string<char32_t, std::char_traits<char32_t>, std::allocator<char32_t> > const&)':
test2.cc:(.text+0x176): multiple definition of `utf8::utf32to8(std::__cxx11::basic_string<char32_t, std::char_traits<char32_t>, std::allocator<char32_t> > const&)'; /tmp/ccKIEFGd.o:test1.cc:(.text+0x176): first defined here
/usr/bin/ld: /tmp/ccmqjhz6.o: in function `utf8::utf8to32(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)':
test2.cc:(.text+0x21a): multiple definition of `utf8::utf8to32(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)'; /tmp/ccKIEFGd.o:test1.cc:(.text+0x21a): first defined here
/usr/bin/ld: /tmp/ccmqjhz6.o: in function `utf8::find_invalid(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)':
test2.cc:(.text+0x2be): multiple definition of `utf8::find_invalid(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)'; /tmp/ccKIEFGd.o:test1.cc:(.text+0x2be): first defined here
/usr/bin/ld: /tmp/ccmqjhz6.o: in function `utf8::is_valid(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)':
test2.cc:(.text+0x372): multiple definition of `utf8::is_valid(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)'; /tmp/ccKIEFGd.o:test1.cc:(.text+0x372): first defined here
/usr/bin/ld: /tmp/ccmqjhz6.o: in function `utf8::replace_invalid(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, char32_t)':
test2.cc:(.text+0x3ac): multiple definition of `utf8::replace_invalid(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, char32_t)'; /tmp/ccKIEFGd.o:test1.cc:(.text+0x3ac): first defined here
/usr/bin/ld: /tmp/ccmqjhz6.o: in function `utf8::replace_invalid(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)':
test2.cc:(.text+0x458): multiple definition of `utf8::replace_invalid(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)'; /tmp/ccKIEFGd.o:test1.cc:(.text+0x458): first defined here
/usr/bin/ld: /tmp/ccmqjhz6.o: in function `utf8::starts_with_bom(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)':
test2.cc:(.text+0x4fc): multiple definition of `utf8::starts_with_bom(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)'; /tmp/ccKIEFGd.o:test1.cc:(.text+0x4fc): first defined here
/usr/bin/ld: /usr/lib/gcc/x86_64-pc-linux-gnu/8.2.1/../../../../lib/Scrt1.o: in function `_start':
(.text+0x24): undefined reference to `main'
collect2: error: ld returned 1 exit status
$ g++ test1.cc test2.cc -std=c++03
/usr/bin/ld: /usr/lib/gcc/x86_64-pc-linux-gnu/8.2.1/../../../../lib/Scrt1.o: in function `_start':
(.text+0x24): undefined reference to `main'
collect2: error: ld returned 1 exit status
As expected since we didn't include a main
I get a very similar error using clang as well.
Is there a way to work around this?
docsample.cpp(39,20): warning C4244: 'initializing': conversion from '__int64' t
o 'int', possible loss of data
I hope this is in the scope of this project.
Can you add a function that calculates the "display width" of the UTF string? By display width I mean the "number of columns that the text would take up when displayed in the terminal". This is especially useful when trying to correctly render East-Asian characters in aligned tabular form on the terminal.
This is not the same as the string length (number of multi-byte characters).
This can be done by using wcswidth in *nix.
Thanks.
Setup:
latest version utfcpp on windows, x86-Debug (or any other) build with latest visual c++ 2019, with /std:c++17 flag
Not sure if problem is in my code or not.
Problem:
Project builds fine, however there's a warning in utf8/checked.h (line 151)
warning C4244: 'argument': conversion from 'const wchar_t' to 'utf8::uint8_t'
reference to function template instantiation at checked.h (line 200) 'utf8::uint32_t utf8::next<octet_iterator>(octet_iterator &,octet_iterator)' being compiled
with
[
octet_iterator=const wchar_t *
]
reference to function template instantiation in my own code:
'std::_Iterator_traits_pointer_base<_Ty,true>::difference_type utf8::distance<const wchar_t*>(octet_iterator,octet_iterator)' being compiled
with
[
_Ty=const wchar_t,
octet_iterator=const wchar_t *
]
Proposed solution:
a simple static_cast<uint8_t>
at utf8/checked.h (line 151) fixes the problem
This license allow me to use, reproduce, display, distribute,
execute, and transmit the Software, and to prepare derivative works of the
Software . It does not say about "modify"
May I modify the Software ?
Hi,
the current homepage points to the sourceforge project page, which in turn only contains a small link to a blog post, which points to this repository.
When searching, for example, with google, this does not always make clear that this is the official source repository as sometimes this repository is listed as better match.
Please change the submodule url for googletest to
https://github.com/google/googletest
(note the protocol https).
This still provides secured connection but allows to clone the repository anonymously, without setting up ssh keys.
Thanks
Visual studio supports pretty much all C++11 features used here. However up until MSVC 2019(?) it does not correctly define __cplusplus
.
Could you add an additional check for MSVC so that those features are available there too?
Hey,
I'm working on a project using this, but I use C++20. I'm getting the following warning:
In file included from c:\msys64\opt\devkitpro\devkitarm\arm-none-eabi\include\c++\12.2.0\bits\stl_construct.h:61,
from c:\msys64\opt\devkitpro\devkitarm\arm-none-eabi\include\c++\12.2.0\bits\char_traits.h:46,
from c:\msys64\opt\devkitpro\devkitarm\arm-none-eabi\include\c++\12.2.0\string:40,
from c:\msys64\opt\devkitpro\devkitarm\arm-none-eabi\include\c++\12.2.0\bitset:47,
from G:/GitHub/Homebrew/lovepotion/include/common/type.hpp:3,
from G:/GitHub/Homebrew/lovepotion/include/common/luax.hpp:3,
from G:/GitHub/Homebrew/lovepotion/include/objects/font/wrap_font.hpp:3:
c:\msys64\opt\devkitpro\devkitarm\arm-none-eabi\include\c++\12.2.0\bits\stl_iterator_base_types.h:127:34: note: declared here
127 | struct _GLIBCXX17_DEPRECATED iterator
| ^~~~~~~~
In file included from G:/GitHub/Homebrew/lovepotion/libraries/utf8/utf8.h:32:
G:/GitHub/Homebrew/lovepotion/libraries/utf8/utf8/unchecked.h:179:40: warning: 'template<class _Category, class _Tp, class _Distance, class _Pointer, class _Reference> stru
ct std::iterator' is deprecated [-Wdeprecated-declarations]
179 | class iterator : public std::iterator <std::bidirectional_iterator_tag, uint32_t> {
| ^~~~~~~~
c:\msys64\opt\devkitpro\devkitarm\arm-none-eabi\include\c++\12.2.0\bits\stl_iterator_base_types.h:127:34: note: declared here
127 | struct _GLIBCXX17_DEPRECATED iterator
| ^~~~~~~~
If there's something I can do about fixing this, let me know, unless the library needs to be updated to work with C++20.
Hello.
I'd like to point out that project's homepage at http://utfcpp.sourceforge.net/ points to outdated sourceforge repo. Further, google search doesn't hit this repository at all (I'm seeing https://github.com/ledger/utfcpp instead).
IMHO the best solution would be to migrate the homepage completely to utfcpp.github.io and delete the SF repository altogether.
// an example of UTF8<-->ANSI
#if defined _WIN32 || defined _WIN64
#include <windows.h>
// utf8 --> Windows-1252
std::string to_ansi(const std::string& utf8)
{
int wlen = MultiByteToWideChar(CP_UTF8, 0, utf8.c_str(), -1, NULL, 0);
std::wstring wstr(wlen, static_cast<wchar_t>('\0'));
MultiByteToWideChar(CP_UTF8, 0, utf8.c_str(), -1, &wstr[0], wlen);
int len = WideCharToMultiByte(CP_ACP, 0, wstr.c_str(), -1, NULL, 0, NULL, NULL);
std::string ansi(len, '\0');
WideCharToMultiByte(CP_ACP, 0, wstr.c_str(), -1, &ansi[0], len, NULL, NULL);
if(!ansi.empty() && static_cast(0 == ansi.back()))
ansi.pop_back();
return ansi;
}
// Windows-1252 --> UTF-8
std::string ansi2utf8(const std::string& ansi)
{
int wlen = MultiByteToWideChar(CP_ACP, 0, ansi.c_str(), -1, NULL, 0);
std::wstring wstr(wlen, static_cast<wchar_t>('\0'));
MultiByteToWideChar(CP_ACP, 0, ansi.c_str(), -1, &wstr[0], wlen);
int len = WideCharToMultiByte(CP_UTF8, 0, wstr.c_str(), -1, NULL, 0, NULL, NULL);
std::string utf8(len, '\0');
WideCharToMultiByte(CP_UTF8, 0, wstr.c_str(), -1, &utf8[0], len, NULL, NULL);
if(!utf8.empty() && 0 == static_cast(utf8.back()))
utf8.pop_back();
return utf8;
}
#endif
replace tag 'master' to 'v2.3.6'
You should not use any extra variables and you should use -DCMAKE_BUILD_TYPE=Debug
instead to activate tests.
https://raymii.org/s/tutorials/Cpp_project_setup_with_cmake_and_unit_tests.html
https://github.com/nemtrif/utfcpp/blob/master/CMakeLists.txt#L10
Due to this issue the antlr4-cpp runtime is having hard times.
antlr/antlr4#2913
Could you add override
to the what
functions of the exception classes? They currently throw warnings -Wsuggest-override
.
You already have a version check for C++11 so this is easy to add and helps for warning free builds.
some systems emit utf8 strings with surrogate pairs encoded as two 3-byte sequences. uftcpp does not support such an encoding and throws an exception about an invalid code point when encountering it.
In c++20, u8string is added.
Is there any plans to add convertion between std::string and std::u8string,
and also support something like
std::u8string utf8::utf16to8(std::u16string)?
In case if provided an array of a form [..., 0xd800u], then utf16to8 will try to read trailing surrogate without even checking if we went outside of the array. This leads the following while (start != end)
to never stop until the code tries to read unmapped memory and segfaults.
I know that it's unchecked
, but segfaulting on invalid input is a pretty grim failure mode :)
`
template<typename in_iterator>
void print_symbols(in_iterator begin, in_iterator end)
{
const size_t sizeSimbol = 5;
char Symbol[sizeSimbol] = { 0, 0, 0, 0, 0 };
uint32_t cp;
while (begin < end)
{
memset(Symbol, 0, sizeSimbol);
cp = utf8::next(begin, end);
utf8::append(cp, Symbol);
cout << "[" << Symbol << "]";
}
cout << endl;
}
int main()
{
string x = "😶🌫️";
print_symbols(x.begin(), x.end());
cin.get();
}
`
why is this smiley face multiple characters? Is this a library error or should it be?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.