PUCU Pascal UniCode Utils Libary

You do need only the src\PUCU.pas file for the normal usage of this Library

Avoid use of longint and longword

In Delphi LongInt and LongWord are platform dependant types: 32-bit on Windows and 64-bit on Linux, iOS, and Android.
In FreePascal LongInt and LongWord are 32-bit on all platforms. I don't know if this is dependent on Delphi-mode or not.

You probably want to use Integer and Cardinal instead as these are 32-bit on all platforms, regardless of the compiler.

It appears that you were aware of this when you wrote the code, but got the handling of it reversed:

TPUCUInt32={$ifdef fpc}Int32{$else}LongInt{$endif};
TPUCUUInt32={$ifdef fpc}UInt32{$else}LongWord{$endif};

Regardless, it would be better to just alias the integer and cardinal types.

Changing case according to language

I went through PUCU.pas as carefully as I could, but I couldn't find an answer to changing case (i.e. upper/lower/title) of a codepoint based on language.

Let me try to make the question a little clearer:

Consider 'LATIN CAPITAL LETTER I' (U+0049).

For all languages the lowercase for this is 'LATIN SMALL LETTER I' (U+0069) except for Turkish and Azerbaijani in which case it becomes 'LATIN SMALL LETTER DOTLESS I' (U+0131).

Similar ones also apply to other languages for title cases, etc.

The issue here is not a codepage since Unicode does not have codepages.

It's neither a script (or codeblock) issue.

It is, solely and purely a language one.

IOW, case change rules differ by the language a piece of string written in.

To sum it all, I suppose I am looking for a function such as:

NewString := LowerCase(OriginalString, Turkish);
NewString := UpperCase(OriginalString, German);
NewString := TitleCase(OriginalString, Spanish);

Is there anything in PUCU.pas that serves (or can be made to serve) this purpose.

Use 'case' here, faster

function PUCUUnicodeIsWhiteSpace(c:TPUCUUInt32):boolean; {$ifdef caninline}inline;{$endif}
//result:=UnicodeGetCategoryFromTable(c) in [PUCUUnicodeCategoryZs,PUCUUnicodeCategoryZp,PUCUUnicodeCategoryZl];
 result:=((c>=$0009) and (c<=$000d)) or (c=$0020) or (c=$00a0) or (c=$1680) or (c=$180e) or ((c>=$2000) and (c<=$200b)) or (c=$2028) or (c=$2029) or (c=$202f) or (c=$205f) or (c=$3000) or (c=$feff) or (c=$fffe);

case block will be better.

Integer overflow in PUCUUTF32Normalize

The statement...


Lines 3340 to 3342 in ea0b2a5

CompositionSequenceIndex:=PUCUUnicodeCharacterCompositionHashTableData[TPUCUUInt32((TPUCUUInt32(StartCodePoint)*98303927) xor
(TPUCUUInt32(CodePoint)*24710753)) and

...can cause an integer overflow when the multiplication of two 32-bit numbers overflows 32-bits.

One way to avoid it is to simply promote the numbers to 64-bit like this:

CompositionSequenceIndex :=
    TPUCUUInt32((TPUCUUInt64(StartCodePoint)*98303927) xor (TPUCUUInt64(CodePoint)*24710753)) and

Special lower/upper case handling

Some characters need special handling because they turn into multiple characters rather than one when case converting, as in "SpecialCasing.txt". E.g. when converting to lower case İ becomes something like0069 0307:

For upper case ß is the most notorious.

That should be handled in PUCUUTF*LowerCase/PUCUUTF*UpperCase.

NFC Normalization of Å

Decomposing and normalizing the letter Å (Latin Capital Letter A with Ring Above, codepoint $00C5) produces the sequence $0041 $030A. This is correct.
However, composing the sequence $0041 $030A produces the codepoint $212B (Angstrom Sign).

$00C5 and $212B are equivalent codepoints but their normal form is $00C5 so the composition is wrong.

For example, the distinct Unicode strings "U+212B" (the angstrom sign "Å") and "U+00C5" (the Swedish letter "Å") are both expanded by NFD (or NFKD) into the sequence "U+0041 U+030A" (Latin letter "A" and combining ring above "°") which is then reduced by NFC (or NFKC) to "U+00C5" (the Swedish letter "Å").

This is a pretty big problem as Å is a fairly common letter in Scandinavian languages (which is why I discovered this problem).

linux building

This is hard to build on Linux

I always have to do it like this:

$   fpc PUCUConvertUnicode.dpr
$   ./PUCUConvertUnicode 
$   fpc -Twin32 PUCUGenCodePages.dpr
Error: Illegal parameter: -Twin32
$ fpc -Twin32 -Pi386 PUCUGenCodePages.dpr
$  wine ./PUCUGenCodePages.exe
$  fpc -B PUCUBuild.dpr
$   ./PUCUBuild 

