You do need only the src\PUCU.pas file for the normal usage of this Library
bero1985 / pucu Goto Github PK
View Code? Open in Web Editor NEWPUCU Pascal UniCode Utils Libary
PUCU Pascal UniCode Utils Libary
In Delphi LongInt and LongWord are platform dependant types: 32-bit on Windows and 64-bit on Linux, iOS, and Android.
In FreePascal LongInt and LongWord are 32-bit on all platforms. I don't know if this is dependent on Delphi-mode or not.
You probably want to use Integer and Cardinal instead as these are 32-bit on all platforms, regardless of the compiler.
https://docwiki.embarcadero.com/RADStudio/Sydney/en/Simple_Types_(Delphi)#Platform-Dependent_Integer_Types
https://www.freepascal.org/docs-html/rtl/system/longint.html
It appears that you were aware of this when you wrote the code, but got the handling of it reversed:
TPUCUInt32={$ifdef fpc}Int32{$else}LongInt{$endif};
TPUCUUInt32={$ifdef fpc}UInt32{$else}LongWord{$endif};
Regardless, it would be better to just alias the integer and cardinal types.
I went through PUCU.pas as carefully as I could, but I couldn't find an answer to changing case (i.e. upper/lower/title) of a codepoint based on language.
Let me try to make the question a little clearer:
Consider 'LATIN CAPITAL LETTER I' (U+0049).
For all languages the lowercase for this is 'LATIN SMALL LETTER I' (U+0069) except for Turkish and Azerbaijani in which case it becomes 'LATIN SMALL LETTER DOTLESS I' (U+0131).
Similar ones also apply to other languages for title cases, etc.
The issue here is not a codepage since Unicode does not have codepages.
It's neither a script (or codeblock) issue.
It is, solely and purely a language one.
IOW, case change rules differ by the language a piece of string written in.
To sum it all, I suppose I am looking for a function such as:
NewString := LowerCase(OriginalString, Turkish);
or
NewString := UpperCase(OriginalString, German);
or
NewString := TitleCase(OriginalString, Spanish);
etc.
Is there anything in PUCU.pas that serves (or can be made to serve) this purpose.
function PUCUUnicodeIsWhiteSpace(c:TPUCUUInt32):boolean; {$ifdef caninline}inline;{$endif}
begin
//result:=UnicodeGetCategoryFromTable(c) in [PUCUUnicodeCategoryZs,PUCUUnicodeCategoryZp,PUCUUnicodeCategoryZl];
result:=((c>=$0009) and (c<=$000d)) or (c=$0020) or (c=$00a0) or (c=$1680) or (c=$180e) or ((c>=$2000) and (c<=$200b)) or (c=$2028) or (c=$2029) or (c=$202f) or (c=$205f) or (c=$3000) or (c=$feff) or (c=$fffe);
end;
case block will be better.
The statement...
Lines 3340 to 3342 in ea0b2a5
...can cause an integer overflow when the multiplication of two 32-bit numbers overflows 32-bits.
One way to avoid it is to simply promote the numbers to 64-bit like this:
CompositionSequenceIndex :=
PUCUUnicodeCharacterCompositionHashTableData[
TPUCUUInt32((TPUCUUInt64(StartCodePoint)*98303927) xor (TPUCUUInt64(CodePoint)*24710753)) and
PUCUUnicodeCharacterCompositionHashTableMask];
Some characters need special handling because they turn into multiple characters rather than one when case converting, as in "SpecialCasing.txt". E.g. when converting to lower case İ
becomes something like0069 0307
:
For upper case ß
is the most notorious.
That should be handled in PUCUUTF*LowerCase/PUCUUTF*UpperCase.
Decomposing and normalizing the letter Å (Latin Capital Letter A with Ring Above, codepoint $00C5
) produces the sequence $0041 $030A
. This is correct.
However, composing the sequence $0041 $030A
produces the codepoint $212B
(Angstrom Sign).
$00C5
and $212B
are equivalent codepoints but their normal form is $00C5
so the composition is wrong.
For example, the distinct Unicode strings "U+212B" (the angstrom sign "Å") and "U+00C5" (the Swedish letter "Å") are both expanded by NFD (or NFKD) into the sequence "U+0041 U+030A" (Latin letter "A" and combining ring above "°") which is then reduced by NFC (or NFKC) to "U+00C5" (the Swedish letter "Å").
This is a pretty big problem as Å is a fairly common letter in Scandinavian languages (which is why I discovered this problem).
This is hard to build on Linux
I always have to do it like this:
$ fpc PUCUConvertUnicode.dpr
..
$ ./PUCUConvertUnicode
..
$ fpc -Twin32 PUCUGenCodePages.dpr
Error: Illegal parameter: -Twin32
$ fpc -Twin32 -Pi386 PUCUGenCodePages.dpr
...
$ wine ./PUCUGenCodePages.exe
$ fpc -B PUCUBuild.dpr
...
$ ./PUCUBuild
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.