GithubHelp home page GithubHelp logo

pucu's Introduction

PUCU Pascal UniCode Utils Libary

You do need only the src\PUCU.pas file for the normal usage of this Library

pucu's People

Contributors

bero1985 avatar

Stargazers

 avatar Eᑫᑌ1ᑎ0᙭ avatar Dean Qin avatar deep-soft avatar  avatar  avatar Paul G avatar Sanyin avatar cheneyyao avatar  avatar  avatar Roland Chastain avatar 球球 avatar Buks van Tonder avatar  avatar ::1 avatar Edwin Yip avatar

Watchers

 avatar James Cloos avatar Dmitry Atamanov avatar Roland Chastain avatar Anders Melander avatar  avatar

pucu's Issues

Avoid use of longint and longword

In Delphi LongInt and LongWord are platform dependant types: 32-bit on Windows and 64-bit on Linux, iOS, and Android.
In FreePascal LongInt and LongWord are 32-bit on all platforms. I don't know if this is dependent on Delphi-mode or not.

You probably want to use Integer and Cardinal instead as these are 32-bit on all platforms, regardless of the compiler.

https://docwiki.embarcadero.com/RADStudio/Sydney/en/Simple_Types_(Delphi)#Platform-Dependent_Integer_Types
https://www.freepascal.org/docs-html/rtl/system/longint.html

It appears that you were aware of this when you wrote the code, but got the handling of it reversed:

TPUCUInt32={$ifdef fpc}Int32{$else}LongInt{$endif};
TPUCUUInt32={$ifdef fpc}UInt32{$else}LongWord{$endif};

Regardless, it would be better to just alias the integer and cardinal types.

Changing case according to language

I went through PUCU.pas as carefully as I could, but I couldn't find an answer to changing case (i.e. upper/lower/title) of a codepoint based on language.

Let me try to make the question a little clearer:

Consider 'LATIN CAPITAL LETTER I' (U+0049).

For all languages the lowercase for this is 'LATIN SMALL LETTER I' (U+0069) except for Turkish and Azerbaijani in which case it becomes 'LATIN SMALL LETTER DOTLESS I' (U+0131).

Similar ones also apply to other languages for title cases, etc.

The issue here is not a codepage since Unicode does not have codepages.

It's neither a script (or codeblock) issue.

It is, solely and purely a language one.

IOW, case change rules differ by the language a piece of string written in.

To sum it all, I suppose I am looking for a function such as:

NewString := LowerCase(OriginalString, Turkish);
or
NewString := UpperCase(OriginalString, German);
or
NewString := TitleCase(OriginalString, Spanish);
etc.

Is there anything in PUCU.pas that serves (or can be made to serve) this purpose.

Use 'case' here, faster

function PUCUUnicodeIsWhiteSpace(c:TPUCUUInt32):boolean; {$ifdef caninline}inline;{$endif}
begin
//result:=UnicodeGetCategoryFromTable(c) in [PUCUUnicodeCategoryZs,PUCUUnicodeCategoryZp,PUCUUnicodeCategoryZl];
 result:=((c>=$0009) and (c<=$000d)) or (c=$0020) or (c=$00a0) or (c=$1680) or (c=$180e) or ((c>=$2000) and (c<=$200b)) or (c=$2028) or (c=$2029) or (c=$202f) or (c=$205f) or (c=$3000) or (c=$feff) or (c=$fffe);
end;

case block will be better.

Integer overflow in PUCUUTF32Normalize

The statement...

pucu/src/PUCUCode.pas

Lines 3340 to 3342 in ea0b2a5

CompositionSequenceIndex:=PUCUUnicodeCharacterCompositionHashTableData[TPUCUUInt32((TPUCUUInt32(StartCodePoint)*98303927) xor
(TPUCUUInt32(CodePoint)*24710753)) and
PUCUUnicodeCharacterCompositionHashTableMask];

...can cause an integer overflow when the multiplication of two 32-bit numbers overflows 32-bits.

One way to avoid it is to simply promote the numbers to 64-bit like this:

CompositionSequenceIndex :=
  PUCUUnicodeCharacterCompositionHashTableData[
    TPUCUUInt32((TPUCUUInt64(StartCodePoint)*98303927) xor (TPUCUUInt64(CodePoint)*24710753)) and
    PUCUUnicodeCharacterCompositionHashTableMask];

Special lower/upper case handling

Some characters need special handling because they turn into multiple characters rather than one when case converting, as in "SpecialCasing.txt". E.g. when converting to lower case İ becomes something like0069 0307:

For upper case ß is the most notorious.

That should be handled in PUCUUTF*LowerCase/PUCUUTF*UpperCase.

NFC Normalization of Å

Decomposing and normalizing the letter Å (Latin Capital Letter A with Ring Above, codepoint $00C5) produces the sequence $0041 $030A. This is correct.
However, composing the sequence $0041 $030A produces the codepoint $212B (Angstrom Sign).

$00C5 and $212B are equivalent codepoints but their normal form is $00C5 so the composition is wrong.

For example, the distinct Unicode strings "U+212B" (the angstrom sign "Å") and "U+00C5" (the Swedish letter "Å") are both expanded by NFD (or NFKD) into the sequence "U+0041 U+030A" (Latin letter "A" and combining ring above "°") which is then reduced by NFC (or NFKC) to "U+00C5" (the Swedish letter "Å").

This is a pretty big problem as Å is a fairly common letter in Scandinavian languages (which is why I discovered this problem).

linux building

This is hard to build on Linux

I always have to do it like this:

$   fpc PUCUConvertUnicode.dpr
..
$   ./PUCUConvertUnicode 
..
$   fpc -Twin32 PUCUGenCodePages.dpr
Error: Illegal parameter: -Twin32
$ fpc -Twin32 -Pi386 PUCUGenCodePages.dpr
...
$  wine ./PUCUGenCodePages.exe
$  fpc -B PUCUBuild.dpr
...
$   ./PUCUBuild 

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.