GithubHelp home page GithubHelp logo

iandoug / unicodeccount Goto Github PK

View Code? Open in Web Editor NEW

This project forked from hughp/unicodeccount

0.0 1.0 0.0 176 KB

This is a wish-list of things I would like to add to UnicodeCCount Some day

Perl 100.00%

unicodeccount's Introduction

UnicodeCCount

This is a wish-list of things I would like to add to UnicodeCCount. The original author would likely consider these features as project creep. So, it is likely up to me to do this, so here is to ... Someday. I am not a perl programer, but as I have used this tool, there are some things which I wish it did, and other things which I wish it did a little differently.

Additional canonical equivalences

  • There are other canonical equivalences in Unicode (ie. FCC + FCD). - I wish there was a flag which enabled these, flags for NFC and NFD already exist. See: http://perldoc.perl.org/5.8.8/Unicode/Normalize.html for a perl centric discussion.
    • This would allow for the proper ordering of diacritics with the same tool as is used to count characters. So if we needed to pre-flight a document for another process we could use UnicodeCCount. Upon more reflection: Actually UCC is only a diagnostic tool. It reads original data. It never re-writes an orignal file.... so the only diagonstics in this area are: is the target file FCC or FCD compliant.

Using a third argument as a reference to a data set

  • I wish there was a flag that permitted the count functional units (sets of unicode codepoints that orthography users think of as a single unit). That is, a third file would be needed but then based on that file (listing a set of strings), functional units could be counted. I have a prototype script in a github repo. - I need to find it.
  • I wish there was also a custom colation order. Three useful use-cases I can think of would be:
    • Case pairing - Aa, Bb, Cc, etc. Each character would get its own row as is the default, pairs would either be defined by a third file, or by a chosen Locale.
    • Segmentation based on colation order of the locale or an order presented in a seperate file
    • Segmentation based on the typology of characters presented in Unicode report #tr35.

Casing

A flag would be needed for applying the Unicode related perl function lc to the input. See discussion here: http://perldoc.perl.org/functions/lc.html :: http://perl.about.com/od/programmingperl/qt/perllcfunction.htm

  • It would be good to also create a paired output - so order based on upper-lower case pairing rather than say Unicode ordering.

Colation Order

Sort order could additionally be lingusitic, orthogaphic, or locale

  • Lingusitic
  • 1: PTK, BDG, etc. place of articulation
  • 2: PB, TD, KG, etc. manner of articulation
  • Orthographic: Vowels, Consonants, Punctuation, Currency,
  • Locale: as defined in the locale chosen.

Unicode #tr35

For instance I was looking at Masai on ScriptSource and noticed that there are "main characters", "Auxiliary characters", and "index characters". Auxiliary and Index characters are informally defined in https://www.unicode.org/reports/tr35/tr35-general.html#Exemplars Index characters are more fully described: http://unicode.org/reports/tr35/tr35-collation.html#Index_Characters. Basically, an index character list just defines a useful set of buckets. Note that index characters are, by definition, in local collation order, but they do not define -- nor could they be used to deduce -- the full collation details. You can see that by recognizing that "A D G J M P S V Y" would be a valid (if uncommon) index character list for English. Similarly, for the Masai example, there are lots of characters in the "Main characters used" than are in the "Index characters use" lists -- you'd have to know how to collate Masai to put things into the buckets defined by the index.
Ideally such an implementation would be based on the Unicode Collation Algorithm (which is available from Perl), using tailoring to get what one wanted.

do the UnicodCCount thing according to all indicated options;
order output according to order in file $_

something on the command line like:  

`CCount -m -s sortorderfile.txt -o outputfile.txt input.txt`  

My use case for this has been to ingest files in languages and then to check those files against the characters in the "orthography" and the characters on the keyboard for that language. If the documents show that authors are using characters which are not "in their orthography" and are also "produced by their keyboard" then I have been taking note of that for my keyboard layout work.

Non-conatinive pairing

  • In tonal languages which represent their tone via diacritics, it is often the case that these languages have tonal patterns which are phonologically important. That is, the sequence of diacritics across the tops of vowels is important just as much as digraphs or trigraphs. So, How can we comput if these exists? if we have the types of tonal patterns (also known as melodies) in a list, then it makes sense to be able to count these as functional units. (and across how many characters the melody occurs.) something like: find diacritic then length($string) to next vowel to match the next melody pattern. (On this point it might be good for me to look at the following tutorials 1 & 2)

Output clarifications

  • I wish sometimes that the Unicode NAME for a character was also avaible via a column.see: https://perldoc.perl.org/charnames.html
  • I wish that a list of scripts from which characters are present in the text input could be output, with those script's IDs. See: https://perldoc.perl.org/Unicode/UCD.html#*charscript()* and https://perldoc.perl.org/5.8.8/perlunicode.html#Scripts
  • I wish that tabs (and in general characters which do not have glyphs) were not returned without glyphs. I think there are two flags needed here. One flag for just tab related issues, and one for all grahemeless characters. Tab is epecially difficult, if the output of UCC is desired to be used as a data file, reading the file as a tab seperated file is problematic when the character output is also a tab. Other graphemeless characters are just difficult to read without the Unicode names or without a unique glyph.

The following is a list of some characters and the glyphs that Unicode has registerd for them:

,␁
,␂
,␃
,␄
,␅
,␆
,␇
,␈
,␋
,␌
,␎
,␏
,␐
,␑
,␒
,␓
,␔
,␕
,␖
,␗
,␘
,␙
,␚
,␛
±,±
,␜
,␝
,␞
,␟
�,␀

,␊

,␍
	,␉
,␡
§,§
&,&
&#x003C;,<
&#x003E;,>
&#x00A0;,␠
&#x1;,␁
&#x2;,␂
&#x3;,␃
&#x4;,␄
&#x5;,␅
&#x6;,␆
&#x7;,␇
&#x8;,␈
&#xB;,␋
&#xC;,␌
&#xE;,␎
&#xF;,␏
&#x10;,␐
&#x11;,␑
&#x12;,␒
&#x13;,␓
&#x14;,␔
&#x15;,␕
&#x16;,␖
&#x17;,␗
&#x18;,␘
&#x19;,␙
&#x1A;,␚
&#x1B;,␛
&#xB1;,±
&#x7F;,␡
&#xA7;,§
&#x26;,&
&#x22;,"
&#x3C;,<
&#x3E;,>
&#xA0;,␠
&#x1C;,␜
&#x1D;,␝
&#x1E;,␞
&#x1F;,␟
&#x0;,␀
&#xA;,␊
&#xD;,␍
&#x9;,␉

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.