GithubHelp home page GithubHelp logo

lang's People

Contributors

chrissimpkins avatar chuckxiong0211 avatar davelab6 avatar dependabot[bot] avatar emmamarichal avatar felipesanches avatar gino-m avatar m4rc1e avatar moyogo avatar neilsureshpatel avatar rosawagner avatar simoncozens avatar sking-2003 avatar vv-monsalve avatar yanone avatar zhaoxiong0211 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

lang's Issues

Are we happy with the api?

We only have four functions which live in the module gflanguages.lang_support. This module is also the only module that we have. Couldn't these functions just live in the __init__ file?

Moving them to the __init__ means we'll get more concise import statement e.g

from gflanguages import LoadScripts. We currently have to do from gflanguages.lang_support import LoadScripts.

Yoruba sample text issue (or glyphset issue?)

@moyogo inspecting the changes of #16 in sandbox, some glyphs are still falling back to another fonts

Capture d’écran 2022-10-26 à 15 37 34

I don't remember if this is this kind of issue:
o dotbelow _ acutecomb instead of o acute _ dotbelowcomb (but I don't think it should change anything in term of text composition)

Or if it is because the comb accents are missing from the Latin .nam glyphset.

Larger corpus

It would be interesting to consolidate various language data repos into this one maybe, so we can use it in e.g. regression testing, as a source of words and sentences to test-shape for a font. Real-world data to use for testing.

Data sources could be the Android Open Source Project UI strings, all of Wikipedia, the Universal Declaration of Human Rights, etc.

Regenerate pb2 files

Preface: It is perfectly possible to run with the help of PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python diffenator

How may I regenerate the Protobuf files with a newer protoc compiler?

I know that protobuf is pinned #19 – I'm just asking how to fix my own issue without changing your source tree.

Consistency in orthography listing

As we prepare implementing shaperglot for testing African language support, I am noticing that there is variation in the way language orthographies are incorporated in gflang. Here are few examples:

bas_Latn

exemplar_chars {
  base: "a á à â ǎ ā {a᷆}{a᷇} b ɓ c d e é è ê ě ē {e᷆}{e᷇} ɛ {ɛ́} {ɛ̀} {ɛ̂} {ɛ̌} {ɛ̄} {ɛ᷆}{ɛ᷇} f g h i í ì î ǐ ī {i᷆}{i᷇} j k l m n ń ǹ ŋ o ó ò ô ǒ ō {o᷆}{o᷇} ɔ {ɔ́} {ɔ̀} {ɔ̂} {ɔ̌} {ɔ̄} {ɔ᷆}{ɔ᷇} p r s t u ú ù û ǔ ū {u᷆}{u᷇} v w y z {a᷆} {a᷇} {e᷆} {e᷇} {ɛ́} {ɛ̀} {ɛ̂} {ɛ̌} {ɛ̄} {ɛ᷆} {ɛ᷇} {i᷆} {i᷇} {o᷆} {o᷇} {ɔ́} {ɔ̀} {ɔ̂} {ɔ̌} {ɔ̄} {ɔ᷆} {ɔ᷇} {u᷆} {u᷇} {a᷆} {a᷇} {e᷆} {e᷇} {ɛ́} {ɛ̀} {ɛ̂} {ɛ̌} {ɛ̄} {ɛ᷆} {ɛ᷇} {i᷆} {i᷇} {o᷆} {o᷇} {ɔ́} {ɔ̀} {ɔ̂} {ɔ̌} {ɔ̄} {ɔ᷆} {ɔ᷇} {u᷆} {u᷇} {a᷆} {a᷇} {e᷆} {e᷇} {ɛ́} {ɛ̀} {ɛ̂} {ɛ̌} {ɛ̄} {ɛ᷆} {ɛ᷇} {i᷆} {i᷇} {o᷆} {o᷇} {ɔ́} {ɔ̀} {ɔ̂} {ɔ̌} {ɔ̄} {ɔ᷆} {ɔ᷇} {u᷆} {u᷇} {a᷆} {a᷇} {e᷆} {e᷇} {ɛ́} {ɛ̀} {ɛ̂} {ɛ̌} {ɛ̄} {ɛ᷆} {ɛ᷇} {i᷆} {i᷇} {o᷆} {o᷇} {ɔ́} {ɔ̀} {ɔ̂} {ɔ̌} {ɔ̄} {ɔ᷆} {ɔ᷇} {u᷆} {u᷇} {a᷆} {a᷇} {e᷆} {e᷇} {ɛ́} {ɛ̀} {ɛ̂} {ɛ̌} {ɛ̄} {ɛ᷆} {ɛ᷇} {i᷆} {i᷇} {o᷆} {o᷇} {ɔ́} {ɔ̀} {ɔ̂} {ɔ̌} {ɔ̄} {ɔ᷆} {ɔ᷇} {u᷆} {u᷇} {a᷆} {a᷇} {e᷆} {e᷇} {ɛ́} {ɛ̀} {ɛ̂} {ɛ̌} {ɛ̄} {ɛ᷆} {ɛ᷇} {i᷆} {i᷇} {o᷆} {o᷇} {ɔ́} {ɔ̀} {ɔ̂} {ɔ̌} {ɔ̄} {ɔ᷆} {ɔ᷇} {u᷆} {u᷇} {a᷆} {a᷇} {e᷆} {e᷇} {ɛ́} {ɛ̀} {ɛ̂} {ɛ̌} {ɛ̄} {ɛ᷆} {ɛ᷇} {i᷆} {i᷇} {o᷆} {o᷇} {ɔ́} {ɔ̀} {ɔ̂} {ɔ̌} {ɔ̄} {ɔ᷆} {ɔ᷇} {u᷆} {u᷇} {a᷆} {a᷇} {e᷆} {e᷇} {ɛ́} {ɛ̀} {ɛ̂} {ɛ̌} {ɛ̄} {ɛ᷆} {ɛ᷇} {i᷆} {i᷇} {o᷆} {o᷇} {ɔ́} {ɔ̀} {ɔ̂} {ɔ̌} {ɔ̄} {ɔ᷆} {ɔ᷇} {u᷆} {u᷇} {a᷆} {a᷇} {e᷆} {e᷇} {ɛ́} {ɛ̀} {ɛ̂} {ɛ̌} {ɛ̄} {ɛ᷆} {ɛ᷇} {i᷆} {i᷇} {o᷆} {o᷇} {ɔ́} {ɔ̀} {ɔ̂} {ɔ̌} {ɔ̄} {ɔ᷆} {ɔ᷇} {u᷆} {u᷇} {a᷆} {a᷇} {e᷆} {e᷇} {ɛ́} {ɛ̀} {ɛ̂} {ɛ̌} {ɛ̄} {ɛ᷆} {ɛ᷇} {i᷆} {i᷇} {o᷆} {o᷇} {ɔ́} {ɔ̀} {ɔ̂} {ɔ̌} {ɔ̄} {ɔ᷆} {ɔ᷇} {u᷆} {u᷇} {a᷆} {a᷇} {e᷆} {e᷇} {ɛ́} {ɛ̀} {ɛ̂} {ɛ̌} {ɛ̄} {ɛ᷆} {ɛ᷇} {i᷆} {i᷇} {o᷆} {o᷇} {ɔ́} {ɔ̀} {ɔ̂} {ɔ̌} {ɔ̄} {ɔ᷆} {ɔ᷇} {u᷆} {u᷇} {a᷆} {a᷇} {e᷆} {e᷇} {ɛ́} {ɛ̀} {ɛ̂} {ɛ̌} {ɛ̄} {ɛ᷆} {ɛ᷇} {i᷆} {i᷇} {o᷆} {o᷇} {ɔ́} {ɔ̀} {ɔ̂} {ɔ̌} {ɔ̄} {ɔ᷆} {ɔ᷇} {u᷆} {u᷇} {a᷆} {a᷇} {e᷆} {e᷇} {ɛ́} {ɛ̀} {ɛ̂} {ɛ̌} {ɛ̄} {ɛ᷆} {ɛ᷇} {i᷆} {i᷇} {o᷆} {o᷇} {ɔ́} {ɔ̀} {ɔ̂} {ɔ̌} {ɔ̄} {ɔ᷆} {ɔ᷇} {u᷆} {u᷇} {a᷆} {a᷇} {e᷆} {e᷇} {ɛ́} {ɛ̀} {ɛ̂} {ɛ̌} {ɛ̄} {ɛ᷆} {ɛ᷇} {i᷆} {i᷇} {o᷆} {o᷇} {ɔ́} {ɔ̀} {ɔ̂} {ɔ̌} {ɔ̄} {ɔ᷆} {ɔ᷇} {u᷆} {u᷇} {a᷆} {a᷇} {e᷆} {e᷇} {ɛ́} {ɛ̀} {ɛ̂} {ɛ̌} {ɛ̄} {ɛ᷆} {ɛ᷇} {i᷆} {i᷇} {o᷆} {o᷇} {ɔ́} {ɔ̀} {ɔ̂} {ɔ̌} {ɔ̄} {ɔ᷆} {ɔ᷇} {u᷆} {u᷇}"
  auxiliary: "q x"
  marks: "◌̀ ◌́ ◌̂ ◌̄ ◌̌ ◌᷆ ◌᷇"
  numerals: "  - ‑ , % ‰ + 0 1 2 3 4 5 6 7 8 9"
  index: "A B Ɓ C D E Ɛ F G H I J K L M N Ŋ O Ɔ P R S T U V W Y Z"
}

bin_Latn

exemplar_chars {
  base: "A B D E F G H I K L M N O P R S T U V W Y Z Á É È Ẹ Í Ó Ò Ọ Ú a b d e f g h i k l m n o p r s t u v w y z á é è ẹ í ó ò ọ ú \'"
  marks: "◌̀ ◌́ ◌̣"
}

af_Latn

exemplar_chars {
  base: "a á â b c d e é è ê ë f g h i î ï j k l m n o ô ö p q r s t u û v w x y z"
  auxiliary: "à å ä ã æ ç í ì ó ò ú ù ü ý"
  marks: "◌̀ ◌̂ ◌̈"
  numerals: "  - ‑ , % ‰ + 0 1 2 3 4 5 6 7 8 9"
  punctuation: "- ‐ ‑ – — , ; : ! ? . … \' ‘ ’ \" “ ” ( ) [ ] § @ * / & # † ‡ ′ ″"
  index: "A B C D E F G H I J K L M N O P Q R S T U V W X Y Z"
}

The first inconsistency is that not all language profiles contain auxiliary bases when they should. When auxiliary bases include a mark the mark list doesn't always include those accents.

The second big inconsistency is the inclusion of non-precomposed base/mark pairs in the base list. Sometimes these pairs are in base list and sometimes they are not.

In order for shaperglot to properly parse gflang to run its orthography tests we need some consistency in how the exemplar character lists are constructed. For the purposes of shaperglot, it is good to have gflang contain all necessary base/mark pairs regardless if they can be precomposed or not. It appears like the variation is caused by the incoming source data. (The bas_Latn entry reflects the data in CLDR, including the lack of spaces between certain bases.) Should we have a guideline specifically spells out what needs to be included in bases, auxiliary, and marks?

Perhaps something like:
-bases: all primary characters of a language including precomposed base/mark pairs and non-composed base/mark pairs, when a precomposed character is not encoded
-auxiliary: all secondary characters of a language including precomposed base/mark pairs and non-composed base/mark pairs, when a precomposed character is not encoded
-marks: all standalone marks whether they are primary or auxiliary

Programmatic access to exemplar_chars without dropping index chars

For a given language textproto with exemplar_chars containing index like e.g. Asturian:

exemplar_chars {
  base: "A Á B C D E É F G H Ḥ I Í L Ḷ M N Ñ O Ó P Q R S T U Ú Ü V X Y Z a á b c d e é f g h ḥ i í l ḷ m n ñ o ó p q r s t u ú ü v x y z"
  auxiliary: "À Ă Â Å Ä Ã Ā Æ Ç È Ĕ Ê Ë Ē Ì Ĭ Î Ï Ī J K Ò Ŏ Ô Ö Ø Ō Œ Ù Ŭ Û Ū W Ÿ ª à ă â å ä ã ā æ ç è ĕ ê ë ē ì ĭ î ï ī j k º ò ŏ ô ö ø ō œ ù ŭ û ū w ÿ"
  marks: "◌́ ◌̃ ◌̈ ◌̣"
  numerals: "- , . % + 0 1 2 3 4 5 6 7 8 9"
  punctuation: "- – — , ; : ! ¡ ? ¿ . … \' ‘ ’ \" “ ” « » ( ) [ ] @ * / \\ & #"
  index: "A B C D E F G H I L M N Ñ O P Q R S T U V X Y Z"
}

…the python interface will return base / auxiliary with subtracted index chars and any components thereof, so e.g.:

from gflanguages import LoadLanguages()
LoadLanguages()["ast_Latn"].exemplar_chars.base
> 'a á b c d e é f g h ḥ i í l ḷ m n ñ o ó p q r s t u ú ü v x y z'

I'd expect the return to be: 'A Á B C D E É F G H Ḥ I Í L Ḷ M N Ñ O Ó P Q R S T U Ú Ü V X Y Z a á b c d e é f g h ḥ i í l ḷ m n ñ o ó p q r s t u ú ü v x y z'

The return without the index characters / capitals is unexpected when other languages without the index will return upper and lower case combined, e.g.

LoadLanguages()["arn_Latn"].exemplar_chars.base
> 'A B C D E F G I J K L M N O P Q R S T U W X Y Z Ü Ñ a b c d e f g i j k l m n o p q r s t u w x y z ü ñ'

Is this intentional? And how can you get the full base without the magic removal of index characters and composites?

Support separate orthographies

Some languages have multiple orthographies within the same script. For example yo_Latn could have both the current yo_Latn_NG and the missing yo_Latn_BJ.

Language ≠ Script

The ‘Language’ dropdown menu contains an inconsistent mix of script and language identifiers. So, for example, ‘Bengali’ which could refer to either the Bengali* language or to the Bengali script, which is also used to write several other languages; ‘Devanagari’ on the other hand refers only to the script, which is used to write numerous languages that exhibit significant orthographic differences that may be reflected in font selection or OpenType Layout behaviour; such orthographic differences are explicitly recognised in the three types of Chinese identified in the list; and ‘Vietnamese’ appears to be the only entry in the ‘Language’ list that unambiguously refers to a language rather than a script.

I would like to suggest that Google Fonts consider moving to a system of two linked menus: one for script and one for language, with the content of the latter depending on the selection of the former. This would enable finer-grain browsing of fonts, and more accurate selection of results, especially if fonts contain a meta table that identifies primary design script or language tags, which could be taken into account in prioritising results. So, for example, almost all Devanagari fonts will support Hindi language reasonably well, but might be particularly tailored to the orthographic conventions of Marathi or Sanskrit (as is the case with fonts in the Tiro Indic collection). Even if fonts do not contain meta tables, it would be a good idea for Google Fonts to tag fonts in a way that prioritises browsing results by script and language, and this would also make it easier to improve specimen language selection. At the moment, for example, all Devanagari fonts show the same Hindi specimen texts, regardless of the orthography targets of the fonts.

  • Bangla is the preferred endonym for both the language and script.

Add "source" metadata field

Adding the Tangsa sample text in #25 reminded me that, at some point, someone's going to look at that and think "What is that text? Where did we get it from? Is it the UHDR or is it something else? Where do we report bugs in it?" And they might find the commit log and possibly trace it back to the PR (although I didn't put any notes in the commit log, just in the PR body), but it would be more robust if we could annotate the textproto files themselves with this kind of source information.

I suppose if we don't really need it as a field, we could just add comments.

Broken clusters in various sample texts

The following sample texts do not contain correct orthographic clusters:

  • S'gaw Karen (ksw_Mymr): "ဟီၣၲ"
  • Zanabazar Square (sa_Zamb): "𑨬𑨴𑩇𑨂"
  • Pwo (pwo_Mymr) : "အခံွးအရ့ၩ" (Maybe should be ခွံး?)
  • Sanskrit in Limbu (sa_Limb): "ᤐ᤺ᤢᤷᤏᤖ᤺ᤢᤐᤣᤏ"
  • Limbu (lif_Limb): "ᤆᤁᤡ ᤁᤧᤘᤠ᤹ᤒᤠ"
  • Sanksrit in Gunjala Gondi (sa_Gong): "𑶍𑶗𑶖𑵡 "
  • Chakma (ccp_Cakm): "𑄖𑄧𑄖𑄧𑄧𑄱" (This looks like Harfbuzz's cluster definition has changed)
  • NP Hmong (hmn_Hmnp): "𞄰𞄱"

For the three Sanskrit texts, something obviously went wrong with the transliteration. For the rest, I don't know where we got those texts from.

tk_Arab and ku_Cyrl are actually Latin

id: "tk_Arab"
script: "Arab"
...
exemplar_chars {
  base: "a b ç d e ä f g h i j ž k l m n ň o ö p r s ş t u ü w y ý z"
  auxiliary: "c q v x"
  numerals: "  - ‑ , % ‰ + 0 1 2 3 4 5 6 7 8 9"
  punctuation: "- ‑ – — , ; : ! ? . … \" “ ” ( ) [ ] { } § @ * # { } { } { } { } { } { } { } { } { } { } { } { } { } { } { }"
  index: "A B Ç D E Ä F G H I J Ž K L M N Ň O Ö P R S Ş T U Ü W Y Ý Z"
}
id: "ku_Cyrl"
language: "ku"
script: "Cyrl"
...
exemplar_chars {
  base: "a b c ç d e ê f g h i î j k l m n o p q r s ş t u û v w x y z"
  auxiliary: "á à ă â å ä ã ā æ è ĕ ë ē é ì ĭ ï ī í ñ ó ò ŏ ô ø ō œ ß ù ŭ ū ú ÿ"
  marks: "◌̆ ◌̈"
  punctuation: "- ‐ ‑ – — , ; : ! ? . … \' ‘ ’ \" “ ” ( ) [ ] § @ * / & # † ‡ ′ ″"
  index: "A B C Ç D E Ê F G H I Î J K L M N O P Q R S Ş T U Û V W X Y Z"
}

I don't know how to fix this, because I'm not sure how they were generated. Just delete them?

Adding wordlists to this repo

Diffenator2 uses wordlists in order to check for font regressions. I currently include these wordlists its repo, https://github.com/m4rc1e/diffenator2/tree/main/diffenator/data/wordlists (repo will move to googlefonts/ soon).

I'm very tempted to move this data to this repo so more people can benefit from it without having to install diffenator2 and its dependencies. Another solution is to make a new repo just for these wordlists. This may make more sense since it's going to be a fair chunk of data and the words have been constructed from Wikipedia which has a CC license.

cc @madig @simoncozens

cut a new release

I want to merge #19 and cut a new release. I don't actually maintain this repo so is there anything I should be aware of?

I realise this repo is a subtree in google/fonts but are there any other nasty surprises?

cc @moyogo, @vv-monsalve , @chrissimpkins

Does not work with latest protobuf

I could use gflanguages with protobuf 3.17.3; if installed as part of the installation of gftools (which requires protobuf with no version dependency, and hence picks up 4.21.1), I get

   File "/home/runner/work/latin-greek-cyrillic/latin-greek-cyrillic/venv/lib/python3.8/site-packages/gftools/packager.py", line 42, in <module>
     from gflanguages import LoadLanguages
   File "/home/runner/work/latin-greek-cyrillic/latin-greek-cyrillic/venv/lib/python3.8/site-packages/gflanguages/__init__.py", line 28, in <module>
     from gflanguages import languages_public_pb2
   File "/home/runner/work/latin-greek-cyrillic/latin-greek-cyrillic/venv/lib/python3.8/site-packages/gflanguages/languages_public_pb2.py", line 36, in <module>
     _descriptor.FieldDescriptor(
   File "/home/runner/work/latin-greek-cyrillic/latin-greek-cyrillic/venv/lib/python3.8/site-packages/google/protobuf/descriptor.py", line 560, in __new__
     _message.Message._CheckCalledFromGeneratedFile()
 TypeError: Descriptors cannot not be created directly.
 If this call came from a _pb2.py file, your generated code is out of date and must be regenerated with protoc >= 3.19.0.
 If you cannot immediately regenerate your protos, some other possible workarounds are:
  1. Downgrade the protobuf package to 3.20.x or lower.
  2. Set PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python (but this will use pure-Python parsing and will be much slower).

Note that while gflanguages depends on a certain protobuf version in its requirements.txt, it does not depend on a pinned version in setup.py.

Out-of-script characters in sd_Khoj and sd_Khud

sd_Khoj now contains quite a few Arabic letters (ڪ , ٿ, ٻ, ڀ, چ, ڻ):

𑈧𑈥𑈺ڪ𑈀𑈞 ... 𑈟𑈥𑈛𑈬ٿ𑈥𑈬 ... ڪ𑈥𑈺 ... 𑈺ٿ𑈥𑈨 ... 𑈺ڪ𑈦𑈥𑈺 ... ڪ𑈥𑈺𑈪ڪ𑈺ٻ𑈥𑈥𑈺𑈩𑈧𑈞𑈺ڀ𑈧𑈥𑈥چ𑈀𑈦𑈺𑈥𑈺𑈨𑈬𑈦𑈨𑈺𑈩𑈬𑈨ڪ𑈺𑈀𑈉𑈶𑈙𑈥𑈬𑈦𑈺ڪ𑈦ڻ𑈺

Khudawadi also has the same issue. I'm guessing Aksharamukha gave up on letters that it couldn't directly put into these scripts.

There's clearly a problem here with our test suite, which should have caught this. I'll look in to fixing the tests.

@SKing-2003, could you prepare another PR with fixes to these files? According to the Khojki Unicode proposal:

  • Kaf ڪ needs to be replaced with 𑈈.
  • Rnoon ڻ is equivalent to 𑈘
  • Bha ڀ goes to 𑈣
  • Bba ٻ goes to 𑈢

I'm guessing چ should be 𑈏, and ٿ should be... 𑈔?

And for Khudawadi:

  • Kaf ڪ needs to be replaced with 𑊺.
  • Rnoon ڻ is equivalent to 𑋌
  • Bha ڀ goes to 𑋖
  • Bba ٻ goes to 𑋕
  • Cheh چ should be 𑋁
  • Teheh ٿ should be 𑋆.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.