googlefonts / lang Goto Github PK
View Code? Open in Web Editor NEWA python API for evaluating language support in the Google Fonts collection.
License: Apache License 2.0
A python API for evaluating language support in the Google Fonts collection.
License: Apache License 2.0
After a complain from a user https://twitter.com/fauxparse/status/1602944061023088643 saying the macron accent in Maori sample text were removed, I wanted to check out the textproto, but couldn't find any. So I guess we need to add one to fix the issue?
We only have four functions which live in the module gflanguages.lang_support
. This module is also the only module that we have. Couldn't these functions just live in the __init__
file?
Moving them to the __init__
means we'll get more concise import statement e.g
from gflanguages import LoadScripts
. We currently have to do from gflanguages.lang_support import LoadScripts
.
@moyogo inspecting the changes of #16 in sandbox, some glyphs are still falling back to another fonts
I don't remember if this is this kind of issue:
o dotbelow _ acutecomb
instead of o acute _ dotbelowcomb
(but I don't think it should change anything in term of text composition)
Or if it is because the comb accents are missing from the Latin .nam glyphset.
It would be interesting to consolidate various language data repos into this one maybe, so we can use it in e.g. regression testing, as a source of words and sentences to test-shape for a font. Real-world data to use for testing.
Data sources could be the Android Open Source Project UI strings, all of Wikipedia, the Universal Declaration of Human Rights, etc.
chn_Dupl.textproto has region: "US"
. However, Chinook Jargon was written in Duployan in Canada, not in the United States.
Preface: It is perfectly possible to run with the help of PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python diffenator
How may I regenerate the Protobuf files with a newer protoc compiler?
I know that protobuf is pinned #19 – I'm just asking how to fix my own issue without changing your source tree.
The 2 scripts are not listed in the "language" roll down menu, so I guess it is connected. Could we add the language tags into metadata.pb for these 2 fonts?
New African language profiles from PR #52 can be further improved with index and punctuation data. These were not collected for this PR since the immediate need was for exemplar data.
As we prepare implementing shaperglot for testing African language support, I am noticing that there is variation in the way language orthographies are incorporated in gflang. Here are few examples:
bas_Latn
exemplar_chars {
base: "a á à â ǎ ā {a᷆}{a᷇} b ɓ c d e é è ê ě ē {e᷆}{e᷇} ɛ {ɛ́} {ɛ̀} {ɛ̂} {ɛ̌} {ɛ̄} {ɛ᷆}{ɛ᷇} f g h i í ì î ǐ ī {i᷆}{i᷇} j k l m n ń ǹ ŋ o ó ò ô ǒ ō {o᷆}{o᷇} ɔ {ɔ́} {ɔ̀} {ɔ̂} {ɔ̌} {ɔ̄} {ɔ᷆}{ɔ᷇} p r s t u ú ù û ǔ ū {u᷆}{u᷇} v w y z {a᷆} {a᷇} {e᷆} {e᷇} {ɛ́} {ɛ̀} {ɛ̂} {ɛ̌} {ɛ̄} {ɛ᷆} {ɛ᷇} {i᷆} {i᷇} {o᷆} {o᷇} {ɔ́} {ɔ̀} {ɔ̂} {ɔ̌} {ɔ̄} {ɔ᷆} {ɔ᷇} {u᷆} {u᷇} {a᷆} {a᷇} {e᷆} {e᷇} {ɛ́} {ɛ̀} {ɛ̂} {ɛ̌} {ɛ̄} {ɛ᷆} {ɛ᷇} {i᷆} {i᷇} {o᷆} {o᷇} {ɔ́} {ɔ̀} {ɔ̂} {ɔ̌} {ɔ̄} {ɔ᷆} {ɔ᷇} {u᷆} {u᷇} {a᷆} {a᷇} {e᷆} {e᷇} {ɛ́} {ɛ̀} {ɛ̂} {ɛ̌} {ɛ̄} {ɛ᷆} {ɛ᷇} {i᷆} {i᷇} {o᷆} {o᷇} {ɔ́} {ɔ̀} {ɔ̂} {ɔ̌} {ɔ̄} {ɔ᷆} {ɔ᷇} {u᷆} {u᷇} {a᷆} {a᷇} {e᷆} {e᷇} {ɛ́} {ɛ̀} {ɛ̂} {ɛ̌} {ɛ̄} {ɛ᷆} {ɛ᷇} {i᷆} {i᷇} {o᷆} {o᷇} {ɔ́} {ɔ̀} {ɔ̂} {ɔ̌} {ɔ̄} {ɔ᷆} {ɔ᷇} {u᷆} {u᷇} {a᷆} {a᷇} {e᷆} {e᷇} {ɛ́} {ɛ̀} {ɛ̂} {ɛ̌} {ɛ̄} {ɛ᷆} {ɛ᷇} {i᷆} {i᷇} {o᷆} {o᷇} {ɔ́} {ɔ̀} {ɔ̂} {ɔ̌} {ɔ̄} {ɔ᷆} {ɔ᷇} {u᷆} {u᷇} {a᷆} {a᷇} {e᷆} {e᷇} {ɛ́} {ɛ̀} {ɛ̂} {ɛ̌} {ɛ̄} {ɛ᷆} {ɛ᷇} {i᷆} {i᷇} {o᷆} {o᷇} {ɔ́} {ɔ̀} {ɔ̂} {ɔ̌} {ɔ̄} {ɔ᷆} {ɔ᷇} {u᷆} {u᷇} {a᷆} {a᷇} {e᷆} {e᷇} {ɛ́} {ɛ̀} {ɛ̂} {ɛ̌} {ɛ̄} {ɛ᷆} {ɛ᷇} {i᷆} {i᷇} {o᷆} {o᷇} {ɔ́} {ɔ̀} {ɔ̂} {ɔ̌} {ɔ̄} {ɔ᷆} {ɔ᷇} {u᷆} {u᷇} {a᷆} {a᷇} {e᷆} {e᷇} {ɛ́} {ɛ̀} {ɛ̂} {ɛ̌} {ɛ̄} {ɛ᷆} {ɛ᷇} {i᷆} {i᷇} {o᷆} {o᷇} {ɔ́} {ɔ̀} {ɔ̂} {ɔ̌} {ɔ̄} {ɔ᷆} {ɔ᷇} {u᷆} {u᷇} {a᷆} {a᷇} {e᷆} {e᷇} {ɛ́} {ɛ̀} {ɛ̂} {ɛ̌} {ɛ̄} {ɛ᷆} {ɛ᷇} {i᷆} {i᷇} {o᷆} {o᷇} {ɔ́} {ɔ̀} {ɔ̂} {ɔ̌} {ɔ̄} {ɔ᷆} {ɔ᷇} {u᷆} {u᷇} {a᷆} {a᷇} {e᷆} {e᷇} {ɛ́} {ɛ̀} {ɛ̂} {ɛ̌} {ɛ̄} {ɛ᷆} {ɛ᷇} {i᷆} {i᷇} {o᷆} {o᷇} {ɔ́} {ɔ̀} {ɔ̂} {ɔ̌} {ɔ̄} {ɔ᷆} {ɔ᷇} {u᷆} {u᷇} {a᷆} {a᷇} {e᷆} {e᷇} {ɛ́} {ɛ̀} {ɛ̂} {ɛ̌} {ɛ̄} {ɛ᷆} {ɛ᷇} {i᷆} {i᷇} {o᷆} {o᷇} {ɔ́} {ɔ̀} {ɔ̂} {ɔ̌} {ɔ̄} {ɔ᷆} {ɔ᷇} {u᷆} {u᷇} {a᷆} {a᷇} {e᷆} {e᷇} {ɛ́} {ɛ̀} {ɛ̂} {ɛ̌} {ɛ̄} {ɛ᷆} {ɛ᷇} {i᷆} {i᷇} {o᷆} {o᷇} {ɔ́} {ɔ̀} {ɔ̂} {ɔ̌} {ɔ̄} {ɔ᷆} {ɔ᷇} {u᷆} {u᷇} {a᷆} {a᷇} {e᷆} {e᷇} {ɛ́} {ɛ̀} {ɛ̂} {ɛ̌} {ɛ̄} {ɛ᷆} {ɛ᷇} {i᷆} {i᷇} {o᷆} {o᷇} {ɔ́} {ɔ̀} {ɔ̂} {ɔ̌} {ɔ̄} {ɔ᷆} {ɔ᷇} {u᷆} {u᷇} {a᷆} {a᷇} {e᷆} {e᷇} {ɛ́} {ɛ̀} {ɛ̂} {ɛ̌} {ɛ̄} {ɛ᷆} {ɛ᷇} {i᷆} {i᷇} {o᷆} {o᷇} {ɔ́} {ɔ̀} {ɔ̂} {ɔ̌} {ɔ̄} {ɔ᷆} {ɔ᷇} {u᷆} {u᷇} {a᷆} {a᷇} {e᷆} {e᷇} {ɛ́} {ɛ̀} {ɛ̂} {ɛ̌} {ɛ̄} {ɛ᷆} {ɛ᷇} {i᷆} {i᷇} {o᷆} {o᷇} {ɔ́} {ɔ̀} {ɔ̂} {ɔ̌} {ɔ̄} {ɔ᷆} {ɔ᷇} {u᷆} {u᷇}"
auxiliary: "q x"
marks: "◌̀ ◌́ ◌̂ ◌̄ ◌̌ ◌᷆ ◌᷇"
numerals: " - ‑ , % ‰ + 0 1 2 3 4 5 6 7 8 9"
index: "A B Ɓ C D E Ɛ F G H I J K L M N Ŋ O Ɔ P R S T U V W Y Z"
}
bin_Latn
exemplar_chars {
base: "A B D E F G H I K L M N O P R S T U V W Y Z Á É È Ẹ Í Ó Ò Ọ Ú a b d e f g h i k l m n o p r s t u v w y z á é è ẹ í ó ò ọ ú \'"
marks: "◌̀ ◌́ ◌̣"
}
af_Latn
exemplar_chars {
base: "a á â b c d e é è ê ë f g h i î ï j k l m n o ô ö p q r s t u û v w x y z"
auxiliary: "à å ä ã æ ç í ì ó ò ú ù ü ý"
marks: "◌̀ ◌̂ ◌̈"
numerals: " - ‑ , % ‰ + 0 1 2 3 4 5 6 7 8 9"
punctuation: "- ‐ ‑ – — , ; : ! ? . … \' ‘ ’ \" “ ” ( ) [ ] § @ * / & # † ‡ ′ ″"
index: "A B C D E F G H I J K L M N O P Q R S T U V W X Y Z"
}
The first inconsistency is that not all language profiles contain auxiliary bases when they should. When auxiliary bases include a mark the mark list doesn't always include those accents.
The second big inconsistency is the inclusion of non-precomposed base/mark pairs in the base list. Sometimes these pairs are in base list and sometimes they are not.
In order for shaperglot to properly parse gflang to run its orthography tests we need some consistency in how the exemplar character lists are constructed. For the purposes of shaperglot, it is good to have gflang contain all necessary base/mark pairs regardless if they can be precomposed or not. It appears like the variation is caused by the incoming source data. (The bas_Latn entry reflects the data in CLDR, including the lack of spaces between certain bases.) Should we have a guideline specifically spells out what needs to be included in bases, auxiliary, and marks?
Perhaps something like:
-bases: all primary characters of a language including precomposed base/mark pairs and non-composed base/mark pairs, when a precomposed character is not encoded
-auxiliary: all secondary characters of a language including precomposed base/mark pairs and non-composed base/mark pairs, when a precomposed character is not encoded
-marks: all standalone marks whether they are primary or auxiliary
For a given language textproto with exemplar_chars
containing index
like e.g. Asturian:
exemplar_chars {
base: "A Á B C D E É F G H Ḥ I Í L Ḷ M N Ñ O Ó P Q R S T U Ú Ü V X Y Z a á b c d e é f g h ḥ i í l ḷ m n ñ o ó p q r s t u ú ü v x y z"
auxiliary: "À Ă Â Å Ä Ã Ā Æ Ç È Ĕ Ê Ë Ē Ì Ĭ Î Ï Ī J K Ò Ŏ Ô Ö Ø Ō Œ Ù Ŭ Û Ū W Ÿ ª à ă â å ä ã ā æ ç è ĕ ê ë ē ì ĭ î ï ī j k º ò ŏ ô ö ø ō œ ù ŭ û ū w ÿ"
marks: "◌́ ◌̃ ◌̈ ◌̣"
numerals: "- , . % + 0 1 2 3 4 5 6 7 8 9"
punctuation: "- – — , ; : ! ¡ ? ¿ . … \' ‘ ’ \" “ ” « » ( ) [ ] @ * / \\ & #"
index: "A B C D E F G H I L M N Ñ O P Q R S T U V X Y Z"
}
…the python interface will return base
/ auxiliary
with subtracted index
chars and any components thereof, so e.g.:
from gflanguages import LoadLanguages()
LoadLanguages()["ast_Latn"].exemplar_chars.base
> 'a á b c d e é f g h ḥ i í l ḷ m n ñ o ó p q r s t u ú ü v x y z'
I'd expect the return to be: 'A Á B C D E É F G H Ḥ I Í L Ḷ M N Ñ O Ó P Q R S T U Ú Ü V X Y Z a á b c d e é f g h ḥ i í l ḷ m n ñ o ó p q r s t u ú ü v x y z'
The return without the index
characters / capitals is unexpected when other languages without the index
will return upper and lower case combined, e.g.
LoadLanguages()["arn_Latn"].exemplar_chars.base
> 'A B C D E F G I J K L M N O P Q R S T U W X Y Z Ü Ñ a b c d e f g i j k l m n o p q r s t u w x y z ü ñ'
Is this intentional? And how can you get the full base
without the magic removal of index
characters and composites?
Add uppercase (except when orthography is monocameral) like #53, also remove ẞ from de_Latn index after its added to de_Latn base.
Could you move the buildsystem to PEP517?
Some languages have multiple orthographies within the same script. For example yo_Latn could have both the current yo_Latn_NG and the missing yo_Latn_BJ.
Remove "◌ ̒" from lv, ltg, prg as it’s not used in orthography (but may be used as component for ģ).
Follow up to #81.
The ‘Language’ dropdown menu contains an inconsistent mix of script and language identifiers. So, for example, ‘Bengali’ which could refer to either the Bengali* language or to the Bengali script, which is also used to write several other languages; ‘Devanagari’ on the other hand refers only to the script, which is used to write numerous languages that exhibit significant orthographic differences that may be reflected in font selection or OpenType Layout behaviour; such orthographic differences are explicitly recognised in the three types of Chinese identified in the list; and ‘Vietnamese’ appears to be the only entry in the ‘Language’ list that unambiguously refers to a language rather than a script.
I would like to suggest that Google Fonts consider moving to a system of two linked menus: one for script and one for language, with the content of the latter depending on the selection of the former. This would enable finer-grain browsing of fonts, and more accurate selection of results, especially if fonts contain a meta table that identifies primary design script or language tags, which could be taken into account in prioritising results. So, for example, almost all Devanagari fonts will support Hindi language reasonably well, but might be particularly tailored to the orthographic conventions of Marathi or Sanskrit (as is the case with fonts in the Tiro Indic collection). Even if fonts do not contain meta tables, it would be a good idea for Google Fonts to tag fonts in a way that prioritises browsing results by script and language, and this would also make it easier to improve specimen language selection. At the moment, for example, all Devanagari fonts show the same Hindi specimen texts, regardless of the orthography targets of the fonts.
Adding the Tangsa sample text in #25 reminded me that, at some point, someone's going to look at that and think "What is that text? Where did we get it from? Is it the UHDR or is it something else? Where do we report bugs in it?" And they might find the commit log and possibly trace it back to the PR (although I didn't put any notes in the commit log, just in the PR body), but it would be more robust if we could annotate the textproto files themselves with this kind of source information.
I suppose if we don't really need it as a field, we could just add comments.
Is there a reson to restrict setuptools_scm version to <6.1?
The following sample texts do not contain correct orthographic clusters:
For the three Sanskrit texts, something obviously went wrong with the transliteration. For the rest, I don't know where we got those texts from.
id: "tk_Arab"
script: "Arab"
...
exemplar_chars {
base: "a b ç d e ä f g h i j ž k l m n ň o ö p r s ş t u ü w y ý z"
auxiliary: "c q v x"
numerals: " - ‑ , % ‰ + 0 1 2 3 4 5 6 7 8 9"
punctuation: "- ‑ – — , ; : ! ? . … \" “ ” ( ) [ ] { } § @ * # { } { } { } { } { } { } { } { } { } { } { } { } { } { } { }"
index: "A B Ç D E Ä F G H I J Ž K L M N Ň O Ö P R S Ş T U Ü W Y Ý Z"
}
id: "ku_Cyrl"
language: "ku"
script: "Cyrl"
...
exemplar_chars {
base: "a b c ç d e ê f g h i î j k l m n o p q r s ş t u û v w x y z"
auxiliary: "á à ă â å ä ã ā æ è ĕ ë ē é ì ĭ ï ī í ñ ó ò ŏ ô ø ō œ ß ù ŭ ū ú ÿ"
marks: "◌̆ ◌̈"
punctuation: "- ‐ ‑ – — , ; : ! ? . … \' ‘ ’ \" “ ” ( ) [ ] § @ * / & # † ‡ ′ ″"
index: "A B C Ç D E Ê F G H I Î J K L M N O P Q R S Ş T U Û V W X Y Z"
}
I don't know how to fix this, because I'm not sure how they were generated. Just delete them?
Link to #64
We have a sample text for Tifinagh, but Akatab only supports a subset of it. We can't push to prod until these subsets are defined.
cc @jvgaultney and @devosb
Diffenator2 uses wordlists in order to check for font regressions. I currently include these wordlists its repo, https://github.com/m4rc1e/diffenator2/tree/main/diffenator/data/wordlists (repo will move to googlefonts/ soon).
I'm very tempted to move this data to this repo so more people can benefit from it without having to install diffenator2 and its dependencies. Another solution is to make a new repo just for these wordlists. This may make more sense since it's going to be a fair chunk of data and the words have been constructed from Wikipedia which has a CC license.
Same issue as in #31 that was fixed in https://github.com/google/fonts/pull/5819/files
Narnoor in sandbox shows only Latin.
I was not here when the Mingzat got fixed, @chrissimpkins @tomasdev, is there anything needed to be done prior to adding languages in METADATA.pb?
I want to merge #19 and cut a new release. I don't actually maintain this repo so is there anything I should be aware of?
I realise this repo is a subtree in google/fonts but are there any other nasty surprises?
This appears to have started on the main branch as of commit d885b2e
I could use gflanguages with protobuf 3.17.3; if installed as part of the installation of gftools (which requires protobuf with no version dependency, and hence picks up 4.21.1), I get
File "/home/runner/work/latin-greek-cyrillic/latin-greek-cyrillic/venv/lib/python3.8/site-packages/gftools/packager.py", line 42, in <module>
from gflanguages import LoadLanguages
File "/home/runner/work/latin-greek-cyrillic/latin-greek-cyrillic/venv/lib/python3.8/site-packages/gflanguages/__init__.py", line 28, in <module>
from gflanguages import languages_public_pb2
File "/home/runner/work/latin-greek-cyrillic/latin-greek-cyrillic/venv/lib/python3.8/site-packages/gflanguages/languages_public_pb2.py", line 36, in <module>
_descriptor.FieldDescriptor(
File "/home/runner/work/latin-greek-cyrillic/latin-greek-cyrillic/venv/lib/python3.8/site-packages/google/protobuf/descriptor.py", line 560, in __new__
_message.Message._CheckCalledFromGeneratedFile()
TypeError: Descriptors cannot not be created directly.
If this call came from a _pb2.py file, your generated code is out of date and must be regenerated with protoc >= 3.19.0.
If you cannot immediately regenerate your protos, some other possible workarounds are:
1. Downgrade the protobuf package to 3.20.x or lower.
2. Set PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python (but this will use pure-Python parsing and will be much slower).
Note that while gflanguages depends on a certain protobuf version in its requirements.txt, it does not depend on a pinned version in setup.py.
sd_Khoj
now contains quite a few Arabic letters (ڪ , ٿ, ٻ, ڀ, چ, ڻ):
𑈧𑈥𑈺ڪ𑈀𑈞 ... 𑈟𑈥𑈛𑈬ٿ𑈥𑈬 ... ڪ𑈥𑈺 ... 𑈺ٿ𑈥𑈨 ... 𑈺ڪ𑈦𑈥𑈺 ... ڪ𑈥𑈺𑈪ڪ𑈺ٻ𑈥𑈥𑈺𑈩𑈧𑈞𑈺ڀ𑈧𑈥𑈥چ𑈀𑈦𑈺𑈥𑈺𑈨𑈬𑈦𑈨𑈺𑈩𑈬𑈨ڪ𑈺𑈀𑈉𑈶𑈙𑈥𑈬𑈦𑈺ڪ𑈦ڻ𑈺
Khudawadi also has the same issue. I'm guessing Aksharamukha gave up on letters that it couldn't directly put into these scripts.
There's clearly a problem here with our test suite, which should have caught this. I'll look in to fixing the tests.
@SKing-2003, could you prepare another PR with fixes to these files? According to the Khojki Unicode proposal:
I'm guessing چ should be 𑈏, and ٿ should be... 𑈔?
And for Khudawadi:
The specimen in idu_Latn do not correspond with Idoma orthography description.
See upstream Unicode UDHR issue eric-muller/udhr#103
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.