GithubHelp home page GithubHelp logo

eticaai / hxltm Goto Github PK

View Code? Open in Web Editor NEW
0.0 2.0 0.0 664 KB

HXLTM - Multilingual Terminology in Humanitarian Language Exchange.TBX, TMX, XLIFF, UTX, XML, CSV, Excel XLSX, Google Sheets, (...)

Home Page: https://hxltm.etica.ai

License: The Unlicense

Python 88.65% Shell 11.35%
humanitarian-exchange-language hxl translation l10n i18n xliff tbx tmx utx

hxltm's Introduction

HXLTM - ontologia and initial reference implementation

Site GitHub Pypi: hxltm-eticaai

The Multilingual Terminology in Humanitarian Language Exchange (abbreviation: HXLTM) is an HXLated valid tabular format (stricly, a documented subset of HXL Standard started by HXL-CPLP) with strong focus to store community contributed translations and multilingual terminology while maximizing portability for implementers.

The documentation is available at https://hxltm.etica.ai/.

pip install hxltm-eticaai --upgrade

Licenses

Public domain means each issue only needs to be resolved once

Software license

Public Domain

To the extent possible under law, Etica.AI and non anonymous collaborators have waived all copyright and related or neighboring rights to this work to Public Domain.

Optionally, the BSD Zero Clause License is also one explicit alternative to the Unlicense as an older license approved by the Open Source Initiative:

SPDX-License-Identifier: Unlicense OR 0BSD

Creative content license (algorithms and concepts as pivot exchange for other data standards, and user documentation)

Public Domain

To the extent possible under law, Etica.AI and non anonymous collaborators have waived all copyright and related or neighboring rights to this work to public domain dedication. As 2021, the CC0 1.0 Universal (CC0 1.0) Public Domain Dedication with additional right to you granted upfront:

  • The collaborators explicitly waive any potential patent rights for algorithms/ideas. We also preemptively ask that potential requests from our heirs in unforeseen future in any jurisdiction be ignored by any regional or international court.
More context about this

This different license for creative content is mostly for lawyers who would complain about use of Unlicense for creative content. More context (from the point of open source) on waiving patent rigths explicitly (since no better license for creative content do exist) is here: https://opensource.org/faq#cc-zero.

There is no interest by ourselves to do patent troll (for monetary gain) or allow abuse copyrights (to enforce companies, organizations, o governments) even if:

  • we directly strongly disagree
  • some entity try to use us as proxy to enforce us some sort of boycott to any other entity.

Note that data exchange on humanitarian context, even outside global-like war-time, already is quite complex and the need of accurate linguistic content conversion still even more critical to not have know errors. While the idea of stories behind cases like the "黙殺" ("mokusatsu") are disputable, the modern tooling to deal with multilingual terminology (including used to create dictionaries) is prone to human error.

hxltm's People

Watchers

 avatar  avatar

hxltm's Issues

`hxltmcli .po`: HXL Trānslātiōnem Memoriam -> Gettext .PO

`hxltmcli .asa.hxltm.json / .asa.hxltm.yml`: HXLTM Abstractum Syntaxim Arborem

# @ARCHIVUM       ontologia/cor.hxltm.yml
# @DESCRIPTIONEM  HXL Trānslātiōnem Memoriam (HXLTM)
# @LICENTIAM      Dominium publicum
formatum:
  # (...)

  HXLTM-ASA:
    __meta:
      archivum_extensionem: 
        - .asa.hxltm.json
        - .asa.hxltm.yml
      normam:
        - https://hdp.etica.ai/hxltm/archivum/#HXLTM-ASA
      descriptionem: |
        _[eng-Latn]
        The HXLTM-ASA is an not strictly documented Abstract Syntax Tree
        of an data conversion operation.

        This format, different from the HXLTM permanent storage, is not
        meant to be used by end users. And, in fact, either JSON (or other
        formats, like YAML) are more a tool for users debugging the initial
        reference implementation hxltmcli OR developers using JSON
        as more advanced input than the end user permanent storage.

        Warning: The HXLTM-ASA is not meant to be an stricly documented format
        even if HXLTM eventually get used by large public. If necessary,
        some special format could be created, but this would require feedback
        from community or some work already done by implementers.
        [eng-Latn]_

        Trivia:
          - abstractum, https://en.wiktionary.org/wiki/abstractus#Latin
          - syntaxim, https://en.wiktionary.org/wiki/syntaxis#Latin
          - arborem, https://en.wiktionary.org/wiki/arbor#Latin
          - conceptum de Abstractum Syntaxim Arborem
            - https://www.wikidata.org/wiki/Q127380
      nomen:
        eng-Latn: 'HXLTM Abstractum Syntaxim Arborem'
      situs_interretialis:
        referens_officinale:
          - https://hdp.etica.ai/hxltm
          - https://github.com/EticaAI/HXL-Data-Science-file-formats/issues/223
          - https://github.com/EticaAI/HXL-Data-Science-file-formats/labels/HXLTM

The idea of create a format to use HXL to store both translation memories (not just the XLIFF format) but also glossaries but in special terminology is hardcore. Not so from the code implementation, but from the point of the issue it tries to abstract is complex.

Even if mostly for internal usage (e.g. not strictly documented for external use) instead of we 'convert' HXLated data (aka CSVs) to other formats (in special the XML ones) we're already drafting what could be called an Abstrac Syntax Tree (https://en.wikipedia.org/wiki/Abstract_syntax_tree). It can be a simpler one, but at least we're not passing to converters raw CSV pointers.

Comparison to others linguistic Abstract Syntax

See also:

Turns out that do exist some long time ideas about abstract linguistic content, but what could be called 'HXLTM ASA' is more at container level (as it could be useful to convert from file types) than at term level (as it would be to undestand what a term is to use for translate concepts).

So even if HXLTM ASA becomes usable for external tools, we will not even try to do too much micro management. BUT one thing we could do here is intentionally let it easy for others to convert for whatever format they want and we do not try to be strict on what HXLTM ASA is, so if someone else would want to inject even more details at term level, they could.

On Grammatical Framework

The Grammatical Framework (that is cited a lot on the Abstract Syntax as Interlingua) seems to be the state of the ar of how to generate a way to understand sentences in different natural languages. I, Rocha, do not plan to go deep on this, since the sort to medium term interest is more about how to store terminology and translations memories, and if the minimal implementation to support TBX export already can take time, the best I could do is make easier to (if do exist interest year later) people use HXLTM dialects to store linguistic data while still have decent portability between other data formats.

`hxltmcli`: HXLTM additional file for advanced processing (YAML-like, as opposed to table-like)

Related:
-hxltmcli .asa.hxltm.json / .asa.hxltm.yml: HXLTM Abstractum Syntaxim Arborem #3
-hxltmcli (or equivalent cli): MVP of custom ouput based on template + HXLTM data as cli option (without need of Ruby) #4


This topic is about at least an Minimal Viable Product of additional instruction to process what HXLTM files would means. While the reference implementation would be the hxltmcli tool, one idea is this file try to be as independent as possible.

For example, if necessary, instead of embed python code, keep the trend of use templating system like Liquid (https://shopify.github.io/liquid/). Liquid also somewhat provide some level of minimal ad hoc virtual machine (since good implementation would only expose and allow export special context; but we don't make official opinion on this at the moment, since would requrie audit the current python livrary that process liquid template); but note that even such ad hoc VM cannot process DDoS, so if someone make infinite loop by accident we could not prevent it.

Change the way the HXLTM ontologia files is described/referenced to suggest possibility of verbs in other writing systems


Currently, the HXLTM ontology file uses Latin in Latin script for the reference implementation while part of the documentation is itself in English. The point of this issue is consideration to at least rename the file used by the ontology file in such a way that eventually could exist versions for the verbs itself in any other writing system. I understand that as 2021, most people tend to tolerate write in some Latin script when developing programs, but by at least intentionally make the writting system part of the ontology file naming, this could at least not lock such type of thinking.

For short explanation on this issue, some tags.

# @ARCHIVUM       cor.hxltm.yml
# (...)
__ontologia_cor_versionem__:  v0.8.6+EticaAI+voluntārium-commūne

fontem_archivum_extensionem:
  .tm.hxl.csv: HXLTM
  .xliff.hxl.csv: CSV-HXL-XLIFF
  # (...)

normam:
  Ad-Hoc:
    __meta:
 # (...)
  CSV-3:
    __meta:
      # archivum_extensionem: .csv
      archivum:
        extensionem: .csv
      descriptionem: |
        ...
 # (...)


ontologia_aliud:
  accuratum:
    "?":
      # The '?' express what to do when the entire column does not exist, so
      # is not a particular value that is missing
      _IATE_valorem_codicem: "★★"
      _IATE_valorem_descriptionem: |
        Automatically assigned to terms entered or updated by native speakers.
      _IATE_valorem_nomen: "Minimum reliability"
      _IATE_valorem_numerum: 6

# (...)

  genus_grammaticum:
    lat_commune:
      _aliud: 'TBX_other'
      # _codicem: lat_commune
      _codicem_TBX: TBX_other
      _descriptionem: 
      codicem_lat: commune
# (...)

  partem_orationis:
    lat_adverbium:
      _aliud: 'TBX_adverb|UTX_adverb'
      _codicem: lat_adverbium
      _codicem_TBX: TBX_adverb
      _codicem_UTX: UTX_adverb
      _codicem_wikidata: Q380057 # https://www.wikidata.org/wiki/Q380057
      _normam: https://la.wikipedia.org/wiki/Adverbium
      codicem_lat: adverbium
# (...)

Note that this is different from "documentation translation". Both documentation and even file paths, new data standards to be added by users on the current ontology file already allow full Unicode support. The main point here is at least make as part of the ontology file name the writing system of the verbs.

How to make even ontology file tolerate different verbs for writing systems

One requirement would be what each ontology verb means between writing systems. Since they are limited, even without adding hardcoded support to the reference implementations, someone could replace the verbs from/to new languages. If at some point do exist interested people who use non-Latin script, such mapping (may be done by external tool) could be used when converting from one region to another.

Note that a good part of these verbs also are part of the command line arguments. So if such mappings are well documented, this makes it possible to at least our reference implementation be used by other regions. The opposite could be true.

Anyway, one potential advantage of allowing this is if for some reason there exists a baseline community in other regions (for example, speakers of Arabic dialects, or Hindi, etc) they could be free to have some differences without wait by Etica.AI merge them.

Example

On practice this means that terms like normam (https://en.wiktionary.org/wiki/norma#Latin), ontologia_aliud (https://la.wikipedia.org/wiki/Ontologia, https://en.wiktionary.org/wiki/alius#Latin), partem_orationis (https://en.wiktionary.org/wiki/pars_orationis#Latin) would need to be explained the relations from other scripts (aka "translated") (but terms like Ad-Hoc, CSV-3 actually would be the same on any ontologia, since they are content.

If some mappings are important enough (for example, the specifications related to Ad-Hoc or HXLTM-ASA (whch in other languages could be something different) since there is much less writing systems than languages, such aliases could be part each ontologia.

But anyway, the point here is shows that even the verbs of the ontologia itself are in Latin and this topic may be remembered much time in the future, are not hardcoded in Latin script.

Reorganization of online documentation for v1.0.0 of HXLTM

Quick links (for the moment of this issue creation):


Documentation is quite important. Most tools (in special the ones related to deal with liguistic data) are poorly documented, and add to this that people sometimes use wrong codes to express languages, so the amount of validation when people import from other tools to HXLTM would be huge.

While we do moved somewhat the documentation to the ontology YAML, averange end user may still like HTML web page to get started. So for the v1.0.0, let's re-organize the documentation.

1. Potential strong changes

1.1. documentation-site.tld

The entry page of documentation sites become a simpler link to other translations. At the moment, we have only English, so the home page would have link only for it.

1.1.1. Question: why not have "versioned" pages? As documentation-site.tld/vN.Z

The combo of translations plus versions on URL alone would be hard to cope. Also, as new translations come, old versions would have new strings. Add to this that the idea of have to plan for formal releases is stressing.

At this moment, I believe that take as inspiration the called HTML living standard (https://html.spec.whatwg.org/) and have some discipline could be good enough. First create documentation takes a lot of time (so the idea of keep making changes without reason are already sufficient avoid too much changes). Also, we could intentionally make all the reference software cope with changes, so we could at least document some way that if someone is leaving some automated parsing use HXLTM (like the HXLTM github action) and want to keep it work for years, we could try plan ahead how the person could freeze the versions.

1.1.2. What about have "real versioned releases"?

This is one of the reasons the pypi package also have the suffix -eticaai (even if is free at the moment hxltm alone). maybe we will reserve this just to avoid random person, but the idea would be intentionally have namespace.

If necessary, (which could be, in case of usage for big organizations), one ideal approach would be some sort of official release by them. Since 101% not like the ISO organization, any group (if not @HXLStandard itself) if not also a documentation page, could at least have frozen versions.

1.2 documentation-site.tld/zzz-Zzzz

The current draft already use as home page the https://hxltm.etica.ai/eng-Latn/. But while the "introduction" page could be the one with smaller URL prefix, this would have problems:

  • With potential to more translations, this makes the URL that "really" would have the contents be non easily planned.
  • The average user is more interested in how to use the tools, than how to contribute or understand the ontologia. So this organization is not intuitive
    • Add to this that HXL (so HXLTM) would be better to be uptimzied for emergency response, so it's even more critical.

Minor refactoring for compatibility with libhxl-python >= 4.25.1


The way the pip package hxltm-eticaai works is by direct usage of how libhxl-python hxl/scripts.py uses to access some HXL classes. We will need to release some version to either enforce 4.25.1 or (the ideal) make it work with 4.25.1. This upgrade on libhxl-python does not cause any problem with command line, but, as obviously expected, would need we update our implementation if the way hxl/scripts.py works

In any case, we're likely to start signaling the version to use. The https://github.com/EticaAI/MDCIII-boostrapper running daily (not just for libhxl, but also for frictionless, is always getting the latest version of everything, and this makes the automated operator see issues before we test them locally.

Test viability of abstract even further mappings from XML-like formats vs HXL tabular format using as base XML nested tags (vs HXL attributes) using as baseline concept/language/term.

Test what the title says.

I'm not very sure if this is possible without making it more complicated to the end user, but I think it may be viable to go a bit less repetitive.

The general idea of how to organize rows in a table in concepts on XML (which could go several rows, with relationships) needs a lot of creativity. The second (but already likely to be solved) is how to generalize language code parsing. Then, there are the terms.

But, after these three big groups, I'm starting to think that additional data attached to these groups (if done with the same logic, requiring adding more lines on python) may actually be worth abstract.

One way to generalize such an idea

Note: is obviously possible to add more semantics by adding more lines to python. The point here is make the ontologia even more powerful

In the current state, the way things are on HXLTM on tabular format (aka HXL, with some extra attributes) it could be ported to a direct mapping. For example, when reading from XML (it could be JSON or YAML, but we would need to make sure to avoid adding powerful features of XML not portable) we already know when we are at concept, language or term level. So In theory the generalization here would be some XML tag (that could appear at 3 levels) with an attribute that tells what inner XML tags (or HXL in tabular format, additional attributes) how to change the strategy.

Some disadvantages

  • This implementation would enforce HXLTM be more strict on order of tags than HXL allow.
        - This may not be an issue if new attributes are only one level deep than each baseline; but cannot be generalized without invert orders of tags on XML; it coild convert from tabular HXL and XML, but would not make much sense to other tools trying to parse the XML.
  • Well, this is not implemented. It may not be as simple to generalize.

Some advantages:

  • This obviously makes it much more generalized to convert HXLTM tabular format to XML (and go back) while still somewhat aware what the user is doing. The potential caotic part is reduced.
        - And, if XML forma be intentionally not as complex, also viable to reuse the logic to use JSON or YAML as storage.
            - Note that allowing somewhat intuitive editing by hand is interesting. But either tabular data, or XML/JSON/YAML at some point would be complex, because the subject is complex.
  • This makes it easy to add new complex data structures without previous planning.
    • On downside is, despite being editing XML (which have higher user expectation of have some way to validate, people who uses JSON may not care as much to have JSON Schemas) the nature of allow this flexibility means never be able to create a schema to validate what is intentionally open to be flexible. This means more human error.
      • Part of these errors may be easier to catch if validating exported formats to TBX/TMX/XLIFF, but, again, create custom formats means that some information may not map to other specifications.

Built-in specialized commands to convert wide and narrow data (pivot) for language


This issue is about implement command specialized to convert from and to wide and narrow language data.


Edit:

  • Changed from Expand-lists-filter to Implode-data-filter

`hxltmcli --objectivum-TMX`: HXL Trānslātiōnem Memoriam -> Translation Memory eXchange (TMX) v1.4b

Potential guidelines to deal with source terms for new translations in case of proofreading/terminology review based on new evidence when compiling multilingual terminology.


For sake of simplified documentation, I think we may create a convention for language attributes to mark proofread (or terminologically reviewed based on translator's feedback) for source terms optimized for cases where they're already cannot be changed even if creators of initial terms would see the complaints as valid. The [2], for example, mentions that at UN, they have a strong system in place to review (even before texts are passed for translators), which is very different from the situation on [3].

I'm not sure how many types of attributes we create, but at least one related to average proofreading (for example, work done by software developers or people copy and pasting from other references that may already be wrong) and another related to when translation are done based on material that may already have context that do explain what the concept means, but the exact term on that language is likely to produce wrong translations.

Considering we using now to encode English, approach similar to BCP47 (but as baseline ISO 639-3 and non optional ISO 15924) the source terms could still be eng-Latn and variant "eng-Latn-x-term1234" where term1234 means the variant. So when exporting translations jobs, the human could try to export the "eng-Latn-x-term1234" and for terms it did not found, it would export from the base eng-Latn. Or XLIFF formats (or spreadsheets) to give to translators could already differentiate what was the official term and the reviewed term.

Examples of use case

Core-Person-Vocab head term "gender" (with definition that mix two concepts)

See also comment SEMICeu/CPOV#12 (comment).

Captura de tela de 2021-11-29 20-06-22

Note: from the point of view of terminology, the fact of already define a term in so generic terms (in special considering that Core-Person-Vocabulary already was supposed to be a planned controlled vocabulary. But even if is intentionally be ambiguous, the way to design the head terms in English would be make a composed term with "OR", as in "Sex or Gender" or "Biological Sex or Gender Identity" The problem of take one head term from one of the "sub concepts" and attach definition of both concepts causes even more confusion.

For example this table by HL7 https://confluence.hl7.org/display/VOC/Gender+Coding+with+International+Data+Exchange+Standards (https://archive.ph/VQR42) already uses term "gender" in English while in German the word is "Biologisches Geschlecht".
Captura-de-tela-de-2021-11-29-21-00-31

Note that the old version of 1.0 in addition to use tables from ISO / IEC 5218: 2004 (so, if compilers of HL7 try to find the better English term, they following CPV 1.00 would lead to use gender.) Add to this that the [1] 2017, Interaction of law and language in the EU: Challenges of translating in multilingual environment (17 pages) already mentions the issue with English as working language be more ambiguous than German or French, this is not an isolated case.

Examples of conflicting issues with the head term "gender"

Captura de tela de 2021-11-29 20-31-30

"Gender interacts with but is different from sex, which refers to the different biological and physiological characteristics of females, males and intersex persons, such as chromosomes, hormones and reproductive organs. Gender and sex are related to but different from gender identity." -- WHO

Already in English most references on what the definition used on the preview of Person-Specification, and not just WHO, not only would disagree to put same short head word for both concepts, but most de facto used values for these fields are strongly related to "biological sex" (which already do have terminology for it).

Quick comments on potential strategies to document changes on source terms before prepare translation jobs

Based on another job we're doing to compile HXLTM , the TICO-19 (see https://tico-19.github.io/ and https://github.com/EticaAI/tico-19-hxltm) on this podcast https://www.stitcher.com/show/the-global-podcast-2/episode/episode-14-twb-and-tico-19-project-80576088 the TICO-19 members mentions that use translated versions instead of go directly from English already is relevant. Considering both Spanish and their french version, I somewhat also agree that the translations seems to be less literal than the English source.

But here one thing: even either for TICO-19 (that is very different from CPV, was a project based on urgency) one alternative can also be a proofreading version of the English source.

I think on case of urgency projects, like TICO-19, if we add some feature to label alternative versions of source term, who is preparing the work to distribute for new translations could have more freedom and optimize for speed (dozens of terms, days, if not hours, to take actions). But on case of Core-Person-Vocab, not only because is less terms (but also because there is more planning involved), if we document some additional attribute to justify the change for new variant of source term, we may also document that this would need more metadata (for example, organizations like WHO, that could also back up feedback from translations that the source terms are not aligned with definitions).

Map (at least) Europe IATE termType to BCP-47 language attributes *AND* have documentation on it


If we manage to make a minimal viable product of #11, this means that content generated either by 3rd party software or complex interfaces not directly edited by HXLTM on spreadsheets are likely to use the non-wide format (as the one used on TICO-19 terminologies).

So, since we with #11 already it is necessary to pivot the formats, if we also document the language type as a specialized language attribute, this could make it easier later for users.

Also note that some use cases (not fault of HXLTM, but how the world is) someone could actually edit by hand the non-wide formats and then import on other systems (either ones optimized for HXLTM, or private companies who may use HXLTM documentation for closed-source terminology standards behind paywalls, like the new ones used by TBX 2019).

Note that people (even if years later) are likely to go for HXLTM not only as file format, but to have a crash course on how to deal with multilingual terminology.


Trivia:

  • one potential advantage of this is implementer (like private companies already trying to help the humanitarian sector) mostly have to adhere to full BCP47 (which is useful beyond HXLTM or humanitarian sector) that actually implement new code conventions.
  • Most extensions from BCP47, including Unicode -u-, are poorly documented outside Unicode.
    • This means we somewhat (also to avoid others creating new extensions namespaces) need to cite they do exist.
  • Most codes, like the ones used on Europe IATE (fullForm, abbreviation, shortForm, phrase, formula, variant) actually are only documented as part of some ISOs (which are behind paywals), which means whatever e create in Latin (which is optimized as public domain reference for translations) actually is relevant even for developers who could read the English/French version of such terminological standards, but don't have access.
    • Why people from global-south dedicate time to create ISOs that cannot be used by their own population (not just because language issues, but license issues to read specifications) is beyond my comprehension.

`hxltmcli` (or equivalent cli): MVP of custom ouput based on template + HXLTM data as cli option (without need of Ruby)

The current hxltmcli, as documented on https://hdp.etica.ai/hxltm/archivum/, allows several exporters of the complete dataset. But some of the "more basic" functionality of https://github.com/HXL-CPLP/Auxilium-Humanitarium-API / https://hapi.etica.ai/, were is possible to use the HXLTM as terms to create custom formats, still need to use Ruby code.

Either the Hapi or the hxltmcli (or a new cli) must be able to allow the extra fields (like the term definitions) to be used on templates.

`hxltmcli --objectivum-XLIFF`: HXL Trānslātiōnem Memoriam -> XLIFF Version 2.1

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.