GithubHelp home page GithubHelp logo

oasis-tcs / lexidma Goto Github PK

View Code? Open in Web Editor NEW
7.0 7.0 8.0 5.86 MB

OASIS Lexicographic Infrastructure Data Model and API (LEXIDMA) TC: A repository designed for use in development of TC chartered work products and test suites. https://github.com/oasis-tcs/lexidma

License: Other

CSS 0.17% HTML 3.08% XSLT 91.51% Java 3.24% Perl 0.05% JavaScript 1.37% Batchfile 0.04% Shell 0.01% Makefile 0.02% Python 0.49% M4 0.01%

lexidma's People

Contributors

davidfatdavidf avatar jmccrae avatar michmech avatar mjakubicek avatar oasis-op-admin avatar tomazerjavec avatar vojtech-kovar avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

lexidma's Issues

URIs for entries and senses

(Submitted by Louis Cotgrove)

Can there be an option for a uri tag at the entry and sense levels? I can foresee use cases (such as an abbreviated version of a resource that links to the full text, or for citations), where this would be useful. I see this might be covered by the linking module, but this seems more complicated for me as a potential user

Should sameAs really be called sameAs?

(Submitted by David Lindemann)

One suggestion: Isn’t skos:exactMatch better than owl:sameAs for expressing that a POS tag matches to a lexinfo or OLiA or whatever URI? owl:sameAs means that two entities are exactly the same, including all statements made about them (=including all triples with one of the two as subjects). That is not the case here, so I think skos:exactMatch is better; that just claims that two URI represent entities describing the same thing, allowing different statements for the two. I think this distinction is important for machine reasoning. For example, in Wikdiata, owl:sameAs is used for internal redirects, when a user merges items (for the machine: instead of looking at entity A, you can look at entity B, and assume that every single statement about B is true for A as well).

Listing order

(Submitted by Robert Lew)

Second, referring the WALK example with listingOrder:1 and listingOrder:2 -- to me this seems worryingly literal and inflexible. I'm thinking of a scenario whereby we might want, in analogous cases, implement a rule that these entries are presented in a separately given order of POS, e.g. n > v > adj > adv > prep ..., rather than hard-coded. Would that be possible? Another option: using a rule to look up some kind of metric to compute the order, e.g in a separate table

walk n 9
walk v 6

(This could be frequency or figures from known usage logs expressing past user behavious, or a combination of different things.)

URI required for lexicographicResource?

(Submitted by Louis Cotgrove)

On p.13 of the PDF [= section 3.1], it says uri is required but the text says “zero or one” - is that not then optional?

Subentries

(Submitted by Bob Boelhouwer)

A distinctive feature of DMLex is that all entries are at the same level. There are no subentries. The main argument for this is that it is computational more efficient. But, I’m not sure of for modern computational equipment this is a significant advantage. An argument for using embedding structures is that editors working on compiling a dictionary would prefer to have an overview of all the submeanings and collocations that can make up an entry. Moreover, they will have a better indication of how what the entry will look like when presented to the public.

Variant headwords

(Submitted by Robert Lew)

First, using the example colour / color, I see that the model assumes "full treatment" under one headword. At this point there's a bit of an arbitrary decision which 'variant' becomes the reference, and there's a structural imbalance between the variants. Assuming the publisher wanted to vary the presentation depending on the user's preference of American or European spelling (also (S) African, Australian and NZ, but that's beside the point). For the American as primary, they'd have to do some extra operations when using this particular model: perhaps it would be possible to include a sentence in the summary that this is doable, to anticipate any worries to that effect (and another sentence as to HOW this might be done, if not too complex).

`<text>` element in XML serialization

There seem to be some inconsistencies about the use of a <text> element in serialization. Sometimes it is required (as a child of <example>) and sometimes not (as a child of <definition>) and the examples seem to mix this.

Is the idea that this is optional or can we clarify why it is used sometimes and other times not?

Names of ‘tags’ and ‘types’

(Submitted by Jonatan Steller, talking about his DMLex implementation in TYPO3)

A more severe hurdle I encountered in the implementation were
the names of several properties clashing with each other due to the fact
that I implemented all tags as well as "RelationType" and "MemberType"
in a single "Tag" table of TYPO3's database with multiple "type"s
designating which type of tag was being provided. Crawling the spec repo
and the mailing list I realise that an early draft of the spec contained
and abandoned this logic before I first encountered it, but I would
assume that other implementers may take a similar shortcut because it
significantly reduces the number of database tables required to run
DMLex. I provide a list of classes and properties that I needed to
rename below - not to convince the LEXIDMA group to follow suit, but to
document potential hassle around a spec feature where I found the names
of classes and properties to be more confusing than elsewhere and
possibly clashing in lazy implementations like mine:

  • InflectedFormTag: renamed to "InflectionTypeTag" to align it with
    "DefinitionTypeTag"
  • RelationType and MemberType: renamed to "RelationTypeTag" and
    "MemberRoleTag", respectively
  • LabelTag, LabelTypeTag, PartOfSpeechTag, InflectionTypeTag,
    DefinitionTypeTag, TranscriptionSchemeTag: property "tag" renamed to
    "code" to avoid confusing the "tag" property with the "Tag" classes
    containing them
  • LabelTag: property "typeTag" renamed to "labelType" to align it with
    other type indicators
  • MemberRoleTag: property "role" renamed to "text" in alignment with
    other tags
  • MemberRoleTag: property "type" renamed to "memberType" because it
    clashed with the type indicator needed for the unified table of all tags
  • RelationTypeTag: "type" renamed to "text" because it clashed with the
    type indicator needed for the unified table of all tags
  • RelationTypeTag: "memberType" renamed to "memberRole" to avoid
    conflict with the new "memberType" property now used for the
    MemberRoleTag class

I hope this makes some sense. I similarly aligned another property in
the class "InflectedForm" where I renamed "tag" to "labelType" similar
to how "Definition" has a "definitionType". The two illustrations I
attached depict all classes and properties I needed to rename in a
fuchsia colour. They were all necessitated either by changing from
bottom-up to top-down relations or by simplifying all tags and tag-like
classes into a single database table.

Validation artefacts

(Submitted by Miloš Jakubíček)

It has been brought to my attention that one of the remaining outstanding issues is that we still need to take a decision on validation artifacts, so this is a formal request to do so.

usefulness of "for" properties in the controlled vocabulary module

The tags have multiple "for" properties, e.g., forHeadwords. Do we have restrictions on how these may be combined, e.g., can a inflectedFormTag apply to headwords or translations or languages. Would it not make sense to combine this into a single property with values, e.g., instead of forHeadwords=true have for=headwords

Validation artefacts

(Submitted by Jan Niestadt)

Comprehensive testing/validation tools would be a necessity as well to aid development and ensure conformance. I'm sure this had already crossed your mind.

Need a how-to guide

(Submitted by Jan Niestadt)

Probably an obvious statement, but the specification is large and complex, making it difficult to wrap your head around the whole thing. People new to DMLex likely will find it very challenging to understand and use it starting from just that document. I think it would be great if there were a separate ebook introducing each part, starting from real-life examples and only referring to the spec for the details.

Whitespace and the annotation module

(Submitted by John McCrae)

I have been thinking and I
am still not sure about our rule for whitespace in the elements. In
particular, in the converter I have been developing I am having problems
because we cannot apply pretty printing (indenting) to an XML file
without changing the content in the model. Further, I think the rules
are unintuitive and many will add whitespace and create unfortunate
errors. Instead I propose that we adopt the HTML methodology as
described here:

https://infra.spec.whatwg.org/#strip-newlines

In this case, before processing the content of any text carrying
element, we will first remove all new lines ('\n', '\r'), delete all
trailing and leading whitespace and replace all remaining blocks of
ASCII whitespace with a single space.

I would also make a model change, replacing all references to 'non-empty
string' with a 'normalised string'. This means a string that contains no
new lines, does not start or end with a whitespace, contains no block of
ASCII whitespace more than a single space and is non-empty. This ensures
that other serializations (JSON, RDF) cannot generate content that
cannot be represented in XML.

I do worry that this does not really cover Chinese, Japanese (and maybe
Thai/Lao), as the whitespace rules for HTML are more complex in Unicode,
but I think that this can probably be worked around by lexicographers
working in these languages. We can add a note to the spec for these
languages.

Some JSON properties are plural and some aren't

Most elements in the JSON serialization use a plural form (e.g., parts_of_speech) but a few don't (etymonLanguage, etymonType, etymology, indicator). Should this be made consistent

This relates to #58

We should explain the logic for non-plural names or change.

Various editorial issues

(Submitted by Paul Knight)

During publication of the files for CSD02, the OASIS staff noted some items which the LEXIDMA TC may wish to consider.

These items are numbered for convenience, not ranked in any particular order of importance.

  1. "Committee" is misspelled on the front page (PDF only - corrected in the HTML during publication). This also affects the "Citation format" text towards the end of the front pages.

  2. XML parsing error for the file /conformance/conformance.xml. (&version; undefined - inside a warning message).

  3. The informative text for Appendix C mentions "OM" without any definition. (maybe Object Model ??)

  4. The initial sentence in Appendix F is essentially obsolete, and could be rewritten or simply removed.

  5. In the Normative References (Appendix B.2), the additional text " at https://docs.oasis-open.org/lexidma/dmlex/v1.0/csd02/schemas/informativeCopiesOf3rdPartySchemas/w3c/xml.xsd in this distribution" should either be removed or made correct by including the target file.

  6. Section 1.2 lists serialization languages which appear to be hyperlinked to empty targets. It appears that the intent is to link to the references in Appendix B. These hyperlinks should be properly instantiated.

Issues with RDF serialization

The RDF serialization has some discrepancies with the examples that need to be resolved. This is part of the implementation of #93

  • Examples use dmlex:langCode for LexicographicResources but the serialization spec says dmlex:language
  • dmlex:indicator has a string value in the examples but a URI value in the spec
  • dmlex:translationLanguage has a string value in the examples but a URI value in the spec
  • Changes from issues #66, #67 and #68 have not been applied to the RDF serialization

Example Errors

I started work on a parser and here are some of the minor issues with the examples in the current spec
I am using the numbering from the source (so Example 00 is Example A1.1)

Example 00

diff --git a/examples/0.json b/examples/0.json
index 34df8ee..0daa9e0 100644
--- a/examples/0.json
+++ b/examples/0.json
@@ -2,7 +2,7 @@
     "uri": "http://example.com",
     "langCode": "en",
     "title": "Example Lexicon",
-    "entry": [{
+    "entries": [{
         "id": "abandon-verb",
         "headword": "abandon",
         "partsOfSpeech": ["verb"],
diff --git a/examples/0.xml b/examples/0.xml
index 9e28e2e..0cc9b5a 100644
--- a/examples/0.xml
+++ b/examples/0.xml
@@ -1,7 +1,8 @@
 <lexicographicResource uri="http://example.com" langCode="en">
+    <title>Example Dictionary</title>
     <entry id="abandon-verb">
         <headword>abandon</headword>
-        <partOfSpeech value="verb"/>
+            <partOfSpeech tag="verb"/>
         <sense id="abandon-verb-1">
             <definition>to suddenly leave a place or a person</definition>
             <example>
@@ -9,14 +10,16 @@
             </example>
             <example>
                 <text>Abandon ship!</text>
-                <label value="idiom"/>
+                <label tag="idiom"/>
             </example>
+        </sense>
         <sense id="abandon-verb-2">
-            <label value="mostly-passive"/>
+            <label tag="mostly-passive"/>
             <definition>to stop supporting an idea</definition>
             <example>
                 <text>That theory has been abandoned.</text>
             </example>
         </sense>
     </entry>
-<lexicographicResource>
+</lexicographicResource>
+
  • JSON property is called entries
  • Add title to example?
  • label has a tag attribute not value
  • First <sense> tag not closed
  • XML <lexicographicResource> is closed wrong

Example 01

diff --git a/examples/1.xml b/examples/1.xml
index 5654140..ef026b1 100644
--- a/examples/1.xml
+++ b/examples/1.xml
@@ -7,5 +7,4 @@
     <inflectedForm tag="pl">
         <text>folúsghlantóirí</text>
     </inflectedForm>
-    <sense>...</sense>
 </entry>
diff --git a/examples/10.json b/examples/10.json
index 961a24f..2c8f6ca 100644
--- a/examples/10.json
+++ b/examples/10.json
@@ -3,6 +3,8 @@
     "uri": "http://example.com",
     "langCode": "ga",
     "translationLanguages": ["en", "de", "cs"],
-    ...
+    "entries": [{
+        "headword": "focal"
+    }]
 }
  • Removed ...

Example 10

diff --git a/examples/10.xml b/examples/10.xml
index 64e4520..a938168 100644
--- a/examples/10.xml
+++ b/examples/10.xml
@@ -3,6 +3,8 @@
     <translationLanguage langCode="en"/>
     <translationLanguage langCode="de"/>
     <translationLanguage langCode="cs"/>
-    ...
+    <entry>
+        <headword>focal</headword>
+    </entry>
 </lexicographicResource>
  • Remove ...

Example 11

diff --git a/examples/11.json b/examples/11.json
index b69fa6e..2c9794b 100644
--- a/examples/11.json
+++ b/examples/11.json
@@ -28,5 +28,5 @@
             "langCode": "cs",
             "text": "sklizeň"
         }]
-    },]
+    }]
 }
  • JSON syntax error

Example 12

diff --git a/examples/12.json b/examples/12.json
index 0ab250f..1b7ee1d 100644
--- a/examples/12.json
+++ b/examples/12.json
@@ -7,30 +7,30 @@
         "senses": [{
             "id": "glasses-1",
             "definitions": [{"text": "an optical seeing aid"}]
-        }, {
+        }]}, {
         "id": "microscope",
         "headword": "microscope",
         "senses": [{
             "id": "microscope-1",
             "definitions": [{"text": "equipment for looking at very small things"}]
-        }, {
+        }]}, {
         "id": "lens",
         "headword": "lens",
         "senses": [{
             "id": "lens-1",
             "definitions": [{"text": "curved glass that makes things seem bigger"}]
         }]
-        }],
-        "relations": [{
-            "type": "meronymy",
-            "members": [{
-                "memberID": "glasses-1",
-                "role": "whole"
-            }, {
+    }],
+    "relations": [{
+        "type": "meronymy",
+        "members": [{
+            "memberID": "glasses-1",
+            "role": "whole"
+        }, {
             "memberID": "lens-1",
             "role": "part"
-            }]
-        }, {
+        }]
+    }, {
         "type": "meronymy",
         "members": [{
             "memberID": "microscope-1",
@@ -39,22 +39,22 @@
         "memberId": "lens-1",
         "role": "part"
         }]
-        }],
-        "relationTypes": [{
-            "type": "meronymy",
-            "description": "part-whole relationship",
-            "memberTypes": [{
-                "role": "whole",
-                "type": "sense",
-                "min": 1,
-                "max": 1,
-                "hint": "navigate"
-            }, {
+    }],
+    "relationTypes": [{
+        "type": "meronymy",
+        "description": "part-whole relationship",
+        "memberTypes": [{
+            "role": "whole",
+            "type": "sense",
+            "min": 1,
+            "max": 1,
+            "hint": "navigate"
+        }, {
             "role": "part",
             "type": "sense",
             "min": 1,
             "max": 1,
             "hint": "navigate"
-            }]
         }]
-        }
+    }]
+}
diff --git a/examples/12.xml b/examples/12.xml
index 5e57296..0104494 100644
--- a/examples/12.xml
+++ b/examples/12.xml
@@ -29,5 +29,5 @@
         <description>part-whole relationship</description>
         <memberType role="whole" type="sense" min="1" max="1" hint="navigate"/>
         <memberType role="part" type="sense" min="1" max="1" hint="navigate"/>
-        </relationType
-    </lexicographicResource>
+    </relationType>
+</lexicographicResource>
  • JSON has many validity errors
  • XML has a small validity error

Example 13

diff --git a/examples/13.json b/examples/13.json
index 35a0c8a..d084cee 100644
--- a/examples/13.json
+++ b/examples/13.json
@@ -7,28 +7,28 @@
         "senses": [{
             "id": "buy-1",
             "definitions": [{"text": "get something by paying money for it"}]
-        }, {
+        }]}, {
         "id": "sell",
         "headword": "sell",
         "senses": [{
             "id": "sell-1",
             "definitions": [{"text": "exchange something for money"}]
         }]
-        }],
-        "relations": [{
-            "type": "antonyms",
-            "members": [
-                {"memberID": "buy-1"},
-                {"memberID": "sell-1"}
-            ]
-        }],
-        "relationTypes": [{
-            "type": "antonyms",
-            "memberTypes": [{
-                "type": "sense",
-                "min": 2,
-                "max": 2,
-                "hint": "navigate"
-            }]
+    }],
+    "relations": [{
+        "type": "antonyms",
+        "members": [
+            {"memberID": "buy-1"},
+            {"memberID": "sell-1"}
+        ]
+    }],
+    "relationTypes": [{
+        "type": "antonyms",
+        "memberTypes": [{
+            "type": "sense",
+            "min": 2,
+            "max": 2,
+            "hint": "navigate"
         }]
-    }
+    }]
+}
  • JSON syntax errors

Example 14

diff --git a/examples/14.xml b/examples/14.xml
index 62f6d7d..d7036f3 100644
--- a/examples/14.xml
+++ b/examples/14.xml
@@ -2,21 +2,21 @@
     <translationLanguage langCode="de"/>
     <entry id="die-see">
         <headword>See</headword>
-        <partOfSpeech value="n-fem"/>
+        <partOfSpeech tag="n-fem"/>
         <sense id="die-see-1">
             <headwordTranslation><text>sea</text></headwordTranslation>
         </sense>
     </entry>
     <entry id="das-meer">
         <headword>Meer</headword>
-        <partOfSpeech value="n-neut"/>
+        <partOfSpeech tag="n-neut"/>
         <sense id="das-meer-1">
             <headwordTranslation><text>sea</text></headwordTranslation>
         </sense>
     </entry>
     <entry id="der-ozean">
         <headword>Ozean</headword>
-        <partOfSpeech value="n-masc"/>
+        <partOfSpeech tag="n-masc"/>
         <sense id="der-ozean-1">
             <headwordTranslation><text>ocean</text></headwordTranslation>
         </sense>
  • partOfSpeech uses tag not value

Example 17

diff --git a/examples/17.json b/examples/17.json
index e86433c..aea04f7 100644
--- a/examples/17.json
+++ b/examples/17.json
@@ -6,11 +6,11 @@
         "headword": "safe",
         "senses": [{
             "id": "safe-1",
-            "indicator": "protected from harm",
+            "indicator": ["protected from harm"],
             "examples": [{"text": "It isn't safe to park here."}]
         }, {
             "id": "safe-2",
-            "indicator": "not likely to cause harm",
+            "indicator": ["not likely to cause harm"],
             "examples": [{"text": "Is the ride safe for a small child?"}]
         }]
     }, {
  • indicator is zero or more so must be a list

Example 19

diff --git a/examples/19.json b/examples/19.json
index 64e4560..85029ae 100644
--- a/examples/19.json
+++ b/examples/19.json
@@ -4,6 +4,6 @@
   "placeholderMarkers": [
      {"startIndex": 9, "endIndex": 13}
   ],
-  "senses": [...]
+  "senses": []
 }
 
diff --git a/examples/19.xml b/examples/19.xml
index 3844464..381f4a6 100644
--- a/examples/19.xml
+++ b/examples/19.xml
@@ -2,6 +2,5 @@
     <headword>
         continue <placeholderMarker>your</placeholderMarker> studies
     </headword>
-    <sense.../>
 </entry>
  • Remove ...

Example 02

diff --git a/examples/2.json b/examples/2.json
index 7c0ded3..0c2de59 100644
--- a/examples/2.json
+++ b/examples/2.json
@@ -4,5 +4,5 @@
     "pronunciations": [{
         "transcriptions": [{"text": "a:rdva:rk"}]
     }],
-    "senses": [...]
+    "senses": []
 }
diff --git a/examples/2.xml b/examples/2.xml
index 55af24c..fa5a24c 100644
--- a/examples/2.xml
+++ b/examples/2.xml
@@ -3,5 +3,4 @@
     <pronunciation>
         <transcription>a:rdva:rk</transcription>
     </pronunciation>
-    <sense>...</sense>
 </entry>
  • Remove ...

Example 20

diff --git a/examples/20.json b/examples/20.json
index d2fc2ed..873aca7 100644
--- a/examples/20.json
+++ b/examples/20.json
@@ -10,7 +10,7 @@
       "text": "jemanden verprügeln",
       "placeholderMarkers": [
           {"startIndex": 0, "endIndex": 8}
-      ],
+      ]
     }]
   }]
 }
  • JSON syntax error

Example 22

diff --git a/examples/22.json b/examples/22.json
index 1997d48..5708b7a 100644
--- a/examples/22.json
+++ b/examples/22.json
@@ -19,7 +19,7 @@
         ],
         "collocateMarkers": [
           {"startIndex": 8, "endIndex": 15, "lemma": "provést"}
-        ],
+        ]
       }]
     }]
   }]
  • JSON syntax error

Example 23

diff --git a/examples/23.json b/examples/23.json
index 8e24909..6c0870d 100644
--- a/examples/23.json
+++ b/examples/23.json
@@ -1,9 +1,9 @@
 {
   "id": "cat-n",
   "headword": "cat",
-  "senses": [...],
+  "senses": [],
   "etymology": {
-    "etymons" [{
+    "etymons": [{
       "etymonUnits": [
         {"langCode": "enm", "text": "catte"}
       ]
  • Remove ...
  • JSON syntax error

Example 03

diff --git a/examples/3.json b/examples/3.json
index ffd405e..339f5d4 100644
--- a/examples/3.json
+++ b/examples/3.json
@@ -4,5 +4,5 @@
     "pronunciations": [{
         "soundFile": "aardvark.mp3"
     }],
-    "senses": [...]
+    "senses": []
 }
  • Remove ...

Exmaple 04

diff --git a/examples/4.json b/examples/4.json
index 2b80790..430a537 100644
--- a/examples/4.json
+++ b/examples/4.json
@@ -5,5 +5,5 @@
         "soundFile": "aardvark.mp3",
         "transcriptions": [{"text": "a:rdva:rk"}]
     }],
-    "senses": [...]
+    "senses": []
 }
  • Remove ...

Example 05

diff --git a/examples/5.json b/examples/5.json
index 02443ec..125633e 100644
--- a/examples/5.json
+++ b/examples/5.json
@@ -12,7 +12,7 @@
         "tag": "pl",
         "text": "folúsghlantóirí"
         }],
-        "senses": [...]
+        "senses": []
     }],
     "partOfSpeechTags": [{
         "tag": "n-masc",
@@ -23,7 +23,7 @@
     }],
     "inflectedFormTags": [{
         "tag": "sg-gen",
-        "description": "singular genitive"
+        "description": "singular genitive",
         "forPartsOfSpeech": ["n-masc", "n-fem"]
     }, {
     "tag": "pl",
diff --git a/examples/5.xml b/examples/5.xml
index 95b8947..b5ceb85 100644
--- a/examples/5.xml
+++ b/examples/5.xml
@@ -1,7 +1,7 @@
 <lexicographicResource uri="http://example.com" langCode="ga">
     <entry id="folúsghlantóir-n">
         <headword>folúsghlantóir</headword>
-        <partOfSpeech value="n-masc"/>
+        <partOfSpeech tag="n-masc"/>
         <inflectedForm tag="sg-gen">
             <text>folúsghlantóra</text>
         </inflectedForm>
  • Remove ...
  • JSON syntax error
  • partOfSpeech uses tag

Example 07

diff --git a/examples/7.json b/examples/7.json
index bbfc18b..d94bd65 100644
--- a/examples/7.json
+++ b/examples/7.json
@@ -3,5 +3,7 @@
     "uri": "http://example.com",
     "langCode": "de",
     "translationLanguages": ["en"],
-    ...
 }
diff --git a/examples/7.xml b/examples/7.xml
index b1b0fb5..55bcc8c 100644
--- a/examples/7.xml
+++ b/examples/7.xml
@@ -1,5 +1,7 @@
 <lexicographicResource uri="http://example.com" langCode="de">
     <title>My German-English Dictionary</title>
     <translationLanguage langCode="en"/>
-    ...
 </lexicographicResource>
  • Remove ... (and trailing comma in JSON)

Example 08

diff --git a/examples/8.json b/examples/8.json
index 4835d8b..b34baea 100644
--- a/examples/8.json
+++ b/examples/8.json
@@ -3,7 +3,7 @@
     "headword": "doctor",
     "senses": [{
         "id": "doctor-n-1",
-        "indicator": "medical doctor",
+        "indicator": ["medical doctor"],
         "headwordTranslations": [{
             "text": "Arzt",
             "partsOfSpeech": ["n-masc"]
@@ -13,7 +13,7 @@
         }]
     }, {
         "id": "doctor-n-2",
-        "indicator": "academic title",
+        "indicator": ["academic title"],
         "headwordTranslations": [{
             "text": "Doktor",
             "partsOfSpeech": ["n-masc"]
  • indicator must be a list

Example 09

diff --git a/examples/9.xml b/examples/9.xml
index 2d2c4f1..b84e5eb 100644
--- a/examples/9.xml
+++ b/examples/9.xml
@@ -1,6 +1,6 @@
 <entry id="treppenwitz">
     <headword>Treppenwitz</headword>
-    <partOfSpeech value="n-masc"/>
+    <partOfSpeech tag="n-masc"/>
     <sense id="treppenwitz-1">
         <headwordExplanation>
             belated realisation of what one could have said
@@ -9,4 +9,5 @@
             <text>staircase wit</text>
         </headwordTranslation>
     </sense>
+</entry>
  • partOfSpeech uses tag
  • No final tag

What’s the point of homograph number?

(Submitted by David Lindemann)

Something strange from the Wikibase point of view is to have a “homograph number” at entry level, which in fact only informs you that there are homographs (if lemmata without homographs in the lemma list don’t have “1” but just no homograph number to make clear they are alone), but not how many, or which ones. I understand that this is something coming from digitized dictionaries. In Wikibase, one would instead of numbering homographs claim which ones are the homograph entries, and link them using a dedicated property, just like this is done on Wikidata.

Only one headword per entry

(Submitted by Christian-Emil Smith Ore, comparing DMLex to a lexicographic database he has been working on)

Since the Norwegian spelling opens for a lot of variations a lexical item is a set of variants, and we find it useful to open for more than one headword.

Purpose of sense indicator

(Submitted by Louis Cotgrove)

I don’t quite understand the purpose of sense > indicator. Perhaps it needs to be more clearly differentiated from either definition or label

Only one headword per entry

(Submitted by Jan Niestadt)

I can see that sometimes you'd want these to be separate entries, but in other cases I feel putting them in the main entry would be simpler, especially if the variants will never actually be processed or displayed as separate entries, just shown as a list in the main entry. Again, this is only my opinion, but it would be nice if the standard was flexible enough to support both approaches, allowing each project to choose what works best for them.

Subsenses

(Submitted by Jan Niestadt)

The way DMLex proposes to model these feels a bit unnatural to me, like it's fighting the domain it's modelling. As far as I understand, to a lexicographer, a subsense actually is a part of the main sense, so it feels natural to model it as contained within the main sense structure. Or alternatively, each sense could be modeled as a separate hierarchical object, with relations to define the tree structure between the entry and its main senses, as well as the main senses and their subsenses. The hybrid structure proposed for DMLex where (sub)senses are all in a flat list as part of the entry, with separate relations indicating their "real" structure feels somewhat awkward and roundabout to me, and an extra hurdle to deal with when processing entries. That's just my opinion, and I can see the arguments to the contrary as well, but standardizing by choosing one approach kind of kills any debate and risks making the standard more difficult to adopt.

Goals that are (probably) out of scope

(Submitted by Jan Niestadt)

I have a few other, more practical/technical questions, such as how one would effectively query a DMLex implementation across its hierarchical and relational structures, and how to efficiently prepare entries for presentation. I understand that DMLex is designed as an abstract data model, but I feel that seeing how it might be implemented would help to properly evaluate it. It would be great to have a proof of concept to try out at some point.

Whitespace rules are not defined in XML

Whitespace rules in the XML are not clearly defined. For example consider this

                <exampleTranslation>
                    <text>
                            Koroner <collocateMarker lemma="provést">provedl</collocateMarker>
                            <headwordMarker>pitvu</headwordMarker>.
                    </text>
                </exampleTranslation>

How do we know that the whitespace is a single character between 'provedl' and 'pitvu', while the whitespace before 'Koronoer' and after '.' is not part of the translation?

Blocking implementation of #61

Pseudocode in the Examples section uses IDs

All the pseudo-code examples in section ‘A.1 Examples’, which are supposed to demonstrate the data model at a serialization-independent level, use IDs to indicate what relations point to. This is problematic because the data model doesn’t actually allow any ID properties at the model-level.

Perhaps it might be better to replace the pseudocode in this section with diagrams like in my “unofficial introduction”: https://www.lexiconista.com/dmlex/

Downside: It’s a lot of work to draw these diagrams. They cannot be generated automatically.

Upside: The diagrams really do make it clearer to human readers how the model “ticks” at model-level. People who have read the “unofficial introduction” have responded very well to them.

Uniqueness constraints

(Submitted by Louis Cotgrove)

Is a unique definition or headword really required? I can foresee some use cases where this could be desired - can uniqueness be controlled by the entryID?

DMLex is too different from traditional ways of modelling things

(Submitted by Jan Niestadt while commenting in DMLex’s approach to modelling variant headwords)

Finding the right balance is tricky, but for better or worse, I would lean a bit more towards supporting more existing practices to try to get everyone on board.

Parent-child relations being top-down versus bottom-up

(Submitted by Jonatan Steller, talking about his DMLex implementation in TYPO3)

An easy challenge during the implementation was the direction in which
parent/child relations are designed in the part of the spec dealing with
relational databases. The spec decides in favour of children indicating
their parents while the automatic forms generated by TYPO3 can be used
more easily (and in a more user-friendly way) when parents identify
their children. The spec already treats the respective section as a
suggestion rather than normative content, but a simple note on
parent/child relations being possible top-down or bottom-up could help
others implement DMLex with less friction.

Doing more with examples

(Submitted by Jonatan Steller, talking about his DMLex implementation in TYPO3)

There is one aspect that I think should not go
unnoticed as it illustrates the neat design of DMLex. The lexicographic
resources we produce at the Academy of Sciences and Literature Mainz
often contain historical examples, as in the case of field names or
names of historical persons. To accommodate this, we simply added
properties like "period", "locationRelation" and "agentRelation" to the
existing "Examples" class and allowed for an "example" property in the
"Entry" class in addition to the "Sense" class. Furthermore, we needed
frequency data for multiple countries to both "Entry" and "Sense" and
were able to just add a respective class and property as a sort-of
custom module. This is to highlight that I have become very fond of the
modular design of the spec because implementers like me may need this
sort of flexibility.

Cardinality of core objects

Do we really want to allow lexicographic resources with zero entries? Similarly, entries with zero senses?

Define which elements have `id`

It is not defined in the standard which elements have an id.

Some elements cannot have id (partOfSpeech and label) as this would not be represent-able in JSON and RDF serialization

Issue arose during implementation of #61

No sense-specific parts of speech

(Submitted by David Lindemann)

There are dictionaries that don’t attach a POS value to the headword but have POS sections inside the entry (the headword is presented as POS neutral). For example, this entry: https://www.euskaltzaindia.eus/index.php?option=com_oehberria&task=bilaketa&Itemid=413&lang=eu-ES&query=aditu

In such a case, would you like to force a re-modeling of the inner entry hierarchy (in this case, make three entries out of one, so that each entry is not POS-ambiguous?) There are reasons for modeling a dict. like you see in the example. In Basque, for example, there are so many nominals that can be interpreted as nouns or adjectives, and the border is not clear. The above example, aditu, means “expert”, and also in English it is not that clear where it is an ADJ and where a NOUN (“I am a Basque expert / I am an expert Basque”). Another reason is that if you describe inflected forms (and there are a lot of forms for each lemma in Basque), you get very redundant if you have to list all possible forms in the entries… Related to that: We have frequency data for Basque word forms, but we are not able to say in each case if it is ADJ or NOUN…, and also, if you have an inflected past participle, is this a verb form, or a nominal (inflection behaves like the one for nouns and adjectives)??..

In Ontolex-on-Wikibase, I am modeling that as follows: I introduce POS-disambiguating property at sense level (“this sense applies to this lemma as noun”), and I do the same for inflected forms, if it is clear what POS a certain form may have (can be more than one). Example: “aditu” with POS on senses,“aditu” with POS on forms (different sources / tools give different values here, which is what I want to record in that case)

Also in German, there are dictionaries that have such POS-like sections (not across POS, but refining POS). Some dictionaries group verb senses inside an element describing a syntactic entity (“verb transitive” vs. “verb intransitive”, “verb reflexive”, etc.) - example.

memberType has property memberType

The memberType object has a property called memberType. We should probably avoid having clashes between property and element names (to simplify serialization) and rename the property to type

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.