GithubHelp home page GithubHelp logo

schemas's Introduction

Global WordNet Schemas

Read the documentation here

Building the metadata

index.html is constructed with PanDoc

pandoc -t html -H template/header -A template/afterbody -B template/beforebody --shift-heading-level-by=-1 index.md > index.html

wn.rdf is generated from the Turtle with Rapper

rapper -i turtle -o rdfxml-abbrev wn.ttl > wn.rdf

schemas's People

Contributors

arademaker avatar fcbond avatar goodmami avatar jmccrae avatar simongray avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

schemas's Issues

Versioning

It would be good to have tagged release with the 1.0 DTD (the version we are using for LiLT) and then maybe start a new file (WN-LMF-1.1.dtd) for the new DTD, so that people are not surprised by changes.

`causes` and `is_caused_by` have the same the descriptions as `involved_result` and `result`

On the schemas page, the descriptions for both pairs are exactly the same:

A relation between two concepts where concept B comes into existence as a result of concept A.
A relation between two concepts where concept A comes into existence as a result of concept B.

Yet in the the more extensive docs the entries for causes, is_caused_by, involved_result, and result do seem to have semantically distinguishable (but similar) meanings.

I guess the descriptions for causes and is_caused_by are the most "wrong" on the schemas page as they differ more significantly from the descriptions in the GWA docs.

Order of words inside a synset

The order of the words within a synset is not specified in the format, however this is important information that many wordnets have. I would propose adding a new tag Member in the Synset that indicates the order of the words in the synset.

For example:

<Synset id="ewn-02327239-n" ili="i47769" partOfSpeech="n">
  <Definition>(usually informal) especially a young rabbit</Definition>
  <SynsetRelation relType="hypernym" target="ewn-02326697-n"/>
  <Member>bunny</Member>
  <Member>bunny rabbit</Member>
</Synset>

Synsets are inferred to be lexical entries in wordnet RDF ontology

Hi,
we would like to point out that according to the current definition of the property wn:partOfSpeech (which has domain ontolex:LexicalEntry), synsets with parts of speech are inferred to be of type lexical entry. This means for example that they have to have at least one lexical form which does not make sense with synsets....

Cheers,
Andrea and Fahad

LexicalEntry ids

In the DTDs, a LexicalEntry have an identifier defined as https://github.com/globalwordnet/schemas/blob/master/WN-LMF-1.1.dtd#L35

The type ID, https://www.w3.org/TR/REC-xml/#id, is quite restricted and can potentially be an issue for words in other languages with accents, etc. Nevertheless, I do want to preserve the legibility and avoid creating extra artificial ids. Ideally, I would like 1-1 relation with the URI used in the RDF encoding. But we can use % scape in URIs. Any idea?

Can we add a logo to the metadata

Many (some) projects have logos, and these can be useful to display as visual representations. Can we add a logo to the meta data?

Core / Extensions

Strictly speaking, this is not an issue but tentative proposals, open for discussion.
Here they go:

1 - It is appropriate to distinguish between core levels and extensions

2 - The core level should define
- the core contents of a word net
- possibly an extension mechanism, but not the extensions themselves

3 - The core level should deal only with stand-alone internal coherence and well-formedness. It excludes external references to external databases. Internal references should be checked.

4 - Extensions are permissible and should have their own namespace. They can deal with external reference. They are responsible for their own validation.

5 - Each extension should provide a strip-down mechanism that produces core-conformant data, in effect stripping down non-core data.
This can be easily done with each extension providing an XSLT transform script "to_core.xsl" that, when invoked, will filter non-core data away.
Invocation can then be triggered by simply including this in the XML header:
<?xml-stylesheet type="text/xsl" href="to_core.xsl"?> or passing the XSLT to DOM builders.

dc:language metadata

The WN-LMF 1.0 DTD allows 14 of the 15 Dublin Core elements as attributes (setting aside the attribute vs element thing (#5) for a moment) on many LMF elements. The missing one is dc:language. Some of the LMF elements have their own language attribute, but not all that allow Dublin Core metadata.

Is there a reason for this situation? Can we just add in dc:language and use that instead of the default namespace's language attribute?

How to specify wordnet project "dependencies"

We have non-English wordnets that build on an existing wordnet (generally PWN). For instance, the Japanese Wordnet is built on top of PWN 3.0's synsets and relations but contributes lemmas, definitions, and examples (here in an older LMF variant retrieved from http://compling.hss.ntu.edu.sg/wnja/index.en.html):

<LexicalResource>
        <GlobalInformation label="Japanese WordNet 1.1 by NICT"/>
        <!-- produced on 2010-10-22 -->
<Lexicon languageCoding='ISO 639-3' label='Japanese Wordnet' language='jpn' owner='NICT' version='1.1'>
   <LexicalEntry id ='w239520'>
      <Lemma writtenForm='夜半' partOfSpeech='n'/>
      <Sense id='w239520_15168185-n' synset='jpn-1.1-15168185-n'/>
      <Sense id='w239520_15167027-n' synset='jpn-1.1-15167027-n'/>
   </LexicalEntry>
   ...
   <Synset id='jpn-1.1-15168185-n' baseConcept='3'>
        <Definition gloss="夜の12時。夜中。">
        <Statement example="幼い子供が真夜中まで起きているのを許されるべきではない"/>
        </Definition>
      <SynsetRelations>
         <SynsetRelation targets='jpn-1.1-15228378-n' relType='hype'/>
         <SynsetRelation targets='jpn-1.1-15167027-n' relType='hprt'/>
      </SynsetRelations>
   </Synset>

While I'm not certain of details like how these IDs map to PWN synsets, my point is that there is no explicit encoding of the dependency on PWN 3.0, and I don't see it in the DTD for the current LMF, either. Let's pretend the DTD has something like this:

<!ELEMENT Lexicon (Require*, LexicalEntry+, Synset*)>
...
<!Element Require EMPTY>
<!ATTLIST Require
    lexicon NMTOKEN #REQUIRED
    version CDATA #REQUIRED>

This would allow something like:

<Lexicon id="jwn" ...>
  <Require lexicon="ewn" version="2020" />
  ...

I'd like for this to extend to other kinds of extensions, too. For instance, we could extract all taboo words or scientific names into wordnet "layers" that can enabled or disabled as needed.

Pronunciation information

We are looking to add some pronunciation information to English WordNet and it would be good to add this as a schema extension. As I see it we would need to have the following information

  • The actual form
  • The notation scheme (e.g., IPA)
  • The dialect, encoded with a ISO-3166 code
  • Further notes. A free text for the describing the pronunciation in more detail

As such, I would suggest something like as follow:

<LexicalEntry id="ewn-transport-n">
  <Lemma writtenForm="transport" partOfSpeech="n">
    <Pronunciation notation="ipa" dialect="GB" notes="RP">/tɹænzˈpɔːt/</Pronunciation>
    <Pronunciation notation="ipa" dialect="GB" notes="RP">/tɹɑːnˈspɔːt/<Pronunciation>
    <Pronunciation notation="ipa" dialect="US" notes="GenAM">/tɹænzˈpɔɹt/</Pronunciation>
  </Lemma>
  <Sense>...</Sense>
</LexicalEntry>

DTD 1.1

Why partOfSpeech is an attribute of the Lemma and not an attribute of lexicalEntry?

Missing relation type in DTD

The following wordnet pointers don't appear to have entries (relType) in the DTD:

pointer string in the wndb2lmf converter
$ verb_group
;u domain_usage
-u missing

Why is ILI a required attribute for synsets in the WN-LMF schemas?

I just implemented WN-LMF as a new export format for DanNet to be queried using goodmami/wn.

Limiting myself to the relations defined directly by GWA is fine (it is a limited format, after all), however, I also ran into an issue where elements without a corresponding ILI key resulted in a failed import of the DanNet dataset in the goodmami/wn library.

I tracked the issue back to here where the various WN-LMF.dtd files state:

<!ATTLIST Synset
    id ID #REQUIRED
    ili CDATA #REQUIRED
    ... >

I must say that I find this requirement to be a quite limiting.

Why should a WordNet be fully linked to the ILI to be valid as WN-LMF...? AFAIK only the English WordNet fits this requirement and only because (and correct me if I'm wrong here) the CILI is essentially just the repurposed, core structure of the Princeton WordNet.

I suggest that this requirement be scrapped entirely.

Definition of 'exemplifies' and 'is_exemplified_by' in ontology

IMO the definition of exemplifies is not clear and makes me think of a hyponym/hypernym relation.

exemplifies
rdfs:comment
A relation between two concepts where B is a type of concept A

is_exemplified_by
rdfs:comment
A relation between two concepts where A an example of the type B

Cf. also "example for vs. example of".

Compare the definitions given on the github page:

exemplifies: Indicates the usage of this word
is_exemplified_by: Indicates a word involved in the usage described by this word

In order to understand I looked at some examples, e.g.

"hence" exemplifies "archaism, archaicism"
"archaism, archaicism" is_exemplified_by "hence"

IMO this relation would be easier to understand if we put it like this:

has_example
A relation between two concepts A and B where A has example B.

is_example_for
A relation between two concepts A and B where A is an example for B.

Example:
"hence" is_example_for "archaism, archaicism"
"archaism, archaicism" has_example "hence"

similar vs verb_group

The wn30.xml from https://github.com/bond-lab/omw-data/tree/main/wns/pwn30 has similar used in place of verb_group that is documented in https://globalwordnet.github.io/schemas/. The DTD files do not contain the verb_group:

    <Synset id="pwn-02136271-v" ili="i32435" partOfSpeech="v" dc:subject="verb.perception">
      <Definition >to throw or bend back (from a surface); &quot;Sound is reflected well in this auditorium&quot;</Definition>
      <SynsetRelation relType="domain_topic" target="pwn-06094774-n" />
      <SynsetRelation relType="similar" target="pwn-02136533-v" />
      <SynsetRelation relType="hyponym" target="pwn-02136533-v" />
      <SynsetRelation relType="hyponym" target="pwn-02766925-v" />
    </Synset>

Validation Schema (dc + foreign keys + namespaces + sensekeys)

The bad news

The dc: attributes are defined as elements not attributes by Dublin Core so that any attempt to validate against this external reference will fail. How to see this ?
The namespace URL http://purl.org/dc/elements/1.1/ redirects to
http://dublincore.org/specifications/dublin-core/dcmi-terms/2012-06-14/?v=elements

This page maps the URL to a schema location and contains the following:

Target namespace: http://purl.org/dc/elements/1.1/
Schema location: http://dublincore.org/schemas/xmls/qdc/2008/02/11/dc.xsd

The latter dc.xsd can be downloaded and defines

  <xs:element name="title" substitutionGroup="any"/>
  <xs:element name="creator" substitutionGroup="any"/>
  <xs:element name="subject" substitutionGroup="any"/>
  <xs:element name="description" substitutionGroup="any"/>
  <xs:element name="publisher" substitutionGroup="any"/>
  <xs:element name="contributor" substitutionGroup="any"/>
  <xs:element name="date" substitutionGroup="any"/>
  <xs:element name="type" substitutionGroup="any"/>
  <xs:element name="format" substitutionGroup="any"/>
  <xs:element name="identifier" substitutionGroup="any"/>
  <xs:element name="source" substitutionGroup="any"/>
  <xs:element name="language" substitutionGroup="any"/>
  <xs:element name="relation" substitutionGroup="any"/>
  <xs:element name="coverage" substitutionGroup="any"/>
  <xs:element name="rights" substitutionGroup="any"/>

All elements are declared as substitutable for the abstract element any,
which means that the default type for all elements is dc:SimpleLiteral.

The good news

Some documents specify the schema they expect to be validated against, typically using
xsi:noNamespaceSchemaLocation and/or xsi:schemaLocation attributes.

However, normally this isn't what you want. Usually the document consumer should choose the schema, not the document producer. This is what I did here where validation data is split according to namespaces between dc.xsd and WN-LMF-1.1.xsd.

Foreign-key attributes in the schema should have their own namespaces

dc: attributes are obviously meta data (I have grouped them into a Meta attribute group) except

  • dc:identifier (the sensekey): this acts more as a foreign key into another database and is the umbilical cord to the wn31 data, a sort of birth certificate with a mention of the parents. As such it means nothing within the confines of WordNet2019 as pointed out elsewhere : it is NOT a sense (nor synset) pointer. Besides 38 senses in adj alone have no dc:identifier. That's why I call it a foreign key. As such it should have its own namespace. I would suggest changing dc:identifier to wn31:sensekey.

  • arguably dc:subject is another umbilical-cord foreign key into wn31 in which case it should be transformed into wn31:lexfile. ... But it is also a reference to the lexical file it belongs to. If we anticipate more lexical files, it should be a lexfile attribute with the current (null) namespace.

  • likewise the ili attribute is a foreign key. It points to nothing within this database. I would suggest changing it to ili:id with its own namespace.

Need for a 'sensekey' attribute

This leaves the problem of sensekeys that would have a meaning within the current database. If they have such a meaning, they are generated, not copied. Incidentally, let me mention that each version of WordNet generates its own sensekeys, the grinder tool does that. It turns out the sensekeys can be generated. I have worked on a XSLT-based transformer tool that does just that in a declarative way (XML-to-XML XSLT transformation description) and is to be found here. More on this later. The transformer adds a sensekey attribute. It would make sense to use it in a standalone generation of index.sense.

Besides being pointers, generated sensekeys have also been considered a measure of stability between successive versions of the WordNet database (if two versions generate the same sensekeys it's highly likely that nothing has changed in the distribution of senses). It can be used as such (and given an important weight) by the relaxmapper which is meant to find mappings between hierarchies of data. Again, this makes sense if the sensekeys are generated, not copied.

Need for a 'lexfile' attribute

See above in Foreign-key attributes section.
Another option is to put it it in the top element (either LexicalResource or Lexicon) of the xml lexical file (it can then be easily accessed by tools). But the problem will remain when merging.

Factor out SyntacticBehaviour

Allowing it

  • to be a child of a top element (again either LexicalResource or Lexicon) where it could be defined, receive an id and content
  • while it could be referenced by an idref in the LexicalEntry

would avoid considerable redundancy ("The banks %s the check" is repeated 7433 times!)

Like this:

<LexicalResource>
  <Lexicon>
...
        <SyntacticBehaviour id='svo1' subcategorizationFrame="The banks %s the check" />
        <SyntacticBehaviour id='sv1'  subcategorizationFrame="The coins %s "/>
        <SyntacticBehaviour id='svo2' subcategorizationFrame="They %s the bags on the table" />
        <SyntacticBehaviour id='svo3' subcategorizationFrame="They %s the coin " />
...
        <LexicalEntry id="ewn-inoculate-v" >
            <Lemma writtenForm="inoculate" partOfSpeech="v" />
            <Sense id="ewn-inoculate
            <SyntacticBehaviour idref="svo1" senses="ewn-inoculate-v-00086587-03"/>
            <SyntacticBehaviour idref="sv1" senses="ewn-inoculate-v-00053234-01 ewn-inoculate-v-00055835-01 ewn-inoculate-v-00188584-01"/>
            <SyntacticBehaviour idref="sv02" senses="ewn-inoculate-v-00188584-01"/>
            <SyntacticBehaviour idref="sv03" senses="ewn-inoculate-v-00086587-03 ewn-inoculate-v-00834278-01" />
...
        </LexicalEntry>
  </Lexicon>
</LexicalResource/">

Note I left out the problem of naming these frames.

wn:is_caused_by and wn:result have identical definition

Both are defined as

A relation between two concepts where concept A comes into existence as a result of concept B@en

Also, wn:causes and wn:involved_result share the same definition:

A relation between two concepts where concept B comes into existence as a result of concept A@en

JSON Scheme synset hasn't example

In DTD file, Synset has example. (https://github.com/globalwordnet/schemas/blob/master/WN-LMF-relaxed-1.0.dtd#L88)

but json scheme file's Synset hasn't example field (

"properties": {
"partOfSpeech": {
"enum": [
"noun",
"verb",
"adjective",
"adverb",
"adjective_satellite",
"phrase",
"conjunction",
"adposition",
"other",
"unknown" ]
},
"@id": { "type": "string" },
"ili": { "type": "string" },
"value": { "type": "string" },
"status": { "type": "string" },
"confidenceScore": { "type": "number", "minimum": 0.0, "maximum": 1.0 },
"contributor": { "type": "string" },
"coverage": { "type": "string" },
"creator": { "type": "string" },
"date": { "type": "string" },
"description": { "type": "string" },
"format": { "type": "string" },
"identifier": { "type": "string" },
"publisher": { "type": "string" },
"relation": { "type": "string" },
"source": { "type": "string" },
"subject": { "type": "string" },
"title": { "type": "string" },
"type": { "type": "string" },
"lexfile": { "type": "string" },
"definition": {
"type": "array",
"minItems": 1,
"items": {
"type": "object",
"required": ["gloss"],
"additionalProperties": false,
"properties": {
"gloss": { "type": "string" },
"language": { "type": "string" },
"status": { "type": "string" },
"confidenceScore": { "type": "number", "minimum": 0.0, "maximum": 1.0 },
"contributor": { "type": "string" },
"coverage": { "type": "string" },
"creator": { "type": "string" },
"date": { "type": "string" },
"description": { "type": "string" },
"format": { "type": "string" },
"identifier": { "type": "string" },
"publisher": { "type": "string" },
"relation": { "type": "string" },
"source": { "type": "string" },
"subject": { "type": "string" },
"title": { "type": "string" },
"type": { "type": "string" }
}
}
},
"iliDefinition": {
"type": "object",
"required": ["gloss"],
"additionalProperties": false,
"properties": {
"gloss": { "type": "string" },
"status": { "type": "string" },
"confidenceScore": { "type": "number", "minimum": 0.0, "maximum": 1.0 },
"contributor": { "type": "string" },
"coverage": { "type": "string" },
"creator": { "type": "string" },
"date": { "type": "string" },
"description": { "type": "string" },
"format": { "type": "string" },
"identifier": { "type": "string" },
"publisher": { "type": "string" },
"relation": { "type": "string" },
"source": { "type": "string" },
"subject": { "type": "string" },
"title": { "type": "string" },
"type": { "type": "string" }
}
},
"members": {
"type": "array",
"minItems": 1,
"items": { "type": "string" }
},
"relations": {
"type": "array",
"minItems": 1,
"items": {
"type": "object",
"required": ["relType","target"],
"additionalProperties": false,
"properties": {
"relType": {
"enum": [
"agent",
"also",
"attribute",
"be_in_state",
"causes",
"classified_by",
"classifies",
"co_agent_instrument",
"co_agent_patient",
"co_agent_result",
"co_instrument_agent",
"co_instrument_patient",
"co_instrument_result",
"co_patient_agent",
"co_patient_instrument",
"co_result_agent",
"co_result_instrument",
"co_role",
"direction",
"domain_region",
"domain_topic",
"exemplifies",
"entails",
"eq_synonym",
"has_domain_region",
"has_domain_topic",
"is_exemplified_by",
"holo_location",
"holo_member",
"holo_part",
"holo_portion",
"holo_substance",
"holonym",
"hypernym",
"hyponym",
"in_manner",
"instance_hypernym",
"instance_hyponym",
"instrument",
"involved",
"involved_agent",
"involved_direction",
"involved_instrument",
"involved_location",
"involved_patient",
"involved_result",
"involved_source_direction",
"involved_target_direction",
"is_caused_by",
"is_entailed_by",
"location",
"manner_of",
"mero_location",
"mero_member",
"mero_part",
"mero_portion",
"mero_substance",
"meronym",
"similar",
"other",
"patient",
"restricted_by",
"restricts",
"result",
"role",
"source_direction",
"state_of",
"target_direction",
"subevent",
"is_subevent_of",
"antonym"
]
},
"target": { "type": "string" },
"status": { "type": "string" },
"confidenceScore": { "type": "number", "minimum": 0.0, "maximum": 1.0 },
"contributor": { "type": "string" },
"coverage": { "type": "string" },
"creator": { "type": "string" },
"date": { "type": "string" },
"description": { "type": "string" },
"format": { "type": "string" },
"identifier": { "type": "string" },
"publisher": { "type": "string" },
"relation": { "type": "string" },
"source": { "type": "string" },
"subject": { "type": "string" },
"title": { "type": "string" },
"type": { "type": "string" }
}
)

Is this bug? or some reason?

Part of speech modifiers

WordNet includes modifiers for adjective position, e.g.,

  • a (attributive, eg: the coming release)
  • ip (immediate postnominal, eg: beaches galore)
  • p (predicative, eg: the house is ablaze)

We should add an attribute to allow this modelling. Perhaps in a language independent way?

It is not clear how to show ranking

In some wordnets (I think PWN and plWordnet) the ranking of senses is meaningful, and I do not know how we should capture this in the DTD.

WordNet part-of-speech vs Ontolex part-of-speech

Scanning through the Turtle file, I noticed that you define your own POS relations and classes rather than use the lexinfo:partOfSpeech relation which is heavily used in the Ontolex specification, which I understand that @jmccrae helped bring to life. I'm unsure why this is the case?

In the Ontolex specification it is specifically stated that

the model abstracts from specific linguistic theory or category systems used to describe the linguistic properties of lexical entries and their syntactic behavior, encouraging reuse of existing data category systems or linguistic ontologies.

I think that this is an excellent ideal as it makes integration of existing datasets mostly a matter of merging sets of triples. The second best option would be having some kind of derived lexinfo relation triple which can be inferred via equivalent/subclass relations.

Unfortunately, the GWA schema's partOfSpeech relation and PartOfSpeech class are proprietary and not linked to any external definitions. I have used Ontolex as the basis for the new version of DanNet, so my part-of-speech tags are all defined using lexinfo:partOfSpeech relation rather than wn:partOfSpeech.

How do you suggest we bridge this gap? The way I see it, either version 1.2 of the schema removes this bit and datasets use lexinfo:partOfSpeech directly -OR- a direct equivalency to lexinfo:partOfSpeech is established in the schema -- preferably the first as it simplifies things.

I could also add both wn and lexinfo relations for all LexicalEntry classes in the new DanNet, but that's both confusing and a messy fix IMO. Better to fix the schema than work around its flaws. Having competing standards for this is not a great situation.


The relevant part of the schema:

:partOfSpeech a owl:ObjectProperty ;
  rdfs:domain ontolex:LexicalEntry ;
  rdfs:range :PartOfSpeech ;
  rdfs:label "part of speech"@en ;
  rdfs:comment "The syntactic class of the entry, e.g., noun, verb"@en .

:PartOfSpeech a owl:Class ;
  rdfs:label "part of speech"@en ;
  rdfs:comment "The syntactic class of the entry, e.g., noun, verb"@en ;
  owl:oneOf (
    :noun :verb :adjective :adverb :adjective_satellite :named_entity 
    :conjunction :adposition :other_pos :unknown_pos ) .

:noun a :PartOfSpeech ;
  rdfs:label "noun"@en.
:verb a :PartOfSpeech ;
  rdfs:label "verb"@en .
:adjective a :PartOfSpeech ;
  rdfs:label "adjective"@en .
:adverb a :PartOfSpeech ;
  rdfs:label "adverb"@en .
:adjective_satellite a :PartOfSpeech ;
  rdfs:label "adjective satellite"@en .
:named_entity a :PartOfSpeech ;
  rdfs:label "named entity"@en .
:conjunction a :PartOfSpeech ;
  rdfs:label "conjunction"@en .
:adposition a :PartOfSpeech ;
  rdfs:label "adposition"@en .
:other_pos a :PartOfSpeech ;
  rdfs:label "other pos"@en .
:unknown_pos a :PartOfSpeech ;
  rdfs:label "unknown pos"@en .

Lexicographer File Attribute

Should we have an attribute for lexicographer files? Is this something that many (more than one) wordnet projects use?

No `owl:inverseOf` or other inter-Resource relations defined in Ontolex schemas

Most of the relations described at https://globalwordnet.github.io/gwadoc/ have inverse relations (called "reverse" in the documentation). However, none of these have been marked as owl:inverseOf in the Ontolex schema files. I wonder why that is the case? I really hope it's simply a case of a lack of time or an oversight, rather than a conscious omission.

There is a wn-simple-1.1.ttl which seems to be a pre-Ontolex RDF schema and this does retain some useful OWL definitions for inter-Resource relations, including owl:inverseOf. However, it seems that this is fundamentally incompatible with the later Ontolex versions (e.g. wn-lemon-1.1.ttl or wn-lemon-1.1.rdf) -- or at least I see no path to obtain equivalence between the different ontology definitions. If I choose the Ontolex schemas I gain a broader set of relations and build on the most current shared definition of what constitutes a WordNet graph, but on the other hand I lose very valuable OWL information. In my case, I really need both.

I am currently using the 1.1 schema files to construct an ontology for DanNet. I actually expected the various relations to have e.g. owl:inverseOf since this is so heavily featured in the documentation (almost every relation listed has a "reverse" relation) so I never bothered to check the schemas if they actually contained this information. I am relying on OWL reasoning to infer the "missing" RDF triples in order to have a consistent dataset and less duplication of data. The fact that the Ontolex-schemas have relatively sparse ontological information compared to wn-simple-1.1.ttl is really not a great foundation for us at the Center for Language Technology to build a knowledge graph on. The schemas are too simple in their current incarnations.

I appreciate the work you're doing to consolidate WordNet standards and create a common foundation, so I hope that you will reconsider.

New Relations from plWordNet

The following new synset relations have been suggested by Ewa

  • simple_aspect: czytać "read/be reading (habitual/progressive)" -> przeczytać "have read" [pl]\
  • secondary_aspect: kopać "dig/be digging" -> nakopać "have dug out a lot of sth" [pl]
  • feminine_form: pig -> sow
  • masculine_form: & pig -> boar
  • young_form: pig -> piglet
  • diminutive: pig -> piggy
  • augmentative: дом "house" -> домище "great house" [ru]
  • anto_gradable: hot <-> cold, warm <-> cool
  • anto_simple: complete <-> incomplete
  • anto_converse: wife <-> husband, employer <-> employee
  • ir_synonym or interregister_synonym: money <-> dough, loot informal, 食べる taberu "eat"<-> 召し上がる meshiagaru "honored person eats" honorific [ja]

I think we could add these quite easily, right?

Add xml:space attribute for WN-LMF format

If a wordnet author wishes to ensure whitespace is preserved in things like examples, definitions, etc. in the WN-LMF format, they should use the xml:space attribute with the value "preserve", but this attribute must be declared in the schema if it is to be used in a valid document. See goodmami/wn#151 (comment) for further discussion.

I'm not advocating for or against this attribute's inclusion as I don't know if there's a real need, but just raising the issue for discussion since it came up in goodmami/wn#151.

wn-simple TTL files missing @prefix declarations

The wn-simple-1.1.ttl, wn-simple-1.2.ttl, and wn-simple-1.3.ttl files are missing the following @prefix annotations:

@prefix cc: <http://creativecommons.org/ns#> .
@prefix dc: <http://purl.org/dc/elements/1.1/> .
@prefix vann: <http://purl.org/vocab/vann/> .
@prefix voaf: <http://purl.org/vocommons/voaf#> .

As such, they will fail to be processed by an RDF application.

addition of grammaticalGender to ATTLIST for Lexical Entry and Form?

Would be useful for languages with gender (in languages like Italian the same Lexical Entry can have singular and plural forms of different genders so it would be good to have this in the ATTLIST for Form too)....(I also think part of speech would make more sense as part of the ATTLIST for Lexical Entry if we start to add grammatical attributes at the level of Lexical Entry)

Dublin Core attributes or elements?

Original comment for @1313ou

The dc: attributes are defined as elements not attributes by Dublin Core so that any attempt to validate against this external reference will fail. How to see this ?
The namespace URL http://purl.org/dc/elements/1.1/ redirects to
http://dublincore.org/specifications/dublin-core/dcmi-terms/2012-06-14/?v=elements

This page maps the URL to a schema location and contains the following:

Target namespace: http://purl.org/dc/elements/1.1/
Schema location: http://dublincore.org/schemas/xmls/qdc/2008/02/11/dc.xsd

The proposed change would be to change the DC attributes to elements. The obvious issue with this is that it breaks backwards compatibility, and the guidelines on Dublin Core are not a hard requirement (RDF for example uses attributes).

pertainym

In https://globalwordnet.github.io/schemas/ we have the

The set of relations between senses is limited to the following

  1. pertainym: A relational adjective. Adjectives that are pertainyms are usually defined by such phrases as “of or pertaining to” and do not have antonyms. A pertainym can point to a noun or another pertainym

Non-Princeton WordNet Relations:

  1. pertainym: usually an adjective, which can be defined as “of or pertaining to” another word.

In the DTDs:

  1. https://github.com/globalwordnet/schemas/blob/master/WN-LMF-relaxed-1.0.dtd#L195
  2. https://github.com/globalwordnet/schemas/blob/master/WN-LMF-relaxed-1.1.dtd#L206

From https://globalwordnet.github.io/gwadoc/#pertainym, I got that this is the \ in the PWN https://wordnet.princeton.edu/documentation/wninput5wn.

So in conclusion, the page https://globalwordnet.github.io/schemas/ listed twice the same relation as two different relations but they are the same one, right?

Breaking changes

This issue is meant to collect the changes we would like to make to WN-LMF but have not because doing so would break backward compatibility. When we get to a 2.0 version we have a chance for some simplification and belt-tightening, so it would be a shame if we miss some and have to wait for the next major version.

For better discussion, these issues could be broken up into separate issues (maybe with an appropriate label or milestone to group them?).

Deferred Changes

These are changes we would have made in WN-LMF 1.1 if backwards compatibility were not an issue.

  • Remove <SyntacticBehaviour> from <LexicalEntry>; it became a child of <Lexicon>
  • Remove the senses attribute from <SyntacticBehaviour>; these associations are handled by the subcat attribute on <Sense> elements
  • Make the id attribute on <SyntacticBehaviour> required

Proposed Changes

These are new changes that we might consider

  • Remove <Tag> (edit: in the comments below, a case is made for other uses of <Tag>)

    Click to show/hide original text

    The use case presented in Bond et al. 2020 ("Some Issues with Building a Multilingual Wordnet") seems more elegantly handled by the script attribute on <Lemma> and <Form>:

    <Lemma writtenForm="头发" partOfSpeech="n" script="Hans" />
    <Form writtenForm="頭髮" script="Hant" />
    <Form writtenForm="tóufa" script="Latn-pinyin" />
    <Form writtenForm="tou2fa5" script="Latn-pinyin-x-numeric" />
    <Form writtenForm="toufa" script="Latn-pinyin-x-simple" />

    Above, if script were limited to ISO15924 script names, then all 3 pinyin variants would be just "Latn", so I used BCP-47-like tags minus the language and region names. The "pinyin" variant and private-use tags "numeric" and "simple" can be used to distinguish them.

  • Remove <Count>? (see comments below)

  • Remove <ILIDefinition>? (see comments below)

  • Remove (apparently) unused attributes?

    • sourceSense on <Definition>
    • lexicalized on <Sense> and <Synset>
    • status on anything with metadata

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.