GithubHelp home page GithubHelp logo

LexicalEntry ids about schemas HOT 20 OPEN

globalwordnet avatar globalwordnet commented on August 29, 2024
LexicalEntry ids

from schemas.

Comments (20)

arademaker avatar arademaker commented on August 29, 2024 1

but it would be useful to discuss possibilities so as to develop a set of general recommendations for our schemas

This is the plan and the reason for opening the issue and ask @FredsoNerd to comment here. Of course I never considered chance the actual ID type in the schema.

Thank you for your suggestions @goodmami. I will discuss them with @FredsoNerd on how to implement.

from schemas.

jmccrae avatar jmccrae commented on August 29, 2024 1

As @goodmami pointed out we have some ad-hoc rules in OEWN for special characters:

https://github.com/globalwordnet/english-wordnet/blob/master/scripts/wordnet_yaml.py#L13

A less 'ad-hoc' approach is to replace them with XML character entities such as '

from schemas.

goodmami avatar goodmami commented on August 29, 2024

For the XML files I think we should follow XML conventions and use ID for ids. This may also be necessary for validation tools to ensure, e.g., that IDs are unique in a document and that IDREF targets are present. There is no interpretable meaning within the ID strings, and using forms that look like lemmas is only a convenience for human annotators. The actual forms are in <Lemma> and <Form>.

If you must have a LexicalEntry ID be an accurate representation of the lemma, you might try using Punycode (update: see comments below) as it is ASCII-only and might fit in XML's range for IDs. Since IDs cannot start with hyphens, numbers, etc., you'll need to give it an appropriate prefix, which is the recommendation anyway for WN-LMF. The downside of this method is that it won't be necessarily be legible to a human. E.g., for fácil you might have own-pt-fcil-5na.

from schemas.

arademaker avatar arademaker commented on August 29, 2024

Thank you Michael. We were considering use a hash from the lemmas but punycode seems more robust. We have a 1-1 correspondence with the lemma, maybe useful for validation.

from schemas.

goodmami avatar goodmami commented on August 29, 2024

Actually, what was the problem with putting the unicode lemmas directly in the ID value, such as own-pt-fácil-a? I don't think that's disallowed, but the suggestions for names mention avoiding easily confusable sequences, like combining characters when a composed character exists (e.g., a + ◌́ instead of á).

from schemas.

jmccrae avatar jmccrae commented on August 29, 2024

Accented characters can be used in XML IDs so I don't really see an issue here.

The use of IDs also provides some extra validation to the DTD, namely that IDs are unique and that all references to the IDs actually exist.

from schemas.

1313ou avatar 1313ou commented on August 29, 2024

English WordNet 2021 sense IDs conform to the old ID definition but not the more recent xsd:ID.
Edit: Sorry I realize now this is more relevant to English WordNet:
globalwordnet/english-wordnet#749

from schemas.

1313ou avatar 1313ou commented on August 29, 2024

@arademaker, did you consider hashes are one-way (you cannot retrieve what you hash), so eventually they are not legible? As are '_'-substitutions on a number of off-limits characters, to a lesser extent.

from schemas.

1313ou avatar 1313ou commented on August 29, 2024

@goodmami, PunyCode is rather English-centric and may be very cumbersome when more than one character cannot boil down to ascii.

from schemas.

goodmami avatar goodmami commented on August 29, 2024

@1313ou I'd say it's English-centric only in that it's ASCII-based, but so are some other languages, e.g., Malay or Rotokas. In any case, having looked closer at the XML spec, I suggested in my second comment above that Punycode, or any such encoding, is not necessary as the accented characters can be used in IDs. To be clear, I no longer recommend using it for this purpose, and I've edited my comment above to make this more obvious.

I do suggest that we add some text to the page for ID suggestions, or maybe even a Javascript-based validator. All we have currently that I can see is:

All synsets must have an ID that starts with ID of the lexicon followed by a dash, e.g., example-en + - + local_synset_id.

The lexicon ID prefix is probably good advice for lexical entries as well because we might have lexical entries for digits or something else that shouldn't appear as the initial character in an XML ID. This means we should have recommendations for lexicon IDs (e.g., that it follows xsd:ID). I'm not sure if RDF has any similar encoding constraints, but those should be taken into account for these recommendations as well.

from schemas.

1313ou avatar 1313ou commented on August 29, 2024

I don't think the global schema must define IDs beyond the requirement that they be valid xsd:IDs. What's the problem with letting each word net define what they look like ? The basic reason is IDs are functionally opaque (and as such should not be parsed) even if it's nice for the lexicographer to recognize something in it. So "recommendations" is the good word.

from schemas.

goodmami avatar goodmami commented on August 29, 2024

Right, I'm only suggesting that we write some "recommendations". Even if the current text says "...synsets must have...", it might be better to change that must to should. These recommendations are just to help ensure the lexicons can be validated correctly. Otherwise wordnet authors should be free to design their own conventions.

from schemas.

FredsoNerd avatar FredsoNerd commented on August 29, 2024

About the discussion, in fact, as @jmccrae said

Accented characters can be used in XML IDs so I don't really see an issue here.

The problem occurs for some other characters, such as &;()+º',?–!’\, found in OWN-PT. For instances: vapor d’água from 15055442-n; from 02202047-s; Jack, o Estripador from 11077369-n, Miltiade? from 11180952-n.

from schemas.

FredsoNerd avatar FredsoNerd commented on August 29, 2024

At first, the option was to, after replacing spaces by underlines, apply some other substitutions, as follows:

        # formatting lexical_entry
        written_form_ = written_form.replace(" ", "_")
        word_id = f"word-{written_form_}-{part_of_speech}"
        
        for char in "&;()+º',?–!’":
            word_id = word_id.replace(char, '_')
        for char in "/":
            word_id = word_id.replace(char, ':')

But, we'd like to avoid this ad-hoc solution. Maybe in a the future a new character could break the code.

from schemas.

FredsoNerd avatar FredsoNerd commented on August 29, 2024

Again, makes sense to have this global (not depending on a specific language or environment) and reversible mapping instead of generating a hash or random Id:

The use of IDs also provides some extra validation to the DTD, namely that IDs are unique and that all references to the IDs actually exist.

An option can be to consider using the utf-8 hexadecimal encoding of the lemma, with part-of-speech for uniqueness. In this case, we generate, for the example before "Jack, o Estripador", from 11077369-n, the ID word-4a61636b2c206f2045737472697061646f72-n

What do you think @jmccrae @goodmami @1313ou @arademaker ?

from schemas.

goodmami avatar goodmami commented on August 29, 2024

@FredsoNerd thanks for the additional context. Yes, many punctuation characters are excluded from the NAME production in XML, used by ID, etc., so you'll need some way to handle these. But, for reasons @jmccrae and I outlined above, I don't think the WN-LMF should change the use of ID/IDREF/IDREFS in the DTD. How you deal with these characters is thus up to you (maintainers of OWN-PT), but it would be useful to discuss possibilities so as to develop a set of general recommendations for our schemas.

First, if you collapse multiple punctuation characters into a single replacement character (as you currently do with _), you risk collisions when two entries differ only in these punctuation characters, e.g. and the hypothetical 1+. To help here, you might find some way to uniquely enumerate them (own-pt-1_-1-n, own-pt-1_-2-n, etc.), or you might find some way to encode the characters uniquely (own-pt-1-ordm-n, own-pt-1-plus-n). The latter option is easier to implement.

Let's also look at some examples from the Open English Wordnet:

  • comma, replaced with -cm-

    <LexicalEntry id="ewn-Prince_William-cm-_Duke_of_Cumberland-n">
      <Lemma writtenForm="Prince William, Duke of Cumberland" partOfSpeech="n" />
  • colon, replaced with -003a- (not necessary unless using xsd:id)

    <LexicalEntry id="ewn-Capital-003a-_Critique_of_Political_Economy-n">
      <Lemma writtenForm="Capital: Critique of Political Economy" partOfSpeech="n" />
  • apostrophe, replaced with -ap-

    <LexicalEntry id="ewn-John_O-ap-Hara-n">
      <Lemma writtenForm="John O'Hara" partOfSpeech="n" />

So it looks like the OEWN has some ad-hoc rules for replacing those. In addition, spaces are replaced with underscores (_) and dashes (hyphens) are also used literally. The OEWN mixes shortened name-based escapes (e.g., ap for apostrophe) and hexadecimal (e.g., 003a for colon), but I'd suggest sticking to one and using established forms as in XML/HTML escapes, e.g., apos for apostrophe, so you don't need to maintain your own lookup table. Otherwise, the dash-escape-dash pattern isn't so bad, but, to avoid further collisions, you might also escape literal underscores (-lowbar-) and dashes (-dash-). In addition, only replace regular spaces with underscores; and other kinds of whitespace (tabs, non-breaking spaces, double-width spaces, newlines, etc.) get escaped.

To construct an ID, you can then:

  1. Replace disallowed ID characters with the dash-escape-dash patterns
  2. Prefix own-pt- (or some other lexicon ID followed by a dash)
  3. Append a dash and the part-of-speech

To recover the form from the ID, you do those steps in reverse. That is, after stripping the lexicon ID and part of speech and their dashes, all other dash characters indicate escape patterns to be unescaped.

from schemas.

1313ou avatar 1313ou commented on August 29, 2024

@FredsoNerd, I think the global word net does not have to superimpose constraints to specific word nets. However xsd:ID well-formedness is required because uniqueness and proper reference are involved. I have a problem with legacy sensekeys being promoted IDs because they are not conformant IDs because of the colon. However they can prove useful in the database and should be kept as an extra field, possibly as an extension.

from schemas.

arademaker avatar arademaker commented on August 29, 2024

A less 'ad-hoc' approach is to replace them with XML character entities such as '

Yes, I like that idea, in line with @goodmami suggestion too. But & and ; are not allowed. So we can use -apos- as you do in OEWN or we can use another symbol that xsd:ID rules accept, maybe #apos?

from schemas.

goodmami avatar goodmami commented on August 29, 2024

maybe #apos

The # character is also excluded punctuation. There is a small set of explicitly allowed ASCII punctuation. From the XML spec:

Note that COLON, HYPHEN-MINUS, FULL STOP (period), LOW LINE (underscore), and MIDDLE DOT are explicitly permitted.

That is, :, -, ., _, and ·. As the first character, however, only : and _ are allowed from that set. If we go with the xsd:id definition, : is also excluded, in any position. Unfortunately, the middle dot is not easy to type (at least on US keyboards), so we're really down to 3 usable ASCII punctuation characters: -, ., and _.

from schemas.

arademaker avatar arademaker commented on August 29, 2024

One extra puzzle to me. Why xml uses begin/end marks (& and ;)? Eventually can we run into any trouble by using only s single mark in -appos-?

from schemas.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.