GithubHelp home page GithubHelp logo

Comments (11)

goodmami avatar goodmami commented on August 29, 2024 1

Looks good in general. A few things:

  1. I would maybe go for a more general word for "dialect" to avoid political controversies. Maybe "variety"?
  2. Instead of a country code in "dialect" (or whatever we call it) and some specialization under "notes", can we combine them into one bcp-47 tag? This saves an attribute and it dovetails as a specialization of the lexicon's language attribute. E.g., en-GB-x-RP.
  3. Do we need the / in the transcription? I can't imagine a use for phonetic transcription ([...] vs /.../), so perhaps we can assume it is a phonemic transcription and drop the / characters?

from schemas.

jmccrae avatar jmccrae commented on August 29, 2024

Hi.

  1. yes, this is a good point, variety is better than dialect
  2. we could do this. I guess that this would mean duplicating the language code, but this is okay
  3. I would guess that some would prefer a phonemic transcription. We could drop the / and have an attribute for phonemic transcriptions?

from schemas.

fcbond avatar fcbond commented on August 29, 2024

from schemas.

lmorgadodacosta avatar lmorgadodacosta commented on August 29, 2024

Hi there,

We have been working/discussing this exact topic a tiny bit for Kristang -- we are hoping to provide IPA and voice recordings for individual lemmas soon.

The problem I'd like to raised here is that Kristang shows a lot of metathesis in certain consonant clusters. E.g. ‘-dr-’(kodrah and kordah for ‘to wake up’). Within the context of revitalization, as we want people to start using a single spelling, we have decided it would be best to cluster these as "Forms" under a single lemma (i.e. the canonical form) . However, these internal forms do have different pronunciations.

Up to this point we were happy to use the Tag element (available to both Forms and Lemmas) and come up with our own "category" notation. But I think including such a Pronunciation element is definitely an improvement. However, I would like to see what you all think about:

  1. making "Pronunciation" available to both Lemmas and Forms
  2. adding an explicit attribute to a sound file path

from schemas.

goodmami avatar goodmami commented on August 29, 2024
  1. making "Pronunciation" available to both Lemmas and Forms

I think this makes sense. Actually, in a new Python-based wordnet module I'm working on, all lemmas are just forms anyway, so doing something different for lemmas and forms would be more trouble than doing the same thing (but this, at least, is just a selfish reason).

  1. adding an explicit attribute to a sound file path

I'm less enthused about this, but if we're adding logos (#3), then it's not breaking new ground to link to external files. However, shouldn't this be a URL instead of a file path? If a file path, then absolute paths won't work, and we'd need some kind of resource directory such that the paths are relative to this directory, or something. This sounds like over-engineering.

Better, perhaps, would be that your application provides a mapping from local paths into the ids of the wordnet. The trouble is that lemmas/forms do not have their own ids, so it would have to be linked to the LexicalEntry, then to the writtenForm under that entry (are forms guaranteed to be unique under a lexical entry?).

Another issue is when you want multiple audio files for the same lemma/form (e.g., from multiple speakers). It doesn't seem like an attribute for a file path or URL would easily scale to multiple files.

from schemas.

1313ou avatar 1313ou commented on August 29, 2024

Implications: IPA symbols are not ASCII, so all tools must handle UTF8 (or whatever charset is defined as desired)

from schemas.

jmccrae avatar jmccrae commented on August 29, 2024
  1. Yes, I had intended this to be available for Forms as well as Lemmas
  2. We could add a URL to the sound file if available, this is useful for some even if it is not ever used
  3. I think UTF-8 is already required by the serialization. If someone wants a strict ASCII file they will have to use an ASCII based transcription scheme

from schemas.

lmorgadodacosta avatar lmorgadodacosta commented on August 29, 2024

Sorry, by sound file path I definitely meant URL or some public URI.
I do kinda see the problem raised by mike for multiple recordings of the same lemma/form... But if pronunciations are multiple elements, then you could just provide multiple Pronunciation elements. Individual projects could then use the attribute notes to keep information about the speaker, if necessary (e.g. male/female). But I would give a quick link to something that is very meaningful under the element we're discussing.

from schemas.

1313ou avatar 1313ou commented on August 29, 2024

To my knowledge the current state of EWN does not use characters that require coding outside ASCII, so the current files are both ASCII and UTF8. So the relevant tests are still to come.

from schemas.

goodmami avatar goodmami commented on August 29, 2024

To my knowledge the current state of EWN does not use characters that require coding outside ASCII, so the current files are both ASCII and UTF8. So the relevant tests are still to come.

I was surprised by this and thought that surely things like jalapeño and résumé would have the diacritics in EWN, even if only as alternative forms, but found nothing but ascii throughout the whole file. In any case, there are non-English wordnets with plenty of non-ascii forms, so it would be unfortunate if any tools assumed wordnets to be ascii-only.

from schemas.

goodmami avatar goodmami commented on August 29, 2024

Returning to the issue of marking a transcription as phonemic or phonetic... What do people think of keeping the IPA delimiters (/../ or [...]) in the actual transcription? When I suggested dropping the delimiters (See (3) in my comment above), I assumed we only cared about phonemic transcriptions. Since for non-IPA transcription the attribute may be irrelevant (or implicit given the notation attribute), perhaps the phonemic attribute is a bad idea. Furthermore, the IPA delimiters are shorter and may be clearer for someone familiar with IPA.

from schemas.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.