Comments (14)
Checkout this post on MCV multi-orthography feature @alvynabranches @anniedempe
from common-voice.
Thank you for the recommended adjustments and for the valuable insights you provided. While we're eager to implement the proposed modifications, it's important for us to consult with the community members and translators who requested the addition of the languages (knn and gom) to Common Voice, this is to ensure that we're accommodating the linguistic preferences of the entire community.
from common-voice.
Hello @thak123 @anniedhempe and @alvynabranches
As you already know, Konkani in देवनागरी (devanagari) is the official standard for reading and writing konkani in Goa. However there are variants or dialects of it spoken in smaller communities in the konkan region. Meaning, a person living in maharastra's konkan region will speak differently from person living in margao, and both of these people will speak differently than a person living in mangalore. Also, not all people will be able to read sentences in देवनागरी because mangalorean people read in kannada script. While many catholics in Goa prefer to read/speak in Romi-konkani.
There was a Mozilla Discourse post in July 2022 which talked about the introduction of Language Variants. Please watch the video linked in the post.
There is also another MCV post in March 2024 Multi-Orthography for language variants introducing support for languages with different writing system.
I am aware that currently the dataset is split into two locales for konkani. But, I think whole-heartedly that combining both of these datasets is possible because the difference of Romi and Devanagari is not much at all! The difference is mainly only in the pronunciation and writing script. With the upcoming introduction of multi-orthography for Language Variants that have multiple writing system/script, it is possible to use one locale (same dataset) for languages with multiple variants! Refer the 2024 discourse post.
Mozilla is already trying to support languages with multiple orthography. With the features i have suggested on the main post of this issue, it would enhance the participation of everyone in building together the konkani common voice dataset.
I am trying to plan in advance before all the work of translation and sentence collection is done for the konkani language.
Please send a comment if you are okay with combining the dataset on the grounds that users can choose a "language variant" + "writing system" before they record their voices. With your support Mozilla will hopefully make the appropriate changes..
तुमचें वीचार एकदम गरजेचे आसा.
from common-voice.
@alvynabranches, manager of gom locale on pontoon
from common-voice.
Related to #3266
from common-voice.
Giving two options is not possible on a same dataset is not possible hence there is a separate language created for Devanagari and Roman scripts. If this has to be created, we would have to make a custom platform for it. @chasingdragonflies
from common-voice.
If the datasets cannot be merged, then at least the sentences can be linked. It would provide an easier way to further use both datasets for transliteration with machine learning. I am not aware of other use cases at the moment.
How?
In the roman dataset, in sentences.tsv, a new column can be added "devanagari_sentence_id". This can be the sentence ID of the devanagari sentence in devanagari dataset that matches the roman sentence..
Same for devanagari dataset, "roman_sentence_id".
from common-voice.
Tools that can aid in transliteration and linking of sentences:
• https://konkanverter.com, developed by World Konkani Center in Mangalore (automatic transliteration tool, although it's not perfect)
• English to Devanagari-konkani dictionary (official Goa Konkani Basha Mandal app co-developed with shabdkosh)
• Modern English to Roman-konkani dictionary, a PDF file uploaded to wikimedia commons.
from common-voice.
Transliteration of konkani is not as simple as converting devanagari words to english pronunciation.
Here are some examples for the understanding of non-konkani speakers. (Use hindi text-to-speech for pronunciation)
-
For "sweet taste":
In gom, we say "ghod" (pronunciation: घोड)
In knn, we say "ghad" (pronunciation: घड) -
For "my":
In gom, we say "mojea" (pronunciation: म्हौज्या)
In knn, we say "mhajea" (pronunciation: म्हज्या)
However, there are some words which have same pronunciation in both konkani scripts..
3. For "is happening":
In gom, we say "zata"
In knn, we also say "zata" (pronunciation: जाता)
from common-voice.
Mozilla common voice can introduce a new tab, "Link", alongside the speak, listen, write, review tabs for the purpose of linking the devanagari and roman sentences across the dataets. This will also boost sentence collection for both datasets
The flow could be: Write -> Review (linguistically correct and copyright-free?) -> Link (type the corresponding devanagari/roman sentence) -> Record -> Validate
Once linked, the other language script can have the flow:
Review (linguistically correct?) -> Record -> Validate.
This way we can collect sentences for both the datasets together!
In the contribution guidelines, for linking purpose, we can suggest the tools i mentioned here which will make it easier to convert the sentences between the devanagari and roman scripts.
from common-voice.
Giving two options is not possible on a same dataset is not possible hence there is a separate language created for Devanagari and Roman scripts. If this has to be created, we would have to make a custom platform for it. @chasingdragonflies
They have planned it. Refer: #4144 (comment)
@ginamoape Could you guide us?
from common-voice.
from common-voice.
Last date is 25 april 2024
from common-voice.
from common-voice.
Related Issues (20)
- [BUG] Develoment setup - Cannot build docker and run common voice HOT 5
- [BUG] After dividing .ftl files some CSS and translations got broken
- [DOCS] clarify if GitHub can still be used for submitting sentences, and if the sentences can be accessed from the repository. HOT 2
- [BUG] Missing language names in .ftl (English) HOT 3
- [FR] Make "Get Involved" & popup more pro-active to recruit members for communities
- [BUG] In "speak" ENTER shortcut does not work and resets recordings
- LOCALISATION REQUEST: Croatian HOT 1
- [BUG] & [FR] In write, the source field is not checked correctly HOT 3
- LOCALISATION REQUEST: Afar Language HOT 4
- Make Partner logo clickable to redirect towards partner site. HOT 1
- [BUG] Socials not centered properly HOT 1
- UI Enhancment HOT 1
- [BUG] Source field is not shown in "review" (sentence validation) HOT 3
- LOCALISATION REQUEST: Puno Quechua HOT 1
- [BUG]
- [BUG] Sign in with google option produces error. HOT 2
- [FR] Need of dataset cards HOT 1
- Adding data collected by external source HOT 2
- LOCALISATION REQUEST: EWE HOT 6
- [BUG] Irregular spacing between words can result in duplicate sentences added via write page HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from common-voice.