Comments (4)
Thanks so much for getting in touch with this issue.
When you say that the sentences in this file have been "recorded more than once", do you mean that identical sentences are being shown to Catalan voice contributors, or that the sentences as listed are repetitive for contributors because they repeat the same sentence structures with only the place names changing from sentence to sentence?
from common-voice.
Thank you for your question.
Some of these sentences have been shown to Catalan voice contributors (and recorded) more than once (up to 4 times in v.16). I see that some others (about 1100) have never been recorded, but since they are too similar (same sentence structures with only the place names changing) we don't see the interest in recording them.
from common-voice.
Apologies for the delay in responding. Getting sentences back out of the validated text corpus is exceptionally challenging from a technical perspective and would have to wait behind feature work and bug fixes for our team.
The fastest fix for re-balancing the Catalan dataset would be to dilute these sentences with fresh uploads of bulk Catalan sentences that would provide speakers and the dataset with a more varied pool of sentences to draw from. We've seen language communities have great success with CC0 books and texts, copywrite free government or cultural writings and with community driven writing challenges. Could this be a faster fix for helping rebalance the text corpus and keep this interesting for contributors and create more useful data for dataset consumers?
from common-voice.
Thank you for your response.
What we are asking for is not to remove them from the validated dataset, only to prevent them from being proposed to speakers to be read. Is there any way to get it?
Regarding to add more sentences in the corpus, we are working on it. We hope to get more soon, since we are committed in achieving a varied and reliable corpus.
Kind regards
from common-voice.
Related Issues (20)
- [BUG] Docker-compose up -d --build has problem with bundler
- Localization request for ful_Adlm HOT 2
- [BUG] Sentence input is not fully cleaned in "write", thus errors in "*_sentences.tsv" HOT 2
- [BUG] Non-unique entries in validated_sentences.tsv
- [BUG] Both ways of donating in CV not functional (android) HOT 1
- [BUG] validated_sentences.tsv for pa-IN is incomplete HOT 1
- LOCALISATION REQUEST: ISO-639-2/3 HOT 12
- [FR] Detail unvalidated text corpus status
- [BUG] reported.tsv has broken rows due to LF & TAB characters in sentence and reason fields HOT 2
- Rare letters in toki pona [BUG] HOT 4
- Create issues template for documentation updates or new docs needed HOT 2
- [BUG] Unable to modify e-mail address. HOT 2
- [FR] (suggestion) Make delta releases easily usable
- [DOCS] Removing discontinued platforms.
- [DOCS] Create information architecture draft for docs HOT 1
- [FR] Add missing major "sentence_domain"s
- Change language name of 'gom' to "Konkani (Romi)" HOT 2
- Multi-orthography for Konkani - linking sentences collected in the gom and knn datasets HOT 13
- [BUG] Delta for v10.0 & v11.0 are buggy and should be removed
- LOCALISATION REQUEST: nqo_Nkoo HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from common-voice.