training data about langid.py HOT 9 CLOSED

wammar commented on August 29, 2024

training data

from langid.py.

Comments (9)

saffsd commented on August 29, 2024

Hi Waleed!

Thanks for getting in touch. Could you give me an idea of what languages
you are looking to add, and what data (source(s), quantity) you have
available? We have made some data available at [1,2], though it isn't the
full dataset used for the default model of langid.py due to license
restrictions.

[1] http://www.csse.unimelb.edu.au/~tim/etc/ijcnlp2011-langid.tgz
[2] http://www.csse.unimelb.edu.au/research/lt/resources/naacl2010-langid/

On Mon, Apr 21, 2014 at 11:30 PM, Waleed Ammar [email protected]:

Thanks for making langid available! It's awesome! We (researchers at
Carnegie Mellon University) would like to augment the training data with
more languages. Shall we send you the data so that you can retrain the
models when your time permits? Alternatively, feel free to send us the data
and we would retrain the models ourselves.

many thanks!
waleed ammar

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/22
.

from langid.py.

wammar commented on August 29, 2024

That's great! I didn't know you've made these datasets available! We are
primarily interested in high precision classification of Yoruba documents
at this point, but we've also collected high quality web crawls in other
African languages (1 million words in Malagasy, 6 million words in
Kinyarwanda, also in Swahili but I'm not sure about the size).

It seems like the data you made available are all we need now though. Many
thanks for your quick reply.

waleed

On Tue, Apr 22, 2014 at 3:27 AM, saffsd [email protected] wrote:

Hi Waleed!

Thanks for getting in touch. Could you give me an idea of what languages
you are looking to add, and what data (source(s), quantity) you have
available? We have made some data available at [1,2], though it isn't the
full dataset used for the default model of langid.py due to license
restrictions.

[1] http://www.csse.unimelb.edu.au/~tim/etc/ijcnlp2011-langid.tgz
[2] http://www.csse.unimelb.edu.au/research/lt/resources/naacl2010-langid/

On Mon, Apr 21, 2014 at 11:30 PM, Waleed Ammar [email protected]:

Thanks for making langid available! It's awesome! We (researchers at
Carnegie Mellon University) would like to augment the training data with
more languages. Shall we send you the data so that you can retrain the
models when your time permits? Alternatively, feel free to send us the
data
and we would retrain the models ourselves.

many thanks!
waleed ammar

—
Reply to this email directly or view it on GitHub<
https://github.com/saffsd/langid.py/issues/22>
.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/22#issuecomment-41011007
.

from langid.py.

corpulent commented on August 29, 2024

@saffsd Is it possible to further train the models with custom data? Such as slang in different languages. My project deals primarily with English, and I want to be able to detect English slang.

from langid.py.

saffsd commented on August 29, 2024

@detrop it certainly is possible, though the effectiveness of the method would need to be empirically verified. Note that you would need to re-train a complete model - there is no way to add a new language to an existing model at the moment. The quality of the outcome will depend largely on the scope of the task and the quality of the training data. Are you able to assume that all the documents that you have are English, and that you just want to separate "English" from "English Slang"? This is an easier task than say adding "English Slang" to the set of 97 languages currently supported by langid.py.

The training tools for langid.py can be found at https://github.com/saffsd/langid.py/tree/master/langid/train

from langid.py.

corpulent commented on August 29, 2024

@saffsd At this point I would manually build up the training data for english slang and retrain the whole model. I don't want to separate "English" from "English Slang", instead, I want "English Slang" to get classified as "English".

from langid.py.

saffsd commented on August 29, 2024

@detrop are you finding that the existing model doesn't work for "English Slang" data? Could you share a few examples that are incorrectly classified? On the basis of that I can probably give you better suggestions.

from langid.py.

Casyfill commented on August 29, 2024

Sorry for joining: Also thanks a lot for this module.
I don't need any new languages, but module works much worser with cyrillics (as others as well) (belarus, russian, ukrainian), so maybe I could possibly help with data, too?
There is, for example, russian language open corpus,
http://opencorpora.org/

from langid.py.

saffsd commented on August 29, 2024

@Casyfill some groups of languages are known to be harder to tell apart and it's actually still a bit of an open issue in research. There was recently a shared task [1] on this subject, though the East Slavic languages you mention were not included. Based on our experiences in the shared task, the best thing to do is to do a 2-layer classification - i.e. train an additional classifier for only ru/uk/be, and use it to re-classify any document that the general classifier labels as ru/uk/be. If you are interested in doing this I can provide some guidance if you need.

[1] http://corporavm.uni-koeln.de/vardial/sharedtask.html

from langid.py.

saffsd commented on August 29, 2024

Closed as it has been inactive for some time.

from langid.py.

training data about langid.py HOT 9 CLOSED

Comments (9)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs