GithubHelp home page GithubHelp logo

training data about langid.py HOT 9 CLOSED

wammar avatar wammar commented on August 29, 2024
training data

from langid.py.

Comments (9)

saffsd avatar saffsd commented on August 29, 2024

Hi Waleed!

Thanks for getting in touch. Could you give me an idea of what languages
you are looking to add, and what data (source(s), quantity) you have
available? We have made some data available at [1,2], though it isn't the
full dataset used for the default model of langid.py due to license
restrictions.

[1] http://www.csse.unimelb.edu.au/~tim/etc/ijcnlp2011-langid.tgz
[2] http://www.csse.unimelb.edu.au/research/lt/resources/naacl2010-langid/

On Mon, Apr 21, 2014 at 11:30 PM, Waleed Ammar [email protected]:

Thanks for making langid available! It's awesome! We (researchers at
Carnegie Mellon University) would like to augment the training data with
more languages. Shall we send you the data so that you can retrain the
models when your time permits? Alternatively, feel free to send us the data
and we would retrain the models ourselves.

many thanks!
waleed ammar

β€”
Reply to this email directly or view it on GitHubhttps://github.com//issues/22
.

from langid.py.

wammar avatar wammar commented on August 29, 2024

That's great! I didn't know you've made these datasets available! We are
primarily interested in high precision classification of Yoruba documents
at this point, but we've also collected high quality web crawls in other
African languages (1 million words in Malagasy, 6 million words in
Kinyarwanda, also in Swahili but I'm not sure about the size).

It seems like the data you made available are all we need now though. Many
thanks for your quick reply.

waleed

On Tue, Apr 22, 2014 at 3:27 AM, saffsd [email protected] wrote:

Hi Waleed!

Thanks for getting in touch. Could you give me an idea of what languages
you are looking to add, and what data (source(s), quantity) you have
available? We have made some data available at [1,2], though it isn't the
full dataset used for the default model of langid.py due to license
restrictions.

[1] http://www.csse.unimelb.edu.au/~tim/etc/ijcnlp2011-langid.tgz
[2] http://www.csse.unimelb.edu.au/research/lt/resources/naacl2010-langid/

On Mon, Apr 21, 2014 at 11:30 PM, Waleed Ammar [email protected]:

Thanks for making langid available! It's awesome! We (researchers at
Carnegie Mellon University) would like to augment the training data with
more languages. Shall we send you the data so that you can retrain the
models when your time permits? Alternatively, feel free to send us the
data
and we would retrain the models ourselves.

many thanks!
waleed ammar

β€”
Reply to this email directly or view it on GitHub<
https://github.com/saffsd/langid.py/issues/22>
.

β€”
Reply to this email directly or view it on GitHubhttps://github.com//issues/22#issuecomment-41011007
.

from langid.py.

corpulent avatar corpulent commented on August 29, 2024

@saffsd Is it possible to further train the models with custom data? Such as slang in different languages. My project deals primarily with English, and I want to be able to detect English slang.

from langid.py.

saffsd avatar saffsd commented on August 29, 2024

@detrop it certainly is possible, though the effectiveness of the method would need to be empirically verified. Note that you would need to re-train a complete model - there is no way to add a new language to an existing model at the moment. The quality of the outcome will depend largely on the scope of the task and the quality of the training data. Are you able to assume that all the documents that you have are English, and that you just want to separate "English" from "English Slang"? This is an easier task than say adding "English Slang" to the set of 97 languages currently supported by langid.py.

The training tools for langid.py can be found at https://github.com/saffsd/langid.py/tree/master/langid/train

from langid.py.

corpulent avatar corpulent commented on August 29, 2024

@saffsd At this point I would manually build up the training data for english slang and retrain the whole model. I don't want to separate "English" from "English Slang", instead, I want "English Slang" to get classified as "English".

from langid.py.

saffsd avatar saffsd commented on August 29, 2024

@detrop are you finding that the existing model doesn't work for "English Slang" data? Could you share a few examples that are incorrectly classified? On the basis of that I can probably give you better suggestions.

from langid.py.

Casyfill avatar Casyfill commented on August 29, 2024

Sorry for joining: Also thanks a lot for this module.
I don't need any new languages, but module works much worser with cyrillics (as others as well) (belarus, russian, ukrainian), so maybe I could possibly help with data, too?
There is, for example, russian language open corpus,
http://opencorpora.org/

from langid.py.

saffsd avatar saffsd commented on August 29, 2024

@Casyfill some groups of languages are known to be harder to tell apart and it's actually still a bit of an open issue in research. There was recently a shared task [1] on this subject, though the East Slavic languages you mention were not included. Based on our experiences in the shared task, the best thing to do is to do a 2-layer classification - i.e. train an additional classifier for only ru/uk/be, and use it to re-classify any document that the general classifier labels as ru/uk/be. If you are interested in doing this I can provide some guidance if you need.

[1] http://corporavm.uni-koeln.de/vardial/sharedtask.html

from langid.py.

saffsd avatar saffsd commented on August 29, 2024

Closed as it has been inactive for some time.

from langid.py.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.