GithubHelp home page GithubHelp logo

martinthoma / lidtk Goto Github PK

View Code? Open in Web Editor NEW
18.0 3.0 7.0 468 KB

Language Identification Toolkit

License: MIT License

Python 99.38% Makefile 0.62%
language-identification python-3 python-3-5 mit-license language-identification-toolkit machine-learning nlp nlp-machine-learning

lidtk's Introduction

DOI PyPI version Python Support Build Status Code style: black GitHub last commit GitHub commits since latest release (by SemVer) CodeFactor

lidtk

lidtk - the language identification toolkit - was written in order to investigate the current state of language performance.

Installation

The recommended way to install clana is:

$ pip install lidtk --user

If you want the latest version:

$ git clone https://github.com/MartinThoma/lidtk.git; cd lidtk
$ pip install -e . --user

I recommend getting the WiLI-2018 dataset.

Usage

$ lidtk --help

Usage: lidtk [OPTIONS] COMMAND [ARGS]...

Options:
  --version  Show the version and exit.
  --help     Show this message and exit.

Commands:
  analyze-data           Utility function for the languages...
  analyze-unicode-block  Analyze how important a Unicode block is for...
  char-distrib           Use the character distribution language...
  cld2                   Use the CLD-2 language classifier.
  create-dataset         Create sharable dataset from downloaded...
  download               Download 1000 documents of each language.
  google-cloud           Use the CLD-2 language classifier.
  langdetect             Use the langdetect language classifier.
  langid                 Use the langid language classifier.
  map                    Map predictions to something known by WiLI
  nn                     Use a neural network classifier.
  textcat                Use the CLD-2 language classifier.
  tfidf_nn               Use the TfidfNNClassifier classifier.

For example:

$ lidtk cld2 predict --text 'This is a test.'
eng

The usual order is:

  1. lidtk download: Please use WiLI-2018 instead of downloading the dataset on your own.
  2. lidtk create-dataset: This step can be skipped if you use WiLI-2018
  3. lidtk analyze-unicode-block --start 0 --end 128
  4. lidtk tfidf_nn train vectorizer --config lidtk/classifiers/config/tfidf_nn.yaml
  5. lidtk tfidf_nn train vectorizer --config lidtk/classifiers/config/tfidf_nn.yaml
  6. lidtk tfidf_nn wili --config lidtk/classifiers/config/tfidf_nn.yaml

Or to use one directly:

$ lidtk cld2 predict --text 'This text is written in some language.'

eng

Development

Check tests with tox.

lidtk's People

Contributors

dependabot[bot] avatar martinthoma avatar xadvitya avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

lidtk's Issues

Use of mutation testing in lidtk - Help needed

Hello there!

My name is Ana. I noted that you use the mutation testing tool in the project.
I am a postdoctoral researcher at the University of Seville (Spain), and my colleagues and I are studying how mutation testing tools are used in practice. With this aim in mind, we have analysed over 3,500 public GitHub repositories using mutation testing tools, including yours! This work has recently been published in a journal paper available at https://link.springer.com/content/pdf/10.1007/s10664-022-10177-8.pdf.

To complete this study, we are asking for your help to understand better how mutation testing is used in practice, please! We would be extremely grateful if you could contribute to this study by answering a brief survey of 21 simple questions (no more than 6 minutes). This is the link to the questionnaire https://forms.gle/FvXNrimWAsJYC1zB9.

Drop me an e-mail if you have any questions or comments ([email protected]). Thank you very much in advance!!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.