GithubHelp home page GithubHelp logo

abhaykul / language-classifier Goto Github PK

View Code? Open in Web Editor NEW
0.0 1.0 0.0 841 KB

Decision tree and Ada-boosted stumps based algorithm to tag sentences with their language.

Python 100.00%
python decision-trees adaboost-algorithm pickle wikidata-dump

language-classifier's Introduction

Language Classifier


Train the model:

  • Training: The data file on which the model gets trained

  • Output: The name of the file that will contain the trained model as a pickle object.

  • Type: Type of model training; For decision trees put "dt"; For ada-boosted stumps put "ada"

  • Testing: The file on which the model is tested for parameter tuning/best tree depth

      python3 train.py <Training> <Output> <Type> <Testing>
    

Test the model:

  • Model: filename of the model generated by train.py

  • Data: The datafile for predicting the classifications

      python3 predict.py <Model> <Data>
    

Features Selected:

I have selected all the features based on the most recurring words present in that language. I tried using average length of words, grammar, sentence construction, alphabets used, and frequency of alphabets.

All these features failed to provide accurate classifications.

<e.g.> For frequency of alphabets, in Dutch: E, N, A are most frequently used alphabets; whereas for English it is E, T, A. This was a poor differentiator.

For average word length, both languages have a similar curve. Spike up between 3-4. This too, was a poor differentiator.

For sentence make-up and grammar; it was difficult to find the proper nouns-adjectives-pronouns etc. in the sentence.

Dutch does not have rules like English and some sentences fail to provide right answers. Hence, the features I selected were based on most common words used in their vernacular.

  1. Check if the sentence contains “the” : It’s the most common word in English language.
  2. Check if the sentence contains “het” or “de” : It’s Dutch equivalent for “THE”
  3. Check if the sentence contains “and”
  4. Check if the sentence contains “ik” : Dutch equivalent of ‘I’
  5. Check if the sentence contains “een” : Dutch equivalent of ‘A’
  6. Check if the sentence contains “en” : Dutch equivalent of ‘AND’
  7. Check if the sentence contains “he” or “she” : Used in most conversations and third person sentences
  8. Check if the sentence contains “hij” or “ze” or “zij” : Dutch equivalent of ‘HE / SHE’
  9. Check if the sentence contains “van” : Dutch equivalent of ‘FROM/OF/BY’
  10. Check if the sentence contains “a” : The English alphabet ‘a’.


Algorithms used:


For Decision Trees:


For Ada-boosted stumps:


language-classifier's People

Contributors

abhaykul avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.