GithubHelp home page GithubHelp logo

kshitijkarthick / tvecs Goto Github PK

View Code? Open in Web Editor NEW
3.0 7.0 0.0 7.95 MB

Establish Semantic Relatedness across Languages Documentation - http://kshitijkarthick.github.io/tvecs

Home Page: https://tvecs.kshitijkarthick.me

License: MIT License

Python 56.50% CSS 6.38% HTML 10.20% Makefile 3.35% Batchfile 3.16% JavaScript 19.81% Dockerfile 0.59%
machine-learning natural-language-processing indic-languages word2vec semantic-relationship-extraction

tvecs's Introduction

https://travis-ci.org/KshitijKarthick/tvecs.svg?branch=master

T-Vecs

Prerequisites

  • Python 2.7 setup and installed
  • Pip setup and installed
  • Ensure all dependencies of requirements.txt are satisfied
  • Download nltk_data using nltk.download() -> only tokenizers required
  • Download corpus and extract in specified directory

Setup Development Environment

git clone https://github.com/KshitijKarthick/tvecs.git
cd tvecs
pip install -r requirements.txt
# Only Model needs to be downloaded and extracted in the t-vex directory

Install as a Package

# Install package
pip install git+https://github.com/KshitijKarthick/tvecs.git

# Usage from cmd line without recommendations menu
tvecs -c ./config.json

# Usage from cmd line with recommendations menu
tvecs -c ./config.json -r

# Usage without config file, with models, without recommendations menu
tvecs -l1 english -l2 hindi -m1 ./data/models/t-vex-english-models -m2 ./data/models/t-vex-hindi-models

# Usage without config file, with models, with recommendations menu
tvecs -r -l1 english -l2 hindi -m1 ./data/models/t-vex-english-models -m2 ./data/models/t-vex-hindi-models

# Usage from inside python as a library
import tvecs.vector_space_mapper.vector_space_mapper as vm

Data

Corpus Download details

We are focusing on [English, Hindi] other possible prospects we could look into Kannada, Tamil languages

Sources

Bilingual Dictionary details

Provided in the repository, data/bilingual_dictionary. Compiled using the following sources.

Credits

Evaluation Dataset details

Human relatedness judgement score datasets provided in data/evaluate

Credits
  • wordsim_relatedness_goldstandard
  • MEN_dataset_natural_form_full
  • Mturk_287
  • Mturk_771

Ensure Model is downloaded and extracted in the t-vex directory

  • data/corpus -> corpus
  • data/models -> models

Usage Details

T-Vecs Driver Module Cmd Line Args

$ python -m tvecs --help

usage: __main__.py [-h] [-v] [-s] [-i ITER] [-m1 MODEL1] [-m2 MODEL2]
               [-l1 LANGUAGE1] [-l2 LANGUAGE2] [-c CONFIG]
               [-b BILINGUAL_DICT] [-r]

Script used to generate models

optional arguments:
  -h, --help            show this help message and exit
  -v, --verbose         increase output verbosity
  -s, --silent          silence all logging
  -i ITER, --iter ITER  number of Word2Vec iterations
  -m1 MODEL1, --model1 MODEL1
                        pre-computed model file path
  -m2 MODEL2, --model2 MODEL2
                        pre-computed model file path
  -l1 LANGUAGE1, --language1 LANGUAGE1
                        language name of model 1/ text 1
  -l2 LANGUAGE2, --l2 LANGUAGE2
                        language name of model 2/ text 2
  -c CONFIG, --config CONFIG
                        config file path
  -b BILINGUAL_DICT, --bilingual BILINGUAL_DICT
                        bilingual dictionary path
  -r, --recommendations
                        provide recommendations

Config File Format

  • See config.json in the repository for example.

Execution & Building

# Preprocessing, Model Generation, Bilingual Generation, Vector Space Mapping between two languages english hindi from the corpus using the config file

python -im tvecs -c config.json

# [ utilise the dictionary tvex_calls which contains results of every step performed ]

# Bilingual generation, Vector space mapping between two languages english hindi providing the models

python -im tvecs -l1 english -l2 hindi -m1 ./data/models/t-vex-english-model -m2 ./data/models/t-vex-hindi-model -b ./data/bilingual_dictionary/english_hindi_train

python -im tvecs -c config.json

# [ utilise the dictionary tvex_calls which contains results of every step performed ]

Obtain Recommendations

# Provide Recommendations using config file
python -m tvecs -c ./config.json -r

# Provide Recommendations using cmd line params
python2 -m tvecs -l1 english -l2 hindi -m1 ./data/models/t-vex-english-model -m2 ./data/models/t-vex-hindi-model -b ./data/bilingual_dictionary/english_hindi_train_bd -r

# Output for recommendations

Enter your Choice:
1> Recommendation
2> Exit

Choice: 1
Enter word in Language english: examination

Word    =>  Score

जाँच    =>  0.643208742142
नियुक्ति    =>  0.640852451324
जांच    =>  0.638412773609
अध्ययन  =>  0.638307392597
विवेचना =>  0.638229370117
मंत्रणा =>  0.634038448334
पुनर्मूल्यांकन  =>  0.627283990383
अध्‍ययन =>  0.624040842056
निरीक्षण    =>  0.623490035534
जाच =>  0.619904220104

Visualisation of vector space

python -m tvecs.visualization.server
[ Open browser to localhost:5000 for visualization ]
[ Ensure model generation is completed before running visualization ]

Execution of Individual Modules

# bilingual dictionary generation -> clustering vectors from trained model
python -m tvecs.bilingual_generator.clustering

# model generation
python -m tvecs.model_generator.model_generation

# vector space mapping [ utilise the object vm to obtain recommendations
python -m tvecs.vector_space_mapper.vector_space_mapper

Execution of Unit Tests

# Run all unit tests
py.test

# Run individual module tests seperately
py.test tests/test_emille_preprocessor.py
py.test tests/test_leipzig_preprocessor.py
py.test tests/test_hccorpus_preprocessor.py

Generate Documentation

# Generate HTML Documentation
make html
cd documentation/html && python -m SimpleHTTPServer
# [ Open browser to localhost:8000 for visualization ]

# Generate Man Pages
make man
cd documentation/man && man -l tvecs.1


# Other Makefile options
make

Please use `make <target>' where <target> is one of
html       to make standalone HTML files
dirhtml    to make HTML files named index.html in directories
singlehtml to make a single large HTML file
pickle     to make pickle files
json       to make JSON files
htmlhelp   to make HTML files and a HTML help project
qthelp     to make HTML files and a qthelp project
applehelp  to make an Apple Help Book
devhelp    to make HTML files and a Devhelp project
epub       to make an epub
epub3      to make an epub3
latex      to make LaTeX files, you can set PAPER=a4 or PAPER=letter
latexpdf   to make LaTeX files and run them through pdflatex
latexpdfja to make LaTeX files and run them through platex/dvipdfmx
text       to make text files
man        to make manual pages
texinfo    to make Texinfo files
info       to make Texinfo files and run them through makeinfo
gettext    to make PO message catalogs
changes    to make an overview of all changed/added/deprecated items
xml        to make Docutils-native XML files
pseudoxml  to make pseudoxml-XML files for display purposes
linkcheck  to check all external links for integrity
doctest    to run all doctests embedded in the documentation (if enabled)
coverage   to run coverage check of the documentation (if enabled)

tvecs's People

Contributors

kshitijkarthick avatar prarthana-s avatar prateeksha13 avatar upman avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

tvecs's Issues

Driver Module should support utilisation of different kind of corpus

Add T-Vecs Driver module support for multiple corpus type

  • Currently t-vecs.py builds all corpus with HcCorpusPreprocessor
  • Add suitable conf file to accept as param to build based on the same

Example Config File

Language 1 : 'English'
corpora : [
'./path/corpus_fname': 'HcCorpusPreprocessor'
'./path/corpus_fname': 'EmillePreprocessor
]
Language 2 : 'English'
corpora : [
'./path/corpus_fname': 'HcCorpusPreprocessor'
'./path/corpus_fname': 'EmillePreprocessor
]

Cross Language Visualization support

Support for Visualization for Cross Language.

Example: Given an English word, recommend 10 most semantically similar hindi words.

  • visualization/server.py -> CherryPy server needs to add external route for handling this service.
    • Utilise the module: modules/vector_space_mapper/vector_space_mapper.py -> get_recommendations_from_word()
  • visualization/static/index.html -> Client side changes to add cross language support.

Setup Logging

Setup Logging in all Modules

  • Setup Logging under the logging handler name T-Vecs
  • Show Distinction between modules while logging, using a module name
  • Employ multiple levels of logging
    • Info
    • Debug
    • Warn/Error
  • [ Optional ] Allow Logging into file
  • Set verbosity level in t-vecs cmd line args

Display definitions of words in the graph

When a user hovers on a node, display English definition(s) of the word.

  • Cache the results and hit translation API only if word is not in the cached list.
  • If it is an offline demo, provide a means of dealing with hovering on words not in the cached list.

Contributes to #9

Validate root word entered in input text box

Regarding the word entered in the input text box which sets the root word.

Reject the word if it consists of:

  • digits
  • punctuation
  • special characters
  • multiple words
  • URLs

Ensure the word entered is of the same language as the radio button selected represents.

Display suitable error message on violation of any of the above conditions.

Contributes to #9

Poor model results when built using chaining of preprocessor objects

  • Multiple preprocessor objects are chained together using itertools.chain.
  • Verified that __ iter __ function is called for all the preprocessor objects.
  • Verified sentence increments in word2vec logs when multiple preprocessor objects chained.
  • Verified with/without chaing both result in same vocabulary size.

With all the above criterion satisfied yet the word2vec model results are poor when compared to building a single large preprocessor object and performing model generation.

Multivariate Analysis

Multivariate Analysis on the T-Vecs Model
  • Corpus Size
  • Bilingual Dictionary Size
  • P - Value
  • Correlation Score
  • Algorithm for mapping b/w Vector Spaces

Organise Directory Structure

Sensible Directory Structure needs to be set-up which differentiates between
  • corpus
  • models
  • clustered output
  • bilingual_dictionary
  • src-code/preprocessing
  • src-code/model_generation
  • src-code/bilingual_dictionary_generation
  • src-code/mapping_vector_space

Article Similarity

Application of T-vecs model

Provide a means to compare 2 or more articles using our T-vecs model based on semantic similarity across languages.

Improve performance of server for visualization

Factors for Performance improvement

Bilingual Dictionary Creation

Tasks

  • Generate Bilingual Dictionaries across selected languages.
  • Observe & analyse the minimum number of words required for obtaining suitable output on correlation score with human semantic similarity score.
    • Pick only sample size of words from unique concepts leading to a disparate collection of words in the generated bilingual dictionary.
    • Measure the affect of sample size and number of clusters to be generated for the same.

Need Semantically correct handling of Unicode apostrophe's

Code section to be looked into
  • hccorpus_preprocessor.py : _clean_word, _tokenize_sentences
  • Determine need for inclusion of the commented out code in _tokenize_sentences and if any kind of punctuation handling needed at sentence level or only word level suffices
Cases that need to be satisfied:
  • i'm -> i
  • test's -> test
  • tests' -> tests
  • hello,-> hello // Resolved by #1
  • world? -> world // Resolved by #1

Surface relatedness score in visualisation

While visualizing related words of a particular word, we initially intended to show the measure of relatedness through the length of the edge. This isn't really noticeable. So the visualization is ending up looking like we're trying to translate the word. The value of our system lies in being able to put a number on HOW closely related, two words from different languages are and not in rendering related words in a different language.

We need to focus the visualization on the relatedness score that we are able to retrieve.
Bonus if we can have a couple text boxes and be able to enter words and get a relatedness score.
Better yet, a digraph editor where you can drop nodes, type out words on them and draw edges between them, the weight (length) of the edge gets determined by sending a request to vector space mapper.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.