kshitijkarthick / tvecs Goto Github PK

View Code? Open in Web Editor NEW

3.0 7.0 0.0 7.95 MB

Establish Semantic Relatedness across Languages Documentation - http://kshitijkarthick.github.io/tvecs

Home Page: https://tvecs.kshitijkarthick.me

License: MIT License

Python 56.50% CSS 6.38% HTML 10.20% Makefile 3.35% Batchfile 3.16% JavaScript 19.81% Dockerfile 0.59%

machine-learning natural-language-processing indic-languages word2vec semantic-relationship-extraction

tvecs's Introduction

T-Vecs

Prerequisites

Python 2.7 setup and installed
Pip setup and installed
Ensure all dependencies of requirements.txt are satisfied
Download nltk_data using nltk.download() -> only tokenizers required
Download corpus and extract in specified directory

Setup Development Environment

git clone https://github.com/KshitijKarthick/tvecs.git
cd tvecs
pip install -r requirements.txt
# Only Model needs to be downloaded and extracted in the t-vex directory

Install as a Package

# Install package
pip install git+https://github.com/KshitijKarthick/tvecs.git

# Usage from cmd line without recommendations menu
tvecs -c ./config.json

# Usage from cmd line with recommendations menu
tvecs -c ./config.json -r

# Usage without config file, with models, without recommendations menu
tvecs -l1 english -l2 hindi -m1 ./data/models/t-vex-english-models -m2 ./data/models/t-vex-hindi-models

# Usage without config file, with models, with recommendations menu
tvecs -r -l1 english -l2 hindi -m1 ./data/models/t-vex-english-models -m2 ./data/models/t-vex-hindi-models

# Usage from inside python as a library
import tvecs.vector_space_mapper.vector_space_mapper as vm

Data

Corpus Download details

We are focusing on [English, Hindi] other possible prospects we could look into Kannada, Tamil languages

Sources

HcCorpora http://www.corpora.heliohost.org/download.html
Emille Corpora http://www.emille.lancs.ac.uk/
Leipzig Corpora http://corpora.uni-leipzig.de/

Bilingual Dictionary details

Provided in the repository, data/bilingual_dictionary. Compiled using the following sources.

Credits

Shabdakosh http://www.shabdkosh.com/content/category/downloads/
Dicts Corpora http://dicts.info/dictlist1.php?l=Hindi

Evaluation Dataset details

Human relatedness judgement score datasets provided in data/evaluate

Credits

wordsim_relatedness_goldstandard
MEN_dataset_natural_form_full
Mturk_287
Mturk_771

Ensure Model is downloaded and extracted in the t-vex directory

data/corpus -> corpus
data/models -> models

Usage Details

T-Vecs Driver Module Cmd Line Args

$ python -m tvecs --help

usage: __main__.py [-h] [-v] [-s] [-i ITER] [-m1 MODEL1] [-m2 MODEL2]
               [-l1 LANGUAGE1] [-l2 LANGUAGE2] [-c CONFIG]
               [-b BILINGUAL_DICT] [-r]

Script used to generate models

optional arguments:
  -h, --help            show this help message and exit
  -v, --verbose         increase output verbosity
  -s, --silent          silence all logging
  -i ITER, --iter ITER  number of Word2Vec iterations
  -m1 MODEL1, --model1 MODEL1
                        pre-computed model file path
  -m2 MODEL2, --model2 MODEL2
                        pre-computed model file path
  -l1 LANGUAGE1, --language1 LANGUAGE1
                        language name of model 1/ text 1
  -l2 LANGUAGE2, --l2 LANGUAGE2
                        language name of model 2/ text 2
  -c CONFIG, --config CONFIG
                        config file path
  -b BILINGUAL_DICT, --bilingual BILINGUAL_DICT
                        bilingual dictionary path
  -r, --recommendations
                        provide recommendations

Config File Format

See config.json in the repository for example.

Execution & Building

# Preprocessing, Model Generation, Bilingual Generation, Vector Space Mapping between two languages english hindi from the corpus using the config file

python -im tvecs -c config.json

# [ utilise the dictionary tvex_calls which contains results of every step performed ]

# Bilingual generation, Vector space mapping between two languages english hindi providing the models

python -im tvecs -l1 english -l2 hindi -m1 ./data/models/t-vex-english-model -m2 ./data/models/t-vex-hindi-model -b ./data/bilingual_dictionary/english_hindi_train

python -im tvecs -c config.json

# [ utilise the dictionary tvex_calls which contains results of every step performed ]

Obtain Recommendations

# Provide Recommendations using config file
python -m tvecs -c ./config.json -r

# Provide Recommendations using cmd line params
python2 -m tvecs -l1 english -l2 hindi -m1 ./data/models/t-vex-english-model -m2 ./data/models/t-vex-hindi-model -b ./data/bilingual_dictionary/english_hindi_train_bd -r

# Output for recommendations

Enter your Choice:
1> Recommendation
2> Exit

Choice: 1
Enter word in Language english: examination

Word    =>  Score

जाँच    =>  0.643208742142
नियुक्ति    =>  0.640852451324
जांच    =>  0.638412773609
अध्ययन  =>  0.638307392597
विवेचना =>  0.638229370117
मंत्रणा =>  0.634038448334
पुनर्मूल्यांकन  =>  0.627283990383
अध्‍ययन =>  0.624040842056
निरीक्षण    =>  0.623490035534
जाच =>  0.619904220104

Visualisation of vector space

python -m tvecs.visualization.server
[ Open browser to localhost:5000 for visualization ]
[ Ensure model generation is completed before running visualization ]

Execution of Individual Modules

# bilingual dictionary generation -> clustering vectors from trained model
python -m tvecs.bilingual_generator.clustering

# model generation
python -m tvecs.model_generator.model_generation

# vector space mapping [ utilise the object vm to obtain recommendations
python -m tvecs.vector_space_mapper.vector_space_mapper

Execution of Unit Tests

# Run all unit tests
py.test

# Run individual module tests seperately
py.test tests/test_emille_preprocessor.py
py.test tests/test_leipzig_preprocessor.py
py.test tests/test_hccorpus_preprocessor.py

Generate Documentation

# Generate HTML Documentation
make html
cd documentation/html && python -m SimpleHTTPServer
# [ Open browser to localhost:8000 for visualization ]

# Generate Man Pages
make man
cd documentation/man && man -l tvecs.1


# Other Makefile options
make

Please use `make <target>' where <target> is one of
html       to make standalone HTML files
dirhtml    to make HTML files named index.html in directories
singlehtml to make a single large HTML file
pickle     to make pickle files
json       to make JSON files
htmlhelp   to make HTML files and a HTML help project
qthelp     to make HTML files and a qthelp project
applehelp  to make an Apple Help Book
devhelp    to make HTML files and a Devhelp project
epub       to make an epub
epub3      to make an epub3
latex      to make LaTeX files, you can set PAPER=a4 or PAPER=letter
latexpdf   to make LaTeX files and run them through pdflatex
latexpdfja to make LaTeX files and run them through platex/dvipdfmx
text       to make text files
man        to make manual pages
texinfo    to make Texinfo files
info       to make Texinfo files and run them through makeinfo
gettext    to make PO message catalogs
changes    to make an overview of all changed/added/deprecated items
xml        to make Docutils-native XML files
pseudoxml  to make pseudoxml-XML files for display purposes
linkcheck  to check all external links for integrity
doctest    to run all doctests embedded in the documentation (if enabled)
coverage   to run coverage check of the documentation (if enabled)

tvecs's People

Contributors

Stargazers

Watchers

tvecs's Issues

Need to setup evaluation method for the mapping between vector spaces

Obtain Correlation Score between semantic similarity score of 2 words from the T-Vecs model and semantic similarity score of two words using human evaluation.

Need to remove digits from corpora during Preprocessing

Corpora to be enhanced:

HC Corpus
Emille Corpus
Leipzig Corpus

Driver Module should support utilisation of different kind of corpus

Add T-Vecs Driver module support for multiple corpus type

Currently t-vecs.py builds all corpus with HcCorpusPreprocessor
Add suitable conf file to accept as param to build based on the same

Example Config File

Language 1 : 'English'
corpora : [
'./path/corpus_fname': 'HcCorpusPreprocessor'
'./path/corpus_fname': 'EmillePreprocessor
]
Language 2 : 'English'
corpora : [
'./path/corpus_fname': 'HcCorpusPreprocessor'
'./path/corpus_fname': 'EmillePreprocessor
]

Improve the UI/UX of the Visualization

Cross Lingual Visualization
Semantic Space Visualization
Multivariate Analysis

Cross Language Visualization support

Support for Visualization for Cross Language.

Example: Given an English word, recommend 10 most semantically similar hindi words.

visualization/server.py -> CherryPy server needs to add external route for handling this service.
- Utilise the module: modules/vector_space_mapper/vector_space_mapper.py -> get_recommendations_from_word()
visualization/static/index.html -> Client side changes to add cross language support.

Setup Logging in all Modules

Setup Logging under the logging handler name T-Vecs
Show Distinction between modules while logging, using a module name
Employ multiple levels of logging
- Info
- Debug
- Warn/Error
[ Optional ] Allow Logging into file
Set verbosity level in t-vecs cmd line args

Display definitions of words in the graph

When a user hovers on a node, display English definition(s) of the word.

Cache the results and hit translation API only if word is not in the cached list.
If it is an offline demo, provide a means of dealing with hovering on words not in the cached list.

Contributes to #9

Validate root word entered in input text box

Regarding the word entered in the input text box which sets the root word.

Reject the word if it consists of:

digits
punctuation
special characters
multiple words
URLs

Ensure the word entered is of the same language as the radio button selected represents.

Display suitable error message on violation of any of the above conditions.

Contributes to #9

Poor model results when built using chaining of preprocessor objects

Multiple preprocessor objects are chained together using itertools.chain.
Verified that __ iter __ function is called for all the preprocessor objects.
Verified sentence increments in word2vec logs when multiple preprocessor objects chained.
Verified with/without chaing both result in same vocabulary size.

With all the above criterion satisfied yet the word2vec model results are poor when compared to building a single large preprocessor object and performing model generation.

Multivariate Analysis

Multivariate Analysis on the T-Vecs Model

Corpus Size
Bilingual Dictionary Size
P - Value
Correlation Score
Algorithm for mapping b/w Vector Spaces

Organise Directory Structure

Sensible Directory Structure needs to be set-up which differentiates between

corpus
models
clustered output
bilingual_dictionary
src-code/preprocessing
src-code/model_generation
src-code/bilingual_dictionary_generation
src-code/mapping_vector_space

Application of T-vecs model

Provide a means to compare 2 or more articles using our T-vecs model based on semantic similarity across languages.

Improve performance of server for visualization

Factors for Performance improvement

Cache results
- Tools that can be considered
  - https://pypi.python.org/pypi/HermesCache [ Supports Redis and other options ]
  - https://pypi.python.org/pypi/pyfscache/
  - https://pypi.python.org/pypi/filecache
- Sections needed to be cached
  - T-Vecs Model loading when new language is slow.
  - Cross Lingual Instance loading is very slow
Memory Usage
- Sections to be considered
  - vector_space_mapper instance pickle filesize usually ~800mb
  - t-vecs models filesize ~120mb

Tasks

Generate Bilingual Dictionaries across selected languages.
Observe & analyse the minimum number of words required for obtaining suitable output on correlation score with human semantic similarity score.
- Pick only sample size of words from unique concepts leading to a disparate collection of words in the generated bilingual dictionary.
- Measure the affect of sample size and number of clusters to be generated for the same.

Obtain Correlation Score b/w Human Scores & T-Vecs Model

Generate Correlation Score Between Human Score & T-Vecs Model generated for semantically similar words.

Need Considerable size of Bilingual Dictionary [ English-Hindi ]

Obtain Bilingual Dictionary from either methods for English-Hindi :

Pre-processed Bilingual Dictionary available online.
Generated Bilingual Dictionary by some online translation service with well distributed sampled words from the clusters generated.

Need Semantically correct handling of Unicode apostrophe's

Code section to be looked into

hccorpus_preprocessor.py : _clean_word, _tokenize_sentences
Determine need for inclusion of the commented out code in _tokenize_sentences and if any kind of punctuation handling needed at sentence level or only word level suffices

Cases that need to be satisfied:

i'm -> i
test's -> test
tests' -> tests
hello,-> hello // Resolved by #1
world? -> world // Resolved by #1

Unicode Punctations Need Preprocessing

Unicode punctuations need to be preprocessed before model generation

Surface relatedness score in visualisation

While visualizing related words of a particular word, we initially intended to show the measure of relatedness through the length of the edge. This isn't really noticeable. So the visualization is ending up looking like we're trying to translate the word. The value of our system lies in being able to put a number on HOW closely related, two words from different languages are and not in rendering related words in a different language.

We need to focus the visualization on the relatedness score that we are able to retrieve.
Bonus if we can have a couple text boxes and be able to enter words and get a relatedness score.
Better yet, a digraph editor where you can drop nodes, type out words on them and draw edges between them, the weight (length) of the edge gets determined by sending a request to vector space mapper.