GithubHelp home page GithubHelp logo

iarroyof / distributionalsemanticstabilitythesis Goto Github PK

View Code? Open in Web Editor NEW
0.0 3.0 1.0 4.96 MB

This repo contains codes and, selfcontained, some documentation about implementation of theoretical issues of my PhD thesis. This work is entitled "Learning Kernels for Distributional Semantic Clustering" in which we present a novel (distributional) semantic embedding method.

Home Page: www.corpus.unam.mx:8069/member/1

License: GNU General Public License v2.0

Python 99.76% Shell 0.24%

distributionalsemanticstabilitythesis's Introduction

Distributional Semantics Stability and Consistence

This repo contains codes and, selfcontained, some documentation about implementation of theoretical issues of my PhD thesis on Natural Language Processing (NLP). This work is entitled "Learning Kernels for Distributional Semantic Clustering" in which we present a novel (distributional) semantic embedding method, which aims at consistently performing semantic clustering at sentence level. Taking into account special aspects of Vector Space Models (VSMs), we propose to learn reproducing kernels in classification tasks. By this way, capturing spectral features from data is possible. These features make it theoretically plausible to model semantic similarity criteria in Hilbert spaces, i.e. the embedding spaces where consistency of semantic similarity measures and stability of algorithms implementing them can be ensured. We could improve the semantic assessment over embeddings, which are criterion-derived distributional representations from traditional semantic vectors (e.g. windowed cooccurrence counting). The learned kernel could be easily transferred to clustering methods, where the Class Imbalance Problem is considered (e.g. semantic clustering of polysemic definitions of terms. Their multimple meanings are Zipf distributed: Passonneau, et.al, 2012).

See the initial publications at my personal web site for futher details: www.corpus.unam.mx:8069/member/1

Usage

We have made some tests with toy datasets. At the moment we have parallelized the tool in a local machine with multiple CPUs.

-- Dependencies --

Ubunbtu 14.04 - The Operating system we developed this tool

modshogun - The Shogun Machine Learning Toolbox

scipy - The scientific Python stack

-- Command line --

We need executing the following piped command in the Ubuntu shell (be sure all files are in the same system file path where you execute the command):

$ python gridGen.py -f gridParameterDic.txt -t 5 | ./mklParallel.sh | python mklReducer.py [-options] > o.txt

The gridGen.py script writes, e.g., 5 parameter search paths to the stdout separately. These paths are randomly generated from the file gridParameterDic.txt which contains a python dictionary of parameters. The set of paths is read by the mklParallel.sh bash script who yields multiple training jobs (The OS manages the thing if there are more processes than machine cores). Until the results of these jobs are written at all to stdout, the mklReducer.py script reads them, prints them and emits the one with the maximum performance (command python mklReducer.py -h for seeing [-options]). These results are subsequently written either to an output file, e.g. o.txt, or to the stdout (if the last one is not specified by >).

Currently the training a test sets are provided in the same file. This dataset file must be specified in the configuration file called mkl_object.conf. Its first line is the name of the file containing data samples (one sample vector by each line of the file), the second line is the name of the file containing labels {-1, 1}, associated to each sample; one label by line in the file. The third line is optional (unlike the two first ones) and may specify the directory where the two first files are stored in. The portion of samples (and labels) used for training and testing, respectively, may be specified in code (for now) as the first argument of the function load_binData(), which is called at the beginning of the mklCall.py script. The current default is 75% of samples for training; for testing the remaining ahead ones.

After results are emitted, we can use the resulting best path as parameter set for performing inner products between pairs of input word/phrase/sentence (WPS) vectors. These pairwise products are easily conceived as consistent semantic similarity scores.

Current idea

According to the approach given in SemEval2015-task 1, we are currently posing our problem as a two class one. The class 1 corresponds to similar sentence pairs and the class 0 (-1) to dissimilar ones ({s_a, s_b}). A similarity criterion is learned from labels by the multikernel machine during training. Given that this machine classifies single vectors (but not pairs of them), we propose different sentence vector combination schemata, i.e. sentence_pair_vector = combine_pair{combine_sa(word_vector_1, word_vector_2,...), combine_sb(word_vector_1, word_vector_2,...)}, where sentence_pair_vector is associated to a label in {0, 1} and fed to the multikernel machine. We estimate the combine_pair{} operation allows holding dissimilarity features between combine_sa() and combine_sb() vectors (s_a sentence and s_b sentence, respectively), so the multikernel machine will filter them. Constituent word vectors word_vector_i of each sentence are simply window co-occurrence counting.

We are also planning using morph and PMI vectors as alternative inputs to our combination schemata. A binary NBayes classifier as well as a logistic output neural network are considered as baseline algorithms for addressing the task.

distributionalsemanticstabilitythesis's People

Contributors

iarroyof avatar julianss avatar

Watchers

 avatar  avatar  avatar

Forkers

wandabwa2004

distributionalsemanticstabilitythesis's Issues

Error de deducción de dimensiones al tratar de crear la matriz sparse de salida para un archivo de entrada completo

@julianss Al intentar generar la matriz de pares con db_word_space/combiner.py obtengo el siguiente error (ya puse limit = False en el main):

 Traceback (most recent call last):
  File "combiner.py", line 112, in <module>
    m = csr_matrix((data, (row, col)))
  File "/usr/lib/python2.7/dist-packages/scipy/sparse/compressed.py", line 45, in __init__
    other = self.__class__(coo_matrix(arg1, shape=shape))
  File "/usr/lib/python2.7/dist-packages/scipy/sparse/coo.py", line 143, in __init__
    raise ValueError('cannot infer dimensions from zero sized index arrays')
ValueError: cannot infer dimensions from zero sized index arrays
^C
[1]+  Exit 1                  python combiner.py

También los archivos de entrada y salida fueron especificados...

Error en modo CCBSP - res al intentar guardar la amtriz dispersa de salida

Hola @julianss combiner.py tuvo un error con el archivo de 750 pares (sts.input.headlines.txt). Escript estaba en modo ccbsp y res:

python combiner.py
(750, 473464)
Traceback (most recent call last):
  File "combiner.py", line 114, in <module>
    io.mmwrite(output_file, m)    
  File "/usr/lib/python2.7/dist-packages/scipy/io/mmio.py", line 96, in mmwrite
    MMFile().write(target, a, comment, field, precision)
  File "/usr/lib/python2.7/dist-packages/scipy/io/mmio.py", line 337, in write
    self._write(stream, a, comment, field, precision)
  File "/usr/lib/python2.7/dist-packages/scipy/io/mmio.py", line 625, in _write
    IJV = vstack((coo.row, coo.col, coo.data)).T
  File "/usr/lib/python2.7/dist-packages/numpy/core/shape_base.py", line 228, in vstack
    return _nx.concatenate([atleast_2d(_m) for _m in tup], 0)
MemoryError

Lo ejecuté en mi máquina. Supongo que este Issue se va a quedar sin resolver en caso de que no se use un cluster para generar archivos de este tamaño. Voy a probar con el cluster.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.