GithubHelp home page GithubHelp logo

jdc08161063 / uncovec Goto Github PK

View Code? Open in Web Editor NEW

This project forked from artetxem/uncovec

0.0 2.0 0.0 21 KB

Uncovering divergent linguistic information in word embeddings with lessons for intrinsic and extrinsic evaluation

License: GNU General Public License v3.0

Shell 40.52% Python 59.48%

uncovec's Introduction

UncoVec

This is an open source implementation of our word embedding post-processing and evaluation framework, described in the following paper:

Mikel Artetxe, Gorka Labaka, Iñigo Lopez-Gazpio, and Eneko Agirre. 2018. Uncovering divergent linguistic information in word embeddings with lessons for intrinsic and extrinsic evaluation. In Proceedings of the 22nd Conference on Computational Natural Language Learning (CoNLL 2018).

If you use this software for academic research, please cite the paper in question:

@inproceedings{artetxe2018conll,
  author    = {Artetxe, Mikel  and  Labaka, Gorka  and Lopez-Gazpio, Inigo  and  Agirre, Eneko},
  title     = {Uncovering divergent linguistic information in word embeddings with lessons for intrinsic and extrinsic evaluation},
  booktitle = {Proceedings of the 22nd Conference on Computational Natural Language Learning (CoNLL 2018)},
  month     = {October},
  year      = {2018},
  address   = {Brussels, Belgium},
  publisher = {Association for Computational Linguistics}
}

Requirements

If you just want to apply our proposed post-processing to your own embeddings, all you need is Python 3 with NumPy.

If you want to use the full evaluation framework, which tests embeddings in word analogy, word similarity/relatedness and semantic textual similarity for different values of the alpha parameter of our proposed post-processing, you will additionally need the following:

  • Python 2 with the following dependencies:
    • NumPy
    • H5py
  • Python 3 with the following dependencies:
    • Numpy
    • H5py
    • PyTorch (tested with v0.2)
    • NLTK
  • A copy of DAM-STS and VecMap at third-party/
  • A copy of the evaluation datasets at data/

You will need to take care of the Python libraries yourself, but we provide the following scripts to automatically download the required datasets and the dependencies under third-party/:

./get-data.sh
./get-third-party.sh

Usage

The following command applies the proposed post-processing to the given embeddings in word2vec text format:

python3 post-process.py ALPHA < INPUT.EMB.TXT > OUTPUT.EMB.TXT

Alternatively, you can run the full evaluation framework, which tests embeddings in word analogy, word similarity/relatedness and semantic textual similarity for different values of the alpha parameter (note that this requires an NVIDIA GPU with CUDA support):

./evaluate.sh EMBEDDINGS.TXT

Using the above script, you can reproduce the experiments reported in our paper as follows:

./evaluate.sh embeddings/glove.200k.txt
./evaluate.sh embeddings/fasttext.200k.txt

FAQ

Why doesn't your script download and preprocess the word2vec embeddings used in your paper? How can I do that myself?

These embeddings are hosted in Google Drive and require a few clicks to download, which cannot be easily automatized using command line tools. In any case, you can do it manually as follows:

  1. Download the embeddings from here.
  2. Convert the embeddings from binary to text format. You can use convertvec for that.
  3. Cut off the vocabulary as done in our experiments. You can use the following command for that:
(echo '200000 300' && tail -n +2 WORD2VEC.FULL.TXT | head -200000) > embeddings/word2vec.200k.txt
I am following the instructions in this README, but I am not getting the exact same numbers reported in your paper. Why is that?

You may get slightly different results for the following two reasons, but differences should be negligible and the general trends should be the same:

  1. Different hardware and library versions might result in minor numerical variations in the underlying computations. These are generally imperceptible, but tend to be magnified by the stochastic nature of neural network training, which in this case affect the STS-DAM experiments. For that reason, we report the average across 10 runs for our STS-DAM experiments, which should minimize the effect of this type of variability.
  2. In the case of MEN, the provided script downloads the official natural form version, whereas we used an in-house lemmatized version in our experiments. You should actually get slightly better results with the official version used in this public release. In any case, the general trends for different values of the alpha parameter are the same.

License

Copyright (C) 2018, Mikel Artetxe

Licensed under the terms of the GNU General Public License, either version 3 or (at your option) any later version. A full copy of the license can be found in LICENSE.txt.

uncovec's People

Contributors

artetxem avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.