GithubHelp home page GithubHelp logo

jun20061588 / cpae Goto Github PK

View Code? Open in Web Editor NEW

This project forked from tombosc/cpae

0.0 1.0 0.0 8.17 MB

Code for EMNLP 2018 paper "Auto-Encoding Dictionary Definitions into Consistent Word Embeddings"

License: MIT License

Shell 1.84% Python 94.25% Jupyter Notebook 3.91%

cpae's Introduction

Dependencies

See requirements.txt. Some packages such as blocks and fuel should be installed with pip using the github link to the projects:

  • pip install git+git://github.com/mila-udem/blocks.git@stable -r https://raw.githubusercontent.com/mila-udem/blocks/stable/requirements.txt
  • pip install git+git://github.com/mila-udem/fuel.git@stable

This code is heavily based on the dict_based_learning repo.

We directly include the files of several softwares that are slightly modified:

  • Word Embeddings Benchmark which we have prepackaged into the archive because it is a modified version which includes more datasets and also reads specific model files.
  • Retrofitting which corrects a minor bug and adds more options.

We also include a the wordnet dictionary (definitions only) in data/dict_wn.json and the license that goes with it in data/wordnet_LICENSE.

Prepare the data

  1. Run ./build_split_dict.sh to build the split dictionary.
  2. Run ./build_full_dict.sh to build the full dictionary.

Pretrained embeddings

In order to use pretrained embeddings, you need .npy archives that will be loaded as input embeddings into the model and frozen (not trained). Additionally, you will need a custom vocabulary. For that purpose, you can modify and use two different scripts build_pretrained_archive.sh and build_pretrained_w2v_defs.sh. The first one include words that have definitions but that do not appear in definitions, while the second one does not.

Once you have the custom vocabulary, you can create configurations for the new models into dictlearn/s2s_configs.py. We give the configurations for the full dump experiment, the (very similar) dictionary data with word2vec pretrained archive and the full dictionary experiment without any pretraining.

Train

See run.sh and the corresponding configuration names in dictlearn/s2s_configs.py for how to run one specific experiment.

Generate and evaluate embeddings

Once your model is trained, you can use it to generate embeddings for all the words which have a definition. Use evaluate_embeddings.sh to generate and evalute embeddings. It is not fully automatic (requires the right .tar archive that contains the trained model), so please read it to make sure that the filenames are coherent with the number of epochs that you have, etc. The script generates the scores on dev and test sets. You can use the notebook in notebooks/eval_embs.ipynb which shows how to do model selection.

There is a distinct script to evaluate the one-shot learning abilities of model: see analyze_one_shot.sh.

Comparing against the baselines

  • Hill's model is recovered (with shared embeddings between the encoder and decoder and a L2 distance instead of cosine) when c['proximity_coef'] = 0 for the configuration c. So you can use the same code as for AE and CPAE to run that model.
  • To do retrofitting, you can look into preparation_retrofitting/README.md.
  • To use dict2vec, please look at preparation_dict2vec/README.md.

Misc

In order to export definitions that are word2vec readable (using the naive concatenation scheme described in the paper), you can use bin/export_definitions.py. If you are looking for something that's not described, please look at the scripts in bin/, there might be something undocumented that can help you.

cpae's People

Contributors

tombosc avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.