GithubHelp home page GithubHelp logo

skipgram's Introduction

Skip-gram model

Implementation of the Skip-gram model, as detailed in the paper Distributed Representations of Words and Phrases and their Compositionality, by Tomas Mikolov. This is the extended version, which includes negative sampling and subsampling of frequent words.

You can learn about the expected behavior of the scripts by looking at the tests, located in the tests folder. They include simple examples that illustrate how the classes work.

The model

When initialized, it receives as inputs the size of the vocabulary and the number of dimensions each vector representation should have. These are used to initialize the embeddings.

The forward method of the model receives three tensors with indices:

  1. Input indices: indices of the words that are fed into the model.
  2. Output indices: indices of the context words we want to compute a score for.
  3. Noise indices: indices of the words used for negative sampling (explained above).

Give a batch size N, the input and output indices should be 1D tensors of size N, while the noise indices are given as a 2D tensor of size (N x k), where k is the number of negative samples used for each input-output pair.

The authors recommend k to be between as low as 2 for large datasets, and as large as 20 for smaller ones.

The forward method returns the score of the input-output pair, considering the negative samples. The higher the score is, the more likely it is that the output word can be found near the input word in the data. The formula for the score function can be found at the bottom of p. 3 of the paper.

Negative Sampling

The Dataset (see load_data.py) samples k negative samples for each pair of center and context words. The number is set when initializing the Dataset. If it is not explicitly set, the default value is 15. In the paper they recommend values between 5 and 20 for small datasets. For larger ones, 2 neg. samples may suffice.

Words are sampled using the unigram distribution, as this is the best performing distribution according to the authors. This is basically a weighted uniform distribution, where the frequencies of the words are the weights.

The authors state that the unigram distribution raised to three quarters perform best. This means that the counts of the words are raised to 0.75 when computing the frequencies. This increases the frequencies of words that appear less in the corpus. Given a dictionary with the words as keys and their counts as values, you can compute frequencies raised to any power with the script compute_frequencies.py.

The data

Two data files are required. One containing the terms of the vocabulary, with a term per line, and another containing the text data, with a sentence per line. Only words present in the vocabulary can appear in the text data.

Usually, you will want the data to be already tokenized and lemmatized. This can be achieved by running the script process_data.py (TODO).

The results

After running the train_model.py script, the state of the model will be stored in the embeddings folder, under the timestamp when the training started. The avg. score of each epoch, as well as from every 100 batches is logged in the file named with the same timestamp used to name the folder where the model is stored.

You can retrieve the embeddings with the script get_embeddings.py. The embeddings will be stored as a dictionary, with the words as keys and their vector representations as values.

Installing the package

  1. Install the EGG file by running the command python setup.py install.
  2. Move to the distfolder and run the command pip install skipgram-0.0.1-py3.7.egg to install the module.
    • You may need to change the Python version in the name of the EGG file.
  3. You can then access the module from other projects by importing it (import skipgram).

skipgram's People

Watchers

James Cloos avatar Carlos Franzreb avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.