GithubHelp home page GithubHelp logo

marwolaeth / embeddingstools.jl Goto Github PK

View Code? Open in Web Editor NEW
0.0 1.0 0.0 86 KB

Extra tools for working with word embeddings, such as those in Embeddings.jl. However, the compatibility is currently limited.

License: MIT License

Julia 100.00%
embeddings text-analysis text-mining word-embedding word-embeddings word2vec word2vec-embeddinngs julia julia-package

embeddingstools.jl's Introduction

EmbeddingsTools

Build Status Aqua codecov

EmbeddingsTools.jl is a Julia package that provides additional tools for working with word embeddings, complementing existing packages such as Embeddings.jl. Please note that the compatibility with other packages is currently limited, namely, type conversions are currently missing. Still, this package can be used as a standalone tool for working with embedding vectors.

Installation

You can install EmbeddingsTools.jl from GitHub through the Julia package manager. From the Julia REPL, type ] to enter the Pkg REPL mode and run:

pkg> add https://github.com/Marwolaeth/EmbeddingsTools.jl.git

Or, within your Julia environment, use the following command:

using Pkg
Pkg.add("https://github.com/Marwolaeth/EmbeddingsTools.jl.git")

Usage

The package is intended to read local embedding files, and it currently supports only text files (e.g., .vec) and binary Julia files. The package can perform basic operations on these embedding files.

The embeddings are represented as either WordEmbedding or IndexedWordEmbedding types. Both types contain an embedding table and a token vocabulary that is similar to embedding objects in Embeddings.jl. They also have ntokens and ndims fields to store the dimensionality of an embedding table. In addition, IndexedWordEmbedding objects have an extra lookup dictionary that maps its tokens to corresponding embedding vectors' views.

Indexing is useful when the embedding table must be aligned with a pre-existing vocabulary, such as the one obtained from a corpus of texts.

Loading Word Embeddings

The original goal of the package was to allow users to read local embedding vectors in Julia. We discovered that this feature was quite limited in Embeddings.jl. For example, a user can manually download an embedding table, e.g. from the FastText repository or RusVectōrēs project (a collection of Ukrainian and Russian embeddings) and then read it into Julia using:

using EmbeddingsTools

# download and unzip the embedding file
## unless you prefer to do it manually
download(
    "https://rusvectores.org/static/models/rusvectores4/taiga/taiga_upos_skipgram_300_2_2018.vec.gz",
    "taiga_upos_skipgram_300_2_2018.vec.gz"
)
run(`gzip -dk taiga_upos_skipgram_300_2_2018.vec.gz`);

# Load word embeddings from a file
embtable = read_vec("taiga_upos_skipgram_300_2_2018.vec")

The read_vec() function is a basic function that reads embeddings. It takes two arguments: path and delim (the delimiter), and creates a WordEmbedding object using CSV.jl. This function reads the entire embedding table, which results in better performance due to its straightforward logic. However, it may fail to read embeddings with more than 500k words.

read_embedding() is an alternative function that provides more control options through keyword arguments. If max_vocab_size is specified, the function limits the size of the vector to that number. If a vector keep_words is provided, it only keeps those words. If a word in keep_words is not found, the function returns a zero vector for that word.

If the file is a WordEmbedding object within a Julia binary file (with extension .jld or in specific formats .emb or .wem), the entire embedding is loaded, and keyword arguments are not applicable. You can also use the read_emb() function directly on binary files. See ?write_embedding for saving embedding objects to Julia binary files to read them faster in the future.

# Load word embeddings for 10k most frequent words in a model
embtable = read_embedding(
    "taiga_upos_skipgram_300_2_2018.vec",
    max_vocab_size=10_000
)

Creating Embedding Indices

There are some differences in the behavior of certain functions in EmbeddingsTools.jl, depending on whether the embedding table object contains a lookup dictionary or not. If the object contains a lookup dictionary, then it is referred to as an object of type IndexedWordEmbedding, which is considerably faster to operate on. On the other hand, if it does not contain the lookup dictionary, then it is referred to as an object of type WordEmbedding, which takes a bit of time to index and should only be done when necessary. To index an embedding object, you can either call IndexedWordEmbedding() (which is a constructor function) or index() on the object.

# These are equivalent
embtable_ind = IndexedWordEmbedding(embtable)
embtable_ind = index(embtable)

Quering Embeddings

We can use the get_vector() function with either indexed or simple embeddings table to obtain a word-vector for a given word:

get_vector(embtable, "человек_NOUN")
get_vector(embtable_ind, "человек_NOUN")

Limiting Embedding Vocabulary

Regardles of whether we have read the embedding with limited vocabulary size or not, we can limit it with the limit() function:

small_embtable = limit(embtable, 111)

Embedding Subspaces

At times, we may need to adjust an embedding table to match a set of words or tokens. This could be the result of pre-processing a corpus of text documents using the TextAnalysis.jl package. The subspace() function can be used to create a new WordEmbedding object from an existing embedding and a vector of strings containing the words or tokens of interest. The order of the new embedding vectors corresponds to the order of the input tokens. If a token is not present in the source embedding vocabulary, a zero vector is returned for that token.

It's important to note that the subspace() method performs much faster when used with an indexed embedding object.

words = embtable.vocab[13:26]
embtable2 = subspace(embtable_ind, words)

Dimensionality Reduction

The reduce_emb() function allows you to decrease the size of embedding objects, whether they are indexed or not. You can choose between two reduction techniques (specified using the method keyword): pca (the default) for Principal Component Analysis, or svd for Singular Value Decomposition.

# Reduce the dimensionality of the word embeddings using PCA or SVD
embtable20 = reduce_emb(embtable, 20)
embtable20_svd = reduce_emb(embtable, 20, method="svd")

Compatibility

As of the current version, EmbeddingsTools.jl has limited compatibility with the package that has inspired the entire project. We are actively working on expanding compatibility and interoperability with a wider range of packages.

Contributing

We welcome contributions from the community to enhance the functionality and compatibility of EmbeddingsTools.jl. If you encounter any issues or have ideas for improvement, please feel free to open an issue or submit a pull request on our GitHub repository.

License

EmbeddingsTools.jl is provided under the MIT License. See the LICENSE file for more details.

embeddingstools.jl's People

Contributors

dependabot[bot] avatar marwolaeth avatar

Watchers

 avatar

embeddingstools.jl's Issues

TagBot trigger issue

This issue is used to trigger TagBot; feel free to unsubscribe.

If you haven't already, you should update your TagBot.yml to include issue comment triggers.
Please see this post on Discourse for instructions and more details.

If you'd like for me to do this for you, comment TagBot fix on this issue.
I'll open a PR within a few hours, please be patient!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.