GithubHelp home page GithubHelp logo

ninpnin / probabilistic-word-embeddings Goto Github PK

View Code? Open in Web Editor NEW
3.0 4.0 1.0 11.65 MB

Train and evaluate probabilistic word embeddings with Python.

Home Page: https://ninpnin.github.io/probabilistic-word-embeddings/

Python 100.00%
probabilistic-modeling word-embeddings tensorflow

probabilistic-word-embeddings's Introduction

probabilistic-word-embeddings v1.10.0

Probabilistic Word Embedding module for Python. Built with TensorFlow 2.x and TensorFlow probability.

Documentation is available here.

probabilistic-word-embeddings's People

Contributors

github-actions[bot] avatar iscyb avatar ninpnin avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

Forkers

rojanka

probabilistic-word-embeddings's Issues

Refactoring overhaul

  • Remove unnecessary reliances on TF probability
    • The Embedding class doesn't really need to be a tfd.Distribution
  • Switch away from using numeric indices over to strings
    • theta[23] -> theta["dog"]
    • theta[4394] -> theta["tea_2010"]
    • theta[230372] -> theta["professor_context"]
    • Adapt sample generation to this
    • Adapt preprocessing to this
  • Narrow down the number of likelihoods to two (SGNS, CBOW) as theta will always be of similar structure

Use proportional prior weight to data amount

Currently, the prior is applied with full weight even if a subset of data is used. As this is often the case, there needs to be a way to adjust the strength of the prior to match the amount of data points in the likelihood.

saving large outputs

Trying to save output (e) from "largeish" dynamic model with: e.save(file_name) generate error code:

OverflowError: cannot serialize a bytes object larger than 4 GiB

Suggested solution: change line 108 in probabilistic-word-embeddings/probabilistic_word_embeddings/embeddings.py to "pickle.dump(d, f, protocol = 4)"


Python 3.7.6
Ubuntu 20.04.5 LTS

Handling of edge-cases for data and enabling multiple segments of text data

Currently, we now feed in data as a long string that doesn't care about the sentence or document edges. This probably does not have a large effect when we run large-scale experiments. Although, it has two problems/downsides.

  1. Users will start to ask, because it is a strange way to handle data. This will reduce the expanded use of the code in applied settings.
  2. It limits the use of the code for data with scrambled data, such as copyrighted news data.
  3. The results with the PELP model will be slightly worse than models that handles this correctly, and the effect would depend on the size of the segments. I.e. our code would be worse for tweets than for literature.

The only difference we need to do is to handle arbitrary segments of text rather than one long string.

The best solution is also to just cap the context at the edges, see example below. Although, still treating the whole data as one dataset (with regard to negative samples).

Example:
"The brown dog. \n It jumps over the lazy fix."

CBOW (window size = 1, observations):
p(the| brown)
p(brown|the, dog)
p(dog | brown )
p(it | jumps )
p(jumps | it, over)
p(over| jumps , the)
...

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.