probabilistic-word-embeddings v1.10.0

Probabilistic Word Embedding module for Python. Built with TensorFlow 2.x and TensorFlow probability.

Documentation is available here.

probabilistic-word-embeddings's People

Contributors

Stargazers

Watchers

probabilistic-word-embeddings's Issues

Refactoring overhaul

Remove unnecessary reliances on TF probability
- The Embedding class doesn't really need to be a tfd.Distribution
Switch away from using numeric indices over to strings
- theta[23] -> theta["dog"]
- theta[4394] -> theta["tea_2010"]
- theta[230372] -> theta["professor_context"]
- Adapt sample generation to this
- Adapt preprocessing to this
Narrow down the number of likelihoods to two (SGNS, CBOW) as theta will always be of similar structure

Use proportional prior weight to data amount

Currently, the prior is applied with full weight even if a subset of data is used. As this is often the case, there needs to be a way to adjust the strength of the prior to match the amount of data points in the likelihood.

saving large outputs

Trying to save output (e) from "largeish" dynamic model with: e.save(file_name) generate error code:

OverflowError: cannot serialize a bytes object larger than 4 GiB

Suggested solution: change line 108 in probabilistic-word-embeddings/probabilistic_word_embeddings/embeddings.py to "pickle.dump(d, f, protocol = 4)"

Python 3.7.6
Ubuntu 20.04.5 LTS

Handling of edge-cases for data and enabling multiple segments of text data

Currently, we now feed in data as a long string that doesn't care about the sentence or document edges. This probably does not have a large effect when we run large-scale experiments. Although, it has two problems/downsides.

Users will start to ask, because it is a strange way to handle data. This will reduce the expanded use of the code in applied settings.
It limits the use of the code for data with scrambled data, such as copyrighted news data.
The results with the PELP model will be slightly worse than models that handles this correctly, and the effect would depend on the size of the segments. I.e. our code would be worse for tweets than for literature.

The only difference we need to do is to handle arbitrary segments of text rather than one long string.

The best solution is also to just cap the context at the edges, see example below. Although, still treating the whole data as one dataset (with regard to negative samples).

Example:
"The brown dog. \n It jumps over the lazy fix."

ninpnin / probabilistic-word-embeddings Goto Github PK

probabilistic-word-embeddings's Introduction

probabilistic-word-embeddings v1.10.0

probabilistic-word-embeddings's People

Contributors

Stargazers

Watchers

Forkers

probabilistic-word-embeddings's Issues

Refactoring overhaul

Use proportional prior weight to data amount

saving large outputs

Handling of edge-cases for data and enabling multiple segments of text data

Adjust tests to new structure

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs