GithubHelp home page GithubHelp logo

devsinghsachan / pythonobjectlm Goto Github PK

View Code? Open in Web Editor NEW

This project forked from jonathanraiman/pythonobjectlm

0.0 2.0 0.0 3.14 MB

Python implementation of Object Language Model with Cython fast training.

CSS 1.92% JavaScript 0.88% Python 97.20%

pythonobjectlm's Introduction

Object Language Model

A language model for documents and words. Train a 1 hidden layer feed forward neural network with document vectors and word vectors as inputs using multiclass and binary labels as targets using softmax and sigmoid activations. Both documents and words are trained through backprop. Words are shared among all documents, while document vectors live in their own embedding. The graph for this neural network is shown below:

png

Eucledian distances between documents are observed to possess fuzzy search properties over all the labels. Using T-SNE we can visualize these embeddings for restaurants in Seattle here:

png

and

png

Usage

Here we initialize a model that uses the vectors for the words in a window and a special object vector corresponding to the document (restaurant) to perform classification. By gradient descent we can then update the word vectors and the object vectors so that the object vectors obtain some relation to the labels / targets provided to us (in this case the Yelp category, pricing, and rating labels).

We first prepare the dataset (nothing fancy here, just some iterator magic):

from objectlm import ObjectLM, DatasetGenerator, CategoriesConverter

file = gzip.open("saves/saved_texts.gz", 'r')
texts, texts_data = pickle.load(file)
file.close()

categories = set()
for el in texts_data:
    for c in el["categories"]:
        categories.add(c)

catconvert = CategoriesConverter(categories)
dataset_gen = DatasetGenerator(texts, texts_data, catconvert)

Then we construct the model:

model = ObjectLM(
    vocabulary = lmsenti,
    object_vocabulary_size = len(texts),
    window = 10,
    bilinear_form = False,
    size = 20,
    object_size = 20,
    output_sigmoid_classes = catconvert.num_categories,
    output_sigmoid_labels = catconvert.index2category,
    output_classes=[5, 5], # "", "$", "$$",...,"$$$$", 5 price classes, and 5 rating classes
    output_labels = [["", "$", "$$", "$$$", "$$$$"], ["1", "2", "3", "4", "5"]]
)

min_alpha = float(0.001)
max_alpha = float(0.0035)
max_epoch = 9
for epoch in range(0, max_epoch):
    alpha = max(min_alpha, max_alpha * (1. - (float(epoch) / float(max_epoch))))
    model._alpha = alpha
    objects, err = model.train(dataset_gen, workers = 8, chunksize = 24)
    print("Error = %.3f, alpha = %.3f" % (err, alpha))

We can then perform gradient descent on all the examples and minimize the classification error for each object. Running this for about 9 epochs works for a small dataset, and hopefully applies to the larger case here.

In this particular instance we find that looking at the eucledian distance between object vectors acts as a fuzzy search on all the attributes. It remains to be evaluated how much of the semantic information about the objects is contained in these. Furthermore, this model is not auto-regressive, thus there is no way to generalize to unlabeled data in the future. Nonetheless for document retrieval purposes this is effective.

It is important to note that there are hundreds of labels to predict, but only 20 dimensions for the object vector, thus this enforces specificity.

A Java implementation that can do prediction, but no training, can be found here.

Saving

Saving the model to for Matlab & Java:

The model's parameters can be saved to interact with Java as follows:

model.save_model_to_java("saves/current_model")

Then from Java you can import this model as described here.

Additional Exports

Other files can be saved separately for exporting purposes

dataset_gen.save("saves/objectlm_window_10_lm_20_objlm_20_4/__dataset__.gz")
dataset_gen.save_ids("saves/objectlm_window_10_lm_20_objlm_20_4/__objects__.gz")
model.save_model_parameters("saves/objectlm_window_10_lm_20_objlm_20_4")
model.save_model_to_java("saves/objectlm_window_10_lm_20_objlm_20_4")
catconvert.save_to_java("saves/objectlm_window_10_lm_20_objlm_20_4/__categories__.gz")
model.save_vocabulary("saves/objectlm_window_10_lm_20_objlm_20_4")

Loading a saved model:

To load a saved model, point it to a directory with the saved matrices:

model.load_saved_weights("saves/objectlm_window_10_lm_20_objlm_20_4/")

Querying the model

First we create normalized matrices:

model.create_normalized_matrices()

Then we can perform searches on it using inner product distance (cosine):

model.most_similar_word("science")

[('request', 0.9545077085494995, 7163),
 ('hopefully', 0.9531156420707703, 38713),
 ('community', 0.9526830911636353, 6000),
 ('infused', 0.9511095285415649, 7513),
 ('yummy', 0.9509859085083008, 34636),
 ('fallen', 0.9509795904159546, 6096),
 ('feeling', 0.9508317708969116, 38029),
 ("'d", 0.9483151435852051, 26839),
 ('reading', 0.9478667974472046, 20754),
 ('work', 0.9475015997886658, 586)]

That's not brilliant, however the documents have captured "invariant" properties, use most_similar_object to repeat the operation above among documents.

Dependencies

You will probably want the xml_cleaner for cleaning up text if you want to easily process weirdly formatted inputs. Easy to get:

pip3 install xml_cleaner

Creating trees

We can see infer new properties on restaurants given a tree from this compression model here.

And on a real dataset we can see results here.

pythonobjectlm's People

Contributors

jonathanraiman avatar

Watchers

James Cloos avatar Devendra Singh Sachan avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.