GithubHelp home page GithubHelp logo

weile-zheng / word2vec-vector-embedding Goto Github PK

View Code? Open in Web Editor NEW
0.0 1.0 0.0 526 KB

πŸ“ƒ Training vector embedding and solving analogy questions of the SSAT test

Python 100.00%
deep-learning natural-language-processing

word2vec-vector-embedding's Introduction

SSAT Analogies

Walkthrough of solving SSAT verbal section's analogy questions through vector semantics.

w2v


Background

Word2vec is a famous model for NLP(natural language processing) published in 2013 from the Google lab. The word2vec algorithm uses neural network to train vector embeddings from a large corpus of text.

Gensim is a python library that implements word2vec algorithms with fast training.

We will use Gensim to train our vector embedding with a word2vec model and a dataset from Wikipedia - "text 8". Using vector semantics, we will try to solve real world analogy problems.


Setting Up

The first step is to install Gensim using pip (and all of its dependencies, which will also be installed automatically)

pip install Gensim

Next we will load the text8 corpus to begin our trainning.

import gensim
import gensim.downloader as api

corpus = api.load('text8')

Problems may arise from these lines of code, check solutions here

Start training with text8 corpus

from gensim.models.word2vec import Word2Vec
model = Word2Vec(corpus)

Go ahead and save the model so that you don't have to retrain everytime running the script.

model.save("trainedModel.bin")

Load it later with

model = Word2Vec.load("trained_word2vec_model.bin")

Test it with something like

print(model.wv.most_similar('dog'))

Output:

[('cat', 0.836550772190094), ('hound', 0.7825815081596375), ('pig', 0.7662081718444824), ('ass', 0.7483778595924377), ('cow', 0.7459826469421387), ('goat', 0.7381320595741272), ('dogs', 0.7354974746704102), ('bird', 0.7333177328109741), ('panda', 0.7282117009162903), ('pie', 0.72819584608078)]


Finding Distance Between Words

_"In machine learning, an embedding is a way of representing data as points in n-dimensional space so that similar data points cluster together." _

Since we trainned our embeddings based on Wikipedia billions of characters(text8 dataset), words with similar semantics or context(usually appear around one another) will be clustered in space.



Testing it on some words shows that words have more connection result in a higher similarity than words that have less connection. Such similarty distance is calculated through cosine similarty.

cos

print(
    f"Similarity between france and spain: {model.wv.similarity('france', 'spain')}")
print(
    f"Similarity between bsketball and pluto: {model.wv.similarity('basketball', 'pluto')}")

Similarity between france and spain: 0.8095079660415649

Similarity between bsketball and pluto: 0.07107406854629517


Difference Between Two Vectors

Instead of simply finding the distance between two vectors, we could also find the difference vector between two vectors. Which help as outlining the relationship between two words:

(vector of France - vector of paris) we get the relationship between france and paris, the relationship is Paris is the capital of France. If we then add the difference vector to the vector of "England", and find the most similar word to that final vector, we can find the capital of London, as it has the same relationship to England as Paris to France.

paris_vector = model.wv["paris"]
france_vector = model.wv["france"]
england_vector = model.wv["england"]

difference_vector = paris_vector - france_vector
print("The difference between paris and france: " + str(difference_vector))

final_vector = england_vector + difference_vector
final_word, sim = model.wv.similar_by_vector(final_vector, topn=2)[1]

print("A final similar word after adding the difference between paris and france to england: " + final_word)

A final similar word after adding the difference between paris and france to england: london

**Note that these vectori manipulations depends largely on the datasets we trained on, since the text8 is a dataset from 2006, it will be unlikely to return expected results if we input some more modern contexted words to the model.

For example, subtracting Durant from basketball and adding on to soccer vector should probably result in the name of some star soccer player, however, Kevin Durant was not yet in the league in 2006, so the results would likely not be as expected.


Solving the SSAT Analogy Problems

Given the fact that we could subtract word vectors and find the difference vector between them, we could apply this to analogy problems.

We can subtract word A from word B in each of the options, and comparing the difference vector to the difference vector of the question, and choose the closest one.

Flake is to snow as (The answer is B. A flake is a small unit of snow, just as a drop is a small unit of rain.)

(A) storm is to hail (B) drop is to rain (C) field is to wheat (D) stack is to hay (E) cloud is to fog

Gensim does not provide function for cosine similarity between two vectors that does not have a word associated with so we will have to write our own function first.

def cosine_similarity(vector1, vector2):
  # Calculate the cosine distance between the two vectors using NumPy
  return np.dot(vector1, vector2) / (np.linalg.norm(vector1) * np.linalg.norm(vector2))

Then, we subtract the difference vector between word A and B in the quesiton and all the answer options for comparison.

model = Word2Vec.load("trainedModel.bin")

question = model.wv["flake"] - model.wv["snow"]
a = model.wv["storm"] - model.wv["hail"]
b = model.wv["drop"] - model.wv["rain"]
c = model.wv["field"] - model.wv["wheat"]
d = model.wv["stack"] - model.wv["hay"]
e = model.wv["cloud"] - model.wv["fog"]

print("Cosine Similarities: ")
print("Option A " + str(cosine_similarity(a, question)))
print("Option B " + str(cosine_similarity(b, question)))
print("Option C " + str(cosine_similarity(c, question)))
print("Option D " + str(cosine_similarity(d, question)))
print("Option E " + str(cosine_similarity(e, question)))

Printing out the results shows that option B have the highest similarity to our question difference vector, thus giving us the correct answer.

Cosine Similarities:

Option A -0.5162843

Option B 0.5129441

Option C 0.111720726

Option D 0.1051023

Option E -0.29211172


Visualizing Solutions

Using Sci-kit Learn's principle component analysis(PCA), we can flatten the word vector space to 2D space, which can then be visualized using Matplotlib

import numpy as np
from gensim.models import Word2Vec
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

# Replace with the path to your trained model file
model = Word2Vec.load("trainedModel.bin")

question = model.wv["flake"] - model.wv["snow"]
a = model.wv["storm"] - model.wv["hail"]
b = model.wv["drop"] - model.wv["rain"]
c = model.wv["field"] - model.wv["wheat"]
d = model.wv["stack"] - model.wv["hay"]
e = model.wv["cloud"] - model.wv["fog"]
vectors = [question, a, b, c, d, e]

# Apply PCA to reduce the word vectors to 2 dimensions
pca = PCA(n_components=2)
vectors_2d = pca.fit_transform(vectors)

# Plot the 2D word vectors using Matplotlib
plt.figure(figsize=(10, 8))
plt.scatter(vectors_2d[:, 0], vectors_2d[:, 1], marker='o', s=30)

# Annotate the points with the labels 'question', 'a' to 'e'
labels = ['question', 'a', 'b', 'c', 'd', 'e']
for i, label in enumerate(labels):
    if label == 'question':
        plt.annotate(label, xy=(
            vectors_2d[i, 0], vectors_2d[i, 1]), fontsize=12, color='red')
    else:
        plt.annotate(label, xy=(
            vectors_2d[i, 0], vectors_2d[i, 1]), fontsize=12)

plt.title('Embeddings Visualization')
plt.grid(True)
plt.show()

It can be observed that the vector b is a lot closer to our difference vector presented in the question, showing a better similarity between the wordA and B in comparison to the question than the rest of the options.

Another example; Perimeter is to square as

a) Chord is to Cylinder

b) Side is to Polygon

c) Degree is to Angle

d) Height is to Pyramid

e) Circumference is to circle

Cosine Similarities:

Option A 0.04448876

Option B -0.16866955

Option C 0.23843354

Option D -0.26991528

Option E 0.38653418


Testing Solutions

Through manually testing on a view more problems, it could be concluded that the embeddings have not reached desired results. It failed for many of the harder problems with more intricate relationship and required more human reasoning. Text8 is only an 100MB dataset, usually, larger datasets are needed to create a more robust model. We can also increase the dimension of our embedding up from 50 dimensions for better representing of word contextual information.


Using the bookcorpus dataset, which is over 1GB, we can generate much better vector embeddings suitable for our purpose.

pip -install datasets
from datasets import load_dataset
book_corpus = load_dataset("bookcorpus")
train_data = book_corpus["train"]

Processing the data for Gensim's word2vec model and feeding it for generating vector embeddings of 100 dimensions.

sentences = [simple_preprocess(sentence["text"]) for sentence in train_data]

model = Word2Vec(sentences, vector_size=100, window=5, min_count=5, workers=4)

Note that downloading and generating training splits will take some time.


Using pretrained vectors from GloVe

To save time and computing resources, we can use pretrained vectors from the GloVe, an unsupervised learning algorithm developed by Stanford researchers. Note that this is different from using Word2Vec ourselves, we are using a vectors trained from a different model.

Solver with GloVe


Common Problems

These problems were common when first setting up Gensim and executing the code above:

ssl.SSLCertVerificationError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:997

FileNotFoundError: [Errno 2] No such file or directory: ---

ValueError: unable to read local cache during fallback, connect to the Internet and retry

All need to be done is to create the unfound file under your /Users/(your username)/Gensim-data directory and input it with the lines in ο»Ώlist.json found on the Github Repo.

vim /Users/<username>/Gensim-data/information.json

After creating and filling in information.json, a certVerification error commonly arise.

ssl.SSLCertVerificationError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:997

Solve it with

bash /Applications/Python*/Install\ Certificates.command

word2vec-vector-embedding's People

Contributors

weile-zheng avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.