GithubHelp home page GithubHelp logo

dsc-generating-word-embeddings-lab-online-ds-pt-071519's Introduction

Generating Word Embeddings - Lab

Introduction

In this lab, you'll learn how to generate word embeddings by training a Word2Vec model, and then embedding layers into deep neural networks for NLP!

Objectives

You will be able to:

  • Train a Word2Vec model and transform words into vectors
  • Obtain most similar words by using methods associated with word vectors

Getting Started

In this lab, you'll start by creating your own word embeddings by making use of the Word2Vec model. Then, you'll move onto building neural networks that make use of Embedding Layers to accomplish the same end-goal, but directly in your model.

As you've seen, the easiest way to make use of Word2Vec is to import it from the Gensim Library. This model contains a full implementation of Word2Vec, which you can use to begin training immediately. For this lab, you'll be working with the News Category Dataset from Kaggle. This dataset contains headlines and article descriptions from the news, as well as categories for which type of article they belong to.

Run the cell below to import everything you'll need for this lab.

import pandas as pd
import numpy as np
np.random.seed(0)
from gensim.models import Word2Vec
from nltk import word_tokenize

Now, import the data. The data is stored in the file 'News_Category_Dataset_v2.json'. This file is compressed, so that it can be more easily stored in a GitHub repo. Make sure to unzip the file before continuing!

In the cell below, use the read_json() function from Pandas to read the dataset into a DataFrame. Be sure to include the parameter lines=True when reading in the dataset!

Once you've imported the data, inspect the first few rows of the DataFrame to see what your data looks like.

df = None

Preparing the Data

Since you're working with text data, you need to do some basic preprocessing including tokenization. Notice from the data sample that two different columns contain text data -- headline and short_description. The more text data your Word2Vec model has, the better it will perform. Therefore, you'll want to combine the two columns before tokenizing each comment and training your Word2Vec model.

In the cell below:

  • Create a column called 'combined_text' that consists of the data from the 'headline' column plus a space character (' ') plus the data from the 'short_description' column
  • Use the 'combined_text' column's .map() method and pass in word_tokenize. Store the result returned in data
df['combined_text'] = None
data = None

Inspect the first 5 items in data to see how everything looks.

data[:5]

Notice that although the words are tokenized, they are still in the same order they were in as headlines. This is important, because the words need to be in their original order for Word2Vec to establish the meaning of them. Remember that for a Word2Vec model you can specify a Window Size that tells the model how many words to take into consideration at one time.

If your window size was 5, then the model would start by looking at the words "Will Smith joins Diplo and", and then slide the window by one, so that it's looking at "Smith joins Diplo and Nicky", and so on, until it had completely processed the text example at index 1 above. By doing this for every piece of text in the entire dataset, the Word2Vec model learns excellent vector representations for each word in an Embedding Space, where the relationships between vectors capture semantic meaning (recall the vector that captures gender in the previous "king - man + woman = queen" example you saw).

Now that you've prepared the data, train your model and explore a bit!

Training the Model

Start by instantiating a Word2Vec Model from gensim.

In the cell below:

  • Create a Word2Vec model and pass in the following arguments:
    • The dataset we'll be training on, data
    • The size of the word vectors to create, size=100
    • The window size, window=5
    • The minimum number of times a word needs to appear in order to be counted in the model, min_count=1
    • The number of threads to use during training, workers=4
model = None

Now, that you've instantiated Word2Vec model, train it on your text data.

In the cell below:

  • Call the .train() method on your model and pass in the following parameters:
    • The dataset we'll be training on, data
    • The total_examples of sentences in the dataset, which you can find in model.corpus_count
    • The number of epochs you want to train for, which we'll set to 10

Great! You now have a fully trained model! The word vectors themselves are stored in the Word2VecKeyedVectors instance, which is stored in the .wv attribute. To simplify this, restore this object inside of the variable wv to save yourself some keystrokes down the line.

wv = None

Examining Your Word Vectors

Now that you have a trained Word2Vec model, go ahead and explore the relationships between some of the words in the corpus!

One cool thing you can use Word2Vec for is to get the most similar words to a given word. You can do this by passing in the word to wv.most_similar().

In the cell below, try getting the most similar word to 'Texas'.

Interesting! All of the most similar words are also states.

You can also get the least similar vectors to a given word by passing in the word to the .most_similar() method's negative parameter.

In the cell below, get the least similar words to 'Texas'.

This seems like random noise. It is a result of the way Word2Vec is computing the similarity between word vectors in the embedding space. Although the word vectors closest to a given word vector are almost certainly going to have similar meaning or connotation with your given word, the word vectors that the model considers 'least similar' are just the word vectors that are farthest away, or have the lowest cosine similarity. It's important to understand that while the closest vectors in the embedding space will almost certainly share some level of semantic meaning with a given word, there is no guarantee that this relationship will hold at large distances.

You can also get the vector for a given word by passing in the word as if you were passing in a key to a dictionary.

In the cell below, get the word vector for 'Texas'.

Now get all of the word vectors from the object at once. You can find these inside of wv.vectors. Try it out in the cell below.

As a final exercise, try to recreate the 'king' - 'man' + 'woman' = 'queen' example previously mentioned. You can do this by using the .most_similar() method and translating the word analogies into an addition/subtraction formulation (as shown above). Pass the original comparison, which you are calculating a difference between, to the negative parameter, and the analogous starter you want to apply the same transformation to, to the positive parameter.

Do this now in the cell below.

As you can see from the output above, your model isn't perfect, but 'Queen' and 'Princess' are still in the top 5. As you can see from the other word in top 5, 'reminiscent' -- your model is far from perfect. This is likely because you didn't have enough training data. That said, given the small amount of training data provided, the model still performs remarkably well!

In the next lab, you'll reinvestigate transfer learning, loading in the weights from an open-sourced model that has already been trained for a very long time on a massive amount of data. Specifically, you'll work with the GloVe model from the Stanford NLP Group. There's not really any benefit from training the model ourselves, unless your text uses different, specialized vocabulary that isn't likely to be well represented inside an open-source model.

Summary

In this lab, you learned how to train and use a Word2Vec model to create vectorized word embeddings!

dsc-generating-word-embeddings-lab-online-ds-pt-071519's People

Contributors

mike-kane avatar loredirick avatar sumedh10 avatar mathymitchell avatar

Watchers

James Cloos avatar  avatar Mohawk Greene avatar Victoria Thevenot avatar Bernard Mordan avatar Otha avatar raza jafri avatar  avatar Joe Cardarelli avatar The Learn Team avatar  avatar  avatar Ben Oren avatar Matt avatar Antoin avatar  avatar Alex Griffith avatar  avatar Amanda D'Avria avatar  avatar Ahmed avatar Nicole Kroese  avatar Kaeland Chatman avatar Lisa Jiang avatar Vicki Aubin avatar Maxwell Benton avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.