AI Masterclass lab: NLP

Introduction

This lab's goal is to familiarize you with natural language processing by building a sentiment analysis model using pytorch.

Pre-requisites (you should probably follow 42-AI's instructions here)

Python 3.5+
Pytorch 0.2+
Download and unzip a pre-trained embedding file, preferably the wikipedia EN fasttext embeddings wiki-news-300d-1M.vec.zip. Note: this file will be made available on a server internal to 42.

Task definition

The dataset is an extract from Stanford Sentiment Treebank. It consists in portions of English sentences about a movie along with a sentiment label. The sentiment label is 0 for sentences that emit a negative opinion about the movie, and 1 for positive reviews. E.g., a gorgeous , witty , seductive movie will be labeled 1 and forced , familiar and thoroughly condescending is labeled 0.

Instructions

Fork this repo and add your fork's URL to this document
Put the embedding file described in prerequisites into a folder called data inside your repo root.
Create a python script that loads embeddings using embedding.py and runs two tasks:
1. Nearest-neighbor search: print the 10 nearest neighbors of the word 'geek'
2. Analogy: retrieve embeddings closest to a combination of embeddings that corresponds to an analogy, e.g. 'Tokyo' + 'Spain' - 'Japan' = ?. Caveats: you will need to remove the embeddings corresponding to any of the query words (in this case, Tokyo, Spain and Japan) explicitly from the tokens you retrieve.
3. Find the best analogy you can and add it to the analogy tab of the google sheets. Be creative!
Run the baseline model using train.py. After the data is loaded, training will start and the loss and accuracy on train and dev sets will be reported. Remember: the goal is to have the highest test accuracy possible. At each epoch, the script will write a model.pth file in your repo that you will need to push along the rest of your code for your work to be evaluated.
Try to improve the model. Several options appear here:
1. The baseline uses only the first word of the sentence to perform classification. Edit model.py so that it performs the averaging of all embeddings in the sentence.
2. (requires 1. above) Use an importance weighting scheme for each word in the sentence, e.g. scale each embedding in the inverse proportion of the frequency of the token in the documents. This will reduce the importance of very common words such as "the" or "and". After that, you could go for more advanced weighting schemes, e.g. TF-IDF
3. Change the optimizer parameters: you could use different learning rates or different optimizers.
4. Change the model altogether: you could go for a recurrent neural network such as an LSTM

Good luck!

afoures / nlplab42 Goto Github PK

nlplab42's Introduction

AI Masterclass lab: NLP

Introduction

Task definition

Instructions

nlplab42's People

Contributors

Stargazers

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs