Natural-Language-Processing

Project: Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks

Seq2seq models are trained by maximizing the likelihood of next token given both previous token (from previous LSTM) and ground truth summary (current state) while at inference (testing), it can only depend on previous token and no current (recurrent) state can be offered in testing. Seq2seq model has been trained to depend on the outside. While testing, it is forced to only depend on itself, which is something it hasn’t been raised to do. This truly triggers a major problem, which is the difference between training and inference (testing), this is called Exposure Problem. There have been various approaches to resolve this problem. One of them is ‘Scheduled Sampling’ which is a form of curriculum learning that we would use to help our seq2seq models. While in training, it makes the model begin learning to depend on itself by exposing the model to its own mistakes so that it tries to optimize them. For example, it can be learned from its mistakes while training. This model has been implemented using TensorFlow or PyTorch in Google Colab so there is no need to run code on my labtop. This method is built over the concepts addressed by Bengio, Vinyals, Navdeep, Noam from google in their paper (Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks). I performed experiments using Scheduled Sampling to improve the parameter estimation process and my research is focusing on validating the powerful performance of schedule sampling in sequence prediction tasks.

1 The problem statement

In the last few years, we have been seeing a lot of practical applications of models like recurrent neural networks that could be used to solve actual practical tasks. For instance, machine translation, going from a sequence of words in one language to another language that would mean the same thing. Imagine captioning going from an image to a sequence describing what this image is.

And, other more technical problems like what we call constituency parsing, which is going from a sequence of words a sentence to its parsed tree so that which word is a noun phrase, verb phrase and things like that. We can also use the same techniques for speech recognition and maybe music generation. A lot of researchers attempt to predict elements of a sequence on the basis of the preceding elements: Speech Recognition, Machine Translation, Conversation Modeling, Image Video Captioning, Constituency Parsing, and Robotics. Recurrent neural networks can be used to process sequences, either as input, output or both. According to the number of input and output, there are Vanilla RNN(one input to one output), one input to many outputs RNN (image captioning), many inputs to one output RNN (Sentiment analysis), many inputs to many outputs RNN (Video classification). So, there is a lot of applications to these recurrent nets that are used to generate sequence of tokens. And, we have to be careful about how we use them and how we train them. The model has never been raised to depend on itself. Seq2seq models are trained to rely on:

The output of the previous state
The input summary

The problem occurs in the inference (testing) step where the model does not provide the input summary. It only depends on previous LSTM decoder step. This causes a difference between how the model is trained and how it goes in inference (testing). That is, this yields a discrepancy between how the model is used at training and inference. The model use predicted token at inference. This problem is Exposure Bias. The biggest problem is that mistakes made in the beginning process are fed as input to the model and can be accumulated quickly. In all these cases, if a wrong decision is taken at time t-1, the model can be in a part of the state space that is very different from those visited from the training distribution and for which it doesn’t know what to do. Worse, it can easily lead to cumulative bad decisions. Obviously, in the Inference phase, the input of the model is different from the training phase, and if a bad result is output at time t-1, because it does not know what the real output is, the model can only predict based on this bad result. The output at time t, then it is likely that the output is getting worse in the state space. If the model just started training, the model just will say garbage and there is no way you can train anything. If the input is garbage, the output will be garbage. Therefore, you have to help the model. In the paper, they propose a curriculum learning approach to solve this problem and reduce the difference between training and inference for sequence prediction tasks using recurrent neural network. One of them is, while in training, make the model begin learning to depend on itself by exposing the model to its own mistakes so that it tries to optimize them(i.e. learn from its mistakes while in training phase). This is what is called ‘Scheduled Sampling’ which is a form of curriculum learning that we would use to help our sequence prediction models. I present my proposed approach to better train sequence prediction tasks with recurrent neural networks through experimental results.

2 High-level idea of the solution (network arch & algorithm)

2.1 Recurrent Neural Network

General idea of RNN is to calculate the probability of output Y (dynamic sentence) under the input X (static or dynamic, image or sentence). A few groups in particular one at Google, proposed this framework called sequence to sequence where you start from a sequence of tokens of words of discrete values. And we have to generate another sequence of a different size of another vocabulary. So, this is represented by the graph we see here. For instance, you see x1, x2, x3, and x4. And from this, you have to generate other things like y1, y2, no necessarily of the same size of the same dictionary. And, there is a nice mathematical framework where you actually say this is represented by the conditional probability of all the ys given all the x. And, it can be factored into the product of the probability of generating each of the ys given the previous ys and all the xs. RNN are generally trained to maximize the likelihood of generating the target sequence of tokens given the input. In practice, this is done by maximizing the likelihood of each target token given the current state of the model (which summarizes the input and the past output tokens) and the previous target token, which helps the model learn a kind of language model over target tokens.

2.2 Scheduled sampling

In the inference (testing) stage, the model only depends on the previous step, which means that it totally depends on itself. The problem actually arises when the model results in a bad output in (t-1). It would lead the model to an entirely different state space and trained on in the training phase. This would simply result in cumulative bad output decisions. The gap between training and inference for sequence prediction tasks when predicting token yt is whether we use the true previous token y_t−1 or an estimate yhat_t−1 coming from the model itself. We can propose that a sampling method that will decide arbitrarily, during training, whether we use y_t−1 or yhat_t−1 . That is, a solution to this problem that has been suggested by Google research team, was to gradually change the reliance of the model from being totally dependent on the ground truth relying on itself. The concept of making the learning path difficult through time is called curriculum learning. It makes the model depend on only itself. They call it ‘scheduled sampling’. It is also called as Curriculum Strategy, start small, learn easier aspects of the task or easier subtasks, and then gradually increase the difficulty level. They build a simple sampling mechanism which would would arbitrarily choose while training:

ground truth (with probability εi) (i stands for number of batch)
model itself (with probability (1-εi) )

3 Experiment

In the paper, they describe experiments on three different tasks (image captioning, constituency parsing and speech recognition) in order to verify that scheduled sampling can improve original models in different settings. But, I will focus on specifically experimenting scheduled sampling seq2seq2 to compare the scheduled sampling seq2seq2 with original seq2seq models. And, I evaluate the performance of my models through Frame Error Rate (FER) or comparing the loss of each models on my experiments.

- Question and Answering

I build a chatbot based on gender with sequence prediction models and Schedule-Sampling sequence prediction models with PyTorch. When we ask the questions such as ”Hello”, chatbot will answer ”Hello. Nice to see you.” I used curriculum learning to solve a major problem that the sequence prediction models suffer from. I don’t want to simply generate some random sentences and generate sentences based on conditions. Sequence prediction models are trained by maximizing the likelihood of next token given BOTH previous token (from previous LSTM) and ground truth summary. While in inference (testing),it can only depend on previous token, no ground truth summary can be provided in testing, it is needed to consider longer context during chatting so I used LSTM. For the dataset, I used Cornell movie-dialogue corpus at https://www.cs.cornell.edu/~cristian/Cornell_Movie- Dialogs_Corpus.html. There are 220,579 conversational exchanges between 10,292 pairs of movie characters. It involves 9,035 characters from 617 movies. Movie scripts are split into Female data set (33681) and Male data set (78382). This model focus on learning to output the next token given the current state of the model AND the previous token. Special token is marked to the end of each sequence. Four different models are built: Seq2Seq for female dataset, Seq2Seq for male dataset, Scheduled Sampling for female, and Scheduled Sampling for male. I made a chatbot which can learn different speaking types between male/female from movie scripts and I compared original models without scheduled sampling with revised models with scheduled sampling to validate the performance of a scheduled sampling method. Scheduled sampling improves baseline for many tasks yield better sequence prediction models. And, the trained model answered the questions accurately. In the beginning, the schedule sampling models aren’t performing well because original models carry out properly so I tried to perform hyper-parameter tuning to train the model better and I can get the best outcome in the final.

branden-kang / natural-language-processing Goto Github PK