Machine Learning NLP project
#50.007 Machine Learning Design Project
Darren Ng 1000568, Glen Choo 1000472, Vu Xuan Kim Cuong 1000646
[TOC] ##Part II
The implementation of the emission parameters, fix for new words that do not appear in the training set, POS tagger and the accuracy score. ###Code List
The following list of code is used for the estimation of the emission parameters
####training.py
training.py
processes the training data and outputs out all of the necessary count
needed to compute the emission parameters. It outputs the computed count
into 3 main files:
- emission_train_count.txt: the count of all tags
- emission_training.txt: the count of all words in each tag
- emission_trainingReadable.txt: emission_training.txt in readable format
####testing.py
testing.py
processes the testing data using the output files generated from training.py
. To ensure and improve efficiency, the mode of the function testing_splitter
is set to unique
so as to compute the emission parameter probabilities and tags for unique words found in the testing dataset.
testing.py
outputs out the following files:
- emission_testing: file containing the probabilities of each word with their respective tag.
- emission_testing_tags: file containing the tags of each word based on the MLE estimation.
####accuracy.py
accuracy.py
computes the accuracy function to determine how accurate the predicted file is from the annotated testing data. accuracy.py
outputs out a value 0-1 in the console.
####e_pos_tagger.py
e_pos_tagger.py
generates the dev.p2.out
file based on the output files generated by the testing.py
python file.
###Instructions
To compute the emission parameter estimation of any file, the following steps must be done sequentially.
- Input in the training file and run
training.py
. It will generate 4 files. - Input the testing file
dev.in
intotesting.py
. - ensure that the correct
emission_count
andemission_train_count
files generated fromtraining.py
are inemission_count
andstate_count
variable. - Run
training.py
. It will generate 2 files. - Input the file generated by
training.py
intoe_pos_tagger.py
to generate the tag files. - Lastly, run the
accuracy.py
python file with the predicted file generated in step 5 to compare with the annotated testing dataset. A number 0-1 will beprint
on the console.
###Accuracy Findings
The code has been updated to accept new words not seen by the training data:
The accuracy findings for the both POS and NPC files are as follows:
These accuracy findings are based on the naive POS tagger, where the state of the words are determined only by the emission parameters implemented in Part 2.
##Part III
The implementation of the transition parameters, Viterbi algorithm from the estimated parameters found in Part II and the transition parameters.
###Code List
####transition_MLE.py
transition_MLE.py
computes the estimation of the transition parameters. It outputs out the following 2 files:
- transition.txt: file containing all the transition parameters.
- transitionReadable.txt: file containing all the transition parameters in a readable format.
####viterbi.py
viterbi.py
computes the vibteri algorithm using the transition parameters and the emission parameters computed before and outputs out to the file p3_viterbi_train.txt
.
###Instructions
To compute the transition parameter estimation of any file or the Viterbi algorithm, the following steps must be done sequentially.
####i. Transition Parameters
- input training file to
inFile
. - input name of file to be saved to
outFile
andoutFileReable
. - Run
transition_MLE.py
####ii. Viterbi Algorithm
This implementation of Viterbi algorithm uses a ViterbiSequence
class which stores the sequence of tags and their associated probability. The class can compute the probability of a longer sequence by calling the its method probTransmission
and giving the appropriate next tag and next word.
Each iteration of the algorithm writes a ViterbiSequence
to the DP table for every tag. The next iteration scans these entries to find the longer sequence with the best probability. After every iteration, the DP table is replaced with the entries from the previous iteration. At the final step, the DP table is scanned to find the maximum probability sequence, which is then appended to the outputs. The ending state is END
and the word emitted is None
.
###Accuracy Findings
The accuracy findings for the both POS and NPC files are as follows:
Based on the accuracy findings shown above, the accuracy score for POS dropped significantly. This may be due to the fact that the relative frequency between the tags are higher, therefore resulting in a possible overfitting of the model. As compared to the NPC accuracy findings, the accuracy score greatly improved with the use of the viterbi algorithm as all tags in NPC occur in relatively high frequencies. Therefore, the model does not suffer from an overfitting of the data.
##Part IV
###Algorithm
The implemented algorithm used to find the
For the modified Viterbi algorithm, every entry in the DP table stores the top
In the code, we use a self-maintaining priority queue of maximum length
###Code List
###viterbiModified.py
viterbiModified.py
runs the modified Viterbi algorithm and writes the outputs to POS/Part 4/p4_viterbi_<rank>.txt
. Each of this files corresponds to the testing file tagged with a top 10 best sequence, starting from =0 to =9. The lower the value of , the better the sequence. Thus the tagged file with the p4_viterbi_9.txt
.
viterbiModified.py
requires the following file names to be defined
- transmissionFileName: the name of the file containing the transition parameters as generated by
transition_MLE.py
. Default value isPOS/transition.txt
- emissionsFileName: the name of the file containing the emission parameters for the testing data as generated by
testing.py
. Default value isPOS/Part 3/emission_testing.txt
- sequenceFileName: the name of the file containing the input sequences. Default value is
POS/dev.in
- outputFileFormat: the Python string format of the output files written with ranks 0-9. Default value is
POS/Part 4/p4_viterbi_{0}.txt
###Instructions
###Accuracy Findings
As expected the
##Part V
###Algorithm
One of the key observation from using the suggested regularisation method in Part 2 for unseen words is that each unseen word will be tagged the lowest count tag found in the training data. This in fact is not a very good regularisation method. With this observation in mind, the revised algorithm is implemented with the following changes:
Regularisation
Using the MLE parameters from the training set without regularisation can lead to unexpected results. For instance, if at state
Modifying the regularisation for unseen words by assuming that words correspond to all tags at uniform can also improve predictive accuracy. The default method of regularisation by assigning
REGEX A better design for developing an improved POS tagger for tweets could be to modify the viterbi algorithm by adding an additional regular expression check to patterns found in the training dataset. Common word patterns such as number and address patterns often imply a particular tag (For example $234=$ CD, cardinal number). By searching through the training dataset for these patterns and computing their associated probabilities, we can use these REGEX to capture these patterns to better estimate the hidden states in a given testing dataset.
The key motivation behind using REGEX to identify underlying linguistic structures is the fact that certain word types such as username and numbers are in reality, infinite. It is not possible for the training data to see/capture every single username. As such, an unseen username will be tagged with high inaccuracy. This however, can be avoided if an additional REGEX parameter is weighted when interpreting the hidden labels. This will lead to higher accuracies than before.
An example is shown below:
@blackmanwalking is seen in training data
@whitemandancing is not seen in training data but seen in testing and will be treated as unseen word
REGEX checks if @whitemandancing has the underlying structure that describes a username
However, it is important to note that if there are a lot of regular expressions used, the model might overfit. Therefore, it is imperative to only use REGEX that captures the common linguistic structure and for cases when features correlates strongly to the tag and not for everything.
The following list are examples of common regular expressions used in the revised viterbi algorithm:
- starts with
'http://'
- all numeric
- starts with
':'
- Regular expression to catch consonant
The full list of REGEX parameters can be found in the file regularised_feature_probs.txt.
Parameters used in the modified viterbi with REGEX hence becomes:
-
$a_{ij}:$ transition parameters -
$b_j{(o)}:$ emission parameters -
$c_v{(R)}:$ REGEX parameters
The underlying assumption behind the REGEX parameters is that words matching a certain REGEX pattern are more likely to be generated by a small number of tags. The REGEX parameters are generated in the training phase by scanning the entire training data set to identify words and tags corresponding the REGEX. For example, if 0.9 of all words matching the pattern [0-9]+
have the tag CD
and the rest have the tag TO
, we can model the tags of these words as coming from a multinomial distribution. These parameters are generated in the training phase where
In the testing phase, if the word matches a REGEX, probability of transiting to the new tag,
The following pseudocode explains the logic behind the viterbi with REGEX algorithm.
for k = 1,...,n:
if word matches REGEX:
perform modified viterbi with REGEX
else:
perform viterbi as usual (where conditional probability c == 1)
Mathematically, the probabilities are now computed with an additional weighted parameter
where
If the word does not matches any pattern in the set of REGEX, the conditional probability
Second Order Penalty
In theory, a Second Order HMM that uses the conditional probability
To capture some of the information of the previous tag however, we introduce a penalty term dependent on the previous two terms. The extra parameter introduced is
Term by term weighting
In the actual implementation, the parameters were not worked with directly due to numerical underflow, instead the logarithms of the parameters were used. We found that incorporating the additional information in the HMM by simple addition of logarithms did not lead to large improvements until we weighted each term with a relevant weight. A sample Viterbi score would thus be of the form
###Code List
###viterbi_2nd_order.py
viterbi_2nd_order.py
performs the Viterbi algorithm on the modified HMM, using the parameters specified above.
viterbi_2nd_order.py
requires the following file names to be defined
- FEATURE_PROB_IN: the name of the file containing the REGEX parameters as generated by
featureFinder.py
. Default value isregularised_feature_probs.txt
. - TRANSITION_PENALTY: the name of the file containing the second order HMM penalty parameters as generated by
transition_2nd_order_MLE_regularised.py
. Default value istransition_2nd_order.txt
. - EMISSIONS: the name of the file containing the emission parameters as generated by
emission_part5
. Default value ispart5_emission_testing.txt
. - TRANSITIONS: the name of the file containing the transition parameters as generated by
transition_MLE_regularised.py
. Default value istransition.txt
. - TESTING_FILE: the name of the file to test on. Default value is
../dev.in
. - OUTPUT_FILE: the name of the file to write to. Default value is
p5_viterbi_2nd_order.txt
. - GOLD_STANDARD: the name of the file containing the tags to compare against. Default value is
../dev.out
.
###Instructions
- Name the intended testing file in TESTING_FILE
- Run the parameter estimator files. These are
featureFinder.py
,transition_2nd_order_MLE_regularised.py
,emission_part5
,transition_MLE_regularised.py
. - Specify the parameter file names in the variables above.
- The code will perform automatic checking of the accuracy and output a
diff
indifference_2nd_order.txt
.
###Accuracy Findings
The modified algorithm was designed with the following decisions in mind; a better way of regularisation and identifying common patterns and word constructs to yield better accuracy results. Through the test sample provided, the modified viterbi with REGEX algorithm yielded better accuracy results. This shows that with the modification of the regularisation method and the inclusion of the REGEX parameters, Term by term weighting and the Second Order penalty, the modified algorithm provides a more accurate way of determining the hidden states in a given test dataset.