GithubHelp home page GithubHelp logo

mlproject's Introduction

MLproject

Machine Learning NLP project

#50.007 Machine Learning Design Project

Darren Ng 1000568, Glen Choo 1000472, Vu Xuan Kim Cuong 1000646

[TOC] ##Part II

The implementation of the emission parameters, fix for new words that do not appear in the training set, POS tagger and the accuracy score. ###Code List

The following list of code is used for the estimation of the emission parameters $e(x|y)$ using the MLE (maximum likelihood estimation).

####training.py

training.py processes the training data and outputs out all of the necessary count needed to compute the emission parameters. It outputs the computed count into 3 main files:

  1. emission_train_count.txt: the count of all tags
  2. emission_training.txt: the count of all words in each tag
  3. emission_trainingReadable.txt: emission_training.txt in readable format

####testing.py

testing.py processes the testing data using the output files generated from training.py. To ensure and improve efficiency, the mode of the function testing_splitter is set to unique so as to compute the emission parameter probabilities and tags for unique words found in the testing dataset.

testing.py outputs out the following files:

  1. emission_testing: file containing the probabilities of each word with their respective tag.
  2. emission_testing_tags: file containing the tags of each word based on the MLE estimation.

####accuracy.py

accuracy.py computes the accuracy function to determine how accurate the predicted file is from the annotated testing data. accuracy.py outputs out a value 0-1 in the console.

####e_pos_tagger.py

e_pos_tagger.py generates the dev.p2.out file based on the output files generated by the testing.py python file.

###Instructions

To compute the emission parameter estimation of any file, the following steps must be done sequentially.

  1. Input in the training file and run training.py. It will generate 4 files.
  2. Input the testing file dev.in into testing.py.
  3. ensure that the correct emission_count and emission_train_count files generated from training.py are in emission_count and state_count variable.
  4. Run training.py. It will generate 2 files.
  5. Input the file generated by training.py into e_pos_tagger.py to generate the tag files.
  6. Lastly, run the accuracy.py python file with the predicted file generated in step 5 to compare with the annotated testing dataset. A number 0-1 will be print on the console.

###Accuracy Findings

The code has been updated to accept new words not seen by the training data:

$$e(x|y) = \frac1{Count(y) + 1}$$

The accuracy findings for the both POS and NPC files are as follows:

$$Accuracy(POS) = \frac{23161}{33087} = 0.70000302\\ Accuracy(NPC) = \frac{1709}{2844} = 0.60091420$$

These accuracy findings are based on the naive POS tagger, where the state of the words are determined only by the emission parameters implemented in Part 2.

##Part III

The implementation of the transition parameters, Viterbi algorithm from the estimated parameters found in Part II and the transition parameters.

###Code List

####transition_MLE.py

transition_MLE.py computes the estimation of the transition parameters. It outputs out the following 2 files:

  1. transition.txt: file containing all the transition parameters.
  2. transitionReadable.txt: file containing all the transition parameters in a readable format.

####viterbi.py

viterbi.py computes the vibteri algorithm using the transition parameters and the emission parameters computed before and outputs out to the file p3_viterbi_train.txt.

###Instructions

To compute the transition parameter estimation of any file or the Viterbi algorithm, the following steps must be done sequentially.

####i. Transition Parameters

  1. input training file to inFile.
  2. input name of file to be saved to outFile and outFileReable.
  3. Run transition_MLE.py

####ii. Viterbi Algorithm

This implementation of Viterbi algorithm uses a ViterbiSequence class which stores the sequence of tags and their associated probability. The class can compute the probability of a longer sequence by calling the its method probTransmission and giving the appropriate next tag and next word.

Each iteration of the algorithm writes a ViterbiSequence to the DP table for every tag. The next iteration scans these entries to find the longer sequence with the best probability. After every iteration, the DP table is replaced with the entries from the previous iteration. At the final step, the DP table is scanned to find the maximum probability sequence, which is then appended to the outputs. The ending state is END and the word emitted is None.

###Accuracy Findings

The accuracy findings for the both POS and NPC files are as follows:

$$Accuracy(POS) = 0.5928\\ Accuracy(NPC) = 0.776$$

Based on the accuracy findings shown above, the accuracy score for POS dropped significantly. This may be due to the fact that the relative frequency between the tags are higher, therefore resulting in a possible overfitting of the model. As compared to the NPC accuracy findings, the accuracy score greatly improved with the use of the viterbi algorithm as all tags in NPC occur in relatively high frequencies. Therefore, the model does not suffer from an overfitting of the data.

##Part IV

###Algorithm

The implemented algorithm used to find the $10^{th}$ best POS tag sequence is a modified viterbi algorithm. Instead of returning the most likely sequence of hidden states that results in a sequence of observed events, the modified Viterbi algorithm returns the top 10 sequences. The typical Viterbi algorithm gives the maximum probability for a sequence of length $k$, ending in tag $v$ $$\pi(k,v) = \max{\pi(k-1,u) . a_{u,v} . b_v(x_k) }$$

For the modified Viterbi algorithm, every entry in the DP table stores the top $10$ sequences of length $k$, ending in tag $v$. This is necessary as the $10^{th}$ best sequence of length $k+1$ may come from the $10^{th}$ best sequence of length $k$, ending in a certain tag $v$. The algorithm evaluates every entry in the DP table to find the 10 best sequences from the state $k$ to state $k+1$.

In the code, we use a self-maintaining priority queue of maximum length $10$ for each entry in the DP table. Every iteration at state $k$ scans every entry in the DP table in order and adds a new sequence to the queue if the queue is less than 10 items long or if the probability exceeds the smallest probability in the queue. The algorithm will stop scanning the DP table entry if the highest probability sequence in that entry cannot transit to a top 10 sequence.

###Code List

###viterbiModified.py viterbiModified.py runs the modified Viterbi algorithm and writes the outputs to POS/Part 4/p4_viterbi_<rank>.txt. Each of this files corresponds to the testing file tagged with a top 10 best sequence, starting from =0 to =9. The lower the value of , the better the sequence. Thus the tagged file with the $10^{th}$ best sequence is p4_viterbi_9.txt.

viterbiModified.py requires the following file names to be defined

  1. transmissionFileName: the name of the file containing the transition parameters as generated by transition_MLE.py. Default value is POS/transition.txt
  2. emissionsFileName: the name of the file containing the emission parameters for the testing data as generated by testing.py. Default value is POS/Part 3/emission_testing.txt
  3. sequenceFileName: the name of the file containing the input sequences. Default value is POS/dev.in
  4. outputFileFormat: the Python string format of the output files written with ranks 0-9. Default value is POS/Part 4/p4_viterbi_{0}.txt ###Instructions

###Accuracy Findings

$$Accuracy(POS) = 0.5784$$

As expected the $10^{th}$ best is worse than the original viterbi algorithm used in part 4.

##Part V

###Algorithm

One of the key observation from using the suggested regularisation method in Part 2 for unseen words is that each unseen word will be tagged the lowest count tag found in the training data. This in fact is not a very good regularisation method. With this observation in mind, the revised algorithm is implemented with the following changes:

Regularisation Using the MLE parameters from the training set without regularisation can lead to unexpected results. For instance, if at state $k$, one can only transit to the tags $PRP$ and $NN$ but the word at state $k+1$ can only be emitted by the tag $CD$, one ends up in a situation where the maximum probability of a sequence of length $k+1$ is $0$. This affects all predictions from $k+1$ onwards as all probabilities from then on will have probability $0$. This can be solved by adding a regularisation factor to the transition parameters such that any tag can transit to any other tag with a small probability even if the transition was not observed in the training data.

Modifying the regularisation for unseen words by assuming that words correspond to all tags at uniform can also improve predictive accuracy. The default method of regularisation by assigning $b_j(o) = \frac1{count(j) + 1}$ heavily favours rarer tags over more common tags, even though unseen words may be likelier to correspond to common tags. By drawing the tags uniformly, this reduces the bias towards rare tags in the Viterbi algorithm.

REGEX A better design for developing an improved POS tagger for tweets could be to modify the viterbi algorithm by adding an additional regular expression check to patterns found in the training dataset. Common word patterns such as number and address patterns often imply a particular tag (For example $234=$ CD, cardinal number). By searching through the training dataset for these patterns and computing their associated probabilities, we can use these REGEX to capture these patterns to better estimate the hidden states in a given testing dataset.

The key motivation behind using REGEX to identify underlying linguistic structures is the fact that certain word types such as username and numbers are in reality, infinite. It is not possible for the training data to see/capture every single username. As such, an unseen username will be tagged with high inaccuracy. This however, can be avoided if an additional REGEX parameter is weighted when interpreting the hidden labels. This will lead to higher accuracies than before.

An example is shown below:

@blackmanwalking is seen in training data
@whitemandancing is not seen in training data but seen in testing and will be treated as unseen word
REGEX checks if @whitemandancing has the underlying structure that describes a username

However, it is important to note that if there are a lot of regular expressions used, the model might overfit. Therefore, it is imperative to only use REGEX that captures the common linguistic structure and for cases when features correlates strongly to the tag and not for everything.

The following list are examples of common regular expressions used in the revised viterbi algorithm:

  • starts with 'http://'
  • all numeric
  • starts with ':'
  • Regular expression to catch consonant

The full list of REGEX parameters can be found in the file regularised_feature_probs.txt.

Parameters used in the modified viterbi with REGEX hence becomes:

  • $a_{ij}:$ transition parameters
  • $b_j{(o)}:$ emission parameters
  • $c_v{(R)}:$ REGEX parameters

The underlying assumption behind the REGEX parameters is that words matching a certain REGEX pattern are more likely to be generated by a small number of tags. The REGEX parameters are generated in the training phase by scanning the entire training data set to identify words and tags corresponding the REGEX. For example, if 0.9 of all words matching the pattern [0-9]+ have the tag CD and the rest have the tag TO, we can model the tags of these words as coming from a multinomial distribution. These parameters are generated in the training phase where $c_v(R) = \frac{count(tag = v, word \in R)}{count(word \in R)}$ for all $R$. These parameters are then regularised for unseen $v, R$ pairs by assigning a small, fixed probability to these pairs.

In the testing phase, if the word matches a REGEX, probability of transiting to the new tag, $v$ is multiplied by the relevant $c_v(R)$. If the word does not match any of the patterns listed, we compute viterbi as per normal.

The following pseudocode explains the logic behind the viterbi with REGEX algorithm.

for k = 1,...,n:
	if word matches REGEX:
		perform modified viterbi with REGEX
	else:
		perform viterbi as usual (where conditional probability c == 1)

Mathematically, the probabilities are now computed with an additional weighted parameter $c_v(R)$ as shown below:

$$\pi(k,v) = \max{\pi(k-1,u) . a_{u,v} . b_v(x_k) . c_v(R) }$$

where $c_v(R) = P(y=v | R)$ and the list of regular expressions is given by $R$.

If the word does not matches any pattern in the set of REGEX, the conditional probability $c_v(R)$ becomes $1$ and the usual viterbi algorithm takes place.

Second Order Penalty In theory, a Second Order HMM that uses the conditional probability $P(y_j = v_0|y_{j-2} = v_1, y_{j-1}= v_2)$ would lead to better predictions than a First Order HMM. In practice however, the Viterbi algorithm for a Second Order HMM is infeasibly slow.

To capture some of the information of the previous tag however, we introduce a penalty term dependent on the previous two terms. The extra parameter introduced is $d_{uv,w}$, which is the conditional probability $P(y_j = w|y_{j-2} = u, y_{j-1}= v)$. The term acts as a penalty on tag sequences $u,v,w$ that appear in the training set with a low probability. The Viterbi score is thus multiplied by the additional term $d_{uv,w}$ if the last two terms in the sequence are $u,v$ and the next tag is $w$.

Term by term weighting In the actual implementation, the parameters were not worked with directly due to numerical underflow, instead the logarithms of the parameters were used. We found that incorporating the additional information in the HMM by simple addition of logarithms did not lead to large improvements until we weighted each term with a relevant weight. A sample Viterbi score would thus be of the form $k_aln(a_{u,v}) + k_bln(b_j(o)) + k_cln(c_v(R)) + k_dln(d_{wu,v})$.

###Code List

###viterbi_2nd_order.py viterbi_2nd_order.py performs the Viterbi algorithm on the modified HMM, using the parameters specified above.

viterbi_2nd_order.py requires the following file names to be defined

  1. FEATURE_PROB_IN: the name of the file containing the REGEX parameters as generated by featureFinder.py. Default value is regularised_feature_probs.txt.
  2. TRANSITION_PENALTY: the name of the file containing the second order HMM penalty parameters as generated by transition_2nd_order_MLE_regularised.py. Default value is transition_2nd_order.txt.
  3. EMISSIONS: the name of the file containing the emission parameters as generated by emission_part5. Default value is part5_emission_testing.txt.
  4. TRANSITIONS: the name of the file containing the transition parameters as generated by transition_MLE_regularised.py. Default value is transition.txt.
  5. TESTING_FILE: the name of the file to test on. Default value is ../dev.in.
  6. OUTPUT_FILE: the name of the file to write to. Default value is p5_viterbi_2nd_order.txt.
  7. GOLD_STANDARD: the name of the file containing the tags to compare against. Default value is ../dev.out.

###Instructions

  1. Name the intended testing file in TESTING_FILE
  2. Run the parameter estimator files. These are featureFinder.py, transition_2nd_order_MLE_regularised.py, emission_part5, transition_MLE_regularised.py.
  3. Specify the parameter file names in the variables above.
  4. The code will perform automatic checking of the accuracy and output a diff in difference_2nd_order.txt.

###Accuracy Findings

$$Accuracy(Part 5) = 0.7974$$

The modified algorithm was designed with the following decisions in mind; a better way of regularisation and identifying common patterns and word constructs to yield better accuracy results. Through the test sample provided, the modified viterbi with REGEX algorithm yielded better accuracy results. This shows that with the modification of the regularisation method and the inclusion of the REGEX parameters, Term by term weighting and the Second Order penalty, the modified algorithm provides a more accurate way of determining the hidden states in a given test dataset.

mlproject's People

Contributors

darrenascione avatar glencbz avatar xvust avatar

Watchers

 avatar  avatar  avatar

Forkers

darrenascione

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.