The pos_tagger from harshkaria

pos_tagger's Introduction

In this work, we explore hidden markov models for tagging parts of speech. We begin by using the WSJ Penn Treebank corpus, which contains human annotated parts of speech for the sentences.

We begin by parsing the data and creating a vocabulary. For optimization, we select a threshold of 3. We output the word and its frequency within the corpus to vocab.txt and as aforementioned, our unknown word threshold as 3. The total size of the vocabulary ends up being 16,920 words. We have 32,537 as the frequency of our unknown word count

We then implement a HMM on our training data. We compute transition probabailities as a function of count(s -> s') / count(s) and emission probabilties as a function of count(s -> x) / count(s). We output these parameters to hmm.json . There are 1392 transition parameters and 50285 emission parameters in our HMM. We then move on to decoding our HMM. Greedy decoding selects the most probable tag for a single word at a time, it is not the most optimal since it takes local decisions. The accuracy of the greedy decoder on the dev data is 92.13%.

Another way to decode the HMM is using the Viterbi algorithm. Viterbi decoding is an algorithm that computes the maximum probability of a state at a given time step given the maximum path at the previous time step with complexity O(mk^2). The accuracy of the Viterbi decoder on the dev data is 93.64%. We write the output of the test data to greedy.out and viterbi.out , respectively.

Recommend Projects

harshkaria / pos_tagger Goto Github PK

pos_tagger's Introduction

pos_tagger's People

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs