GithubHelp home page GithubHelp logo

pos_tagger's Introduction

In this work, we explore hidden markov models for tagging parts of speech. We begin by using the WSJ Penn Treebank corpus, which contains human annotated parts of speech for the sentences.

We begin by parsing the data and creating a vocabulary. For optimization, we select a threshold of 3. We output the word and its frequency within the corpus to vocab.txt and as aforementioned, our unknown word threshold as 3. The total size of the vocabulary ends up being 16,920 words. We have 32,537 as the frequency of our unknown word count

We then implement a HMM on our training data. We compute transition probabailities as a function of count(s -> s') / count(s) and emission probabilties as a function of count(s -> x) / count(s). We output these parameters to hmm.json . There are 1392 transition parameters and 50285 emission parameters in our HMM. We then move on to decoding our HMM. Greedy decoding selects the most probable tag for a single word at a time, it is not the most optimal since it takes local decisions. The accuracy of the greedy decoder on the dev data is 92.13%.

Another way to decode the HMM is using the Viterbi algorithm. Viterbi decoding is an algorithm that computes the maximum probability of a state at a given time step given the maximum path at the previous time step with complexity O(mk^2). The accuracy of the Viterbi decoder on the dev data is 93.64%. We write the output of the test data to greedy.out and viterbi.out , respectively.

pos_tagger's People

Watchers

Harsh Karia avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.