GithubHelp home page GithubHelp logo

nlp_project's Introduction

Java NLP Project Using Apache OpenNLP

This repository contains Java demonstration code for implementing Apache OpenNLP 1.5.3. I wrote this code as a first experience with Natural Language Processing. Note there are many different ways of implementing NLP, but the Open NLP library is a good place to start for beginners.

This repository was uploaded with all necessary Java, OpenNLP, and Maven libraries and is more or less ready to go. Refer to the Apache OpenNLP link above for info on requirements and versions.

Goals

The goal was to read an input.txt text file (small size because running these classes on longer text can take a loooong time), and write the NLP analysis to an output.txt file. Also the appropriate OpenNLP English-language model was used for analysis. This code does the following:

  • Reads the sentences of input.txt
  • Writes the sentences as a text string to console
  • Writes the number of sentences found to both console and to output.txt
  • Writes the number of tokens (words, punctuation, numbers, etc.) found to both console and to output.txt/li>
  • Writes the proper names of individuals found to console/li>
  • Writes POS tags to output.txt
  • Writes chunks to output.txt
  • Writes parse results to output.txt

This is actually a small subset of what one can do with these classes. For example, you can parse sentences for grammatical structure and much more.

Features

This code contains the following features:

  • Sentence detector
  • Sentence tokenizer
  • Name finder to detect named entities
  • Part-ofspeech tagger
  • Chunker
  • Parser

As mentioned above, the appropriate OpenNLP English-language model is used to do these tasks.

Structure

The main[] method is implemented in OpenNLP_App.java, which executes the following classes:

  • ReadInput.java for reading the contents of input.txt
  • SentenceDetect.java for detecting sentence boundaries
  • SentenceTokenize.java for detecting words and punctuation
  • NamedEntityRecognition.java for finding names in sentences
  • TaggerPOS.java for assigning English grammar categories to detected words
  • SentenceChunk.java for organizing sentences into chunks, based on detected tokens
  • SentenceParse.java for iteratively parsing a sentence according to parts of speech
  • PrintOutput.java for writing results to output.txt

Other Features

This code uses Maven dependency. See the OpenNLP documentation for more information.

nlp_project's People

Stargazers

 avatar

Watchers

James Cloos avatar Greg Babb avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.