GithubHelp home page GithubHelp logo

textextraction's Introduction

Text Extraction

Authour

It is a wrapper and combiner of Stanford CoreNLP, OpenNLP, OpenIE, ClausIE, WordNet and ConceptNet API to make those tools easily to use for tackling some NLP tasks, like Named Entity Recognition, POS Tagging, Chunking, Information Extraction, Dependency Parising, Concept Extraction and etc.

Dependencies

Examples

Stanford CoreNLP Parser

For named entity recognition, first construct a corenlp parser, and annotate the text you want to process. Stanford Named Entity Recognizer (NER) provides three different models to tackle NER tasks, one is 3 class model (Location, Person, Organization), one is 4 class model (Location, Person, Organization, Misc), while another is 7 class model (Location, Person, Organization, Money, Percent, Date, Time), since each of them has different coverage, so in the CoreNLPParser, three NER detectors are created based on those models. All the NER detection methods are based on the mixture of three models' results. Besides, it's also available to use those detectors alone, lile: corenlp.detectNERInlineXML(corenlp.getNerDetector7Class()).

public class NameEntityRecogExample {
    public static void main (String[] args) {
        String text = "Mary is studying in Stanford University, which is located at California, since July 2015. " + 
            "She got up this morning at 9:00 am and went to a shop to spend five dollars to buy a 50% off toothbrush. " + 
            "After she came back, she found her backyard was looking a little empty, so she decided she would plant something in it.";
        // create corenlp parser
        CoreNLPParser corenlp = new CoreNLPParser();
        corenlp.annotate(text); // annotate the given text
        System.out.println(corenlp.detectNERInlineXML()); // print detected NER in text with inline XML.
        System.out.println("Person: " + corenlp.findPerson()); // find person
        System.out.println("Location: " + corenlp.findLocation()); // find location entity
        System.out.println("Organization: " + corenlp.findOrganization()); // find organization entity
        System.out.println("Date: " + corenlp.findDate()); // find date entity
        System.out.println("Time: " + corenlp.findTime()); // find time entity
        System.out.println("Percent: " + corenlp.findPercent()); // find percent entity
        System.out.println("Money: " + corenlp.findMoney()); // find money entity
        System.out.println("MISC: " + corenlp.findMISC()); // find MISC (anything else) entity
        System.out.println();
        // singleton NER model in stanford corenlp
        // model: english.all.3class.distsim.crf.ser.gz
        System.out.println(corenlp.detectNERInlineXML(corenlp.getNerDetector3Class()));
        // model: english.conll.4class.distsim.crf.ser.gz
        System.out.println(corenlp.detectNERInlineXML(corenlp.getNerDetector4Class()));
        // model: english.muc.7class.distsim.crf.ser.gz
        System.out.println(corenlp.detectNERInlineXML(corenlp.getNerDetector7Class()));
    }
}

Here is the output:

<PERSON>Mary</PERSON> is studying in <ORGANIZATION>Stanford University</ORGANIZATION>, which is located at <LOCATION>California</LOCATION>, since <DATE>July 2015</DATE>. <PERSON>John</PERSON> got up <TIME>this morning</TIME> at 9:00 am and went to a shop to spend <MONEY>five dollars</MONEY> to buy a <PERCENT>50%</PERCENT> off toothbrush. After he came back, she found his backyard was looking a little empty, so he decided he would plant something in it.
Person: [Mary, John]
Location: [California]
Organization: [Stanford University]
Date: [July 2015]
Time: [this morning]
Percent: [50%]
Money: [five dollars]
MISC: []

<PERSON>Mary</PERSON> is studying in <ORGANIZATION>Stanford University</ORGANIZATION>, which is located at <LOCATION>California</LOCATION>, since July 2015. <PERSON>John</PERSON> got up this morning at 9:00 am and went to a shop to spend five dollars to buy a 50% off toothbrush. After he came back, she found his backyard was looking a little empty, so he decided he would plant something in it.
<PERSON>Mary</PERSON> is studying in <ORGANIZATION>Stanford University</ORGANIZATION>, which is located at <LOCATION>California</LOCATION>, since July 2015. <PERSON>John</PERSON> got up this morning at 9:00 am and went to a shop to spend five dollars to buy a 50% off toothbrush. After he came back, she found his backyard was looking a little empty, so he decided he would plant something in it.
Mary is studying in <ORGANIZATION>Stanford University</ORGANIZATION>, which is located at <LOCATION>California</LOCATION>, since <DATE>July 2015</DATE>. John got up <TIME>this morning</TIME> at 9:00 am and went to a shop to spend <MONEY>five dollars</MONEY> to buy a <PERCENT>50%</PERCENT> off toothbrush. After he came back, she found his backyard was looking a little empty, so he decided he would plant something in it.

For Tokenize and POS tasks, the CoreNLPParser makes it much easier to got the results with less codes:

public class StanfordCoreNLPExample {
    public static void main (String[] args) {
        String text = "..."; // same as before
        // create corenlp parser
        CoreNLPParser corenlp = new CoreNLPParser();
        corenlp.annotate(text); // annotate the given text
        // Sentence Level tokenization
        List<String> sentences = corenlp.sentenceTokenizer(); // each string in the list is a sentence
        // Word Level Tokenizer
        List<List<String>> tokens = corenlp.wordTokenizer(); // each List<String> is tokenized word in each sentence
        List<List<String>> lemmaTokens = corenlp.lemmaTokenizer(); // each List<String> is tokenized word lemma in each sentence
        /* POS posTagger */
        List<List<POSTagPhrase>> tags = corenlp.posTagger(); // POSTagPhrase contains two elements: word and pos tag
        List<String> tagsStr = corenlp.posTags2String(); // return a list of string, each string is the sentence with each token marked by its pos tag
        // Example: Mary/NNP is/VBZ studying/VBG in/IN Stanford/NNP University/NNP ,/, which/WDT is/VBZ located/JJ at/IN California/NNP ,/, since/IN July/NNP 2015/CD ./.
    }
}

Moreover, for dependency parsing, anaphora (coreference) resolution and other useful functions, see details in the codes.

Apache OpenNLP and UW OpenIE Parser

Apache OpenNLP Parser:

public class OpenNLPExample {
    public static void main (String[] args) {
        String singleSent = "Most large cities in the US had morning and afternoon newspapers, but New York doesn't have on Thursday, Stanford University locates in California.";
        String paragraph = FileUtils.readNthParagraph("paragraphs.txt", 2);
        System.err.println("Create OpenNLP Parser...");
        OpenNLPParser opennlp = new OpenNLPParser();
        System.err.println("Done...");
        // Name Entity detection
        List<String> persons = opennlp.findPerson(singleSent);
        System.out.println("Persons: " + persons);
        List<String> dates = opennlp.findDate(singleSent);
        System.out.println("Dates: " + dates);
        List<String> times = opennlp.findTime(singleSent);
        System.out.println("Time: " + times);
        List<String> locations = opennlp.findLocation(singleSent);
        System.out.println("Locations: " + locations);
        List<String> organizations = opennlp.findOrganization(singleSent);
        System.out.println("Organization: " + organizations);
        // Tokenize, pos tagging, chunking
        List<String> sentences = opennlp.sentenceTokenize(paragraph); // segment paragraph into sentences
        for (String sentence : sentences) {
            List<String> tokens = opennlp.tokenize(sentence);
            List<String> tags = opennlp.tag(sentence);
            List<String> chunks = opennlp.chunk(sentence);
            for (int i = 0; i < tokens.size(); i++)
                System.out.println(tokens.get(i) + "\t" + tags.get(i) + "\t" + chunks.get(i));
            List<ChunkedPhrase> chunkedPhrases = opennlp.chunkedPhrases(sentence);
            chunkedPhrases.forEach(System.out::println);
        }
    }
}

UW OpenIE Parser:

public class OpenIEExample {
    public static void main (String[] args) {
        String singleSent = "The U.S. president Barack Obama gave his speech on Tuesday at White House to thousands of people";
        System.err.println("Create OpenIE Parser...");
        OpenIEParser openie = new OpenIEParser();
        System.err.println("Done...");
        List<Token> tokens = openie.tokenize(singleSent);
        System.out.println(tokens);
        List<String> tokensStr = openie.tokenize2String(singleSent);
        System.out.println(tokensStr);
        List<PostaggedToken> tags = openie.posTag(singleSent);
        System.out.println(tags);
        List<String> tagsStr = openie.posTag2String(singleSent);
        System.out.println(tagsStr);
        List<ChunkedToken> chunks = openie.chunk(singleSent);
        System.out.println(chunks);
        List<String> chunksStr = openie.chunk2String(singleSent);
        System.out.println(chunksStr);
        List<ChunkedPhrase> chunkedPhrases = openie.getChunkedPhrases(singleSent);
        List<String> list = chunkedPhrases.stream().map(ChunkedPhrase::toString).collect(Collectors.toList());
        System.out.println(String.join(", ", list).concat("\n"));
        // Extract information
        System.err.println("Information Extraction Demo...");
        String paragraph = FileUtils.readNthParagraph("paragraphs.txt", 3);
        List<String> sentences = openie.sentenceTokenize(paragraph);
        for (String sentence : sentences) {
            List<ArgumentPhrase> argumentPhrases = openie.extract(sentence);
            argumentPhrases.forEach(arg -> System.out.println(arg.toString()));
            System.out.println();
            List<ChunkedPhrase> chunked = openie.getChunkedPhrases(sentence);
            chunked.forEach(c -> System.out.println(c.toString()));
            List<POSTagPhrase> posTagPhrases = openie.getPosTagPhrases(sentence);
            System.out.println("\n" + posTagPhrases.toString());
        }
    }
}

ConceptNet and ClausIE Parser

ConceptNet Parser: it is a simple http requester and component extractor, which send request to conceptnet api and got response (JSON format data), then using Gson to extract useful information and store in ConceptPhrase (includes, entity1-relation-entity2, weight, example).

public class ConceptNetExample {
    public static void main (String[] args) {
        ConceptNetParser conceptnet = new ConceptNetParser();
        String phrase = "plant_tree";
        System.out.println("Raw Json Response: ".concat(conceptnet.getResponse(phrase)).concat("\n"));
        List<ConceptPhrase> conceptPhrases = conceptnet.extractConceptPhrases(phrase);
        System.out.println("Number of Concept Phrases: " + conceptPhrases.size() + "\n");
        for (ConceptPhrase conceptPhrase : conceptPhrases) System.out.println(conceptPhrase.toString().concat("\n"));
        List<Triple<String, String, String>> triples = conceptnet.extractTriples(phrase);
        triples.forEach(t -> System.out.println("[" + t.first() + ", " + t.second() + ", " + t.third() + "]"));
    }
}

ClausIE Parser: TODO

References

textextraction's People

Contributors

26hzhang avatar

Stargazers

 avatar  avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.