GithubHelp home page GithubHelp logo

nyctanthous / takahe3 Goto Github PK

View Code? Open in Web Editor NEW

This project forked from boudinfl/takahe

2.0 1.0 2.0 722 KB

takahe3 is a multi-sentence compression module

License: MIT License

Python 100.00%
nlp python3 unsupervised-nlp

takahe3's Introduction

takahe3

Unsupervised multi-sentence compression

takahe3 is a Python3 conversion of the takahe multi-sentence compression package. Given a set of redundant sentences, a word-graph is constructed by iteratively adding sentences to it. The best compression is obtained by finding the shortest path in the word graph. The original algorithm was published and described in:

  • Katja Filippova, Multi-Sentence Compression: Finding Shortest Paths in Word Graphs, Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010), pages 322-330, 2010.

A keyphrase-based reranking method can be applied to generate more informative compressions. The reranking method is described in:

  • Florian Boudin and Emmanuel Morin, Keyphrase Extraction for N-best Reranking in Multi-Sentence Compression, Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT 2013), 2013.

Requirements

  • Python 3.5+

All other requirements will be automatically acquired by pip; see requirements.txt for a complete list of all requirements that will be automatically obtained.

Installation

You can install from this github repository with

git clone https://github.com/Nyctanthous/takahe3.git
cd takahe3
pip install .

Additionally, be aware that this package expects Part-of-Speech (POS) tags along with every word. nltk is a good choice for this task.

Example

A typical usage of this module is:

from takahe3.takahe import WordGraph, KeyphraseReranker

sentences = ["The/DT wife/NN of/IN a/DT former/JJ U.S./NNP president/NN \
              Bill/NNP Clinton/NNP Hillary/NNP Clinton/NNP visited/VBD \
              China/NNP last/JJ Monday/NNP ./PUNCT", "Hillary/NNP Clinton/NNP \
              wanted/VBD to/TO visit/VB China/NNP last/JJ month/NN but/CC \
              postponed/VBD her/PRP$ plans/NNS till/IN Monday/NNP last/JJ \
              week/NN ./PUNCT", "Hillary/NNP Clinton/NNP paid/VBD a/DT \
              visit/NN to/TO the/DT People/NNP Republic/NNP of/IN China/NNP \
              on/IN Monday/NNP ./PUNCT", "Last/JJ week/NN the/DT \
              Secretary/NNP of/IN State/NNP Ms./NNP Clinton/NNP visited/VBD \
              Chinese/JJ officials/NNS ./PUNCT"]

# Create a word graph from the set of sentences with parameters :
# - minimal number of words in the compression : 6
# - language of the input sentences : en (english)
# - POS tag for punctuation marks : PUNCT
compressor = WordGraph(sentences, nb_words=6, lang='en', punct_tag="PUNCT")

# Get the 10 best paths
candidates = compressor.get_compression(5)

# 2. Re-rank compressions by keyphrases (Boudin and Morin's method)
reranker = KeyphraseReranker(sentences, candidates, lang="en")

reranked_candidates = reranker.rerank_nbest_compressions()

# Loop over the best re-ranked candidates
for score, path in reranked_candidates:
    # Print the best re-ranked candidates
    print("%.3f: %s" % (score, " ".join([u[0] for u in path])))

If you choose to use nltk to tag words with parts of speech, a utility is provided. Once you install and configure nltk, an example usage is:

from takahe3.utilities import tag_text_part_of_speech

text = "The wife of former U.S. president Bill Clinton, Hillary Clinton, visited China last Monday. Hillary Clinton wanted to visit China last month but postponed her plans till Monday last week. Hillary Clinton paid a visit to the People's Republic of China on Monday. Secretary of State Ms. Clinton visited Chinese officials."
sentences = tag_text_part_of_speech(text)

# Now, you can process your text as in the example above

takahe3's People

Contributors

boudinfl avatar nyctanthous avatar sildar avatar

Stargazers

 avatar  avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.