GithubHelp home page GithubHelp logo

batuhankursatunal / extractive-text-summarizer-octopus Goto Github PK

View Code? Open in Web Editor NEW
0.0 2.0 0.0 26 KB

Automatic Extractive Text Summarization Algorithm

Python 100.00%
nlp nltk pandas scikit-learn summarization-algorithm

extractive-text-summarizer-octopus's Introduction

Static Badge Dependencies Dependencies Dependencies Dependencies

This is an automatic extractive text summarization algorithm.

The working principle of extractive text summarization idea is that the model generates summaries using only the words that are already contained in the original text. Compared to the abstractive text summarization algorithms, these are easier to implement, do not necessarily require network training, but are less accurate and useful.

Extractive document summarization algorithms rank the pre-processed sentences in the original text depending on some selected features and produce a summary using solely these ranked sentences. The main algorithm that is followed throughout this project is the TextRank algorithm which is a graph-based summarization algorithm inspired by PageRank algorithm. Sentences are represented as nodes where connections between them are the edges. After pre-processing of the text documents, features are extracted and they are put into a cosine similarity matrix which is then used to produce the graphs and finally rank the sentences.

Project outline

The main dependencies are; NLTK, which is used mainly by taking advantage of tokenizers and lemmatizers in pre-processing step, and scikitlearn, which is useful in feature extraction.

Pre-processing step includes; special character and punctuation removal, case conversion, tokenization, stop-word removal, and lemmatization.

After these, feature extraction, whose sub-sections are; N-gram bag of words, word frequency vectorizer, and TF-IDF vectorizer. Finally, sentence ranks are calculated using PageRank algorithm and summaries are generated for the News category of Brown corpus.

Data

Brown and Reuters corpora are used via NLTK library. Brown corpus is the main set that the model uses to generate summaries and Reuters corpus is used only for trial purposes.

extractive-text-summarizer-octopus's People

Contributors

batuhankursatunal avatar

Watchers

Kostas Georgiou avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.