This project compares the KL-SUM,TF-IDF, Lexrank and LSA algorithms for text summarization.
Both these algorithms employ an extractive summarization methodology, i.e. important sentences from the original document are selected and concatenated to form a summary.
The paper associated with this project was published in the peer-reviewed journal IJCST and can be found here - http://www.ijcstjournal.org/volume-4/issue-3/IJCST-V4I3P63.pdf
- install python3
- pip install -r requirements.txt
- run python shell and write these commands
- import nltk
- nltk.download('punkt')
- nltk.download('stopwords')
- Now open each folder and run its desired python file for making summary(of the text files which are in input folder).
- at last run comparison files.
The input files have word counts ranging from 500 โ 25,000.
The csv files for both the algorithms contain the word count associated with each text file and the time required for generation of the summary.
The corresponding output files contain the generated automatic summary.
Natural Language Toolkit (NLTK) -
Open source Python library for Natural Language Processing.
http://www.nltk.org/
Sumy -
Python library and command line utility version 0.4.1 used for
extracting summary from html pages and plain text
documents.
https://pypi.python.org/pypi/sumy
For larger sized files (files with a greater word count), LSA is faster than Lexrank however for smaller files (files with a smaller word count), Lexrank is faster.
Comparing the quality of summary generated in addition to the efficiency in terms of speed for greater accuracy and ease of summarization.