Opinosis dataset contains 51 articles. Each article is about a product’s feature, like iPod’s Battery Life, etc. and is a collection of reviews by customers who purchased that product. Each article in the dataset has 5 manually written “gold” summaries. Usually the 5 gold summaries are different but they can also be the same text repeated 5 times.
CNN contains the documents and accompanying questions from the news articles of CNN. There are approximately 90k documents and 380k questions.
Daily Mail contains the documents and accompanying questions from the news articles of Daily Mail. There are approximately 197k documents and 879k questions.
Processed CNN and Daily Mail datasets are just simply concatenation of all data instances and keeping document, question and answer only for their inputs.
Large Scale Chinese Short Text Summarization Dataset(LCSTS): This corpus is constructed from the Chinese microblogging website SinaWeibo. It consists of over 2 million real Chinese short texts with short summaries given by the writer of each text.
sumeval implemented in Python is a well tested & multi-language evaluation framework for text summarization.
sumy is a simple library and command line utility for extracting summary from HTML pages or plain texts. The package also contains simple evaluation framework for text summaries. Implemented summarization methods are Luhn, Edmundson, LSA, LexRank, TextRank, SumBasic and KL-Sum.
TextRank4ZH implements the TextRank algorithm to extract key words/phrases and text summarization
in Chinese. It is written in Python.
snownlp is python library for processing Chinese text.
PKUSUMSUM is an integrated toolkit for automatic document summarization. It supports single-document, multi-document and topic-focused multi-document summarizations, and a variety of summarization methods have been implemented in the toolkit. It supports Western languages (e.g. English) and Chinese language.
fnlp is a toolkit for Chinese natural language processing.
They proposed to fight the curse of dimensionality by learning a distributed representation for words which allows each training sentence to inform the model about an exponential number of semantically neighboring sentences.
Ryan Kiros, Yukun Zhu, Ruslan Salakhutdinov, Richard S. Zemel, Antonio Torralba, Raquel Urtasun and Sanja Fidler. Skip-Thought Vectors. arXiv:1506.06726, 2015. The source code in Python is skip-thoughts.
Piotr Bojanowski, Edouard Grave, Armand Joulin, Tomas Mikolov. Enriching Word Vectors with Subword Information. arXiv:1607.04606. 2017. The souce code in C++11 is fastText, which is a library for efficient learning of word representations and sentence classification.
Alexis Conneau, Guillaume Lample, Marc'Aurelio Ranzato, Ludovic Denoyer and Herv{'e} J{'e}gou. Word Translation Without Parallel Data. arXiv:1710.04087, 2017. The source code in Python is MUSE, which is a library for multilingual unsupervised or supervised word embeddings.
David M. Blei, Andrew Y. Ng and Michael I. Jordan. Latent Dirichlet Allocation. Journal of Machine Learning Research, 2003. The source code in Python is sklearn.decomposition.LatentDirichletAllocation. Reimplement Luhn's algorithm, but with topics instead of words and applied to several documents instead of one.
Train LDA on all products of a certain type (e.g. all the books)
Treat all the reviews of a particular product as one document, and infer their topic distribution
Infer the topic distribution for each sentence
For each topic that dominates the reviews of a product, pick some sentences that are themselves dominated by that topic.
Rada Mihalcea and Paul Tarau. TextRank: Bringing Order into Texts. ACL, 2004. The source code in Python is pytextrank. pytextrank works in four stages, each feeding its output to the next:
Part-of-Speech Tagging and lemmatization are performed for every sentence in the document.
Key phrases are extracted along with their counts, and are normalized.
Calculates a score for each sentence by approximating jaccard distance between the sentence and key phrases.
Summarizes the document based on most significant sentences and key phrases.
Federico Barrios, Federico López, Luis Argerich and Rosa Wachenchauzer. Variations of the Similarity Function of TextRank for Automated Summarization. arXiv:1602.03606, 2016. The source code in Python is gensim.summarization. Gensim's summarization only works for English for now, because the text is pre-processed so that stop words are removed and the words are stemmed, and these processes are language-dependent. TextRank works as follows:
Pre-process the text: remove stop words and stem the remaining words.
Create a graph where vertices are sentences.
Connect every sentence to every other sentence by an edge. The weight of the edge is how similar the two sentences are.
Run the PageRank algorithm on the graph.
Pick the vertices(sentences) with the highest PageRank score.
TextTeaser uses basic summarization features and build from it. Those features are:
Title feature is used to score the sentence with the regards to the title. It is calculated as the count of words which are common to title of the document and sentence.
Sentence length is scored depends on how many words are in the sentence. TextTeaser defined a constant “ideal” (with value 20), which represents the ideal length of the summary, in terms of number of words. Sentence length is calculated as a normalized distance from this value.
Sentence position is where the sentence is located. I learned that introduction and conclusion will have higher score for this feature.
Keyword frequency is just the frequency of the words used in the whole text in the bag-of-words model (after removing stop words).
LexRank uses IDF-modified Cosine as the similarity measure between two sentences. This similarity is used as weight of the graph edge between two sentences. LexRank also incorporates an intelligent post-processing step which makes sure that top sentences chosen for the summary are not too similar to each other.
They use sequence-to-sequence encoder-decoder LSTM with attention.
They use the first sentence of a document. The source document is quite small (about 1 paragraph or ~500 words in the training dataset of Gigaword) and the produced output is also very short (about 75 characters). It remains an open challenge to scale up these limits - to produce longer summaries over multi-paragraph text input (even good LSTM models with attention models fall victim to vanishing gradients when the input sequences become longer than a few hundred items).
The evaluation method used for automatic summarization has traditionally been the ROUGE metric - which has been shown to correlate well with human judgment of summary quality, but also has a known tendency to encourage "extractive" summarization - so that using ROUGE as a target metric to optimize will lead a summarizer towards a copy-paste behavior of the input instead of the hoped-for reformulation type of summaries.
They use GRU with attention and bidirectional neural net.
They use the first 2 sentences of a documnet with a limit at 120 words.
They use the Large vocabulary trick (LVT) of Jean et al. 2014, which means when you decode, use only the words that appear in the source - this reduces perplexity. But then you lose the capability to do "abstractive" summary. So they do "vocabulary expansion" by adding a layer of "word2vec nearest neighbors" to the words in the input.
Feature rich encoding - they add TFIDF and Named Entity types to the word embeddings (concatenated) to the encodings of the words - this adds to the encoding dimensions that reflect "importance" of the words.
The most interesting of all is what they call the "Switching Generator/Pointer" layer. In the decoder, they add a layer that decides to either generate a new word based on the context / previously generated word (usual decoder) or copy a word from the input (that is - add a pointer to the input). They learn when to do Generate vs. Pointer and when it is a Pointer which word of the input to Point to.
Ryang, Seonggi, and Takeshi Abekawa. "Framework of automatic text summarization using reinforcement learning." In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pp. 256-265. Association for Computational Linguistics, 2012. [not neural-based methods]
Rushdi Shams, M.M.A. Hashem, Afrina Hossain, Suraiya Rumana Akter, Monika Gope. Corpus-based Web Document Summarization using Statistical and Linguistic Approach. arXiv:1304.2476, Procs. of the IEEE International Conference on Computer and Communication Engineering (ICCCE10), pp. 115-120, Kuala Lumpur, Malaysia, May 11-13, (2010).
They addressed an important problem in sequence-to-sequence (Seq2Seq) learning referred to as copying, in which certain segments in the input sequence are selectively replicated in the output sequence. In this paper, they incorporated copying into neural network-based Seq2Seq learning and propose a new model called CopyNet with encoder-decoder structure. CopyNet can nicely integrate the regular way of word generation in the decoder with the new copying mechanism which can choose sub-sequences in the input sequence and put them at proper places in the output sequence.
Rui Meng, Sanqiang Zhao, Shuguang Han, Daqing He, Peter Brusilovsky, Yu Chi. Deep Keyphrase Generation. arXiv:1704.06879, 2017. The source code written in Python is seq2seq-keyphrase.
They compared modern extractive methods like LexRank, LSA, Luhn and Gensim’s existing TextRank summarization module on the Opinosis dataset of 51 (article, summary) pairs. They also had a try with an abstractive technique using Tensorflow’s algorithm textsum, but didn’t obtain good results due to its extremely high hardware demands (7000 GPU hours).
Mehdi Allahyari, Seyedamin Pouriyeh, Mehdi Assefi, Saeid Safaei, Elizabeth D. Trippe, Juan B. Gutierrez, Krys Kochut. Text Summarization Techniques: A Brief Survey. arXiv:1707.02268, 2017.
They constructed a large-scale Chinese short text summarization dataset constructed from the Chinese microblogging website Sina Weibo, which is released to the public. Then they performed GRU-based encoder-decoder method on it to generate summary. They took the whole short text as one sequence, this may not be very reasonable, because most of short texts contain several sentences.
LCSTS contains 2,400,591 (short text, summary) pairs as the training set and 1,106 pairs as the test set.
All the models are trained on the GPUs tesla M2090 for about one week.
The results show that the RNN with context outperforms RNN without context on both character and word based input.
Moreover, the performances of the character-based input outperform the word-based input.
Yang, Zichao, Diyi Yang, Chris Dyer, Xiaodong He, Alex Smola, and Eduard Hovy. "Hierarchical Attention Networks for Document Classification." In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2016.