The awesome-text-summarization from awesome-archive

Corpus

Opinosis dataset contains 51 articles. Each article is about a product’s feature, like iPod’s Battery Life, etc. and is a collection of reviews by customers who purchased that product. Each article in the dataset has 5 manually written “gold” summaries. Usually the 5 gold summaries are different but they can also be the same text repeated 5 times.
DUC
English Gigaword: English Gigaword was produced by Linguistic Data Consortium (LDC).
CNN and Daily Mail or github.
- CNN contains the documents and accompanying questions from the news articles of CNN. There are approximately 90k documents and 380k questions.
- Daily Mail contains the documents and accompanying questions from the news articles of Daily Mail. There are approximately 197k documents and 879k questions.
Processed CNN and Daily Mail datasets are just simply concatenation of all data instances and keeping document, question and answer only for their inputs.
Large Scale Chinese Short Text Summarization Dataset（LCSTS）: This corpus is constructed from the Chinese microblogging website SinaWeibo. It consists of over 2 million real Chinese short texts with short summaries given by the writer of each text.
Ziqiang Cao, Chengyao Chen, Wenjie Li, Sujian Li, Furu Wei, Ming Zhou. TGSum: Build Tweet Guided Multi-Document Summarization Dataset. arXiv:1511.08417, 2015.

Text Summarization Software

sumeval implemented in Python is a well tested & multi-language evaluation framework for text summarization.
sumy is a simple library and command line utility for extracting summary from HTML pages or plain texts. The package also contains simple evaluation framework for text summaries. Implemented summarization methods are Luhn, Edmundson, LSA, LexRank, TextRank, SumBasic and KL-Sum.
TextRank4ZH implements the TextRank algorithm to extract key words/phrases and text summarization in Chinese. It is written in Python.
snownlp is python library for processing Chinese text.
PKUSUMSUM is an integrated toolkit for automatic document summarization. It supports single-document, multi-document and topic-focused multi-document summarizations, and a variety of summarization methods have been implemented in the toolkit. It supports Western languages (e.g. English) and Chinese language.
fnlp is a toolkit for Chinese natural language processing.

Word/Sentence Representation

G. E. Hinton. Distributed representations. 1984.
N-Grams
Yoshua Bengio, Réjean Ducharme, Pascal Vincent and Christian Jauvin. A Neural Probabilistic Language Model. 2003.
- They proposed to fight the curse of dimensionality by learning a distributed representation for words which allows each training sentence to inform the model about an exponential number of semantically neighboring sentences.
Levy, Omer, and Yoav Goldberg. Neural word embedding as implicit matrix factorization. NIPS. 2014.
Li, Yitan, et al. Word embedding revisited: A new representation learning and explicit matrix factorization perspective. IJCAI. 2015.
Ryan Kiros, Yukun Zhu, Ruslan Salakhutdinov, Richard S. Zemel, Antonio Torralba, Raquel Urtasun and Sanja Fidler. Skip-Thought Vectors. arXiv:1506.06726, 2015. The source code in Python is skip-thoughts.
O. Levy, Y. Goldberg, and I. Dagan. Improving Distributional Similarity with Lessons Learned from Word Embeddings. Trans. Assoc. Comput. Linguist., 2015.
Yang, Wei and Lu, Wei and Zheng, Vincent. A Simple Regularization-based Algorithm for Learning Cross-Domain Word Embeddings. ACL, 2017. The source code in C is cross_domain_embedding.
- This paper presents a simple yet effective method for learning word embeddings based on text from different domains.
Sebastian Ruder. Word embeddings in 2017: Trends and future directions

Word/Sentence Vectorization

Word2Vec Resources: This is a post with links to and descriptions of word2vec tutorials, papers, and implementations.
Word embeddings: how to transform text into numbers
Ledell Wu, Adam Fisch, Sumit Chopra, Keith Adams, Antoine Bordes, Jason Weston. StarSpace: Embed All The Things. arXiv:1709.03856, 2017. The source code in C++11 is [StarSpace] (https://github.com/facebookresearch/Starspace/).
Piotr Bojanowski, Edouard Grave, Armand Joulin, Tomas Mikolov. Enriching Word Vectors with Subword Information. arXiv:1607.04606. 2017. The souce code in C++11 is fastText, which is a library for efficient learning of word representations and sentence classification.
Alexis Conneau, Guillaume Lample, Marc'Aurelio Ranzato, Ludovic Denoyer and Herv{'e} J{'e}gou. Word Translation Without Parallel Data. arXiv:1710.04087, 2017. The source code in Python is MUSE, which is a library for multilingual unsupervised or supervised word embeddings.
Tomas Mikolov, Edouard Grave, Piotr Bojanowski, Christian Puhrsch and Armand Joulin. Advances in Pre-Training Distributed Word Representations. arXiv:1712.09405, 2017.

Extractive Text Summarization

H. P. Luhn. The automatic creation of literature abstracts. IBM Journal of Research and Development, 1958. Luhn's method is as follows:
1. Ignore Stopwords: Common words (known as stopwords) are ignored.
2. Determine Top Words: The most often occuring words in the document are counted up.
3. Select Top Words: A small number of the top words are selected to be used for scoring.
4. Select Top Sentences: Sentences are scored according to how many of the top words they contain. The top four sentences are selected for the summary.
H. P. Edmundson. New Methods in Automatic Extracting. Journal of the Association for Computing Machinery, 1969.
David M. Blei, Andrew Y. Ng and Michael I. Jordan. Latent Dirichlet Allocation. Journal of Machine Learning Research, 2003. The source code in Python is sklearn.decomposition.LatentDirichletAllocation. Reimplement Luhn's algorithm, but with topics instead of words and applied to several documents instead of one.
1. Train LDA on all products of a certain type (e.g. all the books)
2. Treat all the reviews of a particular product as one document, and infer their topic distribution
3. Infer the topic distribution for each sentence
4. For each topic that dominates the reviews of a product, pick some sentences that are themselves dominated by that topic.
David M. Blei. Probabilistic Topic Models. Communications of the ACM, 2012.
Rada Mihalcea and Paul Tarau. TextRank: Bringing Order into Texts. ACL, 2004. The source code in Python is pytextrank. pytextrank works in four stages, each feeding its output to the next:
- Part-of-Speech Tagging and lemmatization are performed for every sentence in the document.
- Key phrases are extracted along with their counts, and are normalized.
- Calculates a score for each sentence by approximating jaccard distance between the sentence and key phrases.
- Summarizes the document based on most significant sentences and key phrases.
Federico Barrios, Federico López, Luis Argerich and Rosa Wachenchauzer. Variations of the Similarity Function of TextRank for Automated Summarization. arXiv:1602.03606, 2016. The source code in Python is gensim.summarization. Gensim's summarization only works for English for now, because the text is pre-processed so that stop words are removed and the words are stemmed, and these processes are language-dependent. TextRank works as follows:
- Pre-process the text: remove stop words and stem the remaining words.
- Create a graph where vertices are sentences.
- Connect every sentence to every other sentence by an edge. The weight of the edge is how similar the two sentences are.
- Run the PageRank algorithm on the graph.
- Pick the vertices(sentences) with the highest PageRank score.
TextTeaser uses basic summarization features and build from it. Those features are:
- Title feature is used to score the sentence with the regards to the title. It is calculated as the count of words which are common to title of the document and sentence.
- Sentence length is scored depends on how many words are in the sentence. TextTeaser defined a constant “ideal” (with value 20), which represents the ideal length of the summary, in terms of number of words. Sentence length is calculated as a normalized distance from this value.
- Sentence position is where the sentence is located. I learned that introduction and conclusion will have higher score for this feature.
- Keyword frequency is just the frequency of the words used in the whole text in the bag-of-words model (after removing stop words).
Güneş Erkan and Dragomir R. Radev. LexRank: Graph-based Lexical Centrality as Salience in Text Summarization. 2004.
- LexRank uses IDF-modified Cosine as the similarity measure between two sentences. This similarity is used as weight of the graph edge between two sentences. LexRank also incorporates an intelligent post-processing step which makes sure that top sentences chosen for the summary are not too similar to each other.
Latent Semantic Analysis(LSA) Tutorial.
Josef Steinberger and Karel Jezek. Using Latent Semantic Analysis in Text Summarization and Summary Evaluation. Proc. ISIM’04, 2004.
Josef Steinberger, Massimo Poesio, Mijail A Kabadjov and Karel Ježek. Two uses of anaphora resolution in summarization. Information Processing & Management, 2007.
Josef Steinberger and Karel Ježek. Text summarization and singular value decomposition. International Conference on Advances in Information Systems, 2004.
Dan Gillick and Benoit Favre. A Scalable Global Model for Summarization. ACL, 2009.
Ani Nenkova and Kathleen McKeown. Automatic summarization. Foundations and Trend in Information Retrieval, 2011. The slides are also available.
Vahed Qazvinian, Dragomir R. Radev, Saif M. Mohammad, Bonnie Dorr, David Zajic, Michael Whidby, Taesun Moon. Generating Extractive Summaries of Scientific Paradigms. arXiv:1402.0556, 2014.
Kågebäck, Mikael, et al. Extractive summarization using continuous vector space models. Proceedings of the 2nd Workshop on Continuous Vector Space Models and their Compositionality (CVSC)@ EACL. 2014.
Ramesh Nallapati, Bowen Zhou, Mingbo Ma. Classify or Select: Neural Architectures for Extractive Document Summarization. arXiv:1611.04244. 2016.
Ramesh Nallapati, Feifei Zhai, Bowen Zhou. SummaRuNNer: A Recurrent Neural Network based Sequence Model for Extractive Summarization of Documents. arXiv:1611.04230, AAAI, 2017.
Shashi Narayan, Nikos Papasarantopoulos, Mirella Lapata, Shay B. Cohen. Neural Extractive Summarization with Side Information. arXiv:1704.04530, 2017.
Rakesh Verma, Daniel Lee. Extractive Summarization: Limits, Compression, Generalized Model and Heuristics. arXiv:1704.05550, 2017.
Ed Collins, Isabelle Augenstein, Sebastian Riedel. A Supervised Approach to Extractive Summarisation of Scientific Papers. arXiv:1706.03946, 2017.
Sukriti Verma, Vagisha Nidhi. Extractive Summarization using Deep Learning. arXiv:1708.04439, 2017.
Parth Mehta, Gaurav Arora, Prasenjit Majumder. Attention based Sentence Extraction from Scientific Articles using Pseudo-Labeled data. arXiv:1802.04675, 2018.
Shashi Narayan, Shay B. Cohen, Mirella Lapata. Ranking Sentences for Extractive Summarization with Reinforcement Learning. arXiv:1802.08636, NAACL, 2018.

Abstractive Text Summarization

Alexander M. Rush, Sumit Chopra, Jason Weston. A Neural Attention Model for Abstractive Sentence Summarization. EMNLP, 2015. The source code in LUA Torch7 is NAMAS.
- They use sequence-to-sequence encoder-decoder LSTM with attention.
- They use the first sentence of a document. The source document is quite small (about 1 paragraph or ~500 words in the training dataset of Gigaword) and the produced output is also very short (about 75 characters). It remains an open challenge to scale up these limits - to produce longer summaries over multi-paragraph text input (even good LSTM models with attention models fall victim to vanishing gradients when the input sequences become longer than a few hundred items).
- The evaluation method used for automatic summarization has traditionally been the ROUGE metric - which has been shown to correlate well with human judgment of summary quality, but also has a known tendency to encourage "extractive" summarization - so that using ROUGE as a target metric to optimize will lead a summarizer towards a copy-paste behavior of the input instead of the hoped-for reformulation type of summaries.
Peter Liu and Xin Pan. Sequence-to-Sequence with Attention Model for Text Summarization. 2016. The source code in Python is textsum.
- They use sequence-to-sequence encoder-decoder LSTM with attention and bidirectional neural net.
- They use the first 2 sentences of a document with a limit at 120 words.
- The scores achieved by Google’s textsum are 42.57 ROUGE-1 and 23.13 ROUGE-2.
Ramesh Nallapati, Bowen Zhou, Cicero Nogueira dos santos, Caglar Gulcehre, Bing Xiang. Abstractive Text Summarization Using Sequence-to-Sequence RNNs and Beyond. arXiv:1602.06023, 2016.
- They use GRU with attention and bidirectional neural net.
- They use the first 2 sentences of a documnet with a limit at 120 words.
- They use the Large vocabulary trick (LVT) of Jean et al. 2014, which means when you decode, use only the words that appear in the source - this reduces perplexity. But then you lose the capability to do "abstractive" summary. So they do "vocabulary expansion" by adding a layer of "word2vec nearest neighbors" to the words in the input.
- Feature rich encoding - they add TFIDF and Named Entity types to the word embeddings (concatenated) to the encodings of the words - this adds to the encoding dimensions that reflect "importance" of the words.
- The most interesting of all is what they call the "Switching Generator/Pointer" layer. In the decoder, they add a layer that decides to either generate a new word based on the context / previously generated word (usual decoder) or copy a word from the input (that is - add a pointer to the input). They learn when to do Generate vs. Pointer and when it is a Pointer which word of the input to Point to.
Konstantin Lopyrev. Generating News Headlines with Recurrent Neural Networks. arXiv:1512.01712, 2015. The source code in Python is headlines.
Jiwei Li, Minh-Thang Luong and Dan Jurafsky. A Hierarchical Neural Autoencoder for Paragraphs and Documents. arXiv:1506.01057, 2015. The source code in Matlab is Hierarchical-Neural-Autoencoder.
Sumit Chopra, Alexander M. Rush and Michael Auli. Abstractive Sentence Summarization with Attentive Recurrent Neural Networks. NAACL, 2016.
Jianpeng Cheng, Mirella Lapata. Neural Summarization by Extracting Sentences and Words. arXiv:1603.07252, 2016.
- This paper uses attention as a mechanism for identifying the best sentences to extract, and then go beyond that to generate an abstractive summary.
Siddhartha Banerjee, Prasenjit Mitra, Kazunari Sugiyama. Generating Abstractive Summaries from Meeting Transcripts. arXiv:1609.07033, Proceedings of the 2015 ACM Symposium on Document Engineering, DocEng' 2015.
Siddhartha Banerjee, Prasenjit Mitra, Kazunari Sugiyama. Multi-document abstractive summarization using ILP based multi-sentence compression. arXiv:1609.07034, 2016.
Suzuki, Jun, and Masaaki Nagata. Cutting-off Redundant Repeating Generations for Neural Abstractive Summarization. EACL 2017 (2017): 291.
Jiwei Tan and Xiaojun Wan. Abstractive Document Summarization with a Graph-Based Attentional Neural Model. ACL, 2017.
Preksha Nema, Mitesh M. Khapra, Balaraman Ravindran and Anirban Laha. Diversity driven attention model for query-based abstractive summarization. ACL,2017
Romain Paulus, Caiming Xiong, Richard Socher. A Deep Reinforced Model for Abstractive Summarization. arXiv:1705.04304, 2017. The related blog is Your tldr by an ai: a deep reinforced model for abstractive summarization.
Shibhansh Dohare, Harish Karnick. Text Summarization using Abstract Meaning Representation. arXiv:1706.01678, 2017.
Piji Li, Wai Lam, Lidong Bing, Zihao Wang. Deep Recurrent Generative Decoder for Abstractive Text Summarization. arXiv:1708.00625, 2017.
Angela Fan, David Grangier, Michael Auli. Controllable Abstractive Summarization. arXiv:1711.05217, 2017.
Johan Hasselqvist, Niklas Helmertz, Mikael Kågebäck. Query-Based Abstractive Summarization Using Neural Networks. arXiv:1712.06100, 2017.
Tal Baumel, Matan Eyal, Michael Elhadad. Query Focused Abstractive Summarization: Incorporating Query Relevance, Multi-Document Coverage, and Summary Length Constraints into seq2seq Models. arXiv:1801.07704, 2018.
André Cibils, Claudiu Musat, Andreea Hossman, Michael Baeriswyl. Diverse Beam Search for Increased Novelty in Abstractive Summarization. arXiv:1802.01457, 2018.

Text Summarization

Eduard Hovy and Chin-Yew Lin. Automated text summarization and the summarist system. In Proceedings of a Workshop on Held at Baltimore, Maryland, ACL, 1998.
Eduard Hovy and Chin-Yew Lin. Automated Text Summarization in SUMMARIST. In Advances in Automatic Text Summarization, 1999.
Dipanjan Das and Andre F.T. Martins. A survey on automatic text summarization. Technical report, CMU, 2007
J. Leskovec, L. Backstrom, J. Kleinberg. Meme-tracking and the Dynamics of the News Cycle. ACM SIGKDD Intl. Conf. on Knowledge Discovery and Data Mining, 2009.
Ryang, Seonggi, and Takeshi Abekawa. "Framework of automatic text summarization using reinforcement learning." In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pp. 256-265. Association for Computational Linguistics, 2012. [not neural-based methods]
King, Ben, Rahul Jha, Tyler Johnson, Vaishnavi Sundararajan, and Clayton Scott. "Experiments in Automatic Text Summarization Using Deep Neural Networks." Machine Learning (2011).
Liu, Yan, Sheng-hua Zhong, and Wenjie Li. "Query-Oriented Multi-Document Summarization via Unsupervised Deep Learning." AAAI. 2012.
He, Zhanying, Chun Chen, Jiajun Bu, Can Wang, Lijun Zhang, Deng Cai, and Xiaofei He. "Document Summarization Based on Data Reconstruction." In AAAI. 2012.
Mohsen Pourvali, Mohammad Saniee Abadeh. Automated Text Summarization Base on Lexicales Chain and graph Using of WordNet and Wikipedia Knowledge Base. arXiv:1203.3586, 2012.
PadmaPriya, G., and K. Duraiswamy. An Approach For Text Summarization Using Deep Learning Algorithm. Journal of Computer Science 10, no. 1 (2013): 1-9.
Rushdi Shams, M.M.A. Hashem, Afrina Hossain, Suraiya Rumana Akter, Monika Gope. Corpus-based Web Document Summarization using Statistical and Linguistic Approach. arXiv:1304.2476, Procs. of the IEEE International Conference on Computer and Communication Engineering (ICCCE10), pp. 115-120, Kuala Lumpur, Malaysia, May 11-13, (2010).
Rioux, Cody, Sadid A. Hasan, and Yllias Chali. Fear the REAPER: A System for Automatic Multi-Document Summarization with Reinforcement Learning. In EMNLP, pp. 681-690. 2014.[not neural-based methods]
Fatma El-Ghannam, Tarek El-Shishtawy. Multi-Topic Multi-Document Summarizer. arXiv:1401.0640, 2014.
Denil, Misha, Alban Demiraj, and Nando de Freitas. Extraction of Salient Sentences from Labelled Documents. arXiv:1412.6815, 2014.
Denil, Misha, Alban Demiraj, Nal Kalchbrenner, Phil Blunsom, and Nando de Freitas.Modelling, Visualising and Summarising Documents with a Single Convolutional Neural Network. arXiv:1406.3830, 2014.
Cao, Ziqiang, Furu Wei, Li Dong, Sujian Li, and Ming Zhou. Ranking with Recursive Neural Networks and Its Application to Multi-document Summarization. AAAI, 2015.
Fei Liu, Jeffrey Flanigan, Sam Thomson, Norman Sadeh, and Noah A. Smith. Toward Abstractive Summarization Using Semantic Representations. NAACL, 2015.
Wenpeng Yin， Yulong Pei. Optimizing Sentence Modeling and Selection for Document Summarization. IJCAI, 2015.
Liu, He, Hongliang Yu, and Zhi-Hong Deng. Multi-Document Summarization Based on Two-Level Sparse Representation Model. In Twenty-Ninth AAAI Conference on Artificial Intelligence. 2015.
Jin-ge Yao, Xiaojun Wan and Jianguo Xiao. Compressive Document Summarization via Sparse Optimization. IJCAI, 2015.
Piji Li, Lidong Bing, Wai Lam, Hang Li, and Yi Liao. Reader-Aware Multi-Document Summarization via Sparse Coding. arXiv:1504.07324, IJCAI, 2015.
Marta Aparício, Paulo Figueiredo, Francisco Raposo, David Martins de Matos, Ricardo Ribeiro, Luís Marujo. Summarization of Films and Documentaries Based on Subtitles and Scripts. arXiv:1506.01273, 2015.
Luís Marujo, Ricardo Ribeiro, David Martins de Matos, João P. Neto, Anatole Gershman, Jaime Carbonell. Extending a Single-Document Summarizer to Multi-Document: a Hierarchical Approach. arXiv:1507.02907, 2015.
Xiaojun Wan, Yansong Feng and Weiwei Sun. Automatic Text Generation: Research Progress and Future Trends. Book Chapter in CCF 2014-2015 Annual Report on Computer Science and Technology in China (In Chinese), 2015.
Xiaojun Wan, Ziqiang Cao, Furu Wei, Sujian Li, Ming Zhou. Multi-Document Summarization via Discriminative Summary Reranking. arXiv:1507.02062, 2015.
Gulcehre, Caglar, Sungjin Ahn, Ramesh Nallapati, Bowen Zhou, and Yoshua Bengio. Pointing the Unknown Words. arXiv:1603.08148, 2016.
Jiatao Gu, Zhengdong Lu, Hang Li, Victor O.K. Li. Incorporating Copying Mechanism in Sequence-to-Sequence Learning. arXiv:1603.06393, ACL, 2016.
- They addressed an important problem in sequence-to-sequence (Seq2Seq) learning referred to as copying, in which certain segments in the input sequence are selectively replicated in the output sequence. In this paper, they incorporated copying into neural network-based Seq2Seq learning and propose a new model called CopyNet with encoder-decoder structure. CopyNet can nicely integrate the regular way of word generation in the decoder with the new copying mechanism which can choose sub-sequences in the input sequence and put them at proper places in the output sequence.
Jianmin Zhang, Jin-ge Yao and Xiaojun Wan. Toward constructing sports news from live text commentary. In Proceedings of ACL, 2016.
Ziqiang Cao, Wenjie Li, Sujian Li, Furu Wei. "AttSum: Joint Learning of Focusing and Summarization with Neural Attention". arXiv:1604.00125, 2016
Ayana, Shiqi Shen, Yu Zhao, Zhiyuan Liu and Maosong Sun. Neural Headline Generation with Sentence-wise Optimization. arXiv:1604.01904, 2016.
Ayana, Shiqi Shen, Zhiyuan Liu and Maosong Sun. Neural Headline Generation with Minimum Risk Training. 2016.
Lu Wang, Hema Raghavan, Vittorio Castelli, Radu Florian, Claire Cardie. A Sentence Compression Based Framework to Query-Focused Multi-Document Summarization. arXiv:1606.07548, 2016.
Milad Moradi, Nasser Ghadiri. Different approaches for identifying important concepts in probabilistic biomedical text summarization. arXiv:1605.02948, 2016.
Kikuchi, Yuta, Graham Neubig, Ryohei Sasano, Hiroya Takamura, and Manabu Okumura. Controlling Output Length in Neural Encoder-Decoders. arXiv:1609.09552, 2016.
Qian Chen, Xiaodan Zhu, Zhenhua Ling, Si Wei and Hui Jiang. Distraction-Based Neural Networks for Document Summarization. arXiv:1610.08462, IJCAI, 2016.
Wang, Lu, and Wang Ling. Neural Network-Based Abstract Generation for Opinions and Arguments. NAACL, 2016.
Yishu Miao, Phil Blunsom. Language as a Latent Variable: Discrete Generative Models for Sentence Compression. EMNLP, 2016.
Takase, Sho, Jun Suzuki, Naoaki Okazaki, Tsutomu Hirao, and Masaaki Nagata. Neural headline generation on abstract meaning representation. EMNLP, 1054-1059, 2016.
Wenyuan Zeng, Wenjie Luo, Sanja Fidler, Raquel Urtasun. Efficient Summarization with Read-Again and Copy Mechanism. arXiv:1611.03382, 2016.
Ziqiang Cao, Wenjie Li, Sujian Li, Furu Wei. Improving Multi-Document Summarization via Text Classification. arXiv:1611.09238, 2016.
Hongya Song, Zhaochun Ren, Piji Li, Shangsong Liang, Jun Ma, and Maarten de Rijke. Summarizing Answers in Non-Factoid Community Question-Answering. In WSDM 2017: The 10th International Conference on Web Search and Data Mining, 2017.
Piji Li, Zihao Wang, Wai Lam, Zhaochun Ren, Lidong Bing. Salience Estimation via Variational Auto-Encoders for Multi-Document Summarization. In AAAI, 2017.
Yinfei Yang, Forrest Sheng Bao, Ani Nenkova. Detecting (Un)Important Content for Single-Document News Summarization. arXiv:1702.07998, 2017.
Rui Meng, Sanqiang Zhao, Shuguang Han, Daqing He, Peter Brusilovsky, Yu Chi. Deep Keyphrase Generation. arXiv:1704.06879, 2017. The source code written in Python is seq2seq-keyphrase.
Abigail See, Peter J. Liu and Christopher D. Manning. Get To The Point: Summarization with Pointer-Generator Networks. ACL, 2017.
Qingyu Zhou, Nan Yang, Furu Wei and Ming Zhou. Selective Encoding for Abstractive Sentence Summarization. arXiv:1704.07073, ACL, 2017.
Maxime Peyrard and Judith Eckle-Kohler. Supervised Learning of Automatic Pyramid for Optimization-Based Multi-Document Summarization. ACL, 2017.
Jin-ge Yao, Xiaojun Wan and Jianguo Xiao. Recent Advances in Document Summarization. KAIS, survey paper, 2017.
Pranay Mathur, Aman Gill and Aayush Yadav. Text Summarization in Python: Extractive vs. Abstractive techniques revisited. 2017.
- They compared modern extractive methods like LexRank, LSA, Luhn and Gensim’s existing TextRank summarization module on the Opinosis dataset of 51 (article, summary) pairs. They also had a try with an abstractive technique using Tensorflow’s algorithm textsum, but didn’t obtain good results due to its extremely high hardware demands (7000 GPU hours).
Arman Cohan, Nazli Goharian. Scientific Article Summarization Using Citation-Context and Article's Discourse Structure. arXiv:1704.06619, EMNLP, 2015.
Arman Cohan, Nazli Goharian. Scientific document summarization via citation contextualization and scientific discourse. arXiv:1706.03449, 2017.
Michihiro Yasunaga, Rui Zhang, Kshitijh Meelu, Ayush Pareek, Krishnan Srinivasan, Dragomir Radev. Graph-based Neural Multi-Document Summarization. arXiv:1706.06681, CoNLL, 2017.
Abeed Sarker, Diego Molla, Cecile Paris. Automated text summarisation and evidence-based medicine: A survey of two domains. arXiv:1706.08162, 2017.
Mehdi Allahyari, Seyedamin Pouriyeh, Mehdi Assefi, Saeid Safaei, Elizabeth D. Trippe, Juan B. Gutierrez, Krys Kochut. Text Summarization Techniques: A Brief Survey. arXiv:1707.02268, 2017.
Demian Gholipour Ghalandari. Revisiting the Centroid-based Method: A Strong Baseline for Multi-Document Summarization. arXiv:1708.07690, EMNLP, 2017.
Shuming Ma, Xu Sun. A Semantic Relevance Based Neural Network for Text Summarization and Text Simplification. arXiv:1710.02318, 2017.
Kaustubh Mani, Ishan Verma, Lipika Dey. Multi-Document Summarization using Distributed Bag-of-Words Model. arXiv:1710.02745, 2017.
Liqun Shao, Hao Zhang, Ming Jia, Jie Wang. Efficient and Effective Single-Document Summarizations and A Word-Embedding Measurement of Quality. arXiv:1710.00284, KDIR, 2017.
Mohammad Ebrahim Khademi, Mohammad Fakhredanesh, Seyed Mojtaba Hoseini. Conceptual Text Summarizer: A new model in continuous vector space. arXiv:1710.10994, 2017.
Jingjing Xu. Improving Social Media Text Summarization by Learning Sentence Weight Distribution. arXiv:1710.11332, 2017.
Peter J. Liu, Mohammad Saleh, Etienne Pot, Ben Goodrich, Ryan Sepassi, Lukasz Kaiser, Noam Shazeer. Generating Wikipedia by Summarizing Long Sequences. arXiv:1801.10198, 2018.
Parth Mehta, Prasenjit Majumder. Content based Weighted Consensus Summarization. arXiv:1802.00946, 2018.

Chinese Text Summarization

Mao Song Sun. Natural Language Processing Based on Naturally Annotated Web Resources. Journal of Chinese Information Processing, 2011.
Baotian Hu, Qingcai Chen and Fangze Zhu. LCSTS: A Large Scale Chinese Short Text Summarization Dataset. 2015.
- They constructed a large-scale Chinese short text summarization dataset constructed from the Chinese microblogging website Sina Weibo, which is released to the public. Then they performed GRU-based encoder-decoder method on it to generate summary. They took the whole short text as one sequence, this may not be very reasonable, because most of short texts contain several sentences.
- LCSTS contains 2,400,591 (short text, summary) pairs as the training set and 1,106 pairs as the test set.
- All the models are trained on the GPUs tesla M2090 for about one week.
- The results show that the RNN with context outperforms RNN without context on both character and word based input.
- Moreover, the performances of the character-based input outperform the word-based input.

Evaluation Metrics

Chin-Yew Lin and Eduard Hovy. Automatic Evaluation of Summaries Using N-gram Co-Occurrence Statistics. In Proceedings of the Human Technology Conference 2003 (HLT-NAACL-2003).
Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. Workshop on Text Summarization Branches Out, Post-Conference Workshop of ACL 2004.
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. BLEU: a Method for Automatic Evaluation of Machine Translation.
Arman Cohan, Nazli Goharian. Revisiting Summarization Evaluation for Scientific Articles. arXiv:1604.00400, LREC, 2016.

Opinion Summarization

Kavita Ganesan, ChengXiang Zhai and Jiawei Han. Opinosis: A Graph Based Approach to Abstractive Summarization of Highly Redundant Opinions. Proceedings of COLING '10, 2010.
Kavita Ganesan, ChengXiang Zhai and Evelyne Viegas. Micropinion Generation: An Unsupervised Approach to Generating Ultra-Concise Summaries of Opinions. WWW'12, 2012.
Kavita Ganesan. Opinion Driven Decision Support System (ODSS). PhD Thesis, University of Illinois at Urbana-Champaign, 2013.
Haibing Wu, Yiwei Gu, Shangdi Sun and Xiaodong Gu. Aspect-based Opinion Summarization with Convolutional Neural Networks. 2015.
Ozan Irsoy and Claire Cardie. Opinion Mining with Deep Recurrent Neural Networks. In EMNLP, 2014.

Reading Comprehension

Karl Moritz Hermann, Tomas Kocisky, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Suleyman, and Phil Blunsom. Teaching machines to read and comprehend. NIPS, 2015. The source code in Python is DeepMind-Teaching-Machines-to-Read-and-Comprehend.
Hill, Felix, Antoine Bordes, Sumit Chopra, and Jason Weston. The Goldilocks Principle: Reading Children's Books with Explicit Memory Representations. arXiv:1511.02301, 2015.
Kadlec, Rudolf, Martin Schmid, Ondrej Bajgar, and Jan Kleindienst. Text Understanding with the Attention Sum Reader Network. arXiv:1603.01547, 2016.
Danqi Chen, Jason Bolton and Christopher D. Manning. A Thorough Examination of the CNN/Daily Mail Reading Comprehension Task. arXiv:1606.02858, ACL, 2016. The source code in Python is rc-cnn-dailymail.
Dhingra, Bhuwan, Hanxiao Liu, William W. Cohen, and Ruslan Salakhutdinov. Gated-Attention Readers for Text Comprehension. arXiv:1606.01549, 2016.
Sordoni, Alessandro, Phillip Bachman, and Yoshua Bengio. Iterative Alternating Neural Attention for Machine Reading. arXiv:1606.02245, 2016.
Trischler, Adam, Zheng Ye, Xingdi Yuan, and Kaheer Suleman. Natural Language Comprehension with the EpiReader. arXiv:1606.02270, 2016.
Yiming Cui, Zhipeng Chen, Si Wei, Shijin Wang, Ting Liu, Guoping Hu. Attention-over-Attention Neural Networks for Reading Comprehension. arXiv:1607.04423, 2016.
Yiming Cui, Ting Liu, Zhipeng Chen, Shijin Wang, Guoping Hu. Consensus Attention-based Neural Networks for Chinese Reading Comprehension. arXiv:1607.02250, 2016.
Daniel Hewlett, Alexandre Lacoste, Llion Jones, Illia Polosukhin, Andrew Fandrianto, Jay Han, Matthew Kelcey and David Berthelot. WIKIREADING: A Novel Large-scale Language Understanding Task over Wikipedia. ACL, 1535-1545, 2016.
Xinya Du, Junru Shao, Claire Cardie. Learning to Ask: Neural Question Generation for Reading Comprehension. arXiv:1705.00106, 2017.
Minghao Hu, Yuxing Peng, Xipeng Qiu. Mnemonic Reader for Machine Comprehension. arXiv:1705.02798, 2017.
Wenhui Wang, Nan Yang, Furu Wei, Baobao Chang and Ming Zhou. R-NET: Machine Reading Comprehension with Self-matching Networks. ACL, 2017.
Xingdi Yuan, Tong Wang, Caglar Gulcehre, Alessandro Sordoni, Philip Bachman, Sandeep Subramanian, Saizheng Zhang, Adam Trischler. Machine Comprehension by Text-to-Text Neural Question Generation. arXiv:1705.02012, 2017.
Zichao Li, Xin Jiang, Lifeng Shang, Hang Li. Paraphrase Generation with Deep Reinforcement Learning. arXiv:1711.00279, 2017.
Tomáš Kočiský, Jonathan Schwarz, Phil Blunsom, Chris Dyer, Karl Moritz Hermann, Gábor Melis, Edward Grefenstette. The NarrativeQA Reading Comprehension Challenge. arXiv:1712.07040, 2017.

Sentence Modelling

Kalchbrenner, Nal, Edward Grefenstette, and Phil Blunsom. A convolutional neural network for modelling sentences. arXiv:1404.2188, 2014.
Kim, Yoon. Convolutional neural networks for sentence classification. arXiv:1408.5882, 2014.
Le, Quoc V., and Tomas Mikolov. Distributed representations of sentences and documents. arXiv:1405.4053, 2014.
Yang, Zichao, Diyi Yang, Chris Dyer, Xiaodong He, Alex Smola, and Eduard Hovy. "Hierarchical Attention Networks for Document Classification." In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2016.

awesome-archive / awesome-text-summarization Goto Github PK

awesome-text-summarization's Introduction

Corpus

Text Summarization Software

Word/Sentence Representation

Word/Sentence Vectorization

Extractive Text Summarization

Abstractive Text Summarization

Text Summarization

Chinese Text Summarization

Evaluation Metrics

Opinion Summarization

Reading Comprehension

Sentence Modelling

awesome-text-summarization's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs