Robust Single-Document Summarizations and a Semantic Measurement of Quality

Shao, Liqun; Zhang, Hao; Wang, Jie

doi:10.1007/978-3-030-15640-4_7

Liqun Shao¹⁵,
Hao Zhang¹⁵ &
Jie Wang¹⁵

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 976))

Included in the following conference series:

International Joint Conference on Knowledge Discovery, Knowledge Engineering, and Knowledge Management

337 Accesses
1 Citations

Abstract

The goal of this paper is to generate an effective summary for a given document with specific realtime requirements. We use the softplus function to enhance keyword rankings to favor important sentences, based on which we present a number of extractive summarization algorithms using various keyword extraction and topic clustering methods. We show that our algorithms not only meet the realtime requirements but also yield the best ROUGE scores on DUC-02 over all previously-known algorithms. We also evaluate our summarization methods over the SummBank dataset and other datasets to ensure that our methods are robust. Experiments show that summaries generated by our methods achieve higher or about the same ROUGE scores than extractive summaries generated by human evaluators. Moreover, we define a semantic measure based on word-embedding using Word Mover’s Distance to evaluate the quality of summaries without human-generated benchmarks. We show that for our algorithms, the orderings of the ROUGE scores and the scores under the new measure are highly comparable, suggesting that this new measure may serve as a viable alternative for measuring the quality of a summary.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Aslam, J.A., Frost, M.: An information-theoretic measure for document similarity. In: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2003, pp. 449–450. ACM, New York (2003)
Google Scholar
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3(Jan), 993–1022 (2003)
MATH Google Scholar
Boutsioukis, G.: Natural language toolkit: texttiling (2016). http://www.nltk.org/_modules/-nltk/tokenize/texttiling.html
Brin, S., Page, L.: The anatomy of a large-scale hypertextual Web search engine. Comput. Netw. ISDN Syst. 30, 107–117 (1998)
Article Google Scholar
Cheng, J., Lapata, M.: Neural summarization by extracting sentences and words. CoRR abs/1603.07252 (2016). http://dblp.uni-trier.de/db/journals/corr/corr1603.html#ChengL16a
Corney, D., Albakour, D., Martinez, M., Moussa, S.: What do a million news articles look like? In: Proceedings of the First International Workshop on Recent Trends in News Information Retrieval Co-located with 38th European Conference on Information Retrieval (ECIR 2016), Padua, Italy, 20 March 2016, pp. 42–47 (2016). http://ceur-ws.org/Vol-1568/paper8.pdf
Dasgupta, A., Kumar, R., Ravi, S.: Summarization through submodularity and dispersion. In: ACL, vol. 1, pp. 1014–1022. The Association for Computer Linguistics (2013). http://dblp.uni-trier.de/db/conf/acl/acl2013-1.html#DasguptaKR13
DUC: Document understanding conference 2002 (2002). http://www-nlpir.nist.gov/projects/duc/guidelines/2002.html
Foundation, W.: Wikimedia downloads (2017). https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
Glorot, X., Bordes, A., Bengio, Y.: Deep sparse rectifier neural networks. In: AISTATS, vol. 15, p. 275 (2011)
Google Scholar
Hearst, M.A.: TextTiling: segmenting text into multi-paragraph subtopic passages. Comput. Linguist. 23(1), 33–64 (1997)
Google Scholar
Kleinberg, J.M.: Authoritative sources in a hyperlinked environment. J. ACM (JACM) 46(5), 604–632 (1999)
Article MathSciNet Google Scholar
Kusner, M.J., Sun, Y., Kolkin, N.I., Weinberger, K.Q.: From word embeddings to document distances. In: Proceedings of the 32nd International Conference on Machine Learning (ICML 2015), pp. 957–966 (2015)
Google Scholar
Lin, H., Bilmes, J.A.: A class of submodular functions for document summarization. In: Lin, D., Matsumoto, Y., Mihalcea, R. (eds.) ACL, pp. 510–520. The Association for Computer Linguistics (2011). http://dblp.uni-trier.de/db/conf/acl/acl2011.html#LinB11
Louis, A., Nenkova, A.: Automatically evaluating content selection in summarization without human models. In: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 1, vol. 1, pp. 306–314. EMNLP 2009. Association for Computational Linguistics, Stroudsburg (2009)
Google Scholar
Matsuo, Y., Ishizuka, M.: Keyword extraction from a single document using word co-occurrence statistical information. In: Proceedings of the Sixteenth International Florida Artificial Intelligence Research Society Conference, pp. 392–396. AAAI Press (2003)
Google Scholar
Mihalcea, R., Tarau, P.: TextRank: bringing order into texts. In: Proceedings of EMNLP-04 and the 2004 Conference on Empirical Methods in Natural Language Processing, July 2004
Google Scholar
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013)
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013)
Google Scholar
MIT: TextRank implementation in python (2014). https://github.com/summanlp/textrank
MIT: A python implementation of the rapid automatic keyword extraction (2015). https://github.com/aneesha/RAKE
Nallapati, R., Zhou, B., dos Santos, C.N., Gülçehre, Ç., Xiang, B.: Abstractive text summarization using sequence-to-sequence RNNs and beyond. In: CoNLL, pp. 280–290. ACL (2016)
Google Scholar
Parveen, D., Mesgar, M., Strube, M.: Generating coherent summaries of scientific articles using coherence patterns. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 772–783 (2016)
Google Scholar
Parveen, D., Ramsl, H.M., Strube, M.: Topical coherence for graph-based extractive summarization. In: Márquez, L., Callison-Burch, C., Su, J., Pighin, D., Marton, Y. (eds.) EMNLP, pp. 1949–1954. The Association for Computational Linguistics (2015)
Google Scholar
Parveen, D., Strube, M.: Integrating importance, non-redundancy and coherence in graph-based extractive summarization. In: Yang, Q., Wooldridge, M. (eds.) IJCAI, pp. 1298–1304. AAAI Press (2015). http://dblp.uni-trier.de/db/conf/ijcai/ijcai2015.html#Parveen015
Radev, D.R., et al.: Mead-a platform for multidocument multilingual text summarization. In: LREC (2004)
Google Scholar
Radev, D., et al.: Summbank 1.0 LDC2003t16 (2003). https://catalog.ldc.upenn.edu/LDC2003T16
Rehurek, R.: Gensim 2.0.0 (2017). https://pypi.python.org/pypi/gensim
Rose, S., Engel, D., Cramer, N., Cowley, W.: Automatic keyword extraction from individual documents. In: Berry, M.W., Kogan, J. (eds.) Text Mining. Applications and Theory, pp. 1–20. Wiley (2010). https://doi.org/10.1002/9780470689646.ch1
Rush, A.M., Chopra, S., Weston, J.: A neural attention model for abstractive sentence summarization. CoRR abs/1509.00685 (2015). http://dblp.uni-trier.de/db/journals/corr/corr1509.html#RushCW15
Salton, G., Buckley, C.: Term weighting approaches in automatic text retrieval. Cornell University, Ithaca, NY, USA, Technical report (1987)
Google Scholar
Shao, L., Wang, J.: DTATG: an automatic title generator based on dependency trees. In: Fred, A.L.N., Dietz, J.L.G., Aveiro, D., Liu, K., Bernardino, J., Filipe, J. (eds.) KDIR, pp. 166–173. SciTePress (2016). http://dblp.uni-trier.de/db/conf/ic3k/kdir2016.html#ShaoW16
Shao, L., Zhang, H., Jia, M., Wang, J.: Efficient and effective single-document summarizations and a word-embedding measurement of quality. In: Proceedings of the 9th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management - (Volume 1), Funchal, Madeira, Portugal, 1–3 November 2017. pp. 114–122 (2017)
Google Scholar
Wan, X.: Towards a unified approach to simultaneous single-document and multi-document summarizations. In: Proceedings of the 23rd International Conference on Computational Linguistics, pp. 1137–1145. Association for Computational Linguistics (2010)
Google Scholar
Wan, X., Xiao, J.: Exploiting neighborhood knowledge for single document summarization and keyphrase extraction. ACM Trans. Inf. Syst. 28(2) (2010). http://dblp.uni-trier.de/db/journals/tois/tois28.html#WanX10
Woodsend, K., Lapata, M.: Automatic generation of story highlights. In: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pp. 565–574. Association for Computational Linguistics (2010)
Google Scholar

Download references

Acknowledgements

We thank Ming Jia, Jingwen Wang, Cheng Zhang, Wenjing Yang, and the other members of the Text Automation Lab at UMass Lowell for their support and fruitful discussions. We are grateful to Prof. Hong Yu for making the SummBank dataset available for this study.

Author information

Authors and Affiliations

Department of Computer Science, University of Massachusetts, Lowell, MA, USA
Liqun Shao, Hao Zhang & Jie Wang

Authors

Liqun Shao
View author publications
You can also search for this author in PubMed Google Scholar
Hao Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Jie Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Liqun Shao , Hao Zhang or Jie Wang .

Editor information

Editors and Affiliations

Instituto de Telecomunicações, Lisbon, Portugal
Ana Fred
University of Madeira, Funchal, Portugal
David Aveiro
Delft University of Technology, Delft, The Netherlands
Jan L. G. Dietz
Henley Business School, University of Reading, Reading, UK
Kecheng Liu
University of Coimbra, Coimbra, Portugal
Jorge Bernardino
Federal University of Pernambuco, Recife, Brazil
Ana Salgado
INSTICC and Instituto Politecnico de Setúbal, Setúbal, Portugal
Joaquim Filipe

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Shao, L., Zhang, H., Wang, J. (2019). Robust Single-Document Summarizations and a Semantic Measurement of Quality. In: Fred, A., et al. Knowledge Discovery, Knowledge Engineering and Knowledge Management. IC3K 2017. Communications in Computer and Information Science, vol 976. Springer, Cham. https://doi.org/10.1007/978-3-030-15640-4_7

Download citation

DOI: https://doi.org/10.1007/978-3-030-15640-4_7
Published: 15 March 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-15639-8
Online ISBN: 978-3-030-15640-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics