Skip to main content

Robust Single-Document Summarizations and a Semantic Measurement of Quality

  • Conference paper
  • First Online:
Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K 2017)

Abstract

The goal of this paper is to generate an effective summary for a given document with specific realtime requirements. We use the softplus function to enhance keyword rankings to favor important sentences, based on which we present a number of extractive summarization algorithms using various keyword extraction and topic clustering methods. We show that our algorithms not only meet the realtime requirements but also yield the best ROUGE scores on DUC-02 over all previously-known algorithms. We also evaluate our summarization methods over the SummBank dataset and other datasets to ensure that our methods are robust. Experiments show that summaries generated by our methods achieve higher or about the same ROUGE scores than extractive summaries generated by human evaluators. Moreover, we define a semantic measure based on word-embedding using Word Mover’s Distance to evaluate the quality of summaries without human-generated benchmarks. We show that for our algorithms, the orderings of the ROUGE scores and the scores under the new measure are highly comparable, suggesting that this new measure may serve as a viable alternative for measuring the quality of a summary.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Aslam, J.A., Frost, M.: An information-theoretic measure for document similarity. In: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2003, pp. 449–450. ACM, New York (2003)

    Google Scholar 

  2. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3(Jan), 993–1022 (2003)

    MATH  Google Scholar 

  3. Boutsioukis, G.: Natural language toolkit: texttiling (2016). http://www.nltk.org/_modules/-nltk/tokenize/texttiling.html

  4. Brin, S., Page, L.: The anatomy of a large-scale hypertextual Web search engine. Comput. Netw. ISDN Syst. 30, 107–117 (1998)

    Article  Google Scholar 

  5. Cheng, J., Lapata, M.: Neural summarization by extracting sentences and words. CoRR abs/1603.07252 (2016). http://dblp.uni-trier.de/db/journals/corr/corr1603.html#ChengL16a

  6. Corney, D., Albakour, D., Martinez, M., Moussa, S.: What do a million news articles look like? In: Proceedings of the First International Workshop on Recent Trends in News Information Retrieval Co-located with 38th European Conference on Information Retrieval (ECIR 2016), Padua, Italy, 20 March 2016, pp. 42–47 (2016). http://ceur-ws.org/Vol-1568/paper8.pdf

  7. Dasgupta, A., Kumar, R., Ravi, S.: Summarization through submodularity and dispersion. In: ACL, vol. 1, pp. 1014–1022. The Association for Computer Linguistics (2013). http://dblp.uni-trier.de/db/conf/acl/acl2013-1.html#DasguptaKR13

  8. DUC: Document understanding conference 2002 (2002). http://www-nlpir.nist.gov/projects/duc/guidelines/2002.html

  9. Foundation, W.: Wikimedia downloads (2017). https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2

  10. Glorot, X., Bordes, A., Bengio, Y.: Deep sparse rectifier neural networks. In: AISTATS, vol. 15, p. 275 (2011)

    Google Scholar 

  11. Hearst, M.A.: TextTiling: segmenting text into multi-paragraph subtopic passages. Comput. Linguist. 23(1), 33–64 (1997)

    Google Scholar 

  12. Kleinberg, J.M.: Authoritative sources in a hyperlinked environment. J. ACM (JACM) 46(5), 604–632 (1999)

    Article  MathSciNet  Google Scholar 

  13. Kusner, M.J., Sun, Y., Kolkin, N.I., Weinberger, K.Q.: From word embeddings to document distances. In: Proceedings of the 32nd International Conference on Machine Learning (ICML 2015), pp. 957–966 (2015)

    Google Scholar 

  14. Lin, H., Bilmes, J.A.: A class of submodular functions for document summarization. In: Lin, D., Matsumoto, Y., Mihalcea, R. (eds.) ACL, pp. 510–520. The Association for Computer Linguistics (2011). http://dblp.uni-trier.de/db/conf/acl/acl2011.html#LinB11

  15. Louis, A., Nenkova, A.: Automatically evaluating content selection in summarization without human models. In: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 1, vol. 1, pp. 306–314. EMNLP 2009. Association for Computational Linguistics, Stroudsburg (2009)

    Google Scholar 

  16. Matsuo, Y., Ishizuka, M.: Keyword extraction from a single document using word co-occurrence statistical information. In: Proceedings of the Sixteenth International Florida Artificial Intelligence Research Society Conference, pp. 392–396. AAAI Press (2003)

    Google Scholar 

  17. Mihalcea, R., Tarau, P.: TextRank: bringing order into texts. In: Proceedings of EMNLP-04 and the 2004 Conference on Empirical Methods in Natural Language Processing, July 2004

    Google Scholar 

  18. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013)

  19. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013)

    Google Scholar 

  20. MIT: TextRank implementation in python (2014). https://github.com/summanlp/textrank

  21. MIT: A python implementation of the rapid automatic keyword extraction (2015). https://github.com/aneesha/RAKE

  22. Nallapati, R., Zhou, B., dos Santos, C.N., Gülçehre, Ç., Xiang, B.: Abstractive text summarization using sequence-to-sequence RNNs and beyond. In: CoNLL, pp. 280–290. ACL (2016)

    Google Scholar 

  23. Parveen, D., Mesgar, M., Strube, M.: Generating coherent summaries of scientific articles using coherence patterns. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 772–783 (2016)

    Google Scholar 

  24. Parveen, D., Ramsl, H.M., Strube, M.: Topical coherence for graph-based extractive summarization. In: Márquez, L., Callison-Burch, C., Su, J., Pighin, D., Marton, Y. (eds.) EMNLP, pp. 1949–1954. The Association for Computational Linguistics (2015)

    Google Scholar 

  25. Parveen, D., Strube, M.: Integrating importance, non-redundancy and coherence in graph-based extractive summarization. In: Yang, Q., Wooldridge, M. (eds.) IJCAI, pp. 1298–1304. AAAI Press (2015). http://dblp.uni-trier.de/db/conf/ijcai/ijcai2015.html#Parveen015

  26. Radev, D.R., et al.: Mead-a platform for multidocument multilingual text summarization. In: LREC (2004)

    Google Scholar 

  27. Radev, D., et al.: Summbank 1.0 LDC2003t16 (2003). https://catalog.ldc.upenn.edu/LDC2003T16

  28. Rehurek, R.: Gensim 2.0.0 (2017). https://pypi.python.org/pypi/gensim

  29. Rose, S., Engel, D., Cramer, N., Cowley, W.: Automatic keyword extraction from individual documents. In: Berry, M.W., Kogan, J. (eds.) Text Mining. Applications and Theory, pp. 1–20. Wiley (2010). https://doi.org/10.1002/9780470689646.ch1

  30. Rush, A.M., Chopra, S., Weston, J.: A neural attention model for abstractive sentence summarization. CoRR abs/1509.00685 (2015). http://dblp.uni-trier.de/db/journals/corr/corr1509.html#RushCW15

  31. Salton, G., Buckley, C.: Term weighting approaches in automatic text retrieval. Cornell University, Ithaca, NY, USA, Technical report (1987)

    Google Scholar 

  32. Shao, L., Wang, J.: DTATG: an automatic title generator based on dependency trees. In: Fred, A.L.N., Dietz, J.L.G., Aveiro, D., Liu, K., Bernardino, J., Filipe, J. (eds.) KDIR, pp. 166–173. SciTePress (2016). http://dblp.uni-trier.de/db/conf/ic3k/kdir2016.html#ShaoW16

  33. Shao, L., Zhang, H., Jia, M., Wang, J.: Efficient and effective single-document summarizations and a word-embedding measurement of quality. In: Proceedings of the 9th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management - (Volume 1), Funchal, Madeira, Portugal, 1–3 November 2017. pp. 114–122 (2017)

    Google Scholar 

  34. Wan, X.: Towards a unified approach to simultaneous single-document and multi-document summarizations. In: Proceedings of the 23rd International Conference on Computational Linguistics, pp. 1137–1145. Association for Computational Linguistics (2010)

    Google Scholar 

  35. Wan, X., Xiao, J.: Exploiting neighborhood knowledge for single document summarization and keyphrase extraction. ACM Trans. Inf. Syst. 28(2) (2010). http://dblp.uni-trier.de/db/journals/tois/tois28.html#WanX10

  36. Woodsend, K., Lapata, M.: Automatic generation of story highlights. In: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pp. 565–574. Association for Computational Linguistics (2010)

    Google Scholar 

Download references

Acknowledgements

We thank Ming Jia, Jingwen Wang, Cheng Zhang, Wenjing Yang, and the other members of the Text Automation Lab at UMass Lowell for their support and fruitful discussions. We are grateful to Prof. Hong Yu for making the SummBank dataset available for this study.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Liqun Shao , Hao Zhang or Jie Wang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Shao, L., Zhang, H., Wang, J. (2019). Robust Single-Document Summarizations and a Semantic Measurement of Quality. In: Fred, A., et al. Knowledge Discovery, Knowledge Engineering and Knowledge Management. IC3K 2017. Communications in Computer and Information Science, vol 976. Springer, Cham. https://doi.org/10.1007/978-3-030-15640-4_7

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-15640-4_7

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-15639-8

  • Online ISBN: 978-3-030-15640-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics