skip to main content
10.1145/3015157.3015161acmotherconferencesArticle/Chapter ViewAbstractPublication PagesfireConference Proceedingsconference-collections
short-paper

Simurg: An Extendable Multilingual Corpus for Abstractive Single Document Summarization

Authors Info & Claims
Published:08 December 2016Publication History

ABSTRACT

Abstractive single document summarization is considered as a challenging problem in the field of artificial intelligence and natural language processing. Meanwhile and specifically in the last two years, several deep learning summarization approaches were proposed that once again attracted the attention of researchers to this field.

It is a well-known issue that deep learning approaches do not work well with small amounts of data. With some exceptions, this is, unfortunately, the case for most of the datasets available for the summarization task. Besides this problem, it should be considered that phonetic, morphological, semantic and syntactic features of the language are constantly changing over the time and unfortunately most of the summarization corpora are constructed from old resources. Another problem is the language of the corpora. Not only in the summarization field, but also in other fields of natural language processing, most of the corpora are only available in English. In addition to the above problems, license terms, and fees of the corpora are obstacles that prevent many academics and specifically non-academics from accessing these data.

This work describes an open source framework to create an extendable multilingual corpus for abstractive single document summarization that addresses the above-mentioned problems. We describe a tool consisted of a scalable crawler and a centralized key-value store database to construct a corpus of an arbitrary size using a news aggregator service.

References

  1. M. B. Almeida, M. S. C. Almeida, andré F. T. Martins, H. Figueira, P. Mendes, and C. Pinto. Priberam Compressive Summarization Corpus: A New Multi-Document Summarization Corpus for European Portuguese. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14), may 2014.Google ScholarGoogle Scholar
  2. J. A. Aslam, F. Diaz, M. Ekstrand-Abueg, R. McCreadie, V. Pavlu, and T. Sakai. TREC 2015 Temporal Summarization Track Overview. In Proceedings of The Twenty-Fourth Text REtrieval Conference, TREC 2015, Gaithersburg, Maryland, USA, November 17-20, 2015, 2015.Google ScholarGoogle Scholar
  3. Z. Cao, W. Li, S. Li, and F. Wei. AttSum: Joint Learning of Focusing and Summarization with Neural Attention. CoRR, abs/1604.00125, 2016.Google ScholarGoogle Scholar
  4. N. Chatterjee and P. K. Sahoo. Random Indexing and Modified Random Indexing based Approach for Extractive Text Summarization. Computer Speech & Language, 29(1):32--44, 2015.Google ScholarGoogle ScholarCross RefCross Ref
  5. T. Crowley and C. Bowern. An Introduction to Historical Linguistics. OUP USA, 2010.Google ScholarGoogle Scholar
  6. H. T. Dang. Overview of DUC 2006. In Proc. Document Understanding Workshop, page 10, 2006.Google ScholarGoogle Scholar
  7. D. Graff and C. Cieri. English Gigaword. Linguistic Data Consortium, Philadelphia, 2003.Google ScholarGoogle Scholar
  8. K. M. Hermann, T. Kočiský, E. Grefenstette, L. Espeholt, W. Kay, M. Suleyman, and P. Blunsom. Teaching Machines to Read and Comprehend. In Advances in Neural Information Processing Systems (NIPS), 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. J. Huang, M. Peng, H. Wang, J. Cao, W. Gao, and X. Zhang. A Probabilistic Method for Emerging Topic Tracking in Microblog Stream. World Wide Web, pages 1--26, 2016.Google ScholarGoogle Scholar
  10. C. Kohlschütter, P. Fankhauser, and W. Nejdl. Boilerplate Detection Using Shallow Text Features. In Proceedings of the Third ACM International Conference on Web Search and Data Mining, WSDM '10, pages 441--450, New York, NY, USA, 2010. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. C.-Y. Lin. ROUGE: A Package for Automatic Evaluation of summaries. In Proc. ACL workshop on Text Summarization Branches Out, page 10, 2004.Google ScholarGoogle Scholar
  12. H. P. Luhn. The Automatic Creation of Literature Abstracts. IBM J. Res. Dev., 2(2):159--165, Apr. 1958. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. P. McCullagh and J. Nelder. Generalized Linear Models, Second Edition. Taylor & Francis, 1989.Google ScholarGoogle ScholarCross RefCross Ref
  14. P. Modaresi and S. Conrad. On Definition of Automatic Text Summarization. In The Second International Conference on Digital Information Processing, Data Mining, and Wireless Communications (DIPDMWC2015), page 33, 2015.Google ScholarGoogle Scholar
  15. R. Nallapati, B. Xiang, and B. Zhou. Sequence-to-Sequence RNNs for Text Summarization. CoRR, abs/1602.06023, 2016.Google ScholarGoogle Scholar
  16. J. Ngiam, A. Coates, A. Lahiri, B. Prochnow, Q. V. Le, and A. Y. Ng. On Optimization Methods for Deep Learning. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pages 265--272, 2011.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. L. Ostroumova Prokhorenkova, P. Prokhorenkov, E. Samosvat, and P. Serdyukov. Publication Date Prediction Through Reverse Engineering of the Web. In Proceedings of the Ninth ACM International Conference on Web Search and Data Mining, WSDM '16, pages 123--132, New York, NY, USA, 2016. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu. BLEU: A Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, ACL '02, pages 311--318. Association for Computational Linguistics, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. M. E. Peters and D. Lecocq. Content Extraction Using Diverse Feature Sets. In Proceedings of the 22Nd International Conference on World Wide Web, WWW '13 Companion, pages 89--90, New York, NY, USA, 2013. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. D. Radev, S. Teufel, H. Saggion, W. Lam, J. Blitzer, A. Celebri, E. Drabek, D. Liu, H. Qi, and T. Allison. SummBank 1.0. 2003.Google ScholarGoogle Scholar
  21. A. M. Rush, S. Chopra, and J. Weston. A Neural Attention Model for Abstractive Sentence Summarization. In L. Màrquez, C. Callison-Burch, J. Su, D. Pighin, and Y. Marton, editors, EMNLP, pages 379--389. The Association for Computational Linguistics, 2015.Google ScholarGoogle Scholar
  22. M. Schinas, S. Papadopoulos, G. Petkos, Y. Kompatsiaris, and P. A. Mitkas. Multimodal Graph-based Event Detection and Summarization in Social Media Streams. In Proceedings of the 23rd ACM International Conference on Multimedia, MM '15, pages 189--192, New York, NY, USA, 2015. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. J. Ulrich, G. Murray, and G. Carenini. A Publicly Available Annotated Corpus for Supervised Email Summarization. In AAAI08 EMAIL Workshop, Chicago, USA, 2008. AAAI.Google ScholarGoogle Scholar
  24. T. Weninger, W. H. Hsu, and J. Han. CETR: Content Extraction via Tag Ratios. In Proceedings of the 19th International Conference on World Wide Web, WWW '10, pages 971--980, New York, NY, USA, 2010. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhutdinov, R. S. Zemel, and Y. Bengio. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. arXiv preprint arXiv:1502.03044, 2(3):5, 2015.Google ScholarGoogle Scholar
  26. R. Zhang, P. Isola, and A. A. Efros. Colorful Image Colorization. arXiv preprint arXiv:1603.08511, 2016.Google ScholarGoogle Scholar

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in
  • Published in

    cover image ACM Other conferences
    FIRE '16: Proceedings of the 8th Annual Meeting of the Forum for Information Retrieval Evaluation
    December 2016
    47 pages
    ISBN:9781450348386
    DOI:10.1145/3015157

    Copyright © 2016 ACM

    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    • Published: 8 December 2016

    Permissions

    Request permissions about this article.

    Request Permissions

    Check for updates

    Qualifiers

    • short-paper
    • Research
    • Refereed limited

    Acceptance Rates

    FIRE '16 Paper Acceptance Rate7of22submissions,32%Overall Acceptance Rate19of64submissions,30%

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader