ABSTRACT
Abstractive single document summarization is considered as a challenging problem in the field of artificial intelligence and natural language processing. Meanwhile and specifically in the last two years, several deep learning summarization approaches were proposed that once again attracted the attention of researchers to this field.
It is a well-known issue that deep learning approaches do not work well with small amounts of data. With some exceptions, this is, unfortunately, the case for most of the datasets available for the summarization task. Besides this problem, it should be considered that phonetic, morphological, semantic and syntactic features of the language are constantly changing over the time and unfortunately most of the summarization corpora are constructed from old resources. Another problem is the language of the corpora. Not only in the summarization field, but also in other fields of natural language processing, most of the corpora are only available in English. In addition to the above problems, license terms, and fees of the corpora are obstacles that prevent many academics and specifically non-academics from accessing these data.
This work describes an open source framework to create an extendable multilingual corpus for abstractive single document summarization that addresses the above-mentioned problems. We describe a tool consisted of a scalable crawler and a centralized key-value store database to construct a corpus of an arbitrary size using a news aggregator service.
- M. B. Almeida, M. S. C. Almeida, andré F. T. Martins, H. Figueira, P. Mendes, and C. Pinto. Priberam Compressive Summarization Corpus: A New Multi-Document Summarization Corpus for European Portuguese. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14), may 2014.Google Scholar
- J. A. Aslam, F. Diaz, M. Ekstrand-Abueg, R. McCreadie, V. Pavlu, and T. Sakai. TREC 2015 Temporal Summarization Track Overview. In Proceedings of The Twenty-Fourth Text REtrieval Conference, TREC 2015, Gaithersburg, Maryland, USA, November 17-20, 2015, 2015.Google Scholar
- Z. Cao, W. Li, S. Li, and F. Wei. AttSum: Joint Learning of Focusing and Summarization with Neural Attention. CoRR, abs/1604.00125, 2016.Google Scholar
- N. Chatterjee and P. K. Sahoo. Random Indexing and Modified Random Indexing based Approach for Extractive Text Summarization. Computer Speech & Language, 29(1):32--44, 2015.Google ScholarCross Ref
- T. Crowley and C. Bowern. An Introduction to Historical Linguistics. OUP USA, 2010.Google Scholar
- H. T. Dang. Overview of DUC 2006. In Proc. Document Understanding Workshop, page 10, 2006.Google Scholar
- D. Graff and C. Cieri. English Gigaword. Linguistic Data Consortium, Philadelphia, 2003.Google Scholar
- K. M. Hermann, T. Kočiský, E. Grefenstette, L. Espeholt, W. Kay, M. Suleyman, and P. Blunsom. Teaching Machines to Read and Comprehend. In Advances in Neural Information Processing Systems (NIPS), 2015. Google ScholarDigital Library
- J. Huang, M. Peng, H. Wang, J. Cao, W. Gao, and X. Zhang. A Probabilistic Method for Emerging Topic Tracking in Microblog Stream. World Wide Web, pages 1--26, 2016.Google Scholar
- C. Kohlschütter, P. Fankhauser, and W. Nejdl. Boilerplate Detection Using Shallow Text Features. In Proceedings of the Third ACM International Conference on Web Search and Data Mining, WSDM '10, pages 441--450, New York, NY, USA, 2010. ACM. Google ScholarDigital Library
- C.-Y. Lin. ROUGE: A Package for Automatic Evaluation of summaries. In Proc. ACL workshop on Text Summarization Branches Out, page 10, 2004.Google Scholar
- H. P. Luhn. The Automatic Creation of Literature Abstracts. IBM J. Res. Dev., 2(2):159--165, Apr. 1958. Google ScholarDigital Library
- P. McCullagh and J. Nelder. Generalized Linear Models, Second Edition. Taylor & Francis, 1989.Google ScholarCross Ref
- P. Modaresi and S. Conrad. On Definition of Automatic Text Summarization. In The Second International Conference on Digital Information Processing, Data Mining, and Wireless Communications (DIPDMWC2015), page 33, 2015.Google Scholar
- R. Nallapati, B. Xiang, and B. Zhou. Sequence-to-Sequence RNNs for Text Summarization. CoRR, abs/1602.06023, 2016.Google Scholar
- J. Ngiam, A. Coates, A. Lahiri, B. Prochnow, Q. V. Le, and A. Y. Ng. On Optimization Methods for Deep Learning. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pages 265--272, 2011.Google ScholarDigital Library
- L. Ostroumova Prokhorenkova, P. Prokhorenkov, E. Samosvat, and P. Serdyukov. Publication Date Prediction Through Reverse Engineering of the Web. In Proceedings of the Ninth ACM International Conference on Web Search and Data Mining, WSDM '16, pages 123--132, New York, NY, USA, 2016. ACM. Google ScholarDigital Library
- K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu. BLEU: A Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, ACL '02, pages 311--318. Association for Computational Linguistics, 2002. Google ScholarDigital Library
- M. E. Peters and D. Lecocq. Content Extraction Using Diverse Feature Sets. In Proceedings of the 22Nd International Conference on World Wide Web, WWW '13 Companion, pages 89--90, New York, NY, USA, 2013. ACM. Google ScholarDigital Library
- D. Radev, S. Teufel, H. Saggion, W. Lam, J. Blitzer, A. Celebri, E. Drabek, D. Liu, H. Qi, and T. Allison. SummBank 1.0. 2003.Google Scholar
- A. M. Rush, S. Chopra, and J. Weston. A Neural Attention Model for Abstractive Sentence Summarization. In L. Màrquez, C. Callison-Burch, J. Su, D. Pighin, and Y. Marton, editors, EMNLP, pages 379--389. The Association for Computational Linguistics, 2015.Google Scholar
- M. Schinas, S. Papadopoulos, G. Petkos, Y. Kompatsiaris, and P. A. Mitkas. Multimodal Graph-based Event Detection and Summarization in Social Media Streams. In Proceedings of the 23rd ACM International Conference on Multimedia, MM '15, pages 189--192, New York, NY, USA, 2015. ACM. Google ScholarDigital Library
- J. Ulrich, G. Murray, and G. Carenini. A Publicly Available Annotated Corpus for Supervised Email Summarization. In AAAI08 EMAIL Workshop, Chicago, USA, 2008. AAAI.Google Scholar
- T. Weninger, W. H. Hsu, and J. Han. CETR: Content Extraction via Tag Ratios. In Proceedings of the 19th International Conference on World Wide Web, WWW '10, pages 971--980, New York, NY, USA, 2010. ACM. Google ScholarDigital Library
- K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhutdinov, R. S. Zemel, and Y. Bengio. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. arXiv preprint arXiv:1502.03044, 2(3):5, 2015.Google Scholar
- R. Zhang, P. Isola, and A. A. Efros. Colorful Image Colorization. arXiv preprint arXiv:1603.08511, 2016.Google Scholar
Recommendations
Abstractive text summarization and new large-scale datasets for agglutinative languages Turkish and Hungarian
AbstractDue to the exponential growth in the number of documents on the Web, accessing the salient information relevant to a user need is gaining importance, which increases the popularity of text summarization. Recent progress in deep learning shifted ...
Multilingual extraction of semantic indexes
SADPI '07: Proceedings of the 2007 international workshop on Semantically aware document processing and indexingThis article deals with multilingual document indexing. We propose an indexing method based on several stages. First of all the most important terms of the document are extracted using general characteristics of languages and statistical methods. Thus, ...
Abstractive text summarization using LSTM-CNN based deep learning
Abstractive Text Summarization (ATS), which is the task of constructing summary sentences by merging facts from different source sentences and condensing them into a shorter representation while preserving information content and overall meaning. It is ...
Comments