short-paper

Simurg: An Extendable Multilingual Corpus for Abstractive Single Document Summarization

Authors:
Pashutan Modaresi

Institute of Computer Science, Heinrich Heine University of Düsseldorf, Düsseldorf, Germany

Institute of Computer Science, Heinrich Heine University of Düsseldorf, Düsseldorf, Germany
View Profile

,
Stefan Conrad

Institute of Computer Science, Heinrich Heine University of Düsseldorf, Düsseldorf, Germany

Institute of Computer Science, Heinrich Heine University of Düsseldorf, Düsseldorf, Germany
View Profile

FIRE '16: Proceedings of the 8th Annual Meeting of the Forum for Information Retrieval EvaluationDecember 2016Pages 24–27https://doi.org/10.1145/3015157.3015161

Published:08 December 2016Publication History

FIRE '16: Proceedings of the 8th Annual Meeting of the Forum for Information Retrieval Evaluation

Pages 24–27

ABSTRACT

Abstractive single document summarization is considered as a challenging problem in the field of artificial intelligence and natural language processing. Meanwhile and specifically in the last two years, several deep learning summarization approaches were proposed that once again attracted the attention of researchers to this field.

It is a well-known issue that deep learning approaches do not work well with small amounts of data. With some exceptions, this is, unfortunately, the case for most of the datasets available for the summarization task. Besides this problem, it should be considered that phonetic, morphological, semantic and syntactic features of the language are constantly changing over the time and unfortunately most of the summarization corpora are constructed from old resources. Another problem is the language of the corpora. Not only in the summarization field, but also in other fields of natural language processing, most of the corpora are only available in English. In addition to the above problems, license terms, and fees of the corpora are obstacles that prevent many academics and specifically non-academics from accessing these data.

This work describes an open source framework to create an extendable multilingual corpus for abstractive single document summarization that addresses the above-mentioned problems. We describe a tool consisted of a scalable crawler and a centralized key-value store database to construct a corpus of an arbitrary size using a news aggregator service.

References

M. B. Almeida, M. S. C. Almeida, andré F. T. Martins, H. Figueira, P. Mendes, and C. Pinto. Priberam Compressive Summarization Corpus: A New Multi-Document Summarization Corpus for European Portuguese. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14), may 2014.Google Scholar
J. A. Aslam, F. Diaz, M. Ekstrand-Abueg, R. McCreadie, V. Pavlu, and T. Sakai. TREC 2015 Temporal Summarization Track Overview. In Proceedings of The Twenty-Fourth Text REtrieval Conference, TREC 2015, Gaithersburg, Maryland, USA, November 17-20, 2015, 2015.Google Scholar
Z. Cao, W. Li, S. Li, and F. Wei. AttSum: Joint Learning of Focusing and Summarization with Neural Attention. CoRR, abs/1604.00125, 2016.Google Scholar
N. Chatterjee and P. K. Sahoo. Random Indexing and Modified Random Indexing based Approach for Extractive Text Summarization. Computer Speech & Language, 29(1):32--44, 2015.Google ScholarCross Ref
T. Crowley and C. Bowern. An Introduction to Historical Linguistics. OUP USA, 2010.Google Scholar
H. T. Dang. Overview of DUC 2006. In Proc. Document Understanding Workshop, page 10, 2006.Google Scholar
D. Graff and C. Cieri. English Gigaword. Linguistic Data Consortium, Philadelphia, 2003.Google Scholar
K. M. Hermann, T. Kočiský, E. Grefenstette, L. Espeholt, W. Kay, M. Suleyman, and P. Blunsom. Teaching Machines to Read and Comprehend. In Advances in Neural Information Processing Systems (NIPS), 2015. Google ScholarDigital Library
J. Huang, M. Peng, H. Wang, J. Cao, W. Gao, and X. Zhang. A Probabilistic Method for Emerging Topic Tracking in Microblog Stream. World Wide Web, pages 1--26, 2016.Google Scholar
C. Kohlschütter, P. Fankhauser, and W. Nejdl. Boilerplate Detection Using Shallow Text Features. In Proceedings of the Third ACM International Conference on Web Search and Data Mining, WSDM '10, pages 441--450, New York, NY, USA, 2010. ACM. Google ScholarDigital Library
C.-Y. Lin. ROUGE: A Package for Automatic Evaluation of summaries. In Proc. ACL workshop on Text Summarization Branches Out, page 10, 2004.Google Scholar
H. P. Luhn. The Automatic Creation of Literature Abstracts. IBM J. Res. Dev., 2(2):159--165, Apr. 1958. Google ScholarDigital Library
P. McCullagh and J. Nelder. Generalized Linear Models, Second Edition. Taylor & Francis, 1989.Google ScholarCross Ref
P. Modaresi and S. Conrad. On Definition of Automatic Text Summarization. In The Second International Conference on Digital Information Processing, Data Mining, and Wireless Communications (DIPDMWC2015), page 33, 2015.Google Scholar
R. Nallapati, B. Xiang, and B. Zhou. Sequence-to-Sequence RNNs for Text Summarization. CoRR, abs/1602.06023, 2016.Google Scholar
J. Ngiam, A. Coates, A. Lahiri, B. Prochnow, Q. V. Le, and A. Y. Ng. On Optimization Methods for Deep Learning. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pages 265--272, 2011.Google ScholarDigital Library
L. Ostroumova Prokhorenkova, P. Prokhorenkov, E. Samosvat, and P. Serdyukov. Publication Date Prediction Through Reverse Engineering of the Web. In Proceedings of the Ninth ACM International Conference on Web Search and Data Mining, WSDM '16, pages 123--132, New York, NY, USA, 2016. ACM. Google ScholarDigital Library
K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu. BLEU: A Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, ACL '02, pages 311--318. Association for Computational Linguistics, 2002. Google ScholarDigital Library
M. E. Peters and D. Lecocq. Content Extraction Using Diverse Feature Sets. In Proceedings of the 22Nd International Conference on World Wide Web, WWW '13 Companion, pages 89--90, New York, NY, USA, 2013. ACM. Google ScholarDigital Library
D. Radev, S. Teufel, H. Saggion, W. Lam, J. Blitzer, A. Celebri, E. Drabek, D. Liu, H. Qi, and T. Allison. SummBank 1.0. 2003.Google Scholar
A. M. Rush, S. Chopra, and J. Weston. A Neural Attention Model for Abstractive Sentence Summarization. In L. Màrquez, C. Callison-Burch, J. Su, D. Pighin, and Y. Marton, editors, EMNLP, pages 379--389. The Association for Computational Linguistics, 2015.Google Scholar
M. Schinas, S. Papadopoulos, G. Petkos, Y. Kompatsiaris, and P. A. Mitkas. Multimodal Graph-based Event Detection and Summarization in Social Media Streams. In Proceedings of the 23rd ACM International Conference on Multimedia, MM '15, pages 189--192, New York, NY, USA, 2015. ACM. Google ScholarDigital Library
J. Ulrich, G. Murray, and G. Carenini. A Publicly Available Annotated Corpus for Supervised Email Summarization. In AAAI08 EMAIL Workshop, Chicago, USA, 2008. AAAI.Google Scholar
T. Weninger, W. H. Hsu, and J. Han. CETR: Content Extraction via Tag Ratios. In Proceedings of the 19th International Conference on World Wide Web, WWW '10, pages 971--980, New York, NY, USA, 2010. ACM. Google ScholarDigital Library
K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhutdinov, R. S. Zemel, and Y. Bengio. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. arXiv preprint arXiv:1502.03044, 2(3):5, 2015.Google Scholar
R. Zhang, P. Isola, and A. A. Efros. Colorful Image Colorization. arXiv preprint arXiv:1603.08511, 2016.Google Scholar

Recommendations

Abstractive text summarization and new large-scale datasets for agglutinative languages Turkish and Hungarian
Abstract
Due to the exponential growth in the number of documents on the Web, accessing the salient information relevant to a user need is gaining importance, which increases the popularity of text summarization. Recent progress in deep learning shifted ...
Read More
Multilingual extraction of semantic indexes
SADPI '07: Proceedings of the 2007 international workshop on Semantically aware document processing and indexing

This article deals with multilingual document indexing. We propose an indexing method based on several stages. First of all the most important terms of the document are extracted using general characteristics of languages and statistical methods. Thus, ...
Read More
Abstractive text summarization using LSTM-CNN based deep learning

Abstractive Text Summarization (ATS), which is the task of constructing summary sentences by merging facts from different source sentences and condensing them into a shorter representation while preserving information content and overall meaning. It is ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
FIRE '16: Proceedings of the 8th Annual Meeting of the Forum for Information Retrieval Evaluation
December 2016
47 pages
ISBN:9781450348386
DOI:10.1145/3015157
Editors:
Prasenjit Majumder,
Mandar Mitra,
Jainisha Sankhavara,
Parth Mehta
Copyright © 2016 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 8 December 2016
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Abstractive text summarization
extendable corpora
multilingual corpora
single document summarization
Qualifiers
- short-paper
- Research
- Refereed limited
Conference

Acceptance Rates
FIRE '16 Paper Acceptance Rate7of22submissions,32%Overall Acceptance Rate19of64submissions,30%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 2
  Total Citations
  View Citations
- 99
  Total Downloads
- Downloads (Last 12 months)3
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Simurg: An Extendable Multilingual Corpus for Abstractive Single Document Summarization

FIRE '16: Proceedings of the 8th Annual Meeting of the Forum for Information Retrieval Evaluation

ABSTRACT

References

Cited By

Recommendations

Abstractive text summarization and new large-scale datasets for agglutinative languages Turkish and Hungarian

Multilingual extraction of semantic indexes

Abstractive text summarization using LSTM-CNN based deep learning

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media