A web-based Bengali news corpus for named entity recognition

Ekbal, Asif; Bandyopadhyay, Sivaji

doi:10.1007/s10579-008-9064-x

A web-based Bengali news corpus for named entity recognition

Published: 22 February 2008

Volume 42, pages 173–182, (2008)
Cite this article

Language Resources and Evaluation Aims and scope Submit manuscript

Asif Ekbal¹ &
Sivaji Bandyopadhyay¹

608 Accesses
31 Citations
3 Altmetric
Explore all metrics

Abstract

The rapid development of language resources and tools using machine learning techniques for less computerized languages requires appropriately tagged corpus. A tagged Bengali news corpus has been developed from the web archive of a widely read Bengali newspaper. A web crawler retrieves the web pages in Hyper Text Markup Language (HTML) format from the news archive. At present, the corpus contains approximately 34 million wordforms. Named Entity Recognition (NER) systems based on pattern based shallow parsing with or without using linguistic knowledge have been developed using a part of this corpus. The NER system that uses linguistic knowledge has performed better yielding highest F-Score values of 75.40%, 72.30%, 71.37%, and 70.13% for person, location, organization, and miscellaneous names, respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Notes

References

Baroni, M., & Bernardini, S. (2004). BootCat: Bootstrapping corpora and terms from the web. In Proceedings of LREC 2004, Lisbon, pp. 1313–1316.
Bertagna, F., Lenci, A., Monachini, M., Calzolari, N. (2004). Content interoperability of lexical resources, open issues and “MILE” Perspectives. In Proceedings of the LREC 2004, 131–134.
Bharati, A., Sharma, D. M., Chaitanya, V., Kulkarni, A. P., & Sangal, R. (2001). LERIL: Collaborative effort for creating lexical resources. In Proceedings of the 6th NLP Pacific Rim Symposium Post-Conference Workshop, Japan.
Boleda, G., Bott, S., Meza, R., Castillo, C., Badia, T., & Lopez, V. (2006). CUCWeb: A Catalian corpus built from the web. In Proceedings of the second International Workshop on Web as Corpus, Torento, Italy, pp. 19–26.
Calzolari, N., Bertagna, F., Lenci, A., & Monachini, M. (2003). Standards and best practice for multilingual computational lexicons, MILE (the multilingual ISLE lexical entry). ISLE Deliverable D2.2 & 3.2.
Cunningham, H. G. (2002). A general architecture for text engineering. Computers and the Humanities, 36, 223–254.
Google Scholar
Fletcher, W. H. (2001). Concordancing the web with KWiCFinder. In Proceedings of the Third North American Symposium on Corpus Linguistics and Language Teaching, Boston, MA, 23–25 March 2001.
Fletcher, W. H. (2004). Making the web more use-ful as source for linguists corpora. In U. Conor & T. A. Upton (Eds.), Applied corpus linguists: A multidimensional perspective (pp. 191–205). Amsterdam: Rodopi.
Giguet, E., & Luquet, P. (2006). Multilingual lexical database generation from parallel texts in 20 European languages with endogeneous resources. In Proceedings of the COLING/ACL 2006, Sydney, pp. 271–278.
Kilgarriff, A., & Grefenstette, G. (2003). Introduction to the special issue on the web as corpus. Computational Linguistics, 29(3), 333–347.
Article Google Scholar
Lenci, A., Bel, N., Busa, F., Calzolari, N., Gola, E., Monachini, M., Ogonowsky, A., Peters, I., Peters, W., Ruimy, N., Villegas, M., & Zampolli, A. (2000). SIMPLE: A general framework for the development of multilingual lexicons. International Journal of Lexicography, Special Issue, Dictionaries, Thesauri and Lexical-Semantic Relations, XIII(4), 249–263.
Google Scholar
Okanohara, D., Miyao, Y., Tsuruoka, Y., & Tsujii, J. (2006). Improving the scalibility of semi-Markov conditional random fields for named entity recognition. In Proceedings of the COLING/ACL 2006, Sydney, pp. 465–472.
Rayson, P., Walkerdine, J., Fletcher, W. H., & Kolgarriff, A. (2006). Annotated web as corpus. In Proceedings of the second International Workshop on Web as Corpus, Torento, Italy, pp. 27–33.
Robb, T. (2003). Google as a corpus tool? ETJ Journal, 4(1), Spring.
Rundell, M. (2000). The biggest corpus of all. Humanising Language Teaching, 2(3).
Tokunaga, T., Sornlertlamvanich, V., Charoenporn, T., Calzolari, N., Monachini, M., Soria, C., Huang, C., YingJu, X., Hao, Y., Prevot, L., & Shirai, K. (2006). Infrastructure for standardization of asian languages resources. In Proceedings of the COLING/ACL 2006, Sydney, pp. 827–834.
Yangarber, R., Lin, W., & Grishman, R. (2002). Unsupervised learning of generalized names. In Proceedings of the COLING-2002.

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, Jadavpur University, Kolkata, 700032, India
Asif Ekbal & Sivaji Bandyopadhyay

Authors

Asif Ekbal
View author publications
You can also search for this author in PubMed Google Scholar
Sivaji Bandyopadhyay
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Asif Ekbal.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ekbal, A., Bandyopadhyay, S. A web-based Bengali news corpus for named entity recognition. Lang Resources & Evaluation 42, 173–182 (2008). https://doi.org/10.1007/s10579-008-9064-x

Download citation

Received: 18 August 2006
Accepted: 22 January 2008
Published: 22 February 2008
Issue Date: May 2008
DOI: https://doi.org/10.1007/s10579-008-9064-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A web-based Bengali news corpus for named entity recognition

Abstract

Access this article

Similar content being viewed by others

Czech Named Entity Corpus

Introducing Baselines for Russian Named Entity Recognition

A Hybrid Approach for Persian Named Entity Recognition

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A web-based Bengali news corpus for named entity recognition

Abstract

Access this article

Similar content being viewed by others

Czech Named Entity Corpus

Introducing Baselines for Russian Named Entity Recognition

A Hybrid Approach for Persian Named Entity Recognition

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation