Abstract
The rapid development of language resources and tools using machine learning techniques for less computerized languages requires appropriately tagged corpus. A tagged Bengali news corpus has been developed from the web archive of a widely read Bengali newspaper. A web crawler retrieves the web pages in Hyper Text Markup Language (HTML) format from the news archive. At present, the corpus contains approximately 34 million wordforms. Named Entity Recognition (NER) systems based on pattern based shallow parsing with or without using linguistic knowledge have been developed using a part of this corpus. The NER system that uses linguistic knowledge has performed better yielding highest F-Score values of 75.40%, 72.30%, 71.37%, and 70.13% for person, location, organization, and miscellaneous names, respectively.
Similar content being viewed by others
References
Baroni, M., & Bernardini, S. (2004). BootCat: Bootstrapping corpora and terms from the web. In Proceedings of LREC 2004, Lisbon, pp. 1313–1316.
Bertagna, F., Lenci, A., Monachini, M., Calzolari, N. (2004). Content interoperability of lexical resources, open issues and “MILE” Perspectives. In Proceedings of the LREC 2004, 131–134.
Bharati, A., Sharma, D. M., Chaitanya, V., Kulkarni, A. P., & Sangal, R. (2001). LERIL: Collaborative effort for creating lexical resources. In Proceedings of the 6th NLP Pacific Rim Symposium Post-Conference Workshop, Japan.
Boleda, G., Bott, S., Meza, R., Castillo, C., Badia, T., & Lopez, V. (2006). CUCWeb: A Catalian corpus built from the web. In Proceedings of the second International Workshop on Web as Corpus, Torento, Italy, pp. 19–26.
Calzolari, N., Bertagna, F., Lenci, A., & Monachini, M. (2003). Standards and best practice for multilingual computational lexicons, MILE (the multilingual ISLE lexical entry). ISLE Deliverable D2.2 & 3.2.
Cunningham, H. G. (2002). A general architecture for text engineering. Computers and the Humanities, 36, 223–254.
Fletcher, W. H. (2001). Concordancing the web with KWiCFinder. In Proceedings of the Third North American Symposium on Corpus Linguistics and Language Teaching, Boston, MA, 23–25 March 2001.
Fletcher, W. H. (2004). Making the web more use-ful as source for linguists corpora. In U. Conor & T. A. Upton (Eds.), Applied corpus linguists: A multidimensional perspective (pp. 191–205). Amsterdam: Rodopi.
Giguet, E., & Luquet, P. (2006). Multilingual lexical database generation from parallel texts in 20 European languages with endogeneous resources. In Proceedings of the COLING/ACL 2006, Sydney, pp. 271–278.
Kilgarriff, A., & Grefenstette, G. (2003). Introduction to the special issue on the web as corpus. Computational Linguistics, 29(3), 333–347.
Lenci, A., Bel, N., Busa, F., Calzolari, N., Gola, E., Monachini, M., Ogonowsky, A., Peters, I., Peters, W., Ruimy, N., Villegas, M., & Zampolli, A. (2000). SIMPLE: A general framework for the development of multilingual lexicons. International Journal of Lexicography, Special Issue, Dictionaries, Thesauri and Lexical-Semantic Relations, XIII(4), 249–263.
Okanohara, D., Miyao, Y., Tsuruoka, Y., & Tsujii, J. (2006). Improving the scalibility of semi-Markov conditional random fields for named entity recognition. In Proceedings of the COLING/ACL 2006, Sydney, pp. 465–472.
Rayson, P., Walkerdine, J., Fletcher, W. H., & Kolgarriff, A. (2006). Annotated web as corpus. In Proceedings of the second International Workshop on Web as Corpus, Torento, Italy, pp. 27–33.
Robb, T. (2003). Google as a corpus tool? ETJ Journal, 4(1), Spring.
Rundell, M. (2000). The biggest corpus of all. Humanising Language Teaching, 2(3).
Tokunaga, T., Sornlertlamvanich, V., Charoenporn, T., Calzolari, N., Monachini, M., Soria, C., Huang, C., YingJu, X., Hao, Y., Prevot, L., & Shirai, K. (2006). Infrastructure for standardization of asian languages resources. In Proceedings of the COLING/ACL 2006, Sydney, pp. 827–834.
Yangarber, R., Lin, W., & Grishman, R. (2002). Unsupervised learning of generalized names. In Proceedings of the COLING-2002.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Ekbal, A., Bandyopadhyay, S. A web-based Bengali news corpus for named entity recognition. Lang Resources & Evaluation 42, 173–182 (2008). https://doi.org/10.1007/s10579-008-9064-x
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10579-008-9064-x