Abstract
Document classification is used for various applications from spam detection in email to article classification. Recently document classification in Tamil has been gaining momentum due to increased data available in said language. One of the major advances in document classification is due to bidirectional encoder representations from transformers (also known as BERT), which uses transformer architecture and has been used effectively in many natural language processing problems like sentiment analysis and document classification for Tamil language. One of the main disadvantages of pre-trained BERT model is the number of tokens cannot be higher than 512; otherwise, it has to be retrained. Our implementation mitigates this issue by using hierarchical transformer architecture and is especially useful for resource poor languages like Tamil. We compare hierarchical transformer model and compared with classical machine learning algorithms and found recurrence over BERT shows substantial improvement over SVM, logistic regression and random forest, with a weighted average F1 score of 0.88 for news article classification.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Comrie, B.: Languages of the World, chap. 2, pp. 21–38. John Wiley & Sons, Ltd. (2017)
Daniels, P.T.: Writing Systems, chap. 5, pp. 75–94. John Wiley & Sons, Ltd. (2017)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol. 1 (Long and Short Papers), pp. 4171–4186 (2019)
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 6000–6010 (2017)
Pappagari, R., Zelasko, P., Villalba, J., Carmiel, Y., Dehak, N.: Hierarchical transformers for long document classification. In: 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 838–844. IEEE (2019)
Adhikari, A., Ram, A., Tang, R., Lin, J.: Docbert: Bert for document classification. arXiv:1904.08398 (2019)
Yang, Z., Yang, D., Dyer, C., He, X., Smola, A., Hovy, E.: Hierarchical attention networks for document classification. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 1480–1489 (2016)
Rao, G., Huang, W., Feng, Z., Cong, Q.: Lstm with sentence representations for document-level sentiment classification. Neurocomputing 308, 49–57 (2018)
Rajan, K., Ramalingam, V., Ganesan, M., Palanivel, S., Palaniappan, B.: Automatic classification of tamil documents using vector space model and artificial neural network. Expert Syst. Appl. 36(8), 10914–10918 (2009)
Reshma, U., Barathi Ganesh, H., Anand Kumar, M., Soman, K.: Supervised methods for domain classification of tamil documents. ARPN J. Eng. Appl. Sci. 10(8), 3702–3707 (2015)
Sanjanasri, J., et al.: A computational framework for tamil document classification using random kitchen sink. In: 2015 International Conference on Advances in Computing, Communications and Informatics (ICACCI), pp. 1571–1577. IEEE (2015)
vijayabhaskar, J.: Tamil news classification dataset (tamilmurasu) (Jan 2020). https://www.kaggle.com/vijayabhaskar96/tamil-news-classification-dataset-tamilmurasu
Sanh, V., Debut, L., Chaumond, J., Wolf, T.: Distilbert, a distilled version of BERT: smaller, faster, cheaper and lighter. CoRR abs/1910.01108, http://arxiv.org/abs/1910.01108 (2019)
Turc, I., Chang, M.W., Lee, K., Toutanova, K.: Well-read students learn better: on the importance of pre-training compact models. arXiv:1908.08962 (2019)
Kakwani, D., Kunchukuttan, A., Golla, S., N.C., G., Bhattacharyya, A., Khapra, M.M., Kumar, P.: IndicNLPSuite: Monolingual corpora, evaluation benchmarks and pre-trained multilingual language models for Indian languages. In: Findings of the Association for Computational Linguistics: EMNLP 2020. pp. 4948–4961. Association for Computational Linguistics, Online (Nov 2020). https://doi.org/10.18653/v1/2020.findings-emnlp.445, https://www.aclweb.org/anthology/2020.findings-emnlp.445
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Riyaz Ahmed, M., Raghuraman, B., Briskilal, J. (2022). Using Hierarchical Transformers for Document Classification in Tamil Language. In: Smys, S., Bestak, R., Palanisamy, R., Kotuliak, I. (eds) Computer Networks and Inventive Communication Technologies . Lecture Notes on Data Engineering and Communications Technologies, vol 75. Springer, Singapore. https://doi.org/10.1007/978-981-16-3728-5_55
Download citation
DOI: https://doi.org/10.1007/978-981-16-3728-5_55
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-16-3727-8
Online ISBN: 978-981-16-3728-5
eBook Packages: EngineeringEngineering (R0)