Using Hierarchical Transformers for Document Classification in Tamil Language

Riyaz Ahmed, M.; Raghuraman, Bhuvan; Briskilal, J.

doi:10.1007/978-981-16-3728-5_55

Using Hierarchical Transformers for Document Classification in Tamil Language

Conference paper
First Online: 14 September 2021

1008 Accesses

Part of the book series: Lecture Notes on Data Engineering and Communications Technologies ((LNDECT,volume 75))

Abstract

Document classification is used for various applications from spam detection in email to article classification. Recently document classification in Tamil has been gaining momentum due to increased data available in said language. One of the major advances in document classification is due to bidirectional encoder representations from transformers (also known as BERT), which uses transformer architecture and has been used effectively in many natural language processing problems like sentiment analysis and document classification for Tamil language. One of the main disadvantages of pre-trained BERT model is the number of tokens cannot be higher than 512; otherwise, it has to be retrained. Our implementation mitigates this issue by using hierarchical transformer architecture and is especially useful for resource poor languages like Tamil. We compare hierarchical transformer model and compared with classical machine learning algorithms and found recurrence over BERT shows substantial improvement over SVM, logistic regression and random forest, with a weighted average F1 score of 0.88 for news article classification.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 299.00; Price excludes VAT (USA)

Softcover Book: USD 379.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

Comrie, B.: Languages of the World, chap. 2, pp. 21–38. John Wiley & Sons, Ltd. (2017)
Google Scholar
Daniels, P.T.: Writing Systems, chap. 5, pp. 75–94. John Wiley & Sons, Ltd. (2017)
Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol. 1 (Long and Short Papers), pp. 4171–4186 (2019)
Google Scholar
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 6000–6010 (2017)
Google Scholar
Pappagari, R., Zelasko, P., Villalba, J., Carmiel, Y., Dehak, N.: Hierarchical transformers for long document classification. In: 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 838–844. IEEE (2019)
Google Scholar
Adhikari, A., Ram, A., Tang, R., Lin, J.: Docbert: Bert for document classification. arXiv:1904.08398 (2019)
Yang, Z., Yang, D., Dyer, C., He, X., Smola, A., Hovy, E.: Hierarchical attention networks for document classification. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 1480–1489 (2016)
Google Scholar
Rao, G., Huang, W., Feng, Z., Cong, Q.: Lstm with sentence representations for document-level sentiment classification. Neurocomputing 308, 49–57 (2018)
Article Google Scholar
Rajan, K., Ramalingam, V., Ganesan, M., Palanivel, S., Palaniappan, B.: Automatic classification of tamil documents using vector space model and artificial neural network. Expert Syst. Appl. 36(8), 10914–10918 (2009)
Article Google Scholar
Reshma, U., Barathi Ganesh, H., Anand Kumar, M., Soman, K.: Supervised methods for domain classification of tamil documents. ARPN J. Eng. Appl. Sci. 10(8), 3702–3707 (2015)
Google Scholar
Sanjanasri, J., et al.: A computational framework for tamil document classification using random kitchen sink. In: 2015 International Conference on Advances in Computing, Communications and Informatics (ICACCI), pp. 1571–1577. IEEE (2015)
Google Scholar
vijayabhaskar, J.: Tamil news classification dataset (tamilmurasu) (Jan 2020). https://www.kaggle.com/vijayabhaskar96/tamil-news-classification-dataset-tamilmurasu
Sanh, V., Debut, L., Chaumond, J., Wolf, T.: Distilbert, a distilled version of BERT: smaller, faster, cheaper and lighter. CoRR abs/1910.01108, http://arxiv.org/abs/1910.01108 (2019)
Turc, I., Chang, M.W., Lee, K., Toutanova, K.: Well-read students learn better: on the importance of pre-training compact models. arXiv:1908.08962 (2019)
Kakwani, D., Kunchukuttan, A., Golla, S., N.C., G., Bhattacharyya, A., Khapra, M.M., Kumar, P.: IndicNLPSuite: Monolingual corpora, evaluation benchmarks and pre-trained multilingual language models for Indian languages. In: Findings of the Association for Computational Linguistics: EMNLP 2020. pp. 4948–4961. Association for Computational Linguistics, Online (Nov 2020). https://doi.org/10.18653/v1/2020.findings-emnlp.445, https://www.aclweb.org/anthology/2020.findings-emnlp.445

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, SRM Institute of Science and Technology, Chengalpattu, Tamil Nadu, India
M. Riyaz Ahmed, Bhuvan Raghuraman & J. Briskilal

Authors

M. Riyaz Ahmed
View author publications
You can also search for this author in PubMed Google Scholar
Bhuvan Raghuraman
View author publications
You can also search for this author in PubMed Google Scholar
J. Briskilal
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to M. Riyaz Ahmed .

Editor information

Editors and Affiliations

Department of Information Technology, RVS Technical Campus, Coimbatore, Tamil Nadu, India
S. Smys
Department of Telecommunication Engineering, Czech Technical University in Prague, Praha, Czech Republic
Robert Bestak
Gerald Schwartz School of Business, St. Francis Xavier University, Antigonish, NS, Canada
Ram Palanisamy
Faculty of Informatics and Information Technology, Slovak University Technology, Bratislava, Slovakia
Ivan Kotuliak

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Riyaz Ahmed, M., Raghuraman, B., Briskilal, J. (2022). Using Hierarchical Transformers for Document Classification in Tamil Language. In: Smys, S., Bestak, R., Palanisamy, R., Kotuliak, I. (eds) Computer Networks and Inventive Communication Technologies . Lecture Notes on Data Engineering and Communications Technologies, vol 75. Springer, Singapore. https://doi.org/10.1007/978-981-16-3728-5_55

Download citation

DOI: https://doi.org/10.1007/978-981-16-3728-5_55
Published: 14 September 2021
Publisher Name: Springer, Singapore
Print ISBN: 978-981-16-3727-8
Online ISBN: 978-981-16-3728-5
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics