Skip to main content

Using Hierarchical Transformers for Document Classification in Tamil Language

  • Conference paper
  • First Online:
  • 1008 Accesses

Part of the book series: Lecture Notes on Data Engineering and Communications Technologies ((LNDECT,volume 75))

Abstract

Document classification is used for various applications from spam detection in email to article classification. Recently document classification in Tamil has been gaining momentum due to increased data available in said language. One of the major advances in document classification is due to bidirectional encoder representations from transformers (also known as BERT), which uses transformer architecture and has been used effectively in many natural language processing problems like sentiment analysis and document classification for Tamil language. One of the main disadvantages of pre-trained BERT model is the number of tokens cannot be higher than 512; otherwise, it has to be retrained. Our implementation mitigates this issue by using hierarchical transformer architecture and is especially useful for resource poor languages like Tamil. We compare hierarchical transformer model and compared with classical machine learning algorithms and found recurrence over BERT shows substantial improvement over SVM, logistic regression and random forest, with a weighted average F1 score of 0.88 for news article classification.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   299.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   379.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Comrie, B.: Languages of the World, chap. 2, pp. 21–38. John Wiley & Sons, Ltd. (2017)

    Google Scholar 

  2. Daniels, P.T.: Writing Systems, chap. 5, pp. 75–94. John Wiley & Sons, Ltd. (2017)

    Google Scholar 

  3. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol. 1 (Long and Short Papers), pp. 4171–4186 (2019)

    Google Scholar 

  4. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 6000–6010 (2017)

    Google Scholar 

  5. Pappagari, R., Zelasko, P., Villalba, J., Carmiel, Y., Dehak, N.: Hierarchical transformers for long document classification. In: 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 838–844. IEEE (2019)

    Google Scholar 

  6. Adhikari, A., Ram, A., Tang, R., Lin, J.: Docbert: Bert for document classification. arXiv:1904.08398 (2019)

  7. Yang, Z., Yang, D., Dyer, C., He, X., Smola, A., Hovy, E.: Hierarchical attention networks for document classification. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 1480–1489 (2016)

    Google Scholar 

  8. Rao, G., Huang, W., Feng, Z., Cong, Q.: Lstm with sentence representations for document-level sentiment classification. Neurocomputing 308, 49–57 (2018)

    Article  Google Scholar 

  9. Rajan, K., Ramalingam, V., Ganesan, M., Palanivel, S., Palaniappan, B.: Automatic classification of tamil documents using vector space model and artificial neural network. Expert Syst. Appl. 36(8), 10914–10918 (2009)

    Article  Google Scholar 

  10. Reshma, U., Barathi Ganesh, H., Anand Kumar, M., Soman, K.: Supervised methods for domain classification of tamil documents. ARPN J. Eng. Appl. Sci. 10(8), 3702–3707 (2015)

    Google Scholar 

  11. Sanjanasri, J., et al.: A computational framework for tamil document classification using random kitchen sink. In: 2015 International Conference on Advances in Computing, Communications and Informatics (ICACCI), pp. 1571–1577. IEEE (2015)

    Google Scholar 

  12. vijayabhaskar, J.: Tamil news classification dataset (tamilmurasu) (Jan 2020). https://www.kaggle.com/vijayabhaskar96/tamil-news-classification-dataset-tamilmurasu

  13. Sanh, V., Debut, L., Chaumond, J., Wolf, T.: Distilbert, a distilled version of BERT: smaller, faster, cheaper and lighter. CoRR abs/1910.01108, http://arxiv.org/abs/1910.01108 (2019)

  14. Turc, I., Chang, M.W., Lee, K., Toutanova, K.: Well-read students learn better: on the importance of pre-training compact models. arXiv:1908.08962 (2019)

  15. Kakwani, D., Kunchukuttan, A., Golla, S., N.C., G., Bhattacharyya, A., Khapra, M.M., Kumar, P.: IndicNLPSuite: Monolingual corpora, evaluation benchmarks and pre-trained multilingual language models for Indian languages. In: Findings of the Association for Computational Linguistics: EMNLP 2020. pp. 4948–4961. Association for Computational Linguistics, Online (Nov 2020). https://doi.org/10.18653/v1/2020.findings-emnlp.445, https://www.aclweb.org/anthology/2020.findings-emnlp.445

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to M. Riyaz Ahmed .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Riyaz Ahmed, M., Raghuraman, B., Briskilal, J. (2022). Using Hierarchical Transformers for Document Classification in Tamil Language. In: Smys, S., Bestak, R., Palanisamy, R., Kotuliak, I. (eds) Computer Networks and Inventive Communication Technologies . Lecture Notes on Data Engineering and Communications Technologies, vol 75. Springer, Singapore. https://doi.org/10.1007/978-981-16-3728-5_55

Download citation

  • DOI: https://doi.org/10.1007/978-981-16-3728-5_55

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-16-3727-8

  • Online ISBN: 978-981-16-3728-5

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics