ParsBERT: Transformer-based Model for Persian Language Understanding

Farahani, Mehrdad; Gharachorloo, Mohammad; Farahani, Marzieh; Manthouri, Mohammad

doi:10.1007/s11063-021-10528-4

ParsBERT: Transformer-based Model for Persian Language Understanding

Published: 08 October 2021

Volume 53, pages 3831–3847, (2021)
Cite this article

Neural Processing Letters Aims and scope Submit manuscript

Mehrdad Farahani¹,
Mohammad Gharachorloo²,
Marzieh Farahani³ &
…
Mohammad Manthouri ORCID: orcid.org/0000-0003-3461-9250⁴

2091 Accesses
41 Citations
18 Altmetric
Explore all metrics

Abstract

The surge of pre-trained language models has begun a new era in the field of Natural Language Processing (NLP) by allowing us to build powerful language models. Among these models, Transformer-based models such as BERT have become increasingly popular due to their state-of-the-art performance. However, these models are usually focused on English, leaving other languages to multilingual models with limited resources. This paper proposes a monolingual BERT for the Persian language (ParsBERT), which shows its state-of-the-art performance compared to other architectures and multilingual models. Also, since the amount of data available for NLP tasks in Persian is very restricted, a massive dataset for different NLP tasks as well as pre-training the model is composed. ParsBERT obtains higher scores in all datasets, including existing ones and gathered ones, and improves the state-of-the-art performance by outperforming both multilingual BERT and other prior works in Sentiment Analysis, Text Classification, and Named Entity Recognition tasks.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

TunBERT: Pretraining BERT for Tunisian Dialect Understanding

Article 03 February 2023

BertOdia: BERT Pre-training for Low Resource Odia Language

BERTimbau: Pretrained BERT Models for Brazilian Portuguese

Notes

References

Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013) Distributed representations of words and phrases and their compositionality. arXiv:1310.4546
Pennington J, Socher R, Manning C (2014) GloVe: Global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). Association for Computational Linguistics, Doha, Qatar, pp 1532–1543. https://doi.org/10.3115/v1/D14-1162
Peters ME, Neumann M, Iyyer M, Gardner M, Clark C, Lee K, Zettlemoyer L (2018) Deep contextualized word representations. arXiv:1802.05365
Devlin J, Chang M-W, Lee K, Toutanova K (2019) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805
adford A (2018) Improving language understanding by generative pre-training. In: OpenAI
Liu Y, Ott M, Goyal N, Du J, Joshi M, Chen D, Levy O, Lewis M, Zettlemoyer L, Stoyanov V (2019) Roberta: a robustly optimized bert pretraining approach. arXiv:1907.11692
Yang Z, Dai Z, Yang Y, Carbonell J G, Salakhutdinov R, Le Q V (2019) Xlnet: generalized autoregressive pretraining for language understanding. In NeurIPS
Raffel C, Shazeer N, Roberts A, Lee K, Narang S, Matena M, Zhou Y, Li W, Liu P J (2019) Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv:1910.10683
Conneau A, Khandelwal K, Goyal N, Chaudhary V, Wenzek G, Guzmán F, Grave E, Ott M, Zettlemoyer L, Stoyanov V (2019) Unsupervised cross-lingual representation learning at scale. arXiv:1911.02116
Wang W, Bao F, Gao G (2019) Learning morpheme representation for mongoliannamed entity recognition.Neural Process Lett 50(3):2647–2664
Huang Gengshi, Haifeng Hu (2018) c-rnn: a fine-grained language model for image captioning. Neural Process Letts 49:683–691
Article Google Scholar
Niu Jinghao, Yang Yehui, Zhang Siheng, Sun Zhengya, Zhang Wensheng (2018) Multi-task character-level attentional networks for medical concept normalization. Neural Process Letts 49:1239–1256
Article Google Scholar
Dai Andrew M, Le Quoc V (2015) Semi-supervised sequence learning. arXiv:1511.01432,
Ramachandran P, Liu P J, Le Q V (2016) Unsupervised pretraining for sequence to sequence learning. arXiv:1611.02683
Sutskever I, Vinyals O, Le Quoc V (2014) Sequence to sequence learning with neural networks. arXiv:1409.3215
Howard J, Ruder S (2018) Universal language model fine-tuning for text classification. In ACL
Hochreiter Sepp, Schmidhuber Jürgen (1997) Long short-term memory. Neural Comput 9:1735–1780
Article Google Scholar
Vaswani A, Shazeer N, Parmar N , Uszkoreit J, Jones L, Gomez A N, Kaiser L, Polosukhin I (2017) Attention is all you need. arXiv:1706.03762
Lample G, Conneau A (2019) Cross-lingual language model pretraining. arXiv:1901.07291
Lan Z-Z, Chen M, Goodman S, Gimpel K, Sharma P, Soricut R (2020) Albert: A lite bert for self-supervised learning of language representations. arXiv:1909.11942
Wang A, Singh A, Michael J, Hill F , Levy O, Bowman SR (2018) GLUE: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint. arXiv:1804.07461
Rajpurkar P, Jia R, Liang P (2018) Know what you don’t know: Unanswerable questions for squad. arXiv:1806.03822
Wietse de Vries, Andreas van C, Arianna B, Tommaso C, Gertjan van N, and Malvina N. Bertje: A dutch bert model. arXiv:1912.09582, 2019
Polignano M, Basile P, Degemmis M, Semeraro G, Basile V (2019) Alberto: Italian bert language understanding model for nlp challenging tasks based on tweets. In CLiC-it
Antoun W, Baly F, Hajj H M (2020) Arabert: transformer-based model for arabic language understanding. arXiv:2003.00104
Virtanen A, Kanerva J, Ilo R, Luoma J, Luotolahti J, Salakoski T , Ginter F, Pyysalo (2019) Multilingual is not enough: Bert for finnish. arXiv:1912.07076
Kuratov Y, Arkhipov M (2019) Adaptation of deep bidirectional multilingual transformers for russian language. arXiv:1905.07213
de Souza Fábio Barbosa, Nogueira R, de Alencar Lotufo R (2019) Portuguese named entity recognition using bert-crf. arXiv:1909.10649
Grave E, Bojanowski P, Gupta P, Joulin A, Mikolov T (2018) Learning word vectors for 157 languages. arXiv:1802.06893
Zahedi M S, Bokaei M H, Shoeleh F, Yadollahi M M, Doostmohammadi E, Farhoodi M (2018) Persian word embedding evaluation benchmarks. Electrical Engineering (ICEE), Iranian Conference on, pp. 1583–1588
Saravani SHH, Bahrani M, Veisi H, Besharati S (2018) Persian language modeling using recurrent neural networks. 2018 9th International Symposium on Telecommunications (IST), pp. 207–210
Ahmadi F, Moradi H (2015) A hybrid method for persian named entity recognition. 2015 7th Conference on Information and Knowledge Technology (IKT), pp. 1–7
Dashtipour K, Gogate M, Adeel A, Algarafi A, Howard N, Hussain A (2017) Persian named entity recognition. 2017 IEEE 16th International Conference on Cognitive Informatics & Cognitive Computing (ICCI*CC), pp. 79–83
Bokaei M H, Mahmoudi M (2018) Improved deep persian named entity recognition. 2018 9th International Symposium on Telecommunications (IST), pp. 381–386
Taher E, Hoseini S A, Shamsfard M (2020) Beheshti-ner: Persian named entity recognition using bert. arXiv:2003.08875
Dastgheib MB, Koleini S, Rasti F (2020) The application of deep learning in persian documents sentiment analysis. Int J Inf Sci Manag 18:1–15
Google Scholar
Bijari K, Zare H, Kebriaei E, Veisi H (2020) Leveraging deep graph-based text representation for sentiment polarity applications. Expert Syst Appl 144: 113090
Sharami JPR, Sarabestani P A, Mirroshandel S A (2020) Deepsentipers: Novel deep learning models trained over proposed augmented persian sentiment corpus. arXiv:2004.05328
Hosseini P, Ramaki AA, Maleki H, Anvari M , & Mirroshandel SA (2018) Sentipers: A sentiment analysis corpus for persian. arXiv:1801.07737
Goldhahn D, Eckart T, Quasthoff U et al (2012) Building large monolingual dictionaries at the Leipzig Corpora collection: from 100 to 200 languages. In: LREC, vol 29, pp 31–43
Javier Ortiz Suárez Pedro, Sagot Benoît, Romary Laurent (2019) Asynchronous pipeline for processing huge corpora on medium to low resource infrastructures. In CMLC-7
Sabeti B, Firouzjaee HA, Choobbasti AJ, Najafabadi SHEM , Vaheb A (2018) Mirastext: An automatically generated text corpus for persian. In: Proceedings of the eleventh international conference on language resources and evaluation (LREC 2018)
Kingma DP, Ba J (2015) Adam: a method for stochastic optimization. CoRR, arXiv:1412.6980
Kudo T (2018) Subword regularization: Improving neural network translation models with multiple subword candidates. arXiv preprint. arXiv:1804.10959
Sennrich R, Haddow B, Birch A (2016) Neural machine translation of rare words with subword units. arXiv:1508.07909
Shahshahani MS, Mohseni M, Shakery A, Faili H (2018) Peyma: a tagged corpus for persian named entities. arXiv:1801.09936
Poostchi H, Borzeshi EZ, Piccardi M (2018) Bilstm-crf for persian named-entity recognition armanpersonercorpus: the first entity-annotated persian dataset. In LREC
Hafezi L, Rezaeian M (2018) Neural architecture for persian named entity recognition. 2018 4th Iranian Conference on Signal Processing and Intelligent Systems (ICSPIS), pp. 61–64
Poostchi H, Borzeshi EZ, Abdous M, Piccardi M (2016) PersoNER: Persian named-entity recognition. In: COLING 2016-26th international conference on computational linguistics, Proceedings of COLING 2016: Technical Papers

Download references

Acknowledgements

We hereby, express our gratitude to the Tensorflow Research Cloud (TFRC) program (https://tensorflow.org/tfrc) for providing us with the necessary computation resources. We also thank Hooshvare (https://hooshvare.com) Research Group for facilitating dataset gathering and scraping online text resources.

Author information

Authors and Affiliations

Department of Computer Engineering, Islamic Azad University North Tehran Branch, Tehran, Iran
Mehrdad Farahani
Queensland University of Technology, School of Electrical Engineering and Robotics, Brisbane, Australia
Mohammad Gharachorloo
Department of Computing Science, Umeå University, Umeå, Sweden
Marzieh Farahani
Department of Electrical and Electronic Engineering, Shahed Univerisity, Tehran, Iran
Mohammad Manthouri

Authors

Mehrdad Farahani
View author publications
You can also search for this author in PubMed Google Scholar
Mohammad Gharachorloo
View author publications
You can also search for this author in PubMed Google Scholar
Marzieh Farahani
View author publications
You can also search for this author in PubMed Google Scholar
Mohammad Manthouri
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mohammad Manthouri.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Farahani, M., Gharachorloo, M., Farahani, M. et al. ParsBERT: Transformer-based Model for Persian Language Understanding. Neural Process Lett 53, 3831–3847 (2021). https://doi.org/10.1007/s11063-021-10528-4

Download citation

Accepted: 15 May 2021
Published: 08 October 2021
Issue Date: December 2021
DOI: https://doi.org/10.1007/s11063-021-10528-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

ParsBERT: Transformer-based Model for Persian Language Understanding

Abstract

Access this article

Similar content being viewed by others

TunBERT: Pretraining BERT for Tunisian Dialect Understanding

BertOdia: BERT Pre-training for Low Resource Odia Language

BERTimbau: Pretrained BERT Models for Brazilian Portuguese

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

ParsBERT: Transformer-based Model for Persian Language Understanding

Abstract

Access this article

Similar content being viewed by others

TunBERT: Pretraining BERT for Tunisian Dialect Understanding

BertOdia: BERT Pre-training for Low Resource Odia Language

BERTimbau: Pretrained BERT Models for Brazilian Portuguese

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation