Hope Speech Detection for Dravidian Languages Using Cross-Lingual Embeddings with Stacked Encoder Architecture

Sundar, Arunima; Ramakrishnan, Akshay; Balaji, Avantika; Durairaj, Thenmozhi

doi:10.1007/s42979-021-00943-8

Hope Speech Detection for Dravidian Languages Using Cross-Lingual Embeddings with Stacked Encoder Architecture

Original Research
Published: 18 November 2021

Volume 3, article number 67, (2022)
Cite this article

Download PDF

SN Computer Science Aims and scope Submit manuscript

Hope Speech Detection for Dravidian Languages Using Cross-Lingual Embeddings with Stacked Encoder Architecture

Download PDF

Arunima Sundar ORCID: orcid.org/0000-0002-9278-8717¹,
Akshay Ramakrishnan¹,
Avantika Balaji¹ &
…
Thenmozhi Durairaj¹

1561 Accesses
6 Citations
1 Altmetric
Explore all metrics

Abstract

The task of hope speech detection has gained traction in the natural language processing field owing to the need for an increase in positive reinforcement online during the COVID-19 pandemic. Hope speech detection focuses on identifying texts among social media comments that could invoke positive emotions in people. Students and working adults alike posit that they experience a lot of work-induced stress further proving that there exists a need for external inspiration which in this current scenario, is mostly found online. In this paper, we propose a multilingual model, with main emphasis on Dravidian languages, to automatically detect hope speech. We have employed a stacked encoder architecture which makes use of language agnostic cross-lingual word embeddings as the dataset consists of code-mixed YouTube comments. Additionally, we have carried out an empirical analysis and tested our architecture against various traditional, transformer, and transfer learning methods. Furthermore a k-fold paired t test was conducted which corroborates that our model outperforms the other approaches. Our methodology achieved an F1-score of 0.61 and 0.85 for Tamil and Malayalam, respectively. Our methodology is quite competitive to the state-of-the-art methods. The code for our work can be found in our GitHub repository (https://github.com/arunimasundar/Hope-Speech-LT-EDI).

FakeBERT: Fake news detection in social media with a BERT-based deep learning approach

Article 07 January 2021

Automatic speech recognition: a survey

Article 10 November 2020

Transformer models for text-based emotion detection: a review of BERT-based approaches

Article 08 February 2021

Introduction

According to studies [1], the average time a person spends on social media is approximately 2 h and 25 min on a daily basis. It is also a known fact that the majority of teenagers and young adults, roughly 84% [2] of them, utilize social media regularly. Apart from the positive consequences of the advent of social media, there have been several negative impacts too. In a study conducted by Pew Research Center, it was revealed that around one in five internet users (22%) [3] have been victims of online harassment. Moreover, one of the most common methods by which cyber bullies harass their victims is by commenting on their posts on social media.

Considering that young adults are readily exposed to such vitriolic content on social media, it tends to create a snowball effect. Abusive comments tend to produce more abusive comments and ultimately result in an avalanche of online aggression. This also works the other way around, with positive comments inspiring more people to leave more positive comments [4]. Many studies have been done on this subject including the much disputed Facebook experiment [5] in which the tech-giant modified its “Newsfeed” algorithm to showcase more positive or negative posts to certain users. The results demonstrate that people tend to create more positive posts upon seeing happy posts on their newsfeed, and vice versa.

Platforms such as Facebook and Instagram allow their users to create online communities of like-minded people which promote a sense of acceptance and belonging, especially for those who feel marginalized or lonely. Given the COVID pandemic situation, people in most parts of the world are forced to remain isolated and look to social media to assuage their fears. Therefore, it is important to reinforce the positivity that people seek by focusing on hope speech.

Keeping this in mind, our proposed work focuses on detecting hope speech in YouTube comments. Hope speech consists of texts which have the propensity to inspire and invigorate individuals. Motivational texts and encouragements all fall under the category of hope speech. Our dataset consists of three labels, Hope_speech, Non_hope_speech and Not in language. For example:

thalaivar ah mathunga ellam nallabadiya marum: Change the leader, and everything will change for the better.—Hope_speech.
Bro neegha sonna eantha app um en kitta ialla: I don’t have any of the applications mentioned by you, brother.—Non_Hope_Speech.
madhan gowri fans hit like: Fans of madhan gowri, hit like.—not-Tamil.

Our work of hope speech identification mainly focuses on Tamil and Malayalam which are Dravidian Languages predominantly spoken in South India. The official language of the Indian state of Tamil Nadu is Tamil. Moreover, it is also recognized as an official language in two other sovereign nations, Singapore and Sri Lanka. Malayalam is the official language of the Indian state of Kerala and the union territory of Lakshadweep. There are approximately 38 million native Malayalam speakers. As opposed to other Dravidian languages, the finite verb in Malayalam is inflected only for tense and not for person, number or gender. Both the Tamil and Malayalam languages are characterized by a sequence of retroflex consonants which are formed by curling the tip of your tongue to the palate.

Existing work [6,7,8] pertaining to this task of hope speech detection involves the usage of neural network based architectures as well as transformers. These works perform fairly well for all languages but fail to work well for code-mixed and transliterated text. The proposed approach takes this premise into consideration by utilizing language-agnostic cross-lingual word embeddings which map words from different languages with equivalent meanings to similar hidden representations. This mapping helps to better understand contextual relationships in code-mixed texts resulting in improved performance without much use of feature engineering, whereas other traditional and deep learning models require more data augmentation methods.

Related Work

The first multilingual hope speech dataset [9] for Equality, Diversity and Inclusion (HopeEDI) was created by sourcing comments from YouTube. This formed the basis for the workshop on hope speech detection, conducted by LTEDI-EACL-2021 [10], which brought to light the importance of identifying positive comments on social media. The task proposed in the workshop has promoted research in this field.

In [11], the problem of hope speech detection has approached by adopting character n-grams-based TF-IDF and MuRIL text representations for the sentences in the dataset. Using such text representations, the authors have classified the sentences into hope speech, non-hope speech or not in language. The authors of this paper have compared the different approaches, namely TF-IDF + LR, TF-IDF + SVM, MuRIL + LR and MuRIL + SVM for each language.

Apart from hope speech, we also referred to work related to hate speech and offensive language identification as the crux of these NLP classification tasks is to analyze texts and categorize them. This work [12] follows two machine learning approaches in identifying the offensive language in twitter. The multilingual dataset is preprocessed by using spacy tokenizer for English and German languages followed by the nltk Twitter tokenizer used for tokenizing the input. The approach consisted of an ensemble of SVM, Random forest and Adaboost classifiers with majority voting.

Moreover, the authors have observed that models which make use of MuRIL text representations have been outperformed by models which use TF-IDF representations which led to our choice of disregarding MuRIL for our work. However, even in the case of TF-IDF, it cannot capture semantics as compared to word embeddings and also does not work well in the case of large vocabularies.

These papers formed a basis for our research on traditional learning classifiers with feature engineering like the use of TF-IDF and MuRIL text representations. However, both of these papers use extensive data augmentation in order to train the dataset with traditional classifiers. Hence, we explored literature which use other end classifiers after having obtained semantically rich text embeddings from the dataset.

This work [13] utilizes the FLAIR framework consisting of pre-trained embeddings for all their language models. This approach focuses on using context-aware string embeddings for word representations in deep learning techniques. For text representation, Recurrent Neural Network (RNN) and pooled document embeddings are used. The authors of [14] experiment with Bi-LSTM and Dense architectures along with the incorporation of transformer embeddings like BERT, ULMFiT, MuRIL, etc. For the Tamil and Malayalam languages, mBERT cased with BiLSTM architecture and mBERT uncased with BiLSTM architecture outperformed the other models, respectively.

BERT embeddings are used in [15] to feed into a CNN classifier. The embeddings have been obtained from the transliterated text where modules like language detection and linguistic rules are used. Convolutions are applied in the classifier network followed by a max-pooling layer. For output representation, the convolutions are passed through to a feed-forward network, and to counteract overfitting, a dropout layer is also added.

Balouchzahi et al. [16] created 3 models, namely CoHope-ML, CoHope-NN, and CoHope-TL, based on an ensemble of classifiers, Keras Neural Network (NN), and BiLSTM with Conv1d model respectively. The CoHope-ML and CoHopeNN models were trained on a feature set consisting of character sequences extracted from sentences combined with words for Ma–En (Malayalam–English) and Ta–En (Tamil–English) code-mixed texts and a combination of word and character n-grams along with syntactic word n-grams for the English dataset. The CoHope-ML model was developed as an ensemble of three classifiers—Logistic Regression, eXtreme Gradient Boosting (XGB) and Multi-Layer Perceptron (MLP) using the bagging technique. The CoHopeNN model employs a Keras dense Neural Network (NN) architecture. The CoHope-TL model comprised three parts: training tokenizers, BERT Language Model training followed using the pre-trained BERT Language Model as weights in BiLSTM-Conv1D model. The CoHope-ML model performed the best amongst the 3 proposed models.

From the above works, it is clear as to how embeddings are essential to grasp the context of the sentence. However, with transformers employing positional embeddings and self attention, it makes it a more attractive choice over using neural networks to accomplish the same. Moreover, given the code-mixed nature of the dataset, we were inspired to experiment with transformers while incorporating multilingual embeddings.

Sharma et al. [17], have contributed to the fields of hope speech detection and offensive language identification. The datasets were preprocessed by applying transformations from native script to Latin script using the Indic-trans library. The ULMFiT model has been trained on synthetically generated code-mixed languages. For the hope speech detection task, KNN algorithm is used to build the baseline model and the final model consists of a classifier trained on the fine-tuned ULMFIT language model. An ensemble of RoBERTa and ULMFIT was submitted for the offensive language identification task. The first classifier is trained on the fine-tuned language model, in this case RoBERTa, and the second classifier is obtained by training the ULMFiT model.

In [18], the authors have processed the dataset by applying stratified K-fold method prior to training and employed XLM-RoBERTa with attention as their classifier for all three languages. A similar approach [19] also utilized XLM-RoBERTa framework. The labels are classified using the output obtained from the final layer of XLM-RoBERTa and the weighted semantic information of TF-IDF.

This work [20] segregates the process of identifying hope speech into two phases: one phase which consists of five language detection models to identify the language of the text, and another phase which detects the occurrence of hope speech. The authors generate SBERT embeddings and pass them to the feed-forward network for prediction

A transformer-based pretrained BERT model [21], with a rule-based language identification system, assists in the detection of “Other language” labels. The authors of this work also experiment with traditional learning models with TF-IDF and deep learning models with pretrained Glove and FastText embeddings.

Other works that employ transformer models include [22, 23]. In both of these works, the authors have implemented various methods and found that XLM-RoBERTa was the best performing method, achieving promising scores.

Keeping in mind the state-of-the-art performance of transformer-based models, we propose a language-agnostic transformer model capable of providing comparable results on low-resource languages like Tamil and Malayalam. Moreover, to capture the semantics of the different languages, we utilized cross-lingual word embeddings which map similar meanings to the same vector space, regardless of the language.

Data

The dataset [9] used in our methodology was provided by LT-EDI for the purpose of detecting hope speech in YouTube comments. The dataset consists of YouTube comments belonging to three languages, namely English, Tamil and Malayalam as seen in Fig. 1.

These YouTube comments were then segregated into Training, Validation and Test datasets for each language which has been tabulated in Table 1. Each dataset consists of two attributes, “text” and “label”, where “text” is the YouTube comment itself and “label” can be either one of “Hope_speech”, “Non_Hope_speech” and “Not in language” as shown below in Table 2.

Table 1 Distribution of data

Hope Speech Detection for Dravidian Languages Using Cross-Lingual Embeddings with Stacked Encoder Architecture

Abstract

Similar content being viewed by others

FakeBERT: Fake news detection in social media with a BERT-based deep learning approach

Automatic speech recognition: a survey

Transformer models for text-based emotion detection: a review of BERT-based approaches

Introduction

Related Work

Data

Preprocessing

Pre-processing Trade-offs and Analysis

Methodology

Implementation

Result Analysis

Empirical Analysis

Traditional and Deep Learning Approaches

Transfer Learning Approach

Transformer-Based Approach

Statistical Analysis

Tamil Language

Malayalam Language

T-test

Results and Performance Comparison

Results

Performance Comparison with State-of-the-Art Results

Conclusion

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation