Towards filtering undesired short text messages using an online learning approach with semantic indexing
Introduction
Combating spam is an important problem in the online world. Since the first use of the “word” spam to describe an unsolicited bulk message, this plague has “infected” almost all popular types of electronic communication by text. Although the email spam is the most widely recognized form of spam, this is also spreading to applications of short text messages, such as blogs, instant messaging, mobile phone (SMS), and social media.
The volume of text information is increasing rapidly and a significant amount is spam, which makes manual selection unpractical and grants automatic expert and intelligent systems for spam detection an important role for filtering undesired content (Alsaleh, Alarifi, Al-Quayed, & Al-Salman, 2015). This problem is an example of adversarial classification, in which the spammers constantly attempt to evade filtering, while the predictive models try to adapt to continuously evolving spamming techniques (Bratko, Filipič, Cormack, Lynam, Zupan, 2006, Dalvi, Domingos, Mausam, Sanghai, Verma, 2004).
According to Akismet1, a spam filtering service for blog comments, their systems have kept spam off the web with an average of about 7.5 million spam comments per hour2. The volume of legitimate blog comments on average is less than 5% of the total messages published3. A report by Nexgate, a computer security company, showed that in social media sites, such as Facebook and YouTube, 1 in 200 messages contains spam, including lures to adult content and malwares4. In fact, experts estimate that as many as 40% of social network accounts are used to disseminate spam5.
Some types of spam can cause damage to the users. On blogs, for instance, the text comments represent between 15% and 30% of the total blog content, therefore consisting in an inseparable part of each blog and a motivation for the authors to keep publishing (Alberto, Lochter, Almeida, 2015a, Mishne, Glance, 2006). If this interaction is flooded with undesired comments, it can reduce the quality of the information and also confuse search engines which impacts directly the traffic of readers (Mishne & Glance, 2006).
While some users are aware of spam, most of them lack the knowledge to deal with it, often being lured into problems such as hacking and phishing (Ridzuan, Potdar, & Hui, 2012). Traditional methods for preventing it, such as user registration, CAPTCHA, and IP blacklisting might limit the ability of automatic spam bots, but they also tend to hinder legitimate users’ experience (Alsaleh et al., 2015). Moreover, spam messages are not only sent by bots, but also by people who pose as legitimate users and attempt to post messages with links and advertisements (Alberto et al., 2015a). To make it harder, these messages are usually very short and rife with slangs, acronyms, symbols and misspelled words that difficult the computational representation of their content and the learning process necessary to automatically filter these messages.
Many of the traditional text categorization techniques cannot be employed to deal with real spam problems in short text messages because they require that all the examples should be stored in memory, or they should be simultaneously presented in a process known as batch or offline learning. The predictive model created by offline classification methods is static, which harms the spam detection performance, since the spammers tend to adapt and change the messages style to slip through filtering techniques (Bratko et al., 2006). Moreover, since the messages are usually very short and written with an arbitrary grammar, it can lead to text problems of redundancies, polysemy, and synonymy, which make the sample computational representation more difficult, thus impacting the learning process.
Given this scenario, in this study we evaluated the MDLText, a new text classification approach based on the minimum description length (MDL) principle (Rissanen, 1978), to filter spam on short and noisy text messages. This method can be easily deployed in an expert system for spam detection and offers many desirable characteristics, such as (1) incremental learning necessary for online and dynamic scenarios and (2) inherent ability to prevent overfitting because it selects the model that fits the data well, while it naturally favors less complex models.
We conducted a comprehensive performance evaluation using the proposed text classifier in online spam detection, and compared our approach with benchmark online learning methods. We also investigated the impact of applying text normalization and semantic indexing techniques to avoid common text problems and improve sample computational representation. In addition, based on our findings, we proposed a new ensemble approach that combines the predictions obtained by the classifiers using the original text messages and their variations generated after applying text normalization and semantic indexing.
In summary, the robust and online proposed MDLText categorization method is assisted by text processing techniques that remove noise and enrich the text samples by using background expertise. The approaches proposed in this paper have an expert-level competence and provide powerful and flexible means for obtaining solutions to the spam detection problem on short text messages.
The remainder of this study is organized as follows: in Section 2, we briefly describe the related work available in the literature. The basic concepts of the MDL principle are given in Section 3. In Section 4, we present the text classification approach. In Section 5, we discuss the main concepts about text normalization and semantic indexing techniques. In Section 6 we present the ensemble approach. Section 7 describes our experimental setup. Section 8 is devoted to our experimental results. Finally, Section 9 concludes the study and offers guidelines for future work.
Section snippets
Related work
Some years ago, the main target of the spammers was the email. However, with its decreasing popularity and mainly due the popularization of smartphones, spam has invaded all electronic platforms across all media and new types of spam have been emerging nowadays. Many of them are spreading to applications of short text messages, such as short message service (SMS), online instant messages (IM), comments on blogs, and social media. In this section, we first discuss about the main environments
The MDL principle
The MDL principle was introduced by Rissanen (1978); 1983) for the problem of model selection and it is based on the idea the model that fits better the data can also provide a more compact description for the data. The more regularity detected, the better the model learned about the data (Grünwald, 2005). In terms of coding, this means the best model is the one which provides the shortest description length for the given data.
Mathematically, given a set of potential models the
Mathematical basis of MDLText
Given an unlabeled text document d, the MDLText (Silva et al., 2017) uses the main equation of the MDL principle (Eq. 1) to predict the class of the document. The set of potential classes represents the set of potential models M, while d represents the data X. Therefore, d receives the label j, which corresponds to class cj with the minimum overall description length related to d:
We have ignored the description length of the potential classes (models) because
Text normalization and semantic indexing
Messages propagated in recent electronic means of communication, over Internet or smartphone, are usually very short and rife with idioms, slangs, symbols, emoticons, and abbreviations. With such characteristics, established text categorization approaches have their performance seriously degraded when applied to filter spam on these messages. However, in a recent study, Almeida et al. (2016) demonstrated that traditional spam filters can have their performance highly increased by the employment
Ensemble of predictions by combining different expansions
Considering we can create ten new processed text documents from each single original message, we can combine them in an ensemble of classifiers instead of using them individually. Therefore, in this study, we evaluate a new ensemble approach that combines the individual predictions obtained using the original messages with the ones generated by the TextExpansion tool (Figure 5).
As shown in Figure 5, there is one predictive model generated using the original training samples and ten other
Experimental settings
To simulate a real scenario of a spam filter, we consider that just a small number of text messages are available to train the classifier (20% of the messages in each class). Next, one message is presented at time to the classifier, which made its prediction. Then, the classifier receives the user feedback and calculates the suffered loss. If the loss is bigger than 0, the training model is updated with the true label. The overall process is described in Algorithm 1.
In Algorithm 1, we consider
Results
Table 3 shows the average SC, BH, and MCC obtained in 50 runs of the experimental scheme described in Algorithm 1. For each evaluated method and dataset, we present the results obtained with: (1) the original text samples (column “Orig.”), (2) the expanded text samples in which the best MCC score was obtained (column “Exp.”), and (3) the ensemble approach (column “Ens.”). The results are sorted by MCC.
Bold values indicate the best score for each one of the columns for each dataset. The
Conclusions
Spam has once again become a real challenging problem nowadays. Besides being a classical type of adversarial classification problem, it demands more and more for online and dynamic prediction models. Due to the increase popularity of smartphones, this plague is migrating fast to new means of electronic communication characterized by short text messages. In these environments, the text documents are usually very short and rife with slangs, abbreviations, symbols, emoticons, and misspelled words
Acknowledgments
The authors are grateful for financial support from the Brazilian agencies FAPESP, Capes, and CNPq (grant 141089/2013-0).
References (57)
- et al.
Semi-supervised learning using frequent itemset and ensemble learning for SMS classification
Expert Systems with Applications
(2015) - et al.
Facing the spammers: A very effective approach to avoid junk e-mails
Expert Systems with Applications
(2012) - et al.
Combating comment spam with machine learning approaches
Proceedings of the 14th international conference on machine learning and applications (ICMLA’15)
(2015) - et al.
Libol: A library for online learning algorithms
Journal of Machine Learning Research
(2014) Fisher information and stochastic complexity
IEEE Transaction on Information Theory
(1996)- et al.
Svm-based spam filter with active and online learning
Proceedings of the 15th text retrieval conference (TREC’06)
(2006) - et al.
Post or block? Advances in automatically filtering undesired comments
Journal of Intelligent & Robotic Systems
(2015) - et al.
Tubespam: Comment spam filtering on Youtube
Proceedings of the 14th international conference on machine learning and applications (ICMLA’15)
(2015) - et al.
An autonomous online malicious spam email detection system using extended rbf network
Proceedings of the 2015 international joint conference on neural networks (IJCNN’15)
(2015) - et al.
Contributions to the study of SMS spam filtering: new collection and results
Proceedings of the 11th ACM symposium on document engineering (DOCENG’11)
(2011)