Keywords

1 Introduction

Recently, some researchers [8, 9, 11, 17, 19, 22, 25, 31] have focused on user narratives in order to study the usability or the User eXperience (UX). In addition to detailed UX descriptions of users’ daily lives, the study of user narratives has provided the identification of usability problems [31]. Thus, some initiatives [7, 9, 11, 17, 19, 21] have arisen in order to evaluate systems in a textual way. From those, some papers [7, 15,16,17,18,19, 21] have focused on the analysis of Postings Related to the Use (PRUs). PRUs are public postings from users who discuss about the system while using it. These postings are characterized by spontaneity, since users are not required to make a report of their use. While using it, they mention facts about the system spontaneously, such as: their doubts, compliments, difficulties, reviews, suggestions or even experience reports. The authors [17] affirm that the spontaneity of the report is important, since the questioning by evaluators can influence the responses provided and the UX itself during use of the system [9, 15, 19], through interviews or questionnaires, or by the system itself, as in the method Experience sampling triggered by events [20].

In previous papers, the authors [15] did an investigation about Twitter PRUs in order to investigate the presence and absence of sentiment in them (Table 1) and they observed that both postings with sentiment and with no sentiment (neutral) are important for obtaining some perception of the system. This way, the identification of the polarity of the PRUs would be relevant to evaluate the satisfaction or dissatisfaction (frustration) of the user in the system, whereas the neutral postings would be relevant to identify doubts in system functionalities.

Table 1. Classification of postings with and without sentiment in Twitter.

However, the automatic detection of polarity in sentences is a complex scientific challenge due to the subjectivity present in the natural language. Sentiment Analysis is an area of research that aims to define automated techniques for extracting subjective information from texts in natural language, such as opinions and sentiments, in order to create structured knowledge that can be used by a support or decision-making system [6].

In addition to the existing problematic in automatic detection of polarity in sentences, this work has as differential the study of polarity in spontaneous PRUs. The users can have several ways to show their opinion about the system during its use. They can show it to another user, to a system administrator or even to the system itself, such as some reports of PRUs [17]: “Dear Twitter, it is {dirty word} to have to delete the tweet when I misspell a word, so put the ‘edit tweet’ option”; “Twitter, remove the line break, please, it’s much better without it!”.

Another differential consists in the texts analysis to evaluate a system. Such as [15], we also believe that the user posture in a website for products evaluation is different from that when they are using a system and then face a problem and decide to report it to vent or even to suggest a solution. Similar to [15] have found PRUs relating exactly specific information about the use, such as: “Folks, sorry, but I’ve just had a problem here on Twitter and since then I cannot use punctuation once it opens the Twitter menu whenever I try to”, which also characterizes the fact that the user may report the error in the own system in use.

This way, the objective of this work is to investigate the automatic classification of the polarity of opinion in spontaneous PRUs, specifically: (a) what words are most relevant to each polarity?; (b) what are the characteristics of these words?; and (c) are the investigated classifiers sufficient for use in an automatic textual evaluation?

In this work, we analyzed 1,345 postings of an academic system with social characteristics (e.g., communities, forums, chats, etc.). Two investigations with the PRUs were performed: (1) using the automatic sentistrength classifier [28] and (2) applying Data Mining (DM) algorithms.

This article is organized as follows: in the next section, we present a background on textual evaluation of systems and on sentiment analysis. In the third section, we present some researches related to ours. In the fourth section, we describe the investigative studies, followed by results, discussion, conclusion and future works.

2 Background

2.1 Textual Evaluation of Systems

The textual evaluation of systems consists of using user narratives in order to evaluate or obtain some perception about the system to be evaluated [17]. It is possible to evaluate one or more criteria of quality of use with textual evaluation, such as usability, UX and/or its facets (satisfaction, memorability, learnability, efficiency, effectiveness, comfort, support, etc.) [9, 11, 17, 19, 25]. Other criteria can be evaluated, such as privacy [13], credibility [4, 13] and security [27]. Evaluation forms vary from identifying the context of use to identifying the facets of Usability or UX. Some papers have analyzed specifically the most satisfactory and unsatisfactory user experiences with interactive systems [8, 22, 25, 31].

The textual evaluation can be manual, through questionnaires with questions about the use of the system or experience reports, in which the users are requested to describe their perceptions or sentiments about the use of the system. The other way is automatic: evaluators can collect product evaluations on rating sites [9] or extract PRUs from Social Systems (SS) [15, 17, 19]. The automatic form allows more spontaneous reports, including doubts when using the system, but, on the other hand, may also contain many texts that are not related to the use of the system, and these must be discarded.

Textual evaluation has its advantages and disadvantages, similar to other types of HCI assessment, such as user testing, heuristic evaluation, among others. The main advantage is to consider users’ spontaneous opinions about the system, including their doubts. The main disadvantage is the long time of texts analysis. However, there are few initiatives of automatic textual evaluations, since it is an new evaluation type.

2.2 Sentiment Analysis

The sentiment analysis is the field of study that analyses people’s opinions, such as their feelings, evaluations, attitudes and sentiments related to products, services, organizations, people, problems, events, expressed in texts (reviews, blogs, discussions, news, comments, feedback, or any other document) [12]. According to [39], sentiment analysis is a type of evaluation of subjectivity that focuses on identifying positive and negative opinions, sentiments and evaluation expressed in natural language.

According to [14], the techniques of sentiment analysis can still be classified according to the approach they use based on: lexicon, which use a lexicon of sentiments (collection of precompiled sentiment items, such as dictionaries); machine learning, which make use of the already known algorithms of machine learning for text classification; or hybrid, which combine both approaches already mentioned.

There are works in this area that seek to discover neutral terms of phrases or sentences [10, 24], others that focus on the recognition of polarity in order to classify sentences into positive or negative ones [5, 24, 32] and, there are still those that try to identify different degrees of positivity and negativity, as strongly negative, weakly negative, neutral, weakly positive and strongly positive [1, 3, 5, 33]. However, these studies did not investigate spontaneous SS PRUs.

In this work, the sentiment analysis is applied to the classification of sentiment in positive, negative and neutral in PRUs. We compared two techniques, one of them based on lexicon and the other based on machine learning to classify sentiments in order to identify the one that best suits the domain of systems evaluation from PRUs.

3 Related Works

Some papers that have focused on user narratives, in order to study the polarity in them, are [8, 22, 31]. In [8], the authors collected 500 texts written by users of interactive products (mobile phones, computers, etc.), with the purpose of studying UX from their positive experiences. The narratives were collected from the reports in which the users were asked to describe them. They analyzed the structure of satisfaction, the links between needs, affect and product, and differences between categories of experience. The authors presented measures of classification of a UX target in positive. In [22], the authors collected 90 reports of the most satisfactory and most unsatisfactory experiences of the users, in order to evaluate the UX of beginner users of augmented reality mobile applications. The narratives were extracted through online questionnaires and were analyzed with the objective of identifying the UX goal, the activity in which the user was involved, characteristics of the reported experience and application resources that helped or disrupted the experiment. As a main contribution, this study had the understanding of measures adopted in order to improve these applications, analyzing user satisfaction.

In [31], the authors studied 691 user-generated narratives with positive and negative experiences in technologies in order to study UX with them. The narratives were also obtained through online questionnaires and were analyzed with the objective of identifying the main themes portrayed in the texts. The authors proposed a technique similar to an affinity diagram, which consisted of grouping a large number of ideas, opinions and information into groups, according to their affinities. This study showed the importance of narrative content from positive and negative classifications.

Although such works have contributed to studies and to positive and negative textual analysis, the texts collected were not spontaneous and a study of the polarity in the sentences was not performed.

4 Investigations

The investigations were carried out in PRUs written in Brazillian Portuguese, collected from the database of an academic system with social characteristics (communities, discussion forums, chats, etc.) called SIGAA, which is the academic control system of the Federal Universities in Brazil. In this system, students can have access to several functionalities, such as: proof of enrollment, academic report, enrollment process, etc. The system allows the exchange of messages from a discussion forum. Its users are students and employees from the university. For this work, 408 PRUs were selected from a part of the database coming from a previous work [17]. The database contained 1,345 postings. The selection criteria was to collect postings in which users were talking about the system. An example of a PRUs collected was: “I cannot stand this SIGAA anymore!”. Postings from students asking questions about their graduation courses, grades, location, etc. were not selected, for example: “Will the coordination work during vacation?” and “Professors did not post the grades yet”.

The PRUs contained between one and six sentences each. That is why many times the post starts praising the system and ends up criticizing it, for example: “I think this new system has a lot to improve” (Negative Feeling)…“However, it is already much better than the previous one” (Positive Feeling). In this way, we divided the PRUs into 1,100 sentences. After this division, we performed another analysis in order to verify the related and unrelated sentences to the use of the system, because there were sentences such as: “Good morning”, “Thank you”, “Sincerely…”, “Coordination is on strike”, which were not related to the use of the system. In this way, we discarded such sentences, resulting in 832 sentences related to the use. After this step, two of the authors of this article categorized the sentences, resulting in 99 positive, 229 negative and 504 neutral ones.

We carried out two investigations. The first one aimed to identify the quality of the classification based on lexical in PRUs. The lexical classification requires the use of a set of terms, in which each term is associated with a sentiment (positive, negative and neutral). For this experiment, we used the DM tool, SentiStrength [30]. SentiStrength combines supervised and unsupervised classification methods. In this work, we used the version 2.2 of the tool, which is available in [28]. In the second investigation, we used version 7.1 of the RapidMiner tool [26]. RapidMiner is an open-source tool widely used in DM for use of learning algorithms (supervised or not). In this experiment, the Naive Bayes algorithm [29] was used. Then we performed a comparison of the results of the experiments by calculating the following evaluation metrics: (a) coverage; (b) agreement; (c) accuracy; (d) precision; (e) recall; and (f) F-measure.

4.1 First Investigation

In this investigation, we used Sentistrength tool [30]. This tool works through a dictionary that contains several words in a given language and there are 3 possible values for each word: a positive, whose strength varies from 1 (Non positive) to 5 (Extremely positive), or a negative, whose strength varies from –1 (Non negative) to –5 (Extremely negative), or 0 if the word is not contained in the dictionary, thus being classified as neutral [5]. By using this dictionary, each word of each sentence in the text document provided as input will receive its respective positive or negative or neutral score. Then, the values of the positive and negative polarity strength for each sentence will be resumed. For instance, in the sentence “I like and approve the changes” the tool ranks each word as follows “I |0| like |3| and |0| approve |2| the |0| Changes |0|”. Then, both the positive and negative marks of the sentence are calculated. In this case, we would have the scores 3 (highest strength of the positive valuations present in the sentence) and –1 (since there is no negative strength to be considered. The valuation –1 indicates the absence of any negative feelings). Therefore, this sentence will have the two values (3, –1) thus implying in a sentence with predominantly positive strength.

It was necessary to adapt the database to a format that is accepted by the tool, to perform DM in textual collections using the SentiStrength tool. Then, we applied data pre-processing techniques, such as data cleaning, which consisted in removing accentuations and cedillas, because these special characters are not recognized by the tool. After that, we performed the transformation of the file to be used as input to the tool. After these steps, the tool generates a new document with the following results: the sentence, the sentence rewritten with the scores for each word and the final result of the sentence strength.

4.2 Second Investigation

In the second investigation, we used the RapidMiner tool [26] and the Naïve Bayes algorithm [29], to classify the polarity of the PRUs. The Naïve Bayes algorithm was used in this work because it is one of the Bayesian learning techniques most used in texts classification problems [6]. Naïve Bayes is a probabilistic classifier based on the application of Bayes’ theorem. It allows the calculation of probabilities of a new document belonging to each of the categories and assign it to categories with higher probability values [6].

In order to apply the algorithm in the tool, it was also necessary to adapt the database using the following data pre-processing techniques: tokenization (sectioning of sentences into minimal units called tokens, for instance, a word is a token); removal of stopwords (stopwords are tokens without semantic value, for example: articles, pronouns); text cleanup (removal of accentuation, normalization of characters to lowercase); stemming (reduction of tokens to the original radical [23]). The main objective of applying these techniques was to reduce the size of the lexicon and also the computational effort, thus increasing the accuracy of results.

Then, it was necessary to format appropriately the data for the tool input. The technique used was the Vector Space Model. In this model, each sentence is represented by a vector of terms and each term has an associated value that indicates its degree of importance in the document. The weight of a term in a document can be calculated in several ways. For this experiment, the weight was the frequency of terms.

The technique used to test the Naïve Bayes algorithm was Cross Validation, with 10 folds. The sample is divided into k (10) parts of equal size, and preferably the parts (or folds) must have the same amount of patterns, thus guaranteeing the same proportion of classes for each subset. The algorithm is trained under k-1 folds (subsets) generating the rules and it is subsequently validated under the fold (subset) that remains. The training set is formed by the 9 (nine) parts and tested by calculating the hit rate under the data of the unused part, the part that is left over. In the end, the hit rate is an average of the hit rates in the k iterations performed.

4.3 Results

Table 2 presents the confusion matrices for both investigations. The confusion matrix shows how the test set instances were sorted. The class under analysis appears on the row, the classifications found appear in the columns and the diagonal of the matrix matches the correct classifications. For example, the confusion matrix shows that, for the negative class, 46 sentences were correctly classified, while 183 (115 + 68) were classified incorrectly.

Table 2. Post confusion matrix.

A comparison of the results of both investigations will be presented in the next items. We performed the comparison between the results with the calculation of the following model evaluation metrics: (a) coverage; (b) concordance; (c) accuracy; (d) precision; (e) recall; and (f) F-measure.

(a) Coverage:

The coverage of the methods used for each class was computed for the two investigations. In Fig. 1, we can compare the coverage of each polarity, which consists of the percentage of flows in an application class that is correctly identified [2]. With these results, we can verify whether both investigations resemble as for correctly detecting the feelings. However, the Naïve Bayes classification algorithm obtained a better result in all classes, mainly in the negative one, with a difference of approximately 70%.

Fig. 1.
figure 1

Coverage of the methods

(b) Concordance:

In Table 3, we can compare the percentage at which the two classification methods agreed on the polarity of a content. The result indicates that there is little concordance between the classification made by the SentiStrength tool and the Naïve Bayes algorithm.

Table 3. Percentage of agreement between methods.

(c) Accuracy, (d) Precision, (e) Recall and (f) F-measure:

Table 4 illustrates the prediction performance results for each investigation. Accuracy is the most basic measure of efficiency, in which the fraction of instances are correctly classified. Precision determines how many instances of an X class were correctly predicted given all predicted instances of that class. Recall calculation provides how many instances of an X class were correctly predicted given all the instances that truly belong to this class. The F-measure, finally, is the harmonic average between precision and recall.

Table 4. Post confusion matrix.

5 Discussion

This section intends to discuss the investigations carried out regarding: (a) what words are most relevant to each polarity?; (b) what are the characteristics of these words?; and (c) are the investigated classifiers enough for use in an automatic textual evaluation?

Table 5 presents the most relevant words found (the ones with the highest frequency for each polarity) in Brazillian Portuguese (tokens with stemming) and translated to English.

Table 5. Words with higher frequency.

The characteristics we observed from this result were: (a) the most frequent words refer to system functionalities, such as the name that the user uses to refer to the system: “system”, “SIGAA”, “module”, “academic” and its functionalities: “grade” and “enrollment”; (b) the adjective of intensity “more” was found in expressive polarities (positive and negative), indicating the vehemence that users demonstrate their approval or not for the system; (c) the word “new” found in the negative polarity indicates a rejection of the users by the new system, after all, the postings were collected from the beginning of the use of a new one; (d) the names of functionalities found in the negative polarity are those that the users criticized the most, whereas the functionalities present in the positive class would be the most praised - information that would be interesting to know in a system evaluation.

The third point is to discuss whether the classifiers investigated are sufficient for use in an automatic textual evaluation. This point is divided into the following paragraphs between the two approaches used.

By evaluating the results obtained in the first investigation, we noticed that the accuracy, precision and recall measures presented low results for the hit rates. We can attribute this result to the fact that the dictionary provided for the Portuguese language is not adapted to recognize precisely the polarity in PRUs. They need different treatment to recognize what the user wants to say about a given system, since many words can represent different feelings based on the context in which they are used. We can cite as an example the sentence “Thanks for the help”, which although the tool treats as a positive feeling, is not one regarding the use of the system. That is, the word “Thank you” could not be used as a positive valuation, but as a neutral one. Another example that shows the need to insert a context in the dictionary would be the sentence “Google Chrome is much better”. In this example, by the system context, which works only in the Firefox browser, this sentence should be classified as a negative sentiment, since it is criticizing the system limitation of not being supported by other browsers. However, the tool would rank as a positive sentiment, for lack of information.

We also must consider other problems that are recurrent when we work with natural language processing, which requires treatment for particular cases, such as irony, slang and language peculiarities that are regularly used by users of SS. These problems become even more serious when we take into account the number of languages that need a particular study to improve the classification of the tool. In other words, it is necessary to build a method adapted to the domain of PRUs, aiming to obtain more accurate and reliable results in the analysis based on the postings semantics.

From the results obtained in the second investigation, we can observe that the Naïve Bayes classifier obtained a better result for all the metrics evaluated when compared to the ones acquired by the SentiStrength’s classifier. It is worth mentioning that it was necessary to use a database for training, in which each sentence was previously labeled with its respective class. Thus, some problems related to the use of Naïve Bayes arose: (1) manual lettering is onerous, so it is not a viable method for real scenarios; (2) in the systems evaluation, we would like to capture the spontaneity of the user’s report [17] and as the extraction of PRUs is done automatically, it is not possible to allow the own user to label their sentiment at the moment of the posting.

6 Final Considerations and Future Work

In this work, we investigated the polarity in PRUs of a university system. The results we presented with the two investigations performed were: more relevant words for each polarity, characteristics of these words and a discussion about the investigations carried out.

The results point to the development of methods for automatic classification of sentiment in PRUs, as well as other investigations in PRUs of other types of systems and with other algorithms.

We carried out some initiatives on textual evaluation, such as: (1) a methodology for evaluating Usability and UX in PRUs [17] and (2) the development of a tool for extracting and classifying PRUs called UUX-Post. The methodology, called MALTU, aims to guide an HCI professional in the evaluation of a system from a set of PRUs. The methodology uses five stages for evaluation: (i) definition of the evaluation context; (ii) extraction of PRUs; (iii) classification of PRUs; (iv) interpretation of results; and (v) reporting the results. In step 1, we define: which system will be evaluated, who the users are and what the objective of the evaluation is. In step 2, the extraction of PRUs is performed, which can be either manually or automatically. In step 3, a classification process of PRUs is performed, which involves classifying them into up to six different categories: (a) type (criticism, doubt, praise, suggestion, comparison); (b) intention (visceral, behavioral, reflexive) [17]; (c) analysis of sentiments (positive, negative, neutral); (d) functionality (what is the functionality of the system the user refers to?); (e) quality of use criteria (usability, UX and its facets); and (f) platform (mobile, desktop). In this step, the sentences are analyzed by experts in order to be classified. The results are then interpreted (step 4) and reported (step 5). Most of these classifications are manual. The UUX-Post tool classifies automatically only the categories: type and quality of use criteria. The research carried out in this work motivates new studies in order to define a context for automatic evaluation of polarity in PRUs.

This experiment must be redone with databases of other systems, with other algorithms and other classification tools, in order to compare, discuss and improve the result to be applied in the UUX-Post tool.