1 Introduction

Nowadays, the usability is a critical success factor for the transcendence of any software product in the global market [6]. This software quality attribute can be a determinant element for people to continue using an application, especially, if the reference is to websites or mobile apps. Given the several alternatives that users have on the Internet, if a website is difficult to use or hard to understand or fails to clearly state its purpose, users will leave the site and look for similar options that meet all these aspects. In the E-Commerce domain, where there are multiple websites available and the competition between companies is immeasurable, considering criteria to ensure a high level of ease of use can be the decisive feature to stand out in the market.

The relevance of the usability of software products has led to the emergence of several methods which provide specialists with mechanisms to determine if the graphical interfaces of a system are understandable, intuitive and usable [19]. The purpose of most of these techniques is to identify aspects in the proposed design that can affect the usability, and with this information, the development team can be able to fix the interfaces following the recommendations provided by the specialists in HCI or by potential users. However, although these techniques allow the identification of the aspects to be improved, only a reduced percentage of them let specialists obtain a numeric value about the usability of the system. The importance of quantifying the level of usability is the possibility to perform comparisons between software products of the same type. Likewise, a mechanism of this nature provides companies with a systematic procedure for evaluating the best option from several design proposals.

At present, the most recognized methods that quantify in number values the usability of a software system are the questionnaire and the application of usability metrics [17], specifically, the measurements provided by the ISO/IEC 9126 standard [7]. However, both techniques are time-consuming and demanding in human resources, since that the participation of a representative number of users is required [8]. The questionnaires as evaluation tools are based on subjective data and are an attempt for determining the user satisfaction with the current level of usability of the system. The purpose is to measure the participants’ opinions concerning their perception of usability. On the contrary, the metrics are based on objective data and are measurements of the participants’ performance such as scenario completion time and successful scenario completion rate. The software metrics as well as the usability questionnaires are focused on analyzing the user’s perspective or behavior, and can present accurate results only if these are applied to a representative amount of people. Nevertheless, in contrast to these two commonly used user-oriented methods, we propose the adoption of a specialist-oriented inspection technique to quantify the usability. In this study, we describe an experimental case study in which a variant of the heuristic evaluation is applied to a real scenario to obtain quantitative data about the usability of a web application. The advantage of this method in relation to other techniques is that the assessment process is entirely conducted by specialists in the field of HCI and the participation of users is not required. The results establish that this variant of the traditional heuristic evaluation approach provides appropriate feedback about the usability of a software system, and allows the comparison between different design proposals.

The paper is structured as follows. In Sect. 2, we present the new usability methodological assessment process that was adopted for the conduction of this experimental case study. In Sect. 3, we describe the research design of this research. In Sect. 4, we discuss the results of the usability evaluation. Finally, in Sect. 5 the conclusions and future works of this research work are established.

2 A New Heuristic Evaluation Process

The Nielsen’s traditional approach [14] to carry out heuristic evaluations does not establish well-defined steps that can guide the usability inspection process. Despite this, some guidelines are indicated in the original proposal [10]:

  • Each evaluator must examine the graphical system interfaces individually, to determine if all elements meet certain design principles called “heuristics”. In case the established heuristics are not accomplished, the findings must be cataloged as design problems in a predefined template by the inspectors.

  • Once the individual evaluations have been performed, the findings must be consolidated in a single list of problems. For this activity, the team must organize a discussion session in which each identified issue is analyzed. According to traditional guidelines, one of the team members must work as a moderator to conduct the meeting and allow specialists to provide their opinion about the problematic aspects of the design that were identified. The purpose of this activity is to define the final list of problems as well as the positive aspects of the proposed graphical system interface.

  • Subsequently the evaluation team must proceed to rate the severity of each usability problem that was considered for inclusion in the final list of problems according to the following scale [16] (Table 1):

Table 1. Severity ratings for usability problems

In addition to the previous conduction directives, some other considerations are established by the traditional approach:

  • Only from three to five specialists are required to identify the greatest number of usability problems that are present in an interface design. According to a statistical study conducted by Nielsen and Landauer [13], the participation of three professionals from the field of HCI is enough to find the 75% of the most critical issues of usability. Based on a comprehensive analysis of several projects in which a software product was developed, Nielsen determined that the optimal number of heuristic evaluators to maintain a positive benefit/cost ratio is between three and five people, considering that each specialist represents a cost for the project, in the same way that the fixes of the interface design, because of the usability problems that were not noticed in early phases. However, in contrast to this theory, there are studies [1, 22] that establish that three or five evaluators are still a small and insufficient number to determine the most of all the design issues.

  • The participants of the heuristic evaluation process must be specialists in the field of Human-Computer Interaction. According to the traditional approach, this method is cataloged as a usability inspection method [12], in which is required the participation of professionals with a high degree of expertise in usability. The evaluation must be conducted by experts who due to their extensive experience can easily identify real problems in the interface design, without performing a subsequent verification along with representative users of the software system. The accuracy of the results will be strongly related to the profile of the inspectors that are part of the assessment team. For this reason, Nielsen states that only three specialists are required for this evaluations, since they handle a high degree of mastery in the topic that are able to identify by themselves individually the most issues. Many pieces of research that have a critical stance against the optimal number of evaluators proposed by Nielsen, is because of they have not involved in their experimental case studies specialists in HCI. Only from three to five evaluators are required, but as long as these professionals are experienced.

  • The heuristic evaluation demands the use of a set of usability principles to identify design problems. Although the assessment methodology does not establish the employment of certain specific list of guidelines, Nielsen has defined ten usability principles for user graphical interfaces [15], which are nowadays the most relevant and used proposal. Nevertheless, some authors [4, 18, 20] have identified that these heuristics fail to cover relevant aspects when they are used to evaluate emerging categories of software products such as mobile apps, videogames, medical applications, educational systems and others. Despite being widely recognized and used, the original approach of principles was developed with a focus on web and desktop applications. In this sense, there are features that the new types software applications present with significant impact on the usability, but which are not considered by the Nielsen’s ten traditional heuristics. For this reason, there are new heuristic proposals in the literature that are oriented to evaluate in a more effective way the usability of the software products from different domains.

In a previous work conducted by Granollers [5], a new approach based on the traditional method was developed. The purpose of this study was to establish a methodological procedure with basis on the heuristic evaluation process with which specialists be able to determine in numerical values the level of usability of a specific software product. In many cases, companies use to request to their development teams, several graphical interface designs to subsequently make a selection of the best option. However, given the original nature of the heuristic technique, there is no way to determine in which degree an interface proposal is better than other for a particular system. Specialists in HCI are in need of a framework whose usage can provide with a measurement (i.e., from 0 to 100) of the ease of use of a graphical user interface. In this way, it would be possible to compare different designs and determine how far is one from another. Until the moment, the questionnaires are the only methods that allow professionals to obtain a numeric value about the level of usability of a software interface [21]. However, these subjective measurement tools have been designed to capture the users’ opinion concerning their perception of usability. There is a lack of a mechanism exclusively limited to specialists in HCI for the quantitative estimation of the usability degree of a software product.

Table 2. Heuristics related to the category of Need Recognition and Problem Awareness

The new assessment procedure involves having in addition to the heuristics, a checklist. Unlike the heuristic principles which are broad rules of design, the checklist items are particularly specific to a degree in which the inspectors can quickly identify whether they are met or not, with a simple and fast review of the interface [9]. For this academic scenario, given that the software product to be assessed was a transactional web site, we employed the 64 items of the checklist developed by Granollers to evaluate the usability of E-commerce web applications [2]. Each item of this proposal has been elaborated as a YES/NO question, in which the evaluator would have to answer positively or negatively, depending if the guideline is accomplished or not. The first fourteen items of the entire proposal related to the category of “Need Recognition and Problem Awareness” are shown in Table 2. The complete list of heuristic items can be found in the research performed by Bonastre and Granollers [2].

Fig. 1.
figure 1

Heuristic evaluation process to quantify the level of usability

Given the nature of the heuristic items, the proposed scoring system establishes that each usability guideline must be rated individually by the evaluators, from 0 to 4 according to the level of achievement of the design rule, in which 0 is referred to a total non-compliance of the heuristic item, and 4 to the opposite scenario, the total accomplishment of the rule. In order to assign a score, the evaluators must attempt to answer to each of the 64 YES/NO questions. If the answer is affirmative, it would mean the guideline is completely fulfilled, and therefore, the proper score for the item would be 4. In the same way, if the answer is negative, it would mean there is a total infringement of the rule, and therefore the score must be 0. In cases in which there is not the possibility to answer YES or NO because of a partial accomplishment of the design rule, the inspector should rate between 1 and 3 according to the level of fulfillment.

Once each evaluator has rated the entire list of heuristic items, all the scores are added with the purpose to obtain a value between 0 and 256. This resulting value must be divided by the number of heuristic items that were considered for the assessment (64 items for this particular scenario), obtaining an final individual value by specialist. Finally, the individual scores must be averaged to obtain a resulting value. Equation 1 summarizes the way to calculate the level of usability of a software product.

$$\begin{aligned} \hbox {level of usability} = \frac{\sum \limits _{i=1}^{m} \left( \frac{\sum \limits _{j=1}^{n} s_{ij}}{n}\right) }{m} \end{aligned}$$
(1)

where:

  • \({\mathbf {s}_\mathbf {ij}}\) is the assigned score by each evaluator “i” in each heuristic “j”.

  • n is the number of heuristic items.

  • m is the number of evaluators.

The assessment process that quantifies the level of usability is represented in Fig. 1. In this image, we have defined two organizational entities, the ‘assessment team’ and the ‘evaluator’ in order to divide the collaborative activities from the individual assignments respectively. The most valued aspect in contrast to the traditional evaluation approach is the possibility to obtain a global score about the level of usability of an interface design, which allows specialists to perform comparisons between versions, or determine in which degree the software product is better than the competing applications.

3 Research Design

This experimental case study was performed with the participation of six specialists in Human-Computer Interaction (HCI). The professional profiles are described in Table 3. For the conduction of this study, the general recommendation of Nielsen regarding the proper number of evaluators was taking into consideration. The specialists were divided into two teams of three members each, according to the traditional guidelines which establish that three evaluators are enough to identify the most relevant problems that are present in an interface design [11]. Each team evaluated a different E-Commerce web site with the purpose to determine if the results from diverse assessments can be compared. Team A employed the quantitative approach of the heuristic evaluation method to evaluate the usability of Amazon.com. In the same way, Team B used the variant of the traditional technique to evaluate BestBuy.com. First, the participants were informed about the new framework that they should follow to perform the evaluation. Although all the team members were experienced in the traditional method, they had no prior knowledge about this particular way of conducting a heuristic evaluation due to this is a recent proposal and because of the small modifications we have performed to the original methodology. Once the participants were familiar with the new evaluation approach, they proceeded to evaluate the corresponding web applications according to the usability guidelines proposed by Granollers for user experience evaluation in E-Commerce websites [2]. This particular set of design guidelines is structured as a checklist and consists of YES/NO questions with which a specialist could easily perform a usability inspection. In addition, this proposal corresponds entirely with the assessment framework in the sense that each principle can be rated with 0 or 4 depending on whether the heuristic is entirely infringed or totally fulfilled, and from 1 to 3 in case it is partially implemented.

Table 3. Profile of the usability specialists

The set of heuristics proposed by Bonastre and Granollers [2] to evaluate the usability and user experience of E-Commerce web sites is composed of 64 design principles that have been grouped into six categories according to the aspect that the heuristics are addressing:

  1. 1.

    Need Recognition and Problem Awareness [14 guidelines]

  2. 2.

    Information Search [6 guidelines]

  3. 3.

    Purchase Decision Making [13 guidelines]

  4. 4.

    Transaction [10 guidelines]

  5. 5.

    Post-Sales Services Behavior [4 guidelines]

  6. 6.

    Factors that affect UX during the whole purchase process [17 guidelines]

Both inspection teams A and B used this set heuristics together with the assessment framework described in Sect. 2 to determine the level of usability of the assigned software products. The teams were allowed to establish their own organization and distribution of the activities of the entire evaluation process, under the supervision of the authors, that verify all the steps are followed in the correct order and no step is omitted. Finally, once the teams consolidated their findings, the results were exposed to a comparative analysis.

4 Analysis of Results

The usability evaluation was conducted by both teams following the assessment framework as well as the scoring system explained above. It is important to highlight that the purpose of this study is focused on the evaluation process and the way in which this methodological procedure provides support to quantify the level of usability of the web applications. There is no particular intention on the part of the authors to improve the usability of the mentioned systems or to describe in detail the design aspects that must be fixed.

Tables 4 and 5 presents the results of the scoring process for Amazon.com and BestBuy.com respectively. Likewise, Figs. 2 and 3 show the analysis of usability of the web applications for each of the three evaluators.

Table 4. Average of the usability scores assigned in each category to Amazon.com
Table 5. Average of the usability scores assigned in each category to BestBuy.com

The results establish that the assessment methodology as well as the proposed heuristics together are a reliable tool to evaluate the usability. It is possible to observe that the scoring is uniform. The highest standard deviation in both cases is 0.25 (in the first scenario) and 0.24 (in the second case), which means that overall protocol produces consistent results. From the perspective of the psychometrics [3], a research tool can be considered reliable if it provides similar results under consistent conditions. If the instrument is applied several times over the same subject or object, essentially the same results would be obtained in each attempt.

The global results about the usability of each E-Commerce web site can be calculated by a simple average of all the individual scores. The Fig. 4 is a comparative graph that highlights the usability levels in each aspect that the selected heuristics address. Finally, we averaged the values in each category to determine the final score for the usability for each system (3.29 for Amazon.com and 2.90 for BestBuy.com).

Fig. 2.
figure 2

Results of the evaluation in each category performed to Amazon.com

Fig. 3.
figure 3

Results of the evaluation in each category performed to BestBuy.com

Fig. 4.
figure 4

Comparative analysis of the usability between BestBuy.com and Amazon.com

The analysis establishes that both applications present a proper degree of usability given that the reached score is above the median value (2.0). However, in both cases there are opportunities for improvement, especially in the case of BestBuy.com which obtained the lowest score in the qualification. Nevertheless, the current level is appropriate for both systems, and explains in certain degree their success and strong presence in the market. The fact that the heuristics are grouped by aspects is another advantage that the model provides, in the sense that specialists can notice the features that are aligned with the design guidelines and those which need restructuring. The approach also allows comparisons between different design proposals or software systems. In this particular scenario, we determined that Amazon.com is more usable and provides a better user experience than BestBuy.com. Given the proper degree of detail in which the results are offered by the assessment model, the development team can identify immediately the aspects of the interfaces that require of a redesign process to match or exceed the usability level of other software products, and even from the competitors.

5 Conclusions and Future Works

The usability has become an essential quality attribute in which the majority of companies are currently concerned. Nowadays, the success of any technological product that is released to the market, will depend significantly on the degree of ease of use, how intuitive is the artifact to operate, and the level of the users’ satisfaction after their first interaction with the product. The field of Software Engineering is not the exception, and the software products have to be usable enough to transcend, especially in the E-Commerce domain, in which there are several web sites for the same purpose.

Given the need for verifying whether a web application meets appropriate levels of usability, some techniques have emerged and are employed to analyze not only the usability, but also the user experience (UX). However, the disadvantage of these evaluation methods is that most of them have a qualitative approach and only allow the identification of usability problems based on personal impressions of specialists or end users about the graphical interfaces. Nevertheless, these methods do not provide a numeric value about the level of usability of the system. The questionnaires and the software metrics are the only approaches which quantify the ease of use, but these methods are time-consuming and demand the participation of a representative sample of people. In this study, we are discussing the results of applying a variant of the traditional heuristic evaluation process to identify the level of usability of E-Commerce web applications. The advantage of using an inspection method to measure the usability according to the literature, is the possibility to obtain accurate results with the involvement of only three specialists in HCI. In this experimental case study, this hypothesis was proven. Six specialists from the academia were requested to evaluate a web site employing a specific approach proposed by Granollers that quantifies from 0 to 4 the global level of usability of a software system. This new methodological proposal establishes the use of a checklist in which each item must be rated to be then averaged with the score of all the heuristics. The analysis establishes this assessment framework is a reliable tool in the sense that all results were uniform. The proposed methodology as well as the selected heuristics allow obtaining consistent results, since that the use of these evaluation tools provide similar scores when the same conditions are met. However, it is still necessary to perform more studies especially when the specialists have a different background or profile. In this research, all the specialists that were involved belonged to the academic field. A future work would be the analysis of the results when a specialist from the software industry are considered in the evaluation. Likewise, it would be important to discuss if there exists a relation between the obtained results and the cultural profile of the evaluators, and if the results are significantly different when more specialists participate of the evaluation. In order to generalize the conclusions, more studies are required. However, this study is intended to demonstrate that the presented methodology is reliable and can provide both quantitative and qualitative results.