Elsevier

Journal of Retailing

Volume 75, Issue 2, Summer 1999, Pages 195-217
Journal of Retailing

Original Articles
Unmasking a phantom: a psychometric assessment of mystery shopping2

https://doi.org/10.1016/S0022-4359(99)00004-4Get rights and content

Abstract

Because of increasing concern over whether they are really satisfying their customers, more retail and service firms are using mystery shoppers (sometimes also referred to as secret, phantom, or anonymous consumer shoppers) to monitor their frontline operations, to assess their customer service, and to benchmark their competitors’ performance. But virtually nothing is known about the psychometric quality of the data collected in mystery shopping studies, and how it compares with that of customer survey data. Yet such information is necessary for organizations to know when to use mystery shopping, how to design mystery shopping studies, and what weight to place on their results. Here we use a generalizability theory approach to assess the psychometric quality of mystery shopping data. First, we design a study to compare the use of mystery shopping and a traditional customer survey when collecting data to assess the service quality of competitive stores. Second, we evaluate the psychometric performance of a set of secondary mystery shopping data collected to evaluate and benchmark the performance of branches of a retail network. Finally, in a follow-up study, we examine whether mystery shopping data collected to scale the more objective store environment are more reliable than data for service quality.

Introduction

Because of increasing concern over whether they are really satisfying their customers, more retail and service firms have been using mystery shoppers (sometimes also referred to as secret, phantom, or anonymous consumer shoppers) to assess the performance of their outlets Cobb 1995, Dwek 1996. In a mystery shopping study (sometimes also referred to as a shopping survey), the front-line operations of a business are evaluated by an anonymous trained observer. Typically, the observer enters the outlet to be evaluated posing as an average customer and, immediately after engaging in what appears to be a normal customer interaction, completes a detailed report on various aspects of the service and shopping environment at the outlet. Mystery shopping is best suited when assessing objective characteristics of outlet operations, such as the store environment (e.g., Were the aisles kept clear? Were all point-of-purchase signs in place?). Customer surveys (mail, telephone, or intercepts) are best suited for evaluating more subjective characteristics of operations, such as service quality (e.g., Did the salesperson provide courteous service? Did employees provide prompt service?). While it is difficult to use customer surveys to obtain detailed evaluations of objective outlet characteristics, mystery shopping surveys can evaluate service quality, and should provide useful information when the shoppers are representative of the customer population. Mystery shoppers can be assigned to compare the performance of branches of chains, such as banks, whereas customers can typically only evaluate one branch. So while used most frequently for evaluating objective characteristics of a service, mystery shopping surveys can also provide a supplementary source of subjective quality ratings.

Data collected by mystery shoppers can be reported in the form of rating scales, checklists, and open-ended responses. The results are then used to compare the performance of particular outlets and their employees, to monitor outlet performance over time, and to identify areas where outlets are in most need of improvement. As noted by Dawson and Hillier (1995), mystery shoppers are increasingly being used to benchmark the performance of important competitors or even of outlets in other industry sectors which might provide a performance standard.

Anecdotal accounts of the use of mystery shopping abound. When McDonald’s set new standards of customer satisfaction for its franchisees in response to concerns over quality, service and cleanliness relative to competition, it was reported as planning to have ‘representatives’ make surprise visits to restaurants each quarter (Gibson, 1995). Berry (1995, p. 40) reports that the Hard Rock Cafe uses researchers who shop both its restaurants and on-site merchandise stores every two months, producing ratings of individual servers. Similarly, Marketing Solutions is reported to have sent ‘researchers’ into 1,600 financial institution branches in 10 cities across Canada for an annual syndicated study designed to determine how well employees handle the marketing of mutual funds (Bell,1995). Disney is said to routinely use the technique at its theme parks to evaluate its performance (Meister, 1990).

But despite the popularity of mystery shopping for performance evaluation in such important areas of business as retailing, financial services, food service, tourism, and direct marketing, surprisingly little discussion of the technique has appeared in the academic literature. The most developed literature on the assessment of front-line operations has focused on the use of customer surveys to measure perceived service quality and customer satisfaction (see Iacobucci, Grayson, and Ostrom, 1994 for a useful review). In contrast, in their review of the use of observational methods for services marketing, Grove and Fisk (1992) cite only one proceedings paper which used a mystery shopping approach, and it focused on a substantive, not a methodological issue (McClung, Grove, and Hughes, 1989). Only recently have Morrison, Colman, and Preston (1997) begun to speculate that factors associated with the encoding, storage, and retrieval of information are likely to influence the accuracy of the results reported by mystery shoppers. Wilson (1998) examines UK managers use of mystery shopping and suggests some experience-based guidelines for reliability, which is described as a critical issue if “staff are to be rewarded on the basis of the results.” But published data on the reliability and validity of mystery shopping remains non-existent.

The popularity of mystery shopping has been more apparent to research practitioners, as evidenced by a one year increase from 28 to 187 in 1995 in the number of research agencies listed as providing the service in the MRS Organization Handbook (Dawson and Hillier, 1995). British practitioners have also discussed ensuring the acceptability of particular study results, given the types of decisions for which data are collected Hurst 1992, Craven and Yeomans 1994, and of the general practice, given possible ethical concerns (Dawson and Hillier, 1995). However, some important data quality questions have not been addressed. How do the data generated in mystery shopping studies hold up against the standards of reliability and validity advocated in the marketing research literature? Is the reliability of such mystery shopping data comparable to that of customer survey data? Is the reliability of mystery shopping data higher for objective constructs such as the store environment than for subjective constructs such as service quality? If the results from mystery shopping visits are to be used in managerial decision making, what design principles should be followed in such areas as assigning visits to outlets and choosing the number of items to be evaluated in a visit? For example, does it ever make sense for managerial decisions to be based on the report of a single shopper visit, or are far more visits necessary to assess the performance of outlets? If so, how many are needed? In their survey dealing with the ethical concerns of UK clients for mystery shopping, Dawson and Hillier (1995) found the acceptable number of mystery shopper visits to a competitor in a three month period averaged nearly four at an outlet level and nearly eight at an organization level. However, it is not clear to what extent survey respondents felt these were all that were necessary for methodological reasons or all that would be acceptable for ethical reasons, even though more would be desirable for methodological reasons. While there are a number of other important issues, such as whether mystery shopping infringes on an employee’s right to privacy (see New Zealand Privacy Commissioner 1995, Grove and Fisk 1992), we did not attempt to address them here.

To help answer these methodological questions, we conducted a three-phase research program. First, we designed and carried out a primary study to examine the reliability and validity of mystery shopping when used to evaluate the service quality of different retail outlets. Mystery shoppers for this study were selected from the population of customers of these retail outlets because we intended to compare the reliability and cost effectiveness of mystery shopping and a customer survey when applied to the same problem. The managerial problem of evaluating the relatively subjective service quality of retail outlets was chosen because both methods can be used. Second3, we approached a company which provides mystery shopping services, requesting access to secondary data from a typical mystery shopping project. The objective was to investigate the psychometric properties of a typical example of store evaluation data collected via mystery shopping. Finally, in a follow-up study, we collected additional primary data to compare the effectiveness of mystery shopping when used for evaluating objective and subjective areas of performance. In particular, we compare its use in the assessment of the store environment with its use in the assessment of service quality. In these initial studies, we concentrate on ratings data only. These are evaluated using generalizability theory (Cronbach, et al. 1972), because the theory and its associated methodology can provide answers to the questions raised here about mystery shopping.

Generalizability theory, hereafter called G-theory, is the most comprehensive approach to assessing the reliability and validity of measurement. It was originally developed for educational testing by Cronbach and his colleagues (1972) and was first identified as of potential value in marketing research by Peter (1979). Rentz 1987, Rentz 1988 provided a fuller introduction and presented some demonstration results for the generalizability of some consumer scales using data provided by students. Finn and Kayandé (1997) demonstrated how G-theory can be used to ensure the efficiency of measurement using a study of the customer perceptions of the service quality provided by retailers. If mystery shoppers can provide data of similar quality to average customers’, Finn and Kayandé’s (1997) results for comparing retail chains using crossed designs (Table 4, p. 269) raise the possibility that a relatively small number of mystery shoppers may be sufficient to obtain reliable service quality measures.

In the G-theory approach, an initial generalizability study (hereafter called a G-study) is conducted, collecting data to determine how sensitive the construct being measured is to the levels of different factors in the measurement environment. Then, when subsequent managerial decisions need to be made, knowledge of the extent of variation across and within factors can be used to determine how many observations will be necessary to draw managerial conclusions with a required degree of reliability. If cost information are also available, it is possible to identify the most cost effective designs for each subsequent decision study. Thus, the G-theory approach is most beneficial in a programmatic research context, rather than for one-off research projects.

Section snippets

Applying generalizability theory to mystery shopping

G-theory recognizes that an observed performance evaluation provided for a retail outlet will depend on multiple factors, such as the time of the observation, the type of transaction, the employee observed, the specific mystery shopper doing the observation, and the particular aspect of operations being rated, as well as the particular outlet being observed. Each of these factors is referred to as a facet. The contribution of a facet to the overall variation in performance ratings can be

Primary study

The purpose of the primary study was to determine whether the psychometric properties of service quality data collected using typical mystery shopping conditions are comparable to those collected using typical customer survey conditions, and can meet the standards advocated in the marketing research literature. Of particular concern were (1) the relative size of variance components, and therefore the G-coefficient for different objects of measurement, in data collected using the two methods,

Secondary study

To get a better idea of whether one, ten, or nearly forty mystery shopping visits are generally necessary to reliably scale outlets, we examine some secondary data, collected by a supplier of mystery shopping services. These data were made available for academic study with the proviso that no substantive findings or proprietary details would be published. We can only report that the data were collected for one client with branches located throughout one metropolitan area in North America.

The

Follow-up study

To investigate whether our results are driven by our investigation of subjective performance constructs, we designed a follow-up study to collect data on both subjective and relatively objective constructs, namely service quality and store environment. Items for a store environment scale were developed from a comprehensive checklist suitable for retailers wishing to assess their own store environment (Williams and Torella, 1992, p. 128). Three items were selected for each of seven distinct

General discussion

As industry practice stands today, managers are using mystery shopping to evaluate both the objective and subjective characteristics of front line service operations. However, they have had no idea of the psychometric quality of the data collected using this increasingly popular method nor, in the case of subjective characteristics, whether they are comparable to those collected using customer surveys. A generalizability approach allows the manager to optimize the number of mystery shoppers

Conclusions

The evidence from our primary, secondary and follow-up studies all suggest mystery shoppers provide reasonably reliable ratings of the performance of retail outlets, although much less reliable than seems to be assumed for commercial studies using 3–4 shoppers. The reliability of mystery shopping data is much higher than that of customer surveys, when the data are used for the same problem of scaling outlets. There is presumably an advantage in mystery shoppers knowing they are going to be

References (33)

  • A Parasuraman et al.

    Alternative Scales for Measuring Service QualityA Comparative Assessment Based on Psychometric and Diagnostic Criteria

    Journal of Retailing

    (1994)
  • Bell, Andrew. (1995). Better Mutual Fund Service: Survey. The Globe and Mail, December...
  • Leonard L Berry

    On Great Service

    (1995)
  • Mary Jo Bitner et al.

    The Service EncounterDiagnosing Favorable and Unfavorable Incidents

    Journal of Marketing

    (1990)
  • Robert Brennan

    Elements of Generalizability Theory

    (1983)
  • Robin Cobb

    Magical Mystery Lure

    Marketing

    (1995)
  • Eli P Cox

    The Optimal Number of Response Alternatives for a ScaleA Review

    Journal of Marketing Research

    (1980)
  • Craven, Jill and Yeomans Mark. (1994). “In Defense of Mystery Shopping,” Proceedings of the 37th Annual Conference of...
  • Lee J Cronbach et al.

    The Dependability of Behavioral MeasurementsTheory of Generalizability for Scores and Profiles

    (1972)
  • J.Joseph Cronin et al.

    SERVPREF Versus SERVQUALReconciling Performance Based and Perceptions-Minus-Expectations Measurement of Service Quality

    Journal of Marketing

    (1994)
  • Janet Dawson et al.

    Competitor Mystery ShoppingMethodological Considerations and Implications for MRS Code of Conduct

    Journal of the Market Research Society

    (1995)
  • Robert Dwek

    Magic of Mystery Shopping

    Marketing

    (1996)
  • Adam Finn et al.

    Reliability Assessment and Optimization of Marketing Measurement

    Journal of Marketing Research

    (1997)
  • Gibson, Richard. (1995). “McDonald’s Approaches‘96 With Goal of Making Its U.S. Service‘Hassle-Free.’” Wall Street...
  • Stephen J Grove et al.

    Observational Data Collection Methods for Services MarketingAn Overview

    Journal of the Academy of Marketing Science

    (1992)
  • Hurst, Stephen C. (1992). “Quantifying Customer Service Via Mystery Shopper Surveys.” Proceedings of the 35th Annual...
  • Cited by (92)

    • Measurement of the reliability of pharmacy staff and simulated patient reports of non-prescription medicine requests in community pharmacies

      2021, Research in Social and Administrative Pharmacy
      Citation Excerpt :

      This method has come to be accepted as a rigorous and cost-effective technique for observing practice where other means of assessment may be difficult. This method allows the research team to view the experience through a ‘consumer's’ eyes and in a manner that may result in the participant not being aware that they are under observation at the time.1,4,9 Acceptability studies in pharmacy practice research have identified that pharmacy staff perceive the method to be an acceptable technique to measure practice.10–12

    • Cheating customers in grocery stores: A field study on dishonesty

      2019, Journal of Behavioral and Experimental Economics
    • Do Mystery Shoppers Really Predict Customer Satisfaction and Sales Performance?

      2019, Journal of Retailing
      Citation Excerpt :

      We, therefore, assume that subjective assessments of mystery shoppers and real customers are consistent. Several empirical studies substantiate this assumption: Wilson and Gutmann (1998) and Finn and Kayandé (1999) report significant correlations between overall customer satisfaction scores and average mystery scores. More recently, Hoekstra, Ammeraal, and Leeflang (2014) observed that the satisfaction ratings of real customers are well reflected by mystery callers’ judgments.

    • Customer-oriented communication in retail and Net Promoter Score

      2017, Journal of Retailing and Consumer Services
      Citation Excerpt :

      MS studies have been conducted especially in these areas: quality assessment in financial services (Djordjic, 2011; Tarantola et al., 2012), in restaurant business and tourism industry (Chen and Barrows, 2015a; 2015b; Minghetti and Celotto, 2013) in retail (Gosselt et al., 2007; Janka and Jankalova, 2011; Kehagias et al., 2011). Some studies also develop methodology of MS (Finn and Kayandé, 1999; Ford et al., 2011; Kanto and Pihlajamaa, 2013; Kehagias et al., 2011; Wilson, 1998a,b; Wilson, 2001). In the Czech Republic, Staňková and Vaculíková (2007) described the possibility of using mystery shopping for improvement of trade and services only theoretically.

    View all citing articles on Scopus
    2

    Some work on this paper was carried out while the first author was Visiting Professor, Institute of Marketing, Norwegian School of Economics and Business Administration and the second author was a doctoral candidate at University of Alberta. Order of authorship is alphabetical.

    1

    Adam Finn is R.K. Banister Professor of Business at University of Alberta and Ujwal Kayandé is a Senior Lecturer in Marketing at the Australian Graduate School of Management, Sydney.

    View full text