Unmasking a phantom: a psychometric assessment of mystery shopping

doi:10.1016/S0022-4359(99)00004-4

Journal of Retailing

Volume 75, Issue 2, Summer 1999, Pages 195-217

https://doi.org/10.1016/S0022-4359(99)00004-4 Get rights and content

Abstract

Because of increasing concern over whether they are really satisfying their customers, more retail and service firms are using mystery shoppers (sometimes also referred to as secret, phantom, or anonymous consumer shoppers) to monitor their frontline operations, to assess their customer service, and to benchmark their competitors’ performance. But virtually nothing is known about the psychometric quality of the data collected in mystery shopping studies, and how it compares with that of customer survey data. Yet such information is necessary for organizations to know when to use mystery shopping, how to design mystery shopping studies, and what weight to place on their results. Here we use a generalizability theory approach to assess the psychometric quality of mystery shopping data. First, we design a study to compare the use of mystery shopping and a traditional customer survey when collecting data to assess the service quality of competitive stores. Second, we evaluate the psychometric performance of a set of secondary mystery shopping data collected to evaluate and benchmark the performance of branches of a retail network. Finally, in a follow-up study, we examine whether mystery shopping data collected to scale the more objective store environment are more reliable than data for service quality.

Introduction

Because of increasing concern over whether they are really satisfying their customers, more retail and service firms have been using mystery shoppers (sometimes also referred to as secret, phantom, or anonymous consumer shoppers) to assess the performance of their outlets Cobb 1995, Dwek 1996. In a mystery shopping study (sometimes also referred to as a shopping survey), the front-line operations of a business are evaluated by an anonymous trained observer. Typically, the observer enters the outlet to be evaluated posing as an average customer and, immediately after engaging in what appears to be a normal customer interaction, completes a detailed report on various aspects of the service and shopping environment at the outlet. Mystery shopping is best suited when assessing objective characteristics of outlet operations, such as the store environment (e.g., Were the aisles kept clear? Were all point-of-purchase signs in place?). Customer surveys (mail, telephone, or intercepts) are best suited for evaluating more subjective characteristics of operations, such as service quality (e.g., Did the salesperson provide courteous service? Did employees provide prompt service?). While it is difficult to use customer surveys to obtain detailed evaluations of objective outlet characteristics, mystery shopping surveys can evaluate service quality, and should provide useful information when the shoppers are representative of the customer population. Mystery shoppers can be assigned to compare the performance of branches of chains, such as banks, whereas customers can typically only evaluate one branch. So while used most frequently for evaluating objective characteristics of a service, mystery shopping surveys can also provide a supplementary source of subjective quality ratings.

Data collected by mystery shoppers can be reported in the form of rating scales, checklists, and open-ended responses. The results are then used to compare the performance of particular outlets and their employees, to monitor outlet performance over time, and to identify areas where outlets are in most need of improvement. As noted by Dawson and Hillier (1995), mystery shoppers are increasingly being used to benchmark the performance of important competitors or even of outlets in other industry sectors which might provide a performance standard.

Anecdotal accounts of the use of mystery shopping abound. When McDonald’s set new standards of customer satisfaction for its franchisees in response to concerns over quality, service and cleanliness relative to competition, it was reported as planning to have ‘representatives’ make surprise visits to restaurants each quarter (Gibson, 1995). Berry (1995, p. 40) reports that the Hard Rock Cafe uses researchers who shop both its restaurants and on-site merchandise stores every two months, producing ratings of individual servers. Similarly, Marketing Solutions is reported to have sent ‘researchers’ into 1,600 financial institution branches in 10 cities across Canada for an annual syndicated study designed to determine how well employees handle the marketing of mutual funds (Bell,1995). Disney is said to routinely use the technique at its theme parks to evaluate its performance (Meister, 1990).

But despite the popularity of mystery shopping for performance evaluation in such important areas of business as retailing, financial services, food service, tourism, and direct marketing, surprisingly little discussion of the technique has appeared in the academic literature. The most developed literature on the assessment of front-line operations has focused on the use of customer surveys to measure perceived service quality and customer satisfaction (see Iacobucci, Grayson, and Ostrom, 1994 for a useful review). In contrast, in their review of the use of observational methods for services marketing, Grove and Fisk (1992) cite only one proceedings paper which used a mystery shopping approach, and it focused on a substantive, not a methodological issue (McClung, Grove, and Hughes, 1989). Only recently have Morrison, Colman, and Preston (1997) begun to speculate that factors associated with the encoding, storage, and retrieval of information are likely to influence the accuracy of the results reported by mystery shoppers. Wilson (1998) examines UK managers use of mystery shopping and suggests some experience-based guidelines for reliability, which is described as a critical issue if “staff are to be rewarded on the basis of the results.” But published data on the reliability and validity of mystery shopping remains non-existent.

The popularity of mystery shopping has been more apparent to research practitioners, as evidenced by a one year increase from 28 to 187 in 1995 in the number of research agencies listed as providing the service in the MRS Organization Handbook (Dawson and Hillier, 1995). British practitioners have also discussed ensuring the acceptability of particular study results, given the types of decisions for which data are collected Hurst 1992, Craven and Yeomans 1994, and of the general practice, given possible ethical concerns (Dawson and Hillier, 1995). However, some important data quality questions have not been addressed. How do the data generated in mystery shopping studies hold up against the standards of reliability and validity advocated in the marketing research literature? Is the reliability of such mystery shopping data comparable to that of customer survey data? Is the reliability of mystery shopping data higher for objective constructs such as the store environment than for subjective constructs such as service quality? If the results from mystery shopping visits are to be used in managerial decision making, what design principles should be followed in such areas as assigning visits to outlets and choosing the number of items to be evaluated in a visit? For example, does it ever make sense for managerial decisions to be based on the report of a single shopper visit, or are far more visits necessary to assess the performance of outlets? If so, how many are needed? In their survey dealing with the ethical concerns of UK clients for mystery shopping, Dawson and Hillier (1995) found the acceptable number of mystery shopper visits to a competitor in a three month period averaged nearly four at an outlet level and nearly eight at an organization level. However, it is not clear to what extent survey respondents felt these were all that were necessary for methodological reasons or all that would be acceptable for ethical reasons, even though more would be desirable for methodological reasons. While there are a number of other important issues, such as whether mystery shopping infringes on an employee’s right to privacy (see New Zealand Privacy Commissioner 1995, Grove and Fisk 1992), we did not attempt to address them here.

To help answer these methodological questions, we conducted a three-phase research program. First, we designed and carried out a primary study to examine the reliability and validity of mystery shopping when used to evaluate the service quality of different retail outlets. Mystery shoppers for this study were selected from the population of customers of these retail outlets because we intended to compare the reliability and cost effectiveness of mystery shopping and a customer survey when applied to the same problem. The managerial problem of evaluating the relatively subjective service quality of retail outlets was chosen because both methods can be used. Second³, we approached a company which provides mystery shopping services, requesting access to secondary data from a typical mystery shopping project. The objective was to investigate the psychometric properties of a typical example of store evaluation data collected via mystery shopping. Finally, in a follow-up study, we collected additional primary data to compare the effectiveness of mystery shopping when used for evaluating objective and subjective areas of performance. In particular, we compare its use in the assessment of the store environment with its use in the assessment of service quality. In these initial studies, we concentrate on ratings data only. These are evaluated using generalizability theory (Cronbach, et al. 1972), because the theory and its associated methodology can provide answers to the questions raised here about mystery shopping.

Generalizability theory, hereafter called G-theory, is the most comprehensive approach to assessing the reliability and validity of measurement. It was originally developed for educational testing by Cronbach and his colleagues (1972) and was first identified as of potential value in marketing research by Peter (1979). Rentz 1987, Rentz 1988 provided a fuller introduction and presented some demonstration results for the generalizability of some consumer scales using data provided by students. Finn and Kayandé (1997) demonstrated how G-theory can be used to ensure the efficiency of measurement using a study of the customer perceptions of the service quality provided by retailers. If mystery shoppers can provide data of similar quality to average customers’, Finn and Kayandé’s (1997) results for comparing retail chains using crossed designs (Table 4, p. 269) raise the possibility that a relatively small number of mystery shoppers may be sufficient to obtain reliable service quality measures.

In the G-theory approach, an initial generalizability study (hereafter called a G-study) is conducted, collecting data to determine how sensitive the construct being measured is to the levels of different factors in the measurement environment. Then, when subsequent managerial decisions need to be made, knowledge of the extent of variation across and within factors can be used to determine how many observations will be necessary to draw managerial conclusions with a required degree of reliability. If cost information are also available, it is possible to identify the most cost effective designs for each subsequent decision study. Thus, the G-theory approach is most beneficial in a programmatic research context, rather than for one-off research projects.

Section snippets

Applying generalizability theory to mystery shopping

G-theory recognizes that an observed performance evaluation provided for a retail outlet will depend on multiple factors, such as the time of the observation, the type of transaction, the employee observed, the specific mystery shopper doing the observation, and the particular aspect of operations being rated, as well as the particular outlet being observed. Each of these factors is referred to as a facet. The contribution of a facet to the overall variation in performance ratings can be

Primary study

The purpose of the primary study was to determine whether the psychometric properties of service quality data collected using typical mystery shopping conditions are comparable to those collected using typical customer survey conditions, and can meet the standards advocated in the marketing research literature. Of particular concern were (1) the relative size of variance components, and therefore the G-coefficient for different objects of measurement, in data collected using the two methods,

Secondary study

To get a better idea of whether one, ten, or nearly forty mystery shopping visits are generally necessary to reliably scale outlets, we examine some secondary data, collected by a supplier of mystery shopping services. These data were made available for academic study with the proviso that no substantive findings or proprietary details would be published. We can only report that the data were collected for one client with branches located throughout one metropolitan area in North America.

The

Follow-up study

To investigate whether our results are driven by our investigation of subjective performance constructs, we designed a follow-up study to collect data on both subjective and relatively objective constructs, namely service quality and store environment. Items for a store environment scale were developed from a comprehensive checklist suitable for retailers wishing to assess their own store environment (Williams and Torella, 1992, p. 128). Three items were selected for each of seven distinct

General discussion

As industry practice stands today, managers are using mystery shopping to evaluate both the objective and subjective characteristics of front line service operations. However, they have had no idea of the psychometric quality of the data collected using this increasingly popular method nor, in the case of subjective characteristics, whether they are comparable to those collected using customer surveys. A generalizability approach allows the manager to optimize the number of mystery shoppers

Conclusions

The evidence from our primary, secondary and follow-up studies all suggest mystery shoppers provide reasonably reliable ratings of the performance of retail outlets, although much less reliable than seems to be assumed for commercial studies using 3–4 shoppers. The reliability of mystery shopping data is much higher than that of customer surveys, when the data are used for the same problem of scaling outlets. There is presumably an advantage in mystery shoppers knowing they are going to be

References (33)

A Parasuraman et al.
Alternative Scales for Measuring Service QualityA Comparative Assessment Based on Psychometric and Diagnostic Criteria
Journal of Retailing
(1994)
Bell, Andrew. (1995). Better Mutual Fund Service: Survey. The Globe and Mail, December...
Leonard L Berry
On Great Service
(1995)
Mary Jo Bitner et al.
The Service EncounterDiagnosing Favorable and Unfavorable Incidents
Journal of Marketing
(1990)
Robert Brennan
Elements of Generalizability Theory
(1983)
Robin Cobb
Magical Mystery Lure
Marketing
(1995)
Eli P Cox
The Optimal Number of Response Alternatives for a ScaleA Review
Journal of Marketing Research
(1980)
Craven, Jill and Yeomans Mark. (1994). “In Defense of Mystery Shopping,” Proceedings of the 37th Annual Conference of...
Lee J Cronbach et al.
The Dependability of Behavioral MeasurementsTheory of Generalizability for Scores and Profiles
(1972)
J.Joseph Cronin et al.
SERVPREF Versus SERVQUALReconciling Performance Based and Perceptions-Minus-Expectations Measurement of Service Quality
Journal of Marketing
(1994)

Janet Dawson et al.

Competitor Mystery ShoppingMethodological Considerations and Implications for MRS Code of Conduct

Journal of the Market Research Society

(1995)

Robert Dwek

Magic of Mystery Shopping

Marketing

(1996)

Adam Finn et al.

Reliability Assessment and Optimization of Marketing Measurement

Journal of Marketing Research

(1997)

Gibson, Richard. (1995). “McDonald’s Approaches‘96 With Goal of Making Its U.S. Service‘Hassle-Free.’” Wall Street...

Stephen J Grove et al.

Observational Data Collection Methods for Services MarketingAn Overview

Journal of the Academy of Marketing Science

(1992)

Hurst, Stephen C. (1992). “Quantifying Customer Service Via Mystery Shopper Surveys.” Proceedings of the 35th Annual...

Cited by (92)

Top secret: Integrating 20 years of research on secrecy
2023, Technovation
Secrets are a double-edged sword. They are crucial for protecting valuable knowledge and appropriating value from innovation; they also invite consumer curiosity and in doing so may generate demand. On the other side, secrecy may invite suspicions, distrust, and miscommunication. In this paper, we review the growing body of research on secrets and secrecy. Our review reconciles the various theoretical perspectives on secretive protective mechanisms and their consequences, both positive and potentially detrimental. We adopt a dynamic relational perspective towards secrecy and develop a multilevel framework to categorize and define four major types of secrets: trade secrets, reputation secrets, power secrets, and marketing mix secrets. Finally, we outline a research agenda by pointing out understudied themes and underemployed theoretical perspectives on secrecy.
Measurement of the reliability of pharmacy staff and simulated patient reports of non-prescription medicine requests in community pharmacies
2021, Research in Social and Administrative Pharmacy
Citation Excerpt :
This method has come to be accepted as a rigorous and cost-effective technique for observing practice where other means of assessment may be difficult. This method allows the research team to view the experience through a ‘consumer's’ eyes and in a manner that may result in the participant not being aware that they are under observation at the time.1,4,9 Acceptability studies in pharmacy practice research have identified that pharmacy staff perceive the method to be an acceptable technique to measure practice.10–12
The use of simulated patients (SPs) in pharmacy practice research has become an established method to observe practice. The reliability of data reported using this method in comparison to pharmacy staff self-reported behaviour has yet to be ascertained.
To compare the inter-rater agreement of pharmacy staff and SP-reported data to researcher-reported data from audio recordings of SP encounters.
A dataset of 352 audio-recorded SP encounters was generated in March–October 2015 by 61 undergraduate pharmacy students completing SP visits to 36 community pharmacies in Sydney, Australia. Post-visit scores were recorded on data collection forms by SPs. Staff completed self-assessments on identical forms immediately after the encounter. Two-hundred-and-seventy visits were randomly selected as the sample for this study, where the researcher independently scored encounters via audio recordings. Inter-rater agreement was calculated through intra-class correlation (ICC) and weighted kappa analyses.
Analysis of staff scores returned ICC values of 0.48 (95% CI:0.38–0.56; p < 0.001) for information gathering and 0.63 (95% CI:0.55–0.70; p < 0.001) for total score. Weighted kappa for information rating was 0.30 (95% CI:0.21–0.38; p < 0.001) and 0.43 (95% CI:0.34–0.51; p < 0.001) for overall outcome. ICC values for SPs were 0.91 (95% CI:0.88–0.93; p < 0.001) and 0.90 (95% CI:0.87–0.92; p < 0.001) for information gathering and total scores respectively. Weighted kappa values were 0.44 (95% CI:0.37–0.52; p < 0.001) for information rating and 0.63 (95% CI:0.55–0.70; p < 0.001) for overall outcome.
Pharmacy staff self-reported their behaviour with a poor degree of reliability. Conversely, SPs had a high level of agreement with the researcher scoring from audio recordings. Disagreement for both groups of raters was most apparent in rating the information provided and overall appropriateness of outcome. Future research should investigate this discrepancy between staff-reported behaviour and actual behaviour and consider the implications of this discrepancy in the interpretation of self-reported data.
Mystery Shopping and Well-Being of Service Workers in South Korea
2019, Safety and Health at Work
Mystery shopping is a method in which a company monitors quality of service and employee conduct and compliance with regulations using an evaluator posing as a customer. It is a typical tool of customer-centered bureaucratic control insofar as it provides overall and standardized evaluation of intangible elements of customer service as well as physical elements of service environments. The purpose of this study is to examine how mystery shopping is related to the health status of service workers in South Korea.
Data from semistructured interviews with 15 workers were collected from January to April 2019 to obtain information on service worker experiences with mystery shopping. Data were analyzed using the constant comparison method.
Mystery shopping limits worker autonomy and stiffens the workplace environment by standardizing and monitoring labor processes for service workers. In addition, mystery shopping heightens work stress through increased labor intensity. Five mechanisms by which mystery shopping affects service worker health are identified and comprise (1) multifaceted and multilayered surveillance, (2) evaluator subjectivity and irrational requirements, (3) standardized rules combined with high pressure to achieve sales, (4) self-esteem degradation because of evaluator results, and (5) musculoskeletal disorders because of strict adherence to labor processes based on evaluator results.
Mystery shopping as an evaluation method should be reconsidered not only in terms of health problems but also in terms of organizational efficiency and issues of human rights.
Cheating customers in grocery stores: A field study on dishonesty
2019, Journal of Behavioral and Experimental Economics
The study measures how often customers are cheated in real-world transactions. In a pre-registered field study in Prague, Czech Republic, hired confederates posed as foreigners unfamiliar with local currency. While buying snacks in grocery stores (N = 259) either in the morning or in the evening, they provided cashiers with an opportunity to steal money from them by keeping more change than they were supposed to. The customers were cheated in 21% of stores, the median overcharge was 54% of the value of an average purchase, and overcharging occurred more often in the stores with on-line reviews mentioning dishonesty of employees. Although males cheated and were cheated slightly more often than females, gender differences were not significant. In contrast with predictions of the Morning Morality Effect, dishonest behavior occurs slightly more often in the morning than in the evening. The results show that cheating of customers in grocery stores is relatively widespread and it is especially prevalent in the central city district where the odds of being cheated are more than three times higher in comparison with the rest of the city.
Do Mystery Shoppers Really Predict Customer Satisfaction and Sales Performance?
2019, Journal of Retailing
Citation Excerpt :
We, therefore, assume that subjective assessments of mystery shoppers and real customers are consistent. Several empirical studies substantiate this assumption: Wilson and Gutmann (1998) and Finn and Kayandé (1999) report significant correlations between overall customer satisfaction scores and average mystery scores. More recently, Hoekstra, Ammeraal, and Leeflang (2014) observed that the satisfaction ratings of real customers are well reflected by mystery callers’ judgments.
Mystery shopping (MS) is a widely used tool to monitor the quality of service and personal selling. In consultative retail settings, assessments of mystery shoppers are supposed to capture the most relevant aspects of salespeople’s service and sales behavior. Given the important conclusions drawn by managers from MS results, the standard assumption seems to be that assessments of mystery shoppers are strongly related to customer satisfaction and sales performance. However, surprisingly scant empirical evidence supports this assumption. We test the relationship between MS assessments and customer evaluations and sales performance with large-scale data from three service retail chains. Surprisingly, we do not find a substantial correlation. The results show that mystery shoppers are not good proxies for real customers. While MS assessments are not related to sales, our findings confirm the established correlation between customer satisfaction measurements and sales results.
Customer-oriented communication in retail and Net Promoter Score
2017, Journal of Retailing and Consumer Services
Citation Excerpt :
MS studies have been conducted especially in these areas: quality assessment in financial services (Djordjic, 2011; Tarantola et al., 2012), in restaurant business and tourism industry (Chen and Barrows, 2015a; 2015b; Minghetti and Celotto, 2013) in retail (Gosselt et al., 2007; Janka and Jankalova, 2011; Kehagias et al., 2011). Some studies also develop methodology of MS (Finn and Kayandé, 1999; Ford et al., 2011; Kanto and Pihlajamaa, 2013; Kehagias et al., 2011; Wilson, 1998a,b; Wilson, 2001). In the Czech Republic, Staňková and Vaculíková (2007) described the possibility of using mystery shopping for improvement of trade and services only theoretically.

View all citing articles on Scopus

²: Some work on this paper was carried out while the first author was Visiting Professor, Institute of Marketing, Norwegian School of Economics and Business Administration and the second author was a doctoral candidate at University of Alberta. Order of authorship is alphabetical.

¹: Adam Finn is R.K. Banister Professor of Business at University of Alberta and Ujwal Kayandé is a Senior Lecturer in Marketing at the Australian Graduate School of Management, Sydney.

View full text

Original ArticlesUnmasking a phantom: a psychometric assessment of mystery shopping2

Abstract

Introduction

Section snippets

Applying generalizability theory to mystery shopping

Primary study

Secondary study

Follow-up study

General discussion

Conclusions

Journal of Retailing

On Great Service

The Service EncounterDiagnosing Favorable and Unfavorable Incidents

Journal of Marketing

Elements of Generalizability Theory

Magical Mystery Lure

Marketing

The Optimal Number of Response Alternatives for a ScaleA Review

Journal of Marketing Research

The Dependability of Behavioral MeasurementsTheory of Generalizability for Scores and Profiles

SERVPREF Versus SERVQUALReconciling Performance Based and Perceptions-Minus-Expectations Measurement of Service Quality

Journal of Marketing

Competitor Mystery ShoppingMethodological Considerations and Implications for MRS Code of Conduct

Journal of the Market Research Society

Magic of Mystery Shopping

Marketing

Reliability Assessment and Optimization of Marketing Measurement

Journal of Marketing Research

Observational Data Collection Methods for Services MarketingAn Overview

Journal of the Academy of Marketing Science

Original Articles
Unmasking a phantom: a psychometric assessment of mystery shopping2