Elsevier

Safety Science

Volume 118, October 2019, Pages 763-771
Safety Science

Calibrating experts’ probabilistic assessments for improved probabilistic predictions

https://doi.org/10.1016/j.ssci.2019.05.048Get rights and content

Highlights

  • We propose a new calibration measure to evaluate experts’ probability assessments.

  • The new calibration measure is compared with established calibration measures.

  • Theoretical properties of the new calibration are investigated are discussed.

  • We contrast and discuss results using a large data-set of experts’ predictions.

Abstract

Expert judgement is routinely required to inform critically important decisions. While expert judgement can be remarkably useful when data are absent, it can be easily influenced by contextual biases which can lead to poor judgements and subsequently poor decisions. Structured elicitation protocols aim to: (1) guard against biases and provide better (aggregated) judgements, and (2) subject expert judgements to the same level of scrutiny as is expected for empirical data. The latter ensures that if judgements are to be used as data, they are subject to the scientific principles of review, critical appraisal, and repeatability. Objectively evaluating the quality of expert data and validating expert judgements are other essential elements. Considerable research suggests that the performance of experts should be evaluated by scoring experts on questions related to the elicitation questions, whose answers are known a priori. Experts who can provide accurate, well-calibrated and informative judgements should receive more weight in a final aggregation of judgements. This is referred to as performance-weighting in the mathematical aggregation of multiple judgements. The weights depend on the chosen measures of performance. We are yet to understand the best methods to aggregate judgements, how well such aggregations perform out of sample, or the costs involved, as well as the benefits of the various approaches. In this paper we propose and explore a new measure of experts’ calibration. A sizeable data set containing predictions for outcomes of geopolitical events is used to investigate the properties of this calibration measure when compared to other, well established measures.

Introduction

Experts are consulted in a myriad of situations and inform all stages of the modelling process, from structuring the problem to estimating facts, and quantifying uncertainty. While consulting experts may be a valuable resource for decision-makers, it is crucial that decision-makers, stakeholders and experts play separate roles in the decision process. The experts’ role should be limited to providing estimates of facts and predictions of event outcomes (Sutherland and Burgman, 2015).

Expert judgement is used when empirical data are unavailable, incomplete, uninformative, or conflicting. These judgements then inform critically important decisions. It is therefore important that such judgements are as defensible as possible. Research into the experts’ performance when providing such judgements reveals that expert status is not correlated with the ability of an expert to give unbiased, error-free judgements (Burgman et al., 2011). Expert judgements are susceptible to a range of cognitive and motivational biases, to the expert’s particular context, and to their personal experiences (Shrader-Frechette, 1996, Slovic, 1999, Montibeller and von Winterfeldt, 2015). To counter such limitations, structured expert elicitation protocols have been developed.

A working definition of a structured protocol is given in Cooke (1991) and slightly reformulated in Hanea et al. (2018). It involves asking questions with operational meanings, following a traceable, repeatable and open to review process, mitigating biases and providing opportunities for empirical evaluation and validation. The above are guidelines rather than rules for what would one consider to be a structured and efficient protocol.

While expert interaction and the feedback provided are still discussed within the expert judgement community, one of the elements that everybody agrees with is that the judgements of more than one expert are essential in all situations. A diversity of opinions is always desirable and we use gender, age, experience, affiliation, and world view as proxies for diversity.

Here, we restrict our attention to eliciting expert judgements of event occurrences. Eliciting multiple judgements about the same event results in sets of probabilities, rather than one single probability of that event occurrence which is often what is needed in further modelling. It is not uncommon for these probabilities to differ, reflecting different knowledge bases and different mental models used by the experts when making their judgements. While these differences are crucial to the understanding of the problem in all its complexity and need to be recorded, they make the aggregation of different judgements (into a single one) somewhat cumbersome.

The two main ways of aggregating different judgements are behavioural aggregation (e.g. O’Hagan, 2005), and mathematical aggregation (e.g. Valverde, 2001, Cooke, 1991). Behavioural aggregation typically involves face-to-face meetings of experts who, at the end of the meetings, agree on a judgement. Discussion and consensus seeking are often practised together. The main advantage of this approach is that experts share and debate their knowledge. Nevertheless such interaction is prone to biases including groupthink and halo effects (e.g. Hinsz et al., 1997). Sometimes experts do not agree to any possible consensus. When experts disagree, even after a facilitated discussion, attempts to impose consensus may mask the group’s diversity of opinion.

The alternative is to use a mathematical rule to aggregate the judgements. When mathematical aggregation is used, the interaction between experts is usually limited to training and briefing. Extensive discussion is discouraged because it may induce dependence between the elicited judgements (e.g. O’Hagan et al., 2006). Very few studies have been undertaken in order to investigate this effect, and even fewer found it harmful to the process (e.g. Hanea et al., 2016). Nonetheless, using a mathematical rule makes the aggregation explicit and auditable, and makes the results reproducible. Different rules satisfy different properties and unfortunately it is impossible to have all desirable properties in one rule (Clemen and Winkler, 1999). One well established and used aggregation rule was formulated in Cooke (1991), and it is a linear combination of judgements weighted by the experts’ prior performance on similar tasks. Cooke’s method, also called the Classical Model (CM) of structured expert judgement (SEJ), asks experts to give estimates that can be validated with data in a process that is transparent and neutral. This particular way of aggregation makes CM satisfy all the desiderata of a SEJ protocol.

Ideally after the aggregation, the resulting single probability reflects many of the experts’ initial judgements, and even though they do not recognise it as their own, they should have no valid arguments against it. The consensus is not achieved by conferencing, but in a rather external way, thorough the mathematical aggregation. This sort of consensus is what Cooke calls rational consensus (Cooke, 1991). Achieving a rational consensus single probability may be very difficult in situations when experts strongly disagree and have very little interaction and feedback from their peers.

A structured protocol which strives to deal with such situations is the IDEA protocol (e.g. Hanea et al., 2016). IDEA builds on CM, while using elements from the behavioural aggregation techniques, which makes it a mixed protocol for SEJ (similar to the well known Delphi protocol (Rowe and Wright, 2001)). IDEA asks experts for their individual estimates without allowing them to interact, presents the anonymised set of judgements back to the group of experts, and encourages facilitated discussion and extensive interaction, while discouraging consensus. After experts share their reasoning, (sources of) data, and (mental) models they have the opportunity to privately modify their initial estimates (if they so wish) in accordance with the discussion. These second estimates are then mathematically aggregated. The aggregation can be either an equally, or a differentially weighted linear combination. If differential weights are used in the IDEA protocol, they are always proportional with measures of prior performance on similar tasks (Hanea et al., 2018).

An aggregated opinion can be viewed as that of a “virtual” expert. The same performance measures can be then used to both evaluate this virtual expert’s performance, and to justify choosing the aggregation which performs best. Commonly used measures of performance are designed to be objective and conservative and focus on different attributes of good performance. They are measured on sets of questions to ensure sustained, rather than isolated good performance. Three of these attributes are long term accuracy, long term informativeness and calibration. Long term accuracy and informativeness are calculated per question and averaged across questions, hence they are average measures of performance. Accuracy measures how close an expert’s estimate is to the truth, which is a difficult concept to interpret when the estimate is a probability of occurrence and the truth is the occurrence or non-occurrence of the event. Informativeness may measure the amount of entropy in what the expert says (independent of the actual occurrences of events), or the entropy in the expert’s performance (without corresponding to the distribution that the expert, or anyone else believes) (Cooke, 1991). It may also measure the departure from the uniform distribution (Hanea et al., 2016). Rather than average measures for individual questions (variables) (Cooke, 1991) proposes and discusses the advantages of measures for average probabilities. Calibration1 rather than accuracy is proposed as a more appropriate measure of each experts’ performance. Measures of performance are constructed using scoring rules. These scoring rules are random variables and analysing and comparing the scores’ values requires knowledge about the scores’ respective distributions. An important reason for Cooke’s proposal is that the proposed score has a known asymptotic distribution, as opposed to (for example) the average Brier score for measuring accuracy (Brier, 1950), which does not. Moreover the score is asymptotically proper, which means that its expected pay-off is maximised only when experts express their true beliefs about the predicted event (e.g. Winkler and Murphy, 1968). Despite the very attractive theoretical properties of Cooke’s calibration score, it only has a couple of real life applications for discrete variables (Cooke et al., 1988, Bhola and Cooke, 1992). One reason for this lack of uptake may come from its asymptotic properties which imply the need of tens of questions in order to obtain reliable scores. These questions, commonly referred to as calibration questions, are additional to the questions of interest which are imperative, hence they are time consuming and add to the experts’ fatigue. A reduced number is desirable. Another disadvantage of an asymptotic score is that comparing scores of experts who answered a different number of questions may be cumbersome and a power equalisation technique may be needed (Cooke, 1991). Both disadvantages point to the need for a score with an exact distribution.

In this paper we propose one such score (see Section 2), discuss its theoretical properties and compare it with Cooke’s calibration score on a synthetic, simulated data set (in Section 3.1), as well as on large dataset containing predictions for outcomes of geopolitical events for the period 2011-2015 (Section 3.2). The dataset together with the elicitation protocol used to elicit these data are described in Section 3.2.1. We conclude the paper in Section 4 with a discussion that outlines potential shortcomings of the new score and future research directions.

Section snippets

Methods

Assessing the probability of occurrence for certain events of interest equates to eliciting bivariate random variables. Nonetheless, the methodology presented in this paper can be extended to any discrete random variable. Let X be a bivariate random variable, such that P(X=1)=p and P(X=0)=1-p. Experts are asked to estimate p, but instead of asking directly for this probability, they are asked to assign the event whose occurrence is modelled by X to a probability bin

Results

We first interrogate an artificial data set to understand the behaviour of the three calibration scores when calculated for different number of questions. The theoretical properties of the calibration scores, such as the χ2 approximation, rely on a sufficiently large number of questions. In an ideal situation, hundreds of questions should be answered in order to distinguish between the calibration scores of experts. In practice, it is very unlikely for experts to answer so many questions for

Concluding remarks

A new score for measuring how calibrated experts’ assessments of probabilities are, was introduced and discussed. This score has a known exact distribution. We have investigated the score’s theoretical properties and practical performance and discussed a number of its positive and negative attributes. The identified need for a score which uses an exact rather than an approximate distribution of its test statistic is satisfied, but this comes with other shortcomings, some identified in this

References (29)

  • B. Bhola et al.

    Expert opinion in project management

    Eur. J. Oper. Res.

    (1992)
  • Alan Agresti
    (2003)
  • G.W. Brier

    Verification of forecasts expressed in terms of probability

    Mon. Weather Rev.

    (1950)
  • M.A. Burgman

    Trusting Judgements: How to Get the Best Out of Experts

    (2015)
  • M.A. Burgman et al.

    Expert status and performance

    PLoS ONE

    (2011)
  • Ken Butler et al.

    The distribution of a sum of independent binomial random variables

    Methodol. Comput. Appl. Probab.

    (2017)
  • R. Clemen et al.

    Combining probability distributions from experts in risk analysis

    Risk Anal.

    (1999)
  • R.M. Cooke

    Experts in Uncertainty: Opinion and Subjective Probability in Science. Environmental Ethics and Science Policy Series

    (1991)
  • R.M. Cooke et al.

    Calibration and information in expert resolution

    Automatica

    (1988)
  • A. Hanea et al.

    Classical meets modern in the idea protocol for structured expert judgement

    J. Risk Res.

    (2016)
  • A. Hanea et al.

    InvestigateDiscussEstimateAggregate for structured expert judgement

    Int. J. Forecast.

    (2016)
  • A. Hanea et al.

    The value of discussion and performance weights in aggregated expert judgements

    Risk Anal.

    (2018)
  • V.B. Hinsz et al.

    The emerging conceptualization of groups as information processors

    Psychol. Bull.

    (1997)
  • H.O. Lancaster

    The combination of probabilities arising from data in discrete distributions

    Biometrika

    (1949)
  • Cited by (12)

    • Quantifying human performance for heterogeneous user populations using a structured expert elicitation

      2021, Safety Science
      Citation Excerpt :

      There are other available protocols (e.g. Delphi method (Skulmoski et al., 2007), SHELF (Gosling, 2018)) in the literature and selecting the correct one will depend on the resources available and the availability of domain experts. The advantage of expert elicitation over other approaches discussed is the low resource burden and the flexibility (Hanea and Nane, 2019). For quantifying heterogeneous human performance, expert elicitation minimizes the need to recruit users.

    • Quantitative approach to physical protection systems assessment of critical infrastructure elements: Use case in the Slovak Republic

      2020, International Journal of Critical Infrastructure Protection
      Citation Excerpt :

      Expert assessment thus becomes a broadly applicable tool in the quantification of uncertainty in property protection system models. One of the primary limitations of expert assessment is that it is based on human belief that is inconsistent in time and can lose its validity [52,53]. That is the reason why it is increasingly expected of an expert assessment to be consistent in time and ideally empirically demonstrable.

    View all citing articles on Scopus
    View full text