1 Scope

Composite indicators have seen dramatic growth in use as well as in impact, and are increasingly used at face value by a plurality of actors. At the same time, academia maintains an honest scepticism about these measures, tempered by methodological developments that aim to remedy their most evident shortcomings.

True to its nature, a composite indicator is usually built to ‘tell a story’. It is thus ideally suited to identify and bring attention to a possibly latent phenomenon. For this, it is appreciated by ‘value entrepreneurs’.Footnote 1

When used for policy, the unidirectionality of composite indicators is less desirable. In the context of policy analysis and negotiation, in which different options, as well as different ‘ends in sight’, are relevant, composite indicators may fall short. Ideally, different stakeholders could confront one another, armed with different measures and indicators. Could the concept of composite indicators be stretched so as to accommodate these settings? Could one force it to tell ‘more than one story’?

In this contribution, we give this concept a try based on examples.

2 The Fortune of Composite Indicators

A recent search on Scopus performed in the summer of 2019 indicates growth in interest in composite indicators, proxied by their mention in the scientific literature. Figure 1 suggests that not only is the use of composite indicators increasing but so is the pace of their growth.

Fig. 1
figure 1

Search on www.scopus.com using as search string: TITLE-ABS-KEY (“composite indicator*”) OR TITLE-ABS-KEY (“composite index”) OR TITLE-ABS-KEY (“composite indices”)

The ‘Report by the Commission on the Measurement of Economic Performance and Social Progress’, prepared by Joseph Stiglitz, Amartya Sen, and Jean-Paul Fitoussi for the French president Nicolas Sarkozy (Stiglitz et al. 2009), says that the growth in the number of statistical indicators reflects different concurring trends: improvements in the level of education, increases in the complexity of modern economies, and the widespread use of information technology.

As discussed in Becker et al. (2017), human well-being and progress are areas in which composite indicators are popular, covering themes from happiness-adjusted income to environmentally adjusted income, from child development to information and communication technology. They are also used in the analysis of innovation (Balcerzak and Pietrzak 2017a; Dutta, Lanvin and Wunsch-Vincent 2018; Hausken and Moxnes 2019; Żelazny and Pietrucha 2017), analysis of real estate markets (Małkowska and Głuszak 2016), countries’ competitiveness (Cheba and Szopik-Depczynska 2017; World Bank 2019; Kruk and Waśniewska 2017; Schwab 2019), socio-economic development (Bartkowiak-Bakun 2017; Mazziotta and Pareto 2016), the quality of institutions (Balcerzak and Pietrzak 2017b), sustainable development (Balcerzak and Pietrzak 2017c; Luzzati and Gucciardi 2015; Semenenko et al. 2019), the standard of living (Greyling and Tregenna 2017; Kuc 2017), well-being (Barrington-Leigh and Escande 2018; Chaaban et al. 2016; Peiro-Palomino and Picazo-Tadeo 2018) and many others (Aparicio and Kapelko 2019; Capecchi and Simone 2019; Dinis et al. 2019; Marozzi 2015; Mann and Shideler 2015; Michener 2015; Miro and Piffaut 2019; Rogalska 2018). Many composite indicators target issues of sustainability. In this respect, an ongoing line of inquiry is to understand why these measures seem to have little traction, e.g. in relation to their capacity to displace established indicators such as gross domestic product (GDP) as measures of progress (Boulanger 2018; Popp Berman and Hirschman 2018). We return to these themes later in the paper.

3 Pros and Cons of Composite Indicators

An existing handbook on composite indicators (OECD-JRC 2008) lists several pros and cons of these measures (Table 1).

Table 1 Pros and cons of composite indicators

The report by Stiglitz et al. (2009) considers composite indexes (CIs) problematic because even when the weighting procedure is presented transparently, the normative implications of the weights are seldom spelt out or justified. Along the same lines, for Popp Berman and Hirschman “successful quantification projects tend to hide their assumptions”. The critique of CIs also feeds into the present reflection on the abuse of metrics (Muller 2018), with the example of university ranking and similar league tables standing out as controversial (Saisana et al. 2011; Wilsdon 2016).

For mainstream economics—for the purpose of this paper exemplified by the World Bank’s economist Martin Ravallion—CIs are guilty of not being constructed on sound economics. Ravallion identifies two types of indices: those built on economic theory, either direct monetary aggregates or based on shadow pricesFootnote 2; and all others, which are dismissively termed ‘mashup indices’—a definition that Ravallion applies to existing measures, such as the Human Development Index (HDI) and the Multidimensional Poverty Index (MPI).

More, in general, the use of Economics as a master discipline to adjudicate the soundness of any form of knowledge has suffered, after the internal discussion in economics on the role of mathematics and modelling. Initiated after the models’ inability to predict the last recession (Mirowski 2013), this discussion gave birth to a new term, ‘mathiness’—coined by the economist Paul Romer (2015), defined as the use of mathematics to veil positions which are in fact normative. The use of prices—whether real or shadow—is in itself fraught with the crucial normative assumption that socio-ecological outcomes can be represented in monetary terms (Funtowicz and Ravetz 1994). Finally, doubts remain as to what constitutes a ‘sound economic theory’, given the present controversy in the profession against the prevailing economic paradigm (Reinert 2008; Rethinking Economics 2017).

Indeed, we suggest that composite indicators is a field better addressed by social scientists in general, than it is by the field of economics in particular, as we go on to discuss in the next session.

4 Is a Theory for Composite Indicators Possible?

Without being exhaustive, we mention here some important ingredients of a possible theory of CIs. To start with, the OECD-JRC handbook (2008)offers ten recursive steps to build CIs, from building a theoretical framework to how to present the results, and includes advice on how to tackle technical choices on data selection, imputation, normalization, and aggregation. Even a cursory glance at this list is sufficient to grasp that numerous modelling assumptions are needed in these processes, not just the assignment of weights; the idea that a composite indicator can be made ‘objective’ should be put to rest. As mentioned above, ‘telling a story’ is precisely what a ‘value entrepreneur’ wishes to do. In this context, the problem may not be with the non-neutrality of a measure but, rather, with its purported neutrality. The production of apparently neutral scientific fact in the context of policy is known as ‘stealth advocacy’ (Pielke 2007), meaning that this form of advocacy is hidden behind a veneer of objectivity. Following Pielke, we suggest that scientists and CI developers be clear about their normative stances, rather than courting an impossible neutrality.

It is also important to note that most existing CIs suffer from technical shortcomings, especially when they are built using a linear combination of variables. For example, most indicators are built in such that the weights attached to the different variables (i) add up to one and (ii) reflect the importance of the variables. In fact, both (i) and (ii) are highly questionable, if not outright wrong. It can be proven that according to mathematical theory it is the sum of the squared weights which should be one (Becker et al. 2017) and that the actual importance of variables in a CI may deviate from its weights considerably (Paruolo et al. 2013). Both results owe to the theory of sensitivity analysis (Saltelli et al. 2008), a tool used in mathematical modelling. This theory proves that the importance of a given variable is given by the square of its weight divided by the sum of all squared weights, for the ideal case of uncorrelated and standardised variables (Saltelli et al. 2008, p. 47). For non-standardised and non-independent variables, the importance depends on the interplay of the relative variances and on correlation among the variables.

For Paul-Marie Boulanger (2014), CI can be situated at the intersection of three conceptual movements. The first is associated with the democratisation of expertise (Carrozza 2014), the idea that more knowledge needs to be brought to bear than that provided by experts. This is close to the concept of extended peer community advocated in post-normal science (Funtowicz and Ravetz 1993), in which the soundness of scientific practice is not judged simply by the peers of a given discipline but by several disciplines, because each discipline offers a different viewpoint. Laypeople with a direct stake, knowledge, or interest in the matter at hand are also involved in the process. The OECD-JRC handbook insists on the need for an indicator to be co-developed, possibly and ideally involving the community of those actors (individuals, institutions, countries, or regions), which are affected by the measure. This process should extend to all the stages in a CI’s construction (OECD-JRC 2008).

The second constitutive element identified by Boulanger (2014) is CI as instrumental in the creation of a new public through a process of social discovery (Dewey 1938). Why are ‘social discoveries’ needed? For Dewey, there are ‘publics’ affected by a transaction taking place somewhere else, who need to be made aware of a problem situation—e.g. the pollution of an aquifer, a case of air contamination, the unintended consequences of technology, and so on. In his view, the “machine age has so enormously expanded, multiplied, intensified and complicated the scope of the indirect consequences […] that the resultant public cannot identify and distinguish”. The German sociologist Ulrich Beck later called a society thus affected by a myriad of mostly invisible threats a ‘risk society’ (Beck 1992 [1986]).

In general, developing an index of, say, air quality, sustainability, the rule of law, corruption, or university performance can generate and mobilise new publics, at times producing important policy outcomes. Several recent books on the emerging field of sociology of quantification reviewed in Popp Berman and Hirschman (2018) address this issue.

The third element which can be used to shed light on CI comes from Charles Sanders Peirce, the father of semiotics, and his triadic conception of the sign as a structure connecting three elements: the sign properly said (S), an object (O), and an ‘interpretant’ (I). The example reported by Boulanger (2014) is that of the African vervet monkey, which possesses a sophisticated repertory of vocal signs for signalling the presence of a predator, distinguishing a terrestrial stalker such as a leopard, an aerial raptor such as an eagle, or a ground predator such as a snake. In this case, the ‘object’ is the predator, the ‘sign’ is the cry emitted by the vervet, and the ‘interpretant’ is the collective behaviour of the monkeys in reaction to the cry, e.g. climbing up a tree if the predator is terrestrial. Thus a CI is not just a sign (a number) pointing to an issue; it also entails an interpretant, understood as the policy or social change desired or suggested by the proponents of the measure. The same Dewey (1938) made the point that any social indicator is meaningful only in the context of a desired end in sight.

Boulanger enriched the analysis of indicators of sustainability and their interpretants in a subsequent work (2018) drawing from the German sociologist and systems theorist Niklas Luhmann. For Boulanger, mobilising publics with a CI is not just a matter of providing the ‘facts’. According to Luhmann, modern societies should be seen as functionally differentiated, i.e. composed of several sub-systems, relatively independent (law, science, the economy, the mass media, etc.). These systems may either actively develop/use indicators to influence their evolution or be influenced by them when one system is observing another system with an indicator. The problem with sustainability is thus which system is observing or ‘irritating’ (the word used by Luhmann) which system. Science cannot easily ‘irritate’ e.g. the economy developing a measure of progress to replace GDP, as science and economy follow separate codes of communication, truefalse for science and gainloss for the economy.

There are instead cases where the strength of indicators, their capacity to appear objective, and their addictive character, manage to impact social systems, as evidenced by university league tables mentioned earlier (Saisana et al. 2011; Wilsdon 2016).

5 Frames and Quantitative Storytelling (QST)

It is a tautology that every measure of society corresponds to a frame and—according to Dewey—to ‘an end in sight’ as well. The use of frames in the policy discourse is a hot topic. For example, in the US, George Lakoff (2004, 2010) laments the liberals’ cultural subjugation to frames developed by conservatives. In economics, Akerlof and Shiller (2015)discuss how economic actors operating in a market are forced to exploit the frames of their consumers in order to survive. This view reverses economics’ (and Adam Smith’s) cherished paradigm that the composition of individual selfishness in a market produces a common good. The market as the best arbiter of all negotiations is at the heart of the prevailing orthodoxy, neoclassical economics. As mentioned, this theory is at present contested (Mirowski 2013; Reinert 2008; Rethinking Economics 2017). The French legal scholar Alan Supiot (2015) sees the neoliberal creed in markets as particularly reliant on quantification, in which numbers replace laws to create a world of dystopian injustice and dysfunction.

Frames are very much at the core of discussion on the use of evidence for policy (known as evidence-based policy [EBP]). Dan Kahan, an expert in cultural cognition theory, studies how we process new knowledge in order to reinforce our existing worldviews and political orientations (Kahan et al. 2011), so an actor’s deeper knowledge reinforces—rather than cures—his or her polarisation.

It is thus a current refrain in EBP that facts and values cannot be always be separated, e.g. when social facts are brought to bear on the design of a policy (Gluckman 2017). The present—in our view disingenuous—brouhaha about the end of facts and the post-truth society (Flood 2016) mixes somewhat uncritically conscious strategies of confusion and manipulation—as witnessed in the recent US elections—with the existence of a plurality of legitimate frames around what constitutes a problem and hence a ‘fact’ in relation to the problems (Saltelli and Funtowicz 2017).

In other words, one should not equate the US administration’s ‘alternate facts’ with the concept of ‘extended facts’, defined in the theory of post-normal science as the product of an extended peer community (Funtowicz and Ravetz 1993). An extended fact may be a loss to a constituency which has been overlooked, a fact that is not part of the set of facts brought to bear by a regulator or by the proponent of a policy.

Quantitative storytelling (QST; Giampietro et al. 2014; Saltelli and Giampietro 2017) posits that more legitimate frames and worldviews are upheld by different social actors. Thus, QST draws attention to the fact that economic and mathematical models used in EBP are often in the form of risk analyses or cost–benefit analyses. These models focus on a single framing of the issue under consideration. A classic example is when a monetary equivalent is assigned to a social or environmental good. As discussed above, in relation to the critique of Ravallion (2010), this implies a clear normative stance.

In the logic of QST, the deepening of the analysis corresponding to a single view of what the issue is—achieved by ‘mathematizing’ the problem – distracts from what could be alternative readings.

For Ravetz (1987) and Rayner (2012), alternative frames may represent ‘uncomfortable knowledge’, which is removed from the policy discourse. Lakoff (2004, 2010) suggests that frames may be used to generate ‘hypo-cognition’.

Under this critical viewpoint, mathematical models—or a CI in the present context—can be seen as a tool for ‘displacement’ (Rayner 2012). This occurs when a model or a ranking becomes the end instead of the means, e.g. when universities monitor and manage the outcome of a ranking, rather than what happens within their walls (Saisana et al. 2011). After stakeholders realise they are on the receiving end of a strategy of hypo-cognition, their trust in the actors and institutions involved may be diminished (Saltelli and Giampietro 2017). A fundamental problem with EBP is that stronger players have access to better evidence, i.e. more data and indicators in the present context, and can use it strategically (Saltelli 2018). Thus, EBP is based on a power asymmetry (Boden and Epstein 2006; Strassheim and Kettunen 2014).

In QST, one drops the hope that neutral, impersonal facts will prescribe a policy and suggests instead acknowledging ignorance, as to identify ‘clumsy solutions’ (Rayner 2012), which may accommodate unshared epistemological or ethical principles. Similarly, post-normal science suggests ‘working deliberatively within imperfections’ (van der Sluijs et al. 2008). The solution offered by QST is to use quantification ‘via negativa’, by testing which of the available frames runs afoul of a quantitative or qualitative analytic check, as shown in the PISA example below. QST borrows from system ecology and attempts to refute the frames if they are based on unsound inference or violate the constraints of (Giampietro et al. 2014): (i) feasibility (can one afford a given policy in terms of external constraints, e.g. existing biophysical resources?), (ii) viability (can one afford it in the context of the internal constraints, governance, socio-economic, and technological arrangements?), and (iii) desirability (will the relevant constituency accept it?). For example, in examining the transition to a carbon-free economy, one can test the availability of natural resources (lithium, cobalt, and other minerals needed for energy storage), whether legislation promoting transition is viable, and whether the transition is compatible with existing lifestyles. An instructive test case of QST exploring the deployment of intermittent electrical energy supply in Germany and Spain is reported in Renner and Giampietro (2019).

Perhaps the best application of the concept of QST ante litteram is an old study of how European citizens perceive the existing narratives and conflicts in the adoption of genetically modified (GMO) food and products (Marris et al. 2001). Although the prevailing narrative is that ‘GMO food is safe’ and that consumer reluctance is rooted in anti-technology or anti-science prejudice, Marris and co-authors showed that the people interviewed did not care about the safety of GMO food but, rather, expressed concern about a totally different set of issues, including:

  • who would benefit from these technologies

  • why they were introduced in the first place

  • whether existing regulatory authorities would be up to the task of resisting regulatory capture from powerful industrial incumbents.

Quantitative storytelling, like other tools for evidence appraisal, such as sensitivity analysis (Saltelli et al. 2008), NUSAP (Funtowicz and Ravetz 1990; van der Sluijs et al. 2005), and sensitivity auditing (Saltelli et al. 2013), can be useful for gauging and possibly deconstructing existing measures. Thus, proposers of new CI should factor this danger in if they wish to anticipate criticism. They should be the first to test the relevance and robustness of their constructs, following the well-known Mertonian principle of ‘organized scepticism’ (Merton 1973), in which scientists strive to falsify their own results and invite fellow scientists to attempt such a deconstruction.

In the present work, we apply the QST approach to the construction of a composite indicator. We consider that social convergence, with its dense web of interconnected interests, policies, and outcomes, offers an ideal environment for such an experiment. The Cohesion Policy and convergence issues are still being discussed in the policy arena (e.g. the 7th Cohesion Forum, held in Brussels 26–27 June 2017; European Commission 2017a, b, c, d, e), as well as in the academy (Anagnostou et al. 2015; Baddeley 2006; Balcerzak and Rogalska 2016; Cosci and Mirra 2017; Furkowa and Chocholata 2017; Horridge and Rokicki 2017; Pietrzak and Balcerzak 2017; Próchniak and Witkowski 2016; Scheurer and Haase 2017; Stanickova 2017). The role of narratives linked to the Cohesion Policy is likely to be perceived as increasingly important for mitigating the present difficulties of the European project (Applebaum 2017).

6 A Previous Example of Quantitative Storytelling

Quantitative storytelling has been used in relation to the ranking of the OECD-PISA study (Araújo et al. 2017; Saltelli 2017). We describe this work here, as it shows how the methodology can be used to deconstruct a measure. In the test cases of the following sections a constructive use of quantitative storytelling is demonstrated.

Since the publication of its first results in 2000, the Programme for International Student Assessment (PISA) implemented by the Organisation for Economic Co-operation and Development (OECD) has been a subject of controversy. PISA has been presented by some as a measure of a country’s innovation and growth potential, while others found these metrics—published every 3 years with considerable media amplification—irrelevant and potentially counter-productive. Noticeably, the PISA dispute was the subject of a letter published in The Guardian newspaper and signed by several educationalists and scholars and of a subsequent exchange with the OECD (Meyer and Zahedi 2014; for additional references, see Araújo et al. 2017).

OECD-PISA is a convenient example for discussing the importance of the issue of frames in policy, as well as some limitations in the concept of EBP.

For advocates of ‘evidence-based’ or ‘informed’ policy, PISA incarnates the dispassionate, objective facts which nourish the formulation of sound policies by allowing for comparison across countries and possibly for the identification of good practices worth emulating. For opponents of this survey, the relation between PISA and economic growth represents a neoliberal framing of education policies within a context of globalisation which is perceived as unacceptable. QST showed that—while international comparability is desirable—more tends to be read into these ranking than the quality of evidence allows (Araújo et al. 2017).

According to the analysis in Araújo et al. (2017), a number of issues emerged.

  1. (1)

    Over-interpretation of PISA results: According to PISA supporters (Woessmann 2014), “If every EU Member State achieved an improvement of 25 points in its PISA score (which is what for example Germany and Poland achieved over the last decade), the GDP of the whole EU would increase by between 4% and 6% by 2090; such a 6% increase would correspond to 35 trillion Euro.”

  2. (2)

    PISA scoring strongly depends upon the modelling assumption, the design of the sample, the choice of the items (questions) included or excluded, and the number and typology of students sampled. Previous works reviewed in Araújo et al. (2017) showed that shifts in the relative position of a country were attributed to the success or failure of educational policies when, in fact, they were due to different compositions of the share of students excluded from the test.

  3. (3)

    PISA ranking lacks uncertainty and sensitivity analysis. PISA just offers a summary and non-conservative measure of the error of a country score.

  4. (4)

    The non-availability of the full data hampers a full analysis of the sensitivity of PISA scores to modelling assumptions.

  5. (5)

    PISA embeds strong normative stances, foremost the fact that education is investigated as an input to growth.

  6. (6)

    PISA may adversely affect what is taught and might run counter to our desires concerning what education should be about. It encourages focusing on the subset of educational topics being selected at the expense of others.

  7. (7)

    In measuring what it considers ‘life skills’, PISA assumes that these skills are the same across countries and cultures. All societies are bound to become ‘knowledge’ societies. However, diversity in the curriculum being taught might be a source of country-specific creativity and well-being.

We recognise in this list many of the ‘flags’ from QST (and from sensitivity auditing as well; see Araújo et al. 2017), e.g. in technical shortcomings in the interpretation of analysis, its non-transparency, the non-desirability of the adopted narrative, and the institutional conflicts on whether countries or a supra-national organisation such as the OECD should dictate curricula.

While in the example just given QST was used to deconstruct a frame, in the present work it will be used to enrich the spectrum of frames in order to test a new style of use for CI.

7 First Case: Analysis of Convergence

As discussed above, social convergence offers an ideal arena for testing QST. With the implementation of the European Pillar of Social Rights, a stronger focus is placed on social performance and employment, in which what Europe needs is less division and more cohesion, especially now, when the European Union is struggling with Brexit, a refugee crisis, a multispeed union and a populist upsurge of euro-scepticism.

8 Second Case: Doing Business Index (DBI)

The World Bank’s Doing Business Index—also known as the ease of doing business score—is an extremely popular CI, constructed by aggregating forty-one component indicators over ten thematic areas (World Bank 2019):

  1. 1.

    Starting a business.

  2. 2.

    Dealing with construction permits.

  3. 3.

    Getting electricity.

  4. 4.

    Registering property.

  5. 5.

    Getting credit.

  6. 6.

    Protecting minority investors.

  7. 7.

    Paying taxes.

  8. 8.

    Trading across borders.

  9. 9.

    Enforcing contracts.

  10. 10.

    Resolving insolvency.

The forty-one component indicators are first normalised according to a min–max scheme and then aggregated through a simple average within the thematic areas to which they belong and finally into the overall index. Each of the ten areas has the same weight, and so do the individual CIs in an area.

One-hundred-ninety countries are then ranked from the highest index value to the lowest.

A Google search on ‘world bank’ and ‘doing business index’ in July 2019 yielded as many as 5290 hits, while a search on Scopus with the search strings (TITLE-ABS-KEY ‘world bank’ AND ‘doing business index’) resulted in fifteen documents.

9 Methodology for the Convergence Analysis

9.1 Composite Indicators

The classical approach to constructing CIs implies the assignment of variables to a given pillar (based on researchers’ knowledge or experts’ opinion), then the aggregation of variables within the pillar, and finally aggregation into a holistic CI. We follow here this popular approach.

Variables used in the analysis have a different impact on social performance. Stimulants are factors that have a positive impact on the phenomenon analysed (e.g. the employment rate), while destimulants have a negative impact (e.g. the infant mortality rate). In regional research, destimulants are often transformed to stimulants based on the inversion formula:

$$x_{ijt}^{s} = \frac{1}{{x_{ijt} }} \left( {i = 1, \ldots ,n;\;j = 1, \ldots m;\;t = 1, \ldots ,k} \right)$$
(1)

where \(x_{ijt}^{s}\) is the value obtained as stimulant j in country (region) i in year t obtained from transforming the original destimulant \(x_{ijt}\) (the superscript s stands for stimulant), \(x_{ijt}\) is the value of destimulant j in country (region) i in year t.

The inversion formula is the simplest transformation method, and it gives all the diagnostic variables the same interpretation in the sense of their impact on the phenomenon analysed, i.e. the higher the value, the better in the optic of the index.

After this transformation all variables are normalised according to the formula:

$$x^{\prime}_{ijt} = \frac{{x_{ijt}^{s} - min\;x_{ij2005} }}{{max\;x_{ij2005} - min\;x_{ij2005} }} \left( {i = 1, \ldots ,n;\;j = 1, \ldots m;\;t = 1, \ldots ,k} \right)$$
(2)

where \(min\;x_{ij2005}\) is the minimum value of variable j in 2005. \(max\;x_{ij2005}\) is the maximum value of variable j in 2005.

This normalisation method enables the results to be compared and their dynamics to be analysed by providing a fixed reference point (Pawełek 2008). In this paper, we assume that each dimension is equally important, so the CI is calculated as:

$$CI_{it} = \frac{1}{p}\mathop \sum \limits_{q = 1}^{p} z_{iqt} \left( {i = 1, \ldots ,n;t = 1, \ldots ,k} \right)$$
(3)

where \(CI_{it}\) is the composite indicator describing social performance in country (region) i in year t, \(z_{iqt}\) is the composite indicator in country (region) i calculated for variables included in group q in year t, \(p\) is the number of groups.

The value of \(z_{iqt}\) is calculated as the mean average of all variables in each dimension. In this case, the higher the CI value is, the better for the phenomenon analysed.

9.2 Measuring Convergence

In a convergence analysis of EU Cohesion Policy, several measures are customarily employed: sigma convergence, delta convergence, gamma convergence, and beta convergence.

The sigma convergence concept measures gaps among time series by examining whether cross-sectional variation (measured by either standard deviation, coefficient of variation, Gini index, or any other dispersion measure) decreases over time, as would be anticipated if two series converged (Kong et al. 2019). To investigate the existence of a sigma-convergence trend, regression is usually used:

$$V_{t} = \alpha_{0} + \alpha_{1} t + \varepsilon_{t }$$
(4)

where \(V_{t}\) is the coefficient of variation in the year t.

The following set of hypotheses was tested:

H0:

\(V_{1} = V_{2} = V_{t}\) no sigma convergence or divergence,

H1:

\(V_{1} > V_{2}\) the existence of sigma convergence,

H1a:

\(V_{1} < V_{2}\) the existence of sigma divergence

where \(V_{1}\) is the coefficient of variation in a given year, \(V_{2}\) is the coefficient of variation in the next year.

If the estimated value of parameter \(\alpha_{1}\) turns out to be negative and statistically significant, then sigma convergence is taking place, and diversity among the objects analysed is decreasing. In the case of a positive sign, sigma divergence occurs, i.e. diversity among entries is accelerating (Barro and Sala-i-Martin 1999). In some case studies, a simple plot which shows a tendency of cross-sectional variance to decrease over time is taken as evidence in favour of sigma convergence (Tsionas 2002). This concept is widely used in policy literature (i.e. European Commission 2014, 2016).

Gamma convergence is a concept proposed by Boyle and McCarthy (1997). It requires an examination of the change in the ranking of countries. A simple measure that captures the change in rankings is Kendall’s index of rank concordance, calculated as:

$$\tau = \frac{C - D}{{n\left( {n - 1} \right)}}$$
(5)

where \(C\) is the number of concordant pairs of countries, \(D\) is the number of discordant pairs of countries, \(n\) is the number of observations (countries).

Two observations are called concording if the two members of one observation are larger than the respective members of the other observation. Two observations are said to be discording if the two members of one observation are in opposite order to the respective members of the other observation (Kendall 1938).

If τ is closer to zero, than changes within the distribution are higher, and gamma convergence occurs, in the so-called overtaking effect. The advantage of this approach is the ability to capture dynamics and mobility among objects (Boyle and McCarthy 1997; Holzinger et al. 2011). Gamma convergence is usually based on a comparison of the linear ordering of analysing observations (countries, regions) based on the CI’s vale. Usually it is the first and the last period of the analysis, that is taken into consideration. If Kendall tau is statistically insignificant or negative, one can say that gamma convergence occurs and the overtake effect can be observed.

A less known concept, which is nonetheless important policy-wise, is delta convergence. This concept was proposed by Heichel et al. (2005) and focused on decreasing distance towards an exemplary model, or a frontrunner object. Delta convergence can be measured by the Euclidean distance from the top performer:

$$d_{i} = \sqrt {\sum \left( {x_{ijt} - max\;x_{ijt} } \right)^{2} }$$
(6)

where \(d_{i}\) is the distance of country i from the frontrunner, \(max\;x_{ijt}\) is the frontrunner.

If the sum of distances from the frontrunner is decreasing, that suggest that objects are converging. Otherwise, divergence patterns can be observed.

$$D_{f} = \alpha_{0} + \alpha_{1} t + \varepsilon_{t }$$
(7)

where \(D_{f}\) is the sum of distances.

The following set of hypotheses was tested:

H0:

\(D_{f1} = D_{f2} = D_{f}\) no delta convergence or divergence,

H1:

\(D_{f1} > D_{f2}\) the existence of delta convergence,

H1a:

\(D_{f1} < D_{f2}\) the existence of delta divergence.

In the European Union, cohesion policy sigma and delta convergence are more desirable than beta convergence, as policymakers and the general public are also interested in reducing the disparities, not pure growth per se (Eurofound 2018; European Commission 2015). Because of this fact, beta convergence is not discussed in this paper.

10 Quantitative Storytelling on the Convergence Test Case

We focused on convergence at a national scale, allowing the composition of the index to vary. Different narratives are associated with different measures of convergence. As mentioned in the previous section, sigma convergence concerns a reduction in the disparities among countries, gamma convergence seeks changes in the distribution, and delta convergence corresponds to reducing the distance from the frontrunner.

Here, we test QST in the context of CI to investigate the existence of social convergence among EU countries in 2005–2017. Therefore, for the sake of illustration we introduce a set of new CIs, constructed by using the 24 variables that are in the European Pillar of Social Rights. Six variables describe governance and fairness, and six variables are related to health care. The data come from the European Pillar of Social Rights, Eurostat, the World Health Organization, Eurostat, and national statistical offices. In the case of missing data, we used an imputation procedure based on multiple regression (James et al. 2017).

In our research, we assumed the existence of four different stakeholder groups. Each group has a different point of view about which dimensions should be included in the CI. The starting point in our analysis is the set of variables in the European Pillar of Social Rights, grouped by category (Table 2). These variables are the choice of stakeholder no. 1 (see Table 3). Stakeholder no. 2 agrees with the first that those three categories are important, but from her point of view, a ‘social Europe’ should include measures of governance and fairness. A third stakeholder thinks that governance is not relevant in social convergence analysis and that the functioning of health care should be investigated instead. Finally, a fourth stakeholder argues that for an exhaustive social convergence analysis, and all previously mentioned dimensions should be included (see Table 2).

Table 2 Stakeholders and the dimensions they recommend including in the composite indicator
Table 3 Stakeholders and their desired dimensions in the composite indicator

Table 3 lists the variables included in each dimension. CI values were calculated using Eq. (3). Those values were the basis for estimating sigma convergence from Eq. (4) using an ordinary least squares (OLS) method. The results are in Table 4.

Table 4 OLS estimations of sigma convergence (Eq. (4)) for different stakeholders

Table 4 shows that sigma convergence occurs for stakeholders nos. 1 and 2. By contrast, stakeholder no. 3 sees more variation among countries. In addition, the divergence among the member states is increasing (positive \(\varvec{ \alpha }_{1}\)) for stakeholders no. 3 and 4; therefore, it can be assumed that sigma divergence occurs. We recall that stakeholders 3 and 4 are those who included the functioning of health care dimensions.

The coefficient of variation in 2005 for different stakeholders ranges from 0.27 (stakeholder no. 3) to 0.32 (Stakeholder no. 1 and 2; see Fig. 2). Hence, different stakeholders may perceive the overall spread among the member states differently, depending on the relevance of social performance dimensions. During the financial crisis (2008–2010), divergence patterns were observed, no matter which components were used to build the CI. This proves that country differences in challenging economic conditions are growing. In 2014 and afterwards, a significant increase in the value of the coefficient of variation can be observed for stakeholders nos. 3 and 4, which is once again connected with the unfavourable situation in the health-care system. Also, the perspective of stakeholder no. 2, including governance and fairness, indicates the occurrence of greater variation than stakeholder no. 1, who takes only social performance into consideration.

Fig. 2
figure 2

The dynamic of the coefficient of variation

Table 5 presents the results on gamma convergence. For each stakeholder, the Kendall-tau measure is positive and statistically significant, which implies no evidence of gamma convergence. In other words, the ranking of countries is relatively stable, and no overlapping effects are observed.

Table 5 Values of Kendall–tau coefficient and corresponding p-values

Figure 3 presents the countries’ aggregated distance from the best performer. As in the case of sigma convergence, a definite increase in the distance was also observed during the economic crisis. For all stakeholder, the sum of Euclidean distances was bigger at the end of the period analysed than in the initial year. The increase in distance for stakeholders nos. 1 and 2 was around 35%, whereas it was 125% for stakeholders 3 and 4. Thus Fig. 3 indicates that delta convergence did not occur over the analysed period. The findings on hypothesis testing for delta convergence, Eq. (7), are presented in the Table 6.

Fig. 3
figure 3

The dynamic of standardised Euclidean distance from the frontrunner

Table 6 OLS estimation for Eq. (7) for 27 European Union countries

Analysing data presented in Table 6 we are not able to say whether the delta-convergence occurs or not. First of all, the parameter is statistically insignificant, and, more importantly, the coefficient of determination is extremely low. A comparatively well-fitted model can be obtained for stakeholders nos. 3 and 4, for whom the estimated parameter was statistically significant and positive, indicating that delta divergence occurs. Therefore, it can be argued that adding variables related to health care affected the overall results. The significant and increasing differences among EU countries in health-care organisation, well-being and poverty, and disease prevention may have an impact on an already weakened European Union. Furthermore, they substantiate the notion of a union with many speeds.

While the European project experiences an objective moment of difficulty, it is noticeable that situation appears better when the ‘official set’ of convergence measure is used (stakeholder no.1) than when other sets of variables meeting different concerns (fairness, health) are included. The tension between the official set and the antagonist sets discussed here is artificial, but these situations exist in practice. One illuminating example is a controversy between the French representatives of the trade unions (and their militant experts) and the statistical office INSEE about how to measure poverty in France. Bernard Sujobert recounts this episode in the volume Stat-activisme: comment lutter avec the nombres (Bruno et al. 2014). Noticeable in this story – involving a new statistical measure known as BIP40—is that the initial resistance of the official statisticians was successfully softened by a combination statistical activism, dialogue, and mediatic echo of the new proposed measure. What motivated the stat-activists was precisely the mismatch between what they perceived as a worsening of poverty for segments of the population and the reassuring message conveyed by the official measures of INSEE.

11 Quantitative Storytelling for the Doing Business Index

The DBI has been far from uncontroversial. Conceived as a measure of competitiveness, the methodology for its assessment has changed repeatedly over time. New component indicators have been introduced, some topics were removed, and other methodological changes were made. For instance, the 2016 version of the index introduced new component indicators, including the concept of building quality index (in the ‘dealing with construction permits’ topic), reliability of supply, and transparency of tariff index (‘getting electricity’ topic). Possibly the most prominent variation in the DBI assessment is the exclusion of the ‘employing workers’ thematic area from the 2011 edition onwards. As regards methodological changes, the component indicator ‘total tax and contribution rate’ (previously ‘total tax rate’) is aggregated in a non-linear fashion in the ‘paying taxes’ area from 2015. The quantity is elevated to the power of 0.8 before the min–max normalisation.

These changes in the DBI assessment resulted in variations in ranking for some countries from one edition to the next. For instance, Chile’s performance ranking deteriorated over the two presidencies of Michelle Bachelet. It has been alleged that this was a deliberate manoeuvre to discredit the left-wing Bachelet against the mandates of the conservative Sebastián Piñera.

When questioned about this result, the then–World Bank chief economist, Paul Romer, objected that the index trend was not the result of any deteriorating performance by Chile; rather, it resulted from the introduction of new component indicators (Talley 2018). In a later blog post, Romer (2018b) clarified that this was not a deliberate move by the World Bank and provided an independent analysis of the data. Thus, Romer argued that the controversy was caused merely by insufficient clarity in the World Bank’s communication. The loss of credibility caused by this episode, however, might be one of the reasons why Romer resigned from his duties as World Bank chief economist (Lawder and Wroughton 2018; Zumbrun 2018). Romer is not an ordinary economist; in the past he demonstrated considerable intellectual openness by starting a discussion on the misuse of mathematical models in economics, coining the already mentioned neologism ‘mathiness’ (2015) to signify the use of mathematics to veil normative stances in growth models.

Romer (2018a) applied QST to DBI in an attempt to test the robustness of his collaborators’ calculations. He did so by performing a new assessment, in which he included all the CIs available over the period of the study.

His objective was to remove the effect of introducing new variables in the computation of the DBI assessment methodology, thus producing more stable and comparable rankings over the years.

Romer implemented a Jupyter notebookFootnote 3 with his calculations, which has been made publicly available on his GitHub repository.Footnote 4 One of his main findings was that picking only the set of twenty-four component indicators available for the entire period 2014-2018 would have produced a less volatile ranking for Chile (Fig. 4). Romer ultimately encouraged his blog readers to repeat his analysis and evaluate longer trends, rather than ranking annual variations.

Fig. 4
figure 4

Chile’s world ranking according to its DBI score over the period 2014–2018, depending on the accounting methodology (DBI reports vs. Paul Romer’s QST). The charts show four points for each method, rather than five because the figures for 2014 and 2015 overlap

The prominence of the World Bank and the notoriety of the DBI led various countries to implement economic policies that would target an increase in their DBI score. The foreword to the 2019 edition states ‘What gets measured gets done’ and notes: ‘Since its launch in 2003, Doing Business has inspired more than 3500 reforms in the 10 areas of business regulation measured by the report’.

Developing countries are particularly keen to pursue DBI-inspired reforms and to receive the approval of the World Bank, which conceived the DBI for this very purpose. Yet this did not happen without controversy: criticisms of multiple aspects of the DBI were raised in many quarters.

For opponents of so-called governance by numbers, the metric strategy pursued by international organizations has led to a deplorable erosion of the law and human condition (Supiot 2015). Berg and Cazes (2007) criticise the political perspective of the DBI in relation to the framework of labour laws, disputing the narrative whereby countries with less protective labour laws had a higher ranking. According to the authors of the study, a country would be incentivised by the DBI to foster labour market deregulation, whereas the economic benefits on the ground of such a framework would be questionable. Other authors disputed the normative perspective of some of the component indicators on the country’s employment performance. For instance, Benjamin et al. (2010) claim that the component indicators do not adequately map onto the state of labour regulation. These authors suggested integrating the assessment performed by the World Bank with other aspects, such as ‘microlegislation, labour market institutions and juridical interpretation’. The controversy around the deregulation of labour laws contributed to the removal of this thematic area from the DBI assessment from the 2011 edition. From this version onwards, the working employment area is discussed as a separate annexe, in which the dominant narrative is to seek a balance between worker protection and flexibility.

More technical criticism of the DBI comes from Høyland et al. (2012) and Pinheiro-Alves and Zambujal-Oliveira (2012). The former argue that the index completely neglects uncertainty and, with it, possible volatility in the country rankings. The latter argue that the selection of variables in the DBI may be misleading, as several of them do not contribute to variations in the score in the thematic area that they are part of. That is, they are ‘silent’ in the sense discussed in (Paruolo et al. 2013; Saisana et al. 2005). This may convey inadequate information to investors who are scrutinising the countries’ DBI performance.

The effect of the different DBI component indicators is discussed by Schueth (2011) who analysed the performance of Georgia according to the DBI and the Global Competitiveness Index (GCI). Georgia’s position in the DBI rank has been rising, moving from 100/155 in 2006 to 11/183 in 2010. In the 2019 edition of the DBI report, Georgia ranked sixth out of 190. By contrast, Georgia’s GCI ranking languished: the country was reported to be 85th out of 125 in 2006, and it remained at 90/133 in 2009. Even in the most recent version of the GCI report, Georgia still ranks 66th out of 140. Schueth (2011) argues that this extreme discrepancy in performance can be ascribed to how the different sets of variables included in the indicators capture economic phenomena. This could also be seen as QST setting, whereby a Georgian policy maker who wishes to attract investment to the country would showcase the rapid improvement in Georgia’s DBI ranking; an opposition leader might lament Georgia languishing GCI ranking as a proof of the ineffectiveness of the country’s policies on competitiveness and business friendliness. Doing-business controversies lend themselves naturally to QST experiments.

12 Conclusions

The reflections given in this paper in the context of CIs are likely to apply to a much larger set of quantification practices. As Popp Berman and Hirschman (2018) inquire, in the age of algorithms and indicators, “what qualities are specific to rankings, or indicators, or models, or algorithms?” In particular, the misuse of metrics, statistical inference, mathematical modelling, and algorithms exhibit some common patterns (Saltelli 2019, 2020).

The solution offered here is by no means unique. For example, to address the predicament of using fragile mathematical instruments to measure soft concepts, some authors have suggested resorting to the theory of partially ordered sets. This approach offers a synthesis of multidimensional indicator systems in which the original variables are not aggregated, and the individuals being ranked (e.g. countries or regions or districts; see an example in Beycan et al. (2019); Carlsen and Bruggemann (2014, 2017)) are partially ordered graphically. This procedure removes the design and modelling choices needed for a CI, such as weights, normalisation, and an aggregation scheme. Partial ordering is thus, by design, more robust than CIs.

We focus here on the present generation of CIs, briefly reviewing the existing debates and offering some constructive criticism. In particular, we modify the philosophy of CIs from ‘analysis cum advocacy’ to ‘analysis with multiple storytelling’. In other words, we examine a situation in which different stakeholders agree on the importance of evidence and the need to use statistical data while disagreeing on what ‘the end in sight’ should be, as exemplified in the real world by the BIP40 story (Bruno et al. 2014).

Cohesion policy offers a convenient battleground for testing this methodology, as it is clear that multiple definitions of cohesion are possible and desirable, at a moment when a clear overall EU narrative seems elusive (Applebaum 2017).

Should measures of fairness or health be part of the portfolio of policies to be targeted by a cohesion policy? Clearly, depending on the answers to these questions, different diagnoses can be produced as to the state and the progress of cohesion.

Unsurprisingly, EU countries differ more upon a dimension which we loosely call fairness and which includes corruption, political functioning, stability and accountability, regulatory quality, and the rule of law.

EU countries become more equal when health care is included, but at the same time, this equality is eroded by the recent onset of a divergence trend.

The case of the Doing Business shows that, in practice, some sort of multiple-frame analysis—what we call quantitative storytelling—is already taking place under pressure from stakeholders. This contributed to variation in the structure of the underpinning thematic areas in the DBI over the years. The most prominent of them is the exclusion from the 2011 edition of the index of a controversial thematic area such as ‘employing workers’. The primary role played by stakeholders is also reflected by the fact the DBI is simultaneously a measure and a target proposed to developing countries. For this reason, the danger of the Goodhart (or Campbell) Law—that when a measure becomes a target, it ceases to be a good measure as ‘players’ start adapting to it (Muller 2018), has been contrasted by scholars and stakeholders alike, signalling a mismatch between the measure and the desirability of the resulting policy.

13 Software and Data

Software and data used for the present work can be retrieved at the GitHub repository https://github.com/Confareneoclassico/Quantitative_storytelling_making_composite_indicator