1 Introduction

A popular approach to identify causal effects of education on health and longevity exploits changes in compulsory schooling policies, usually increases in the minimum age or the legally permitted grade to leave school, as instrumental variables for schooling attainment. These studies exploit an identification strategy that assumes that these changes in the law induced people born in different years (or states) to obtain different levels of schooling for reasons that are plausibly unrelated to factors that may influence their health and mortality. If it is assumed that the change in compulsory schooling law only affects health and longevity through its effect on education, one can estimate a causal effect of the additional education on longevity for those who comply with the new law and would not have done so otherwise. Estimates based on these studies point towards a small effect (Mazumder 2008, 2012; Jones et al. 2011; Van Kippersluis et al. 2011; Fletcher 2015; Meghir et al. 2018; Basu et al. 2018) or no effect (Albouy and Lequien 2009; Clark and Royer 2013; Jürges et al. 2013) of education on mortality. Here, we use the British Health and Lifestyle Survey (HALS) and exploit an educational reform in 1947 that increased the legal minimum school leaving age in England and Wales from 14 to 15 (Clark and Royer 2013).

A reason why higher education may lead to lower mortality is that the higher educated are more efficient producers of health investment (Grossman 1972, 2006). Grossman (1972) argues that this could be due to (i) productive efficiency or (ii) allocative efficiency. The former hypothesis posits that the higher educated understand medical advice better and use medical care more efficiently. The allocative efficiency hypothesis on the other hand argues that higher educated individuals choose different, more efficient inputs into health investment, typically thought to be caused by better health knowledge and a more receptive attitude towards new information.

Many studies investigating the impact of education on mortality have used linear models (see e.g., Lleras-Muney 2005; Van Kippersluis et al. 2011; Clark and Royer 2013) to estimate the educational gradient, facilitating ‘standard’ instrumental variable estimation. However, age at death is clearly a duration outcome and, hence, we use a non-linear model for the mortality hazard rate. Duration analysis models the hazard rate, the instantaneous probability that an individual dies at a certain age conditional on surviving up to that age. Accounting for right-censoring, when the individual is only known to have survived up till the end of the observation window, can be modelled directly within this framework. A common characteristic of duration data, including time to death, is that not all individuals experience the event of interest during the observation period. Such right censoring makes inference based on means unreliable. Thus, using survival until survey end would not account for such right-censoring. Another characteristic of duration data is dynamic selection or left truncation: those still alive at the age that the survey starts may not be a random selection of the original population of births. This excludes the comparison of simple survival differences at the end of a survey. We therefore use the (mortality) hazard, or the force of mortality, as this effectively deals with these data characteristics (see e.g., Lancaster 1990; Van den Berg 2001). A common way to accommodate the presence of observed characteristics in a duration model is to specify a proportional hazard (PH) model, in which the hazard is the product of the baseline hazard, that captures the age dependence of the hazard, and a log-linear function of covariates. Neglecting unobserved confounding in inherently non-linear models, such as the proportional hazard model, leads to biased inference. The common approach to address this is to explicitly model the individual-specific effects using unobserved heterogeneity that enters the hazard function multiplicatively, known as the Mixed Proportional hazard (MPH) model.

Studies have attempted to identify the causal effect of education on mortality, using either an inverse propensity weighting method (Bijwaard et al. 2017; Bijwaard and Jones 2019) or a structural modelling approach (Bijwaard et al. 2015a, b, 2019). However, a critical assumption in propensity score weighting is that there is no selection on unobservables. This may be hard to defend. Although the structural models, in which interdependence between education, health, and cognitive ability is explicitly modelled, do account for correlated unobserved factors they assume a particular structure. In contrast, the compulsory schooling change provides a natural instrument to identify the causal effect of education on the mortality rate. However, no unambiguous solution to instrumental variable estimation of the inherently non-linear MPH-model has been found.

Bijwaard (2009) developed a consistent estimator for the parameters of a semiparametric MPH model with an unspecified distribution of the unobserved heterogeneity and with an endogenous variable for which an instrument exists. In its simplest form, the estimator does not require nonparametric estimation of unknown densities. A limitation of this method is that the baseline duration dependence is restricted to a piecewise constant function, which may be hard to implement for fast increasing hazard rates like the mortality hazard. Another limitation is that the method is computationally intensive, because it is based on finding the roots of a multidimensional step function which does not have a derivative. The instrumental variable (IV) based methods of Terza et al. (2008) for non-linear models have been used recently for duration models. However, Wan et al. (2015, 2018) have shown that both the two-stage predictor substitution (2SPS) and the two-stage residual inclusion (2SRI) methods of Terza et al. (2008) are biased in a Weibull proportional hazard framework, at least under the standard assumptions common in the treatment evaluation literature.

The change in the compulsory schooling law offers a (fuzzy) Regression Discontinuity Design (RDD), as it generated a discontinuity in the “treatment” (number of schooling years) for those affected when the reform was implemented. We use the local randomization framework of the RDD (Lee 2008; Lee and Lemieux 2010; Cattaneo et al. 2015), where the treatment assignment (staying on longer at school) is assumed to be as-if randomly assigned in a small interval around the reform implementation date. In the local randomization framework of the RDD a principal stratification into complier types follows naturally (Imbens 2016).

The principal stratification framework (Frangakis and Rubin 2002) is a general potential outcomes framework for causal inference with instruments and/or intermediate variables. Principal stratification has its roots in instrumental variable methods, as described in Angrist et al. (1996); Imbens and Rubin (1997), and it has been developed and formalized within the potential outcome approach to causal inference. The commonly applied framework developed by Angrist et al. (1996) to define the Local Average Treatment Effect (LATE) in a random experiment with non-compliance is a special case of the principal stratification framework. A principal stratum consists of individuals who have the same joint potential outcomes, independent of the treatment assignment (Frangakis and Rubin 1999; Zhang et al. 2009; Mealli and Mattei 2012). Therefore, comparisons of potential outcomes under different treatment levels within a principal stratum give well-defined causal effects. The principal strata are usually defined in term of four complier types: (i) Always takers: individuals who take the treatment irrespective of their assigned treatment (ii) Never takers individuals who never take the treatment (iii) Compliers individuals who only take the treatment if assigned to treatment (iv) Defiers: individuals who only take the treatment if not assigned to treatment. Defiers are ruled out using a monotonicity assumption.

When assuming a parametric baseline mortality hazard rate, estimation of the latent complier types and their associated hazard rate is possible using maximum likelihood estimation of the implied mixture model. We assume a Gompertz proportional mortality rate, with an exponential increase in the mortality rate by age. A Gompertz mortality rate is known to provide accurate mortality rates for middle aged individuals (Gavrilov and Gavrilova 1991). Similar methods for duration outcomes, also based on principal stratification, have been developed by Cuzick et al. (2007); Lin et al. (2014); Wan et al. (2015).

The contribution of this paper is to provide a methodological innovation in instrumental variable analysis for hazard rate models, using the principal stratification approach to motivate estimation of a mixture model.

2 Data and descriptive statistics

We use the British Health and Lifestyle Survey (HALS). This survey was conducted to collect data on health behaviours of the British population, including smoking, alcohol consumption and exercise. We use the first wave of the survey combined with the long-term follow-up of deaths. The first wave was conducted in 1984–1985, with a response rate of 73%. In total 9003 individuals (18–99 years old) were interviewed. In 1991–1992 a follow up survey was carried out for which only 5352 individuals completed the interviews. We therefore focus on the first wave. Johnston et al. (2015) have used these data to investigate the causal link between education and health knowledge. We use the same measure of schooling, the age at which a respondent left secondary school, which ranges from 14 to 19 years old. Just as for Johnston et al. (2015) our identification strategy utilises educational reforms that increased the legal school leaving age in England and Wales from 14 to 15 (in contrast to Johnston et al. (2015) we only focus on the 1947 reform and we remove all individuals living in Scotland from the sample). On 1 April 1947, the legal school leaving age was raised to age 15 in Britain, while until 31 March 1947 children in Britain could leave school when they reached 14 years of age. This reform affected children who turned 14 after 31 March 1947 (born after 31 March 1933) as they had to stay at school longer.

Figure 1 shows how the 1947 reform affects the school leaving age, the probability of leaving school before the age of 15, the probability of leaving school between age 15 and 16, the probability of leaving school between age 16 and 18 and, the probability of leaving school after age. The 1947 reform clearly had a large effect on school leaving around the age of 15, but not on leaving school after age 16.

Fig. 1
figure 1

Probability to leave school before age 15, age 15–16, age 16–18 and after age 18 (around the cut-off birth date of 1–4–1933) Dots represent average schooling from survey entry to survey end by quarter of birth

Longitudinal follow-up of the date and cause of death is available up to July 2009 in the Seventh Death Revision of the HALS. We observe the respondents from their survey interview till July 1st, 2009 or till death, which allows us to construct the mortality hazards. Figure 2 depicts the probability to survive until the end of the survey (July 1st, 2009) and the Kaplan–Meier survival curves for individuals born within 12 years before or after the cut-off birth of the 1947 reform.

Fig. 2
figure 2

Survival from first survey till end of follow up by age left school, only individuals born within 12 years before or after the cut-off birth date (31–3–1933) for the 1947 reform Dots represent average survival from survey entry to survey end by year of birth for survival

Note that the survival gaps, depicted in the right-hand plot of Fig. 2, are based on the raw survival data and these could exist for a multitude of reasons, including selection, reverse causality and, potentially, a causal impact of education on mortality. According to a log-rank test of survival difference the survival of individuals who left school before age 15 (1947-reform) differs significantly from the survival of individuals who stayed longer in school (also for males and females separately).

3 Regression discontinuity design and principal stratification

Understanding the causal effect of a treatment D (education) on an outcome Y (longevity) is fundamental goal of social science. The identification of the causal effect is complicated by the potential endogeneity of education. The association between longevity and education may partly be explained by confounding factors such as cognitive ability and parental background, which affect both education choices and longevity (McCartney et al. 2013).

To address this endogeneity we use a fuzzy regression discontinuity design, as implied by the change in minimum school leaving age of the 1947-reform in England and Wales, in a principal stratification framework. Note that a standard (proportional) hazard model for the mortality rate, such as a Gompertz model, using only observation within the RDD bandwidth is likely to be biased, as it still does not account for the endogeneity. Our instrumental variable method, based on principal stratification, a non-linear extension of the commonly applied linear Local Average Treatment effect (LATE) approach (Angrist et al. 1996), can provide an unbiased estimate of the effect of education on longevity.

3.1 Regression discontinuity as a local randomized experiment

Before we elaborate on the non-linear analysis, we define the instrument used in this study and the regression discontinuity design. A method using observations close to a threshold to identify causal effects is known as a regression discontinuity design (RDD), (Imbens and Lemieux 2008; Lee and Lemieux 2010). The basic idea behind RDD is that assignment to treatment (in our case, continuing schooling after age 15) is determined, either completely or partly, by the value of the instrument (the change in law) being on either side of a fixed threshold (i.e., the “running variable” is the birth date and the threshold is the date the reform was implemented, 1–4–1933). Because people born before the reform could still stay in school beyond age 15 we have a fuzzy RDD.

In the local randomization-based approach to the RD design (Lee 2008; Lee and Lemieux 2010; Cattaneo et al. 2015), it is hypothesized that, within some finite window of an administrative threshold (e.g., a test score or age cutoff) that determines treatment assignment, subjects are “as-if” randomly assigned to treatment and control.

Formally, let \(W_0 =[r_c-h,r_c+h]\) with \(r_c\) the threshold and h the window width, the local randomization assumption can be stated as the following two assumptions:

  1. (A)

    The distribution of the running variable (birth date) in the window \(W_0\) is known and does not depend on the potential outcomes.

  2. (B)

    Inside \(W_0\), the potential outcomes (potential mortality) depend on the running variable solely through the treatment indicator (stay at school beyond age 15).

Assuming that the birth date will have no effect on the (potential) mortality is unrealistic. However, as Cattaneo et al. (2017) show, if the effect of the birthday on the potential mortalities can be captured by a polynomial of order p on the distance of the individual birthday from the threshold, it is possible to allow that the potential outcomes depend on the running value (birthday). We add a local polynomial in the distance from the threshold with the order of the polynomial chosen to minimize the AIC given the bandwidth (just as in Lee and Lemieux 2010). The fuzzy RDD can be viewed as an instrumental variable method, with the change in the law used as instrument for staying longer in school.

3.2 Choice of bandwidth

In general, choosing a bandwidth involves finding an optimal balance between precision and bias. On the one hand, using a larger bandwidth yields more precise estimates as more observations are available to estimate the regression. On the other hand, the specification is less likely to be accurate when a larger bandwidth is used, which can bias the estimate of the treatment effect.

In the local randomization approach the choice of the optimal bandwidth is based on a sequential randomization test. We follow the practical steps suggested by Cattaneo et al. (2015) to establish whether local randomization is plausible in small windows around the cut-off and determine the size of such a window. The procedure involves a simple difference-in-means test for the predetermined covariates comparing their values on each side of the cut-off. This test is carried out for each candidate window. If the p-value regarding the null that a covariate has the same value for both sides of the cut-off is below 0.15 (Cattaneo et al. 2015; Cattaneo and Titiunik 2022), then that window is rejected and we attempt the procedure with a smaller window. A window is selected if one cannot reject the null for any of the predetermined covariates using a threshold p-value of 0.15.

3.3 Assumptions for identification of causal effects

Following the literature, we define causal effects using the potential outcomes (or counterfactual) framework. Define for the policy change (treatment assignment) the (potential) discrete D(z), with \(Z=1\) if an individual was affected by the policy change and zero otherwise and \(D=0\) if the individual left school before age 15, \(D=1\) if the individual left school at age 15 to age 16, \(D=2\) if the individual left school at age 16 to age 18 and, \(D=3\) if the individual left school at age 18 or beyond. We assume that the policy change does not affect the choice to stay at school after age 16.

We use the principal strata formulation of the problem (Frangakis and Rubin 2002). This implies we have six (latent) complier types (P) for education: always takers 1 are individuals who always leave school at age 15 to age 16 irrespectively of whether they were affected by the policy change (i.e., \(D(1)=D(0)=1; P=a_1)\); always takers 2 are individuals who always leave school at age 16 to age 18 irrespectively of whether they were affected by the policy change (i.e., \(D(1)=D(0)=2; P=a_2)\); always takers 3 are individuals who always stay beyond age 18 at school irrespectively of whether they were affected by the policy change (i.e., \(D(1)=D(0)=3; P=a_3)\); never takers are individuals who never stay in school beyond age 15 (i.e., \(D(1)=D(0)=0; P=n)\). Under our identification strategy always takers and never takers do not contribute to identification of the local treatment effect. Compliers are individuals who only stay in school to age 15 to age 16 in school because they were induced to do so through the policy change (i.e., \(D(1)=1\) and D(0)=0; P=c). It is the compliers that identify the local treatment effect of an extended education.

Following the literature on potential outcomes we impose the following assumptions:

Assumption 1: Stable unit value assumption (SUTVA)

SUTVA implies that potential outcomes, for each person i are unrelated to the treatment status (education) of other individuals.

Assumption 2: Ignorable instrument

$$\begin{aligned} \{ Y(d),D(z) \} \bot Z|X \end{aligned}$$

This assumption typically holds in a randomized experiment. The assumption is also plausible in observational studies where Z represents an instrumental variable that is regarded as exogenous after (possibly) conditioning on observed covariates.

Assumption 3: Exclusion restrictions \(\forall z=0,1; d = 0,1,2,3\):

$$\begin{aligned} Y(z,d)= Y(d) \end{aligned}$$

This assumption states that the instrument Z can only affect the outcome through its effect on education. This implies that the potential outcome can be written as Y(d). This also implies that the effect of always-takers and of never-takers is independent of treatment assignment. Note that this restriction is inherent in the RDD, the policy change only effects the outcome through the induced change in treatment (prolonging the time in school).

Assumption 4: Monotonicity \( D(1) \ge D(0)\)

Assumption 4 rules out the existence of Defiers, individuals who only stay in school to age 15 to age 16 because they were not induced to do so through the policy change. This implies that the educational effect on the outcome is only identified for compliers while the educational effect for never takers and always takers is not identified.

3.4 Principal strata hazard rate model

Our work is novel in that we consider inherently non-linear hazard models, instead of linear models. We assume that the (potential) hazard depends on the complier-type.

We use the principal strata framework to show under the identification of the causal effects. Denote the complier type probabilities by \(P^{a_1}, P^{a_2}, P^{a_3}, P^{a_c}, P^{a_n}\), the probability of being an always taker, never taker or complier (possibly conditional on X), which can be derived from cross tabulation of education and the instrument \(\Pr (D=d|Z=z)\):

$$\begin{aligned} \Pr (D=0|Z=1)= & {} P^n \\ \Pr (D=0|Z=0)= & {} p^n + p^c\\ \Pr (D=1|Z=1)= & {} p^{a_1} + p^c \\ \Pr (D=1|Z=0)= & {} p^{a_1} \\ \Pr (D=2|Z=1)= & {} \Pr (D=2|Z=0) = p^{a_2} \\ \Pr (D=3|Z=1)= & {} \Pr (D=3|Z=0) = p^{a_3} \end{aligned}$$

Thus \(p^c= \Pr (D=1|Z=1)-\Pr (D=1|Z=0)\). All these probabilities are estimated jointly with the other parameters of the model. This implies that our specification is a latent class model with the complier types modelled as latent classes, with the LATE identified for the sub-set of compliers:

$$\begin{aligned} LATE = \textrm{E} \Bigl [Y(1)- Y(0)\Bigr |P=c\Bigr ] \end{aligned}$$
(1)

Estimating a principal strata model gives the required functions (see the next sub-Section). Note that in our application, the compliers are the sub-population who have additional years of schooling induced by the change in the minimum school leaving age and the impact of this change in education can therefore be regarded as being due to plausibly exogenous variation.

We assume a Gompertz proportional hazard mortality rate, which postulates that the (baseline) hazard increases exponentially with age (e.g., \(\lambda (t|X) = e^{\beta _0 + \alpha t + \beta ^{\prime } X}\)).Footnote 1 We use the (implied) life expectancy as the outcome of interest.Footnote 2 Assuming that the estimated Gompertz hazard holds, the life expectancy can be very well approximated by Lenart (2014):

$$\begin{aligned} \mu _X = -\exp \biggl ( \tfrac{1}{\alpha } \exp \Bigl [\beta _0+\beta ^{\prime } X \Bigr ]\biggr ) \times \Bigl ( \beta _0+\beta ^{\prime } X -\ln (\alpha ) - e^{\beta ^{\prime } X} + 0.5772\Bigr )/\alpha \end{aligned}$$
(2)

where 0.5772 is the Euler constant. When we assume a Mixed Proportional Hazard (MPH) model with Gamma distributed unobserved heterogeneity (with unit mean and variance \(\sigma ^2\)) the life expectancy can be approximated by Missov (2013):

$$\begin{aligned} \mu _{X}&= \biggl [ \Bigl (1-\tfrac{\sigma ^2}{\alpha } e^{\beta _0+ \beta ^{\prime } X }\Bigr )^{-\tfrac{1}{\sigma ^2}} \Bigl ( \ln (\alpha ) -\beta _0-\beta ^{\prime } X-\ln (\sigma ^2) \Bigr ) \nonumber \\&\quad -\sum _{j=1}^{\tfrac{1}{\sigma ^2}-1}\tfrac{1}{j} \Bigl (1-\tfrac{\sigma ^2}{\alpha } e^{\beta _0+\beta ^{\prime } X } \Bigr )^{j-\tfrac{1}{\sigma ^2}} \biggr ]/\alpha \end{aligned}$$
(3)

Due to right-censoring (which is affected differently by education ) we cannot use the average duration directly to estimate the model parameters. Using a hazard rate model effectively accounts for censoring (and possible time-varying covariates) and allows us to estimate all the parameters. Life-expectancy can then be derived from the estimated parameters.

We assume the complier type influences only the scale of the mortality rate, \(\gamma _{1}\) (for a complier who is induced to continue schooling due to the instrument, \(Z=1\)) and \(\gamma _{0}\) (for a complier who is induced not to continue schooling due to the instrument \(Z=0\)). Thus, the potential hazard for an individual of complier type \(P=\{a(lways),n(ever),c(omplier) \}\) is:

$$\begin{aligned} \lambda ^{(P)}(t|\cdot )&= v\lambda _0(t;\alpha )\exp \biggl (\gamma _{a_1}I(P=a_1)+ \gamma _{a_2}I(P=a_2)+\gamma _{a_3}I(P=a_3) \nonumber \\&\quad + \gamma _{n}I(P=n)+\bigl (\gamma _{1}D+ \gamma _{0}(1-D)\bigr )I(P=c) + \beta 'X \biggr ) \end{aligned}$$
(4)

Note that, due to Assumption 3, for always takers (1,2, or 3) \(\gamma _{a_1}, \gamma _{a_2}, \gamma _{a_3}\) do not depend on D, similarly for never takers we only have \(\gamma _{n}\). For compliers the education level D is either zero or one. We either assume that \(v\equiv 1\) (PH-model) or that v follows a unit mean Gamma distribution with variance \(\sigma ^2\) (MPH-model).

3.5 Estimation of principal strata hazard rate model

Based on the assumption of a known functional form of the baseline hazard, such as a Gompertz (\(\lambda _0(t)=e^{\alpha t}\)), we can derive the likelihood function contribution of individual i, see Appendix A for the full likelihoodFootnote 3:

$$\begin{aligned} L_i = \prod _{Z_i}\!\prod _{D_i} \Bigl [\lambda _i\bigl (t_i|Z_i,D_i\bigr )^{\delta _i} \times S\bigl (t|Z_i,D_i\bigr ) \Bigr ]^{I\bigl (Z_i,D_i\bigr )} \end{aligned}$$
(5)

where \(S\bigl (t|Z_i,D_i\bigr )\) is the survival rate at age t for an individual with \(Z_i,D_i\), e.g.:

$$\begin{aligned} S(t|Z=0,D=0) = p^n\exp \Bigl (-\!\int _0^t \!\!\lambda _0(s)e^{\beta 'X }ds \Bigr ) + p^c \exp \Bigl (-\!\int _0^t \!\!\lambda _0(s)e^{\beta 'X+ \gamma _{0} }ds \Bigr ) \end{aligned}$$

or:

$$\begin{aligned} S(t|Z=1,D=3) = p^{a_3}\exp \Bigl (-\!\int _0^t \!\!\lambda _0(s)e^{\beta 'X+ \gamma _{a_3} }ds \Bigr ) \end{aligned}$$

RDDs identify a treatment effect locally around the threshold. A local continuity assumption is standard in the literature, implying that persons close to the threshold are comparable except for their values of the assignment variable. The standard approach to account for divergence is to include a local polynomial of the running variable, in our case the date of birth, estimated separately on each side of the threshold. We let the AIC determine the order of the polynomial functions of the time of birthdate from April 1933, separately for each side of the threshold. In Sect. 5 we discuss robustness checks based on smaller windows.

4 Empirical results

Our identification strategy relies on the mixed proportional hazard with principal stratification to identify a LATE. This LATE focuses on the compliers, who were influenced by the increase in minimum school leaving age, and the relevant treatment relates to the binary comparison of those who left school before age 15 and those who left school at ages 15 or 16. Before we report the results of this principal strata model we discuss the results for a standard Gompertz model (with or without gamma distributed unobserved heterogeneity) for the mortality rate when a dummy for staying in school beyond age 15 is one of the included variables. This provides a benchmark for our causal inference.

We base the bandwidth choice around April 1933 on a randomization test, see Sect. 3.2 and use a bandwidth of 12 years. This seems a large bandwidth around the cutoff date but is comparable to other bandwidths used in the literature, e.g., Clark and Royer (2013) use a bandwidth of 15 years and both Van Kippersluis et al. (2011) and Johnston et al. (2015) use a bandwidth of 10 years. The first panel of Table 1 give the estimated effect on the mortality hazard. The first two columns of Table 1 provide the estimated coefficients for the basic (M)PH Gompertz model.Footnote 4 In this benchmark model staying in school beyond age 15 is associated with a reduction of the mortality hazard by 33% (=\(1-\hbox {e}^{-0.403})\)). Including a gamma distributed unobserved heterogeneity (MPH) increases the association with staying in school. We calculate the implied life-expectancy of leaving school after age 15, based on the estimated parameters and using Eqs. (2) or (3), which are reported in the second panel of Table 1. Again the first two columns report the estimated educational gains in the implied life-expectancy for the standard Gompertz model. In the basic Gompertz model we find a total association between staying in school beyond age 15 and life-expectancy of 5.7–6.0 years.

Table 1 Estimated impact of schooling on the mortality hazard and estimated educational gain in life-expectancy, \(N=2750\)

The standard Gompertz model presented above does not account for the potential endogeneity of staying in school. Similar to the ‘standard’ RDD analysis that involves an instrumental variable method, like 2SLS, the principal strata Gompertz model described in Sect. 3.4, that exploits the policy reform of 1947 as an instrument for staying in school, seeks to solve this endogeneity issue.

The third and fourth columns of Table 1 report the difference in the hazard parameters for the compliers: \(\gamma _1-\gamma _0\). The full estimation results are given in Table 10 in Appendix D. We find a large and statistically significant effect of staying in school till age 15–16 instead of leaving school before age 15 on the hazard rate for the compliers, when the distance from the threshold is ignored (i.e., the order of the polynomial in the running variable is zero), see Table 10 in Appendix D. However, this estimate lacks credibility as it excludes a direct influence of birth date (the period effect or a secular trend) on the mortality hazard. The model that performs best on statistical criteria, based on the AIC, contains a second order polynomial in the distance to the threshold and leads to a positive and statistically insignificant effect.

Based on these estimated parameters we calculate the life-expectancy for each level of education, using Eqs. (2) or (3). The third and fourth columns of Table 1 report the estimated educational gains for the principal strata models. The total educational gain for the preferred model, with a second order polynomial in the running variable, estimates a statistically insignificant decrease in life-expectancy.

5 Robustness checks

In this section we check how robust our results are for males and females separately, to including covariates, to the choice of the bandwidth and to adjusting for never takers.

Figure 3 shows how the 1947 reform affects the school leaving age for males and females separately. Again, the 1947 reform clearly had a large effect on school leaving around the age of 15, but not on leaving school after age 16 for both males and females.

Fig. 3
figure 3

Probability to leave school before age 15, age 15–16, age 16–18 and after age 18 (around the cut-off birth date of 1–4–1933) for males and females separately Dots represent average schooling from survey entry to survey end by quarter of birth

Figure 4 depicts the probability to survive until the end of the survey (July 1st, 2009) and the Kaplan–Meier survival curves for individuals born within 12 years before or after the cut-off birth of the 1947 reform for males and females separately.

Fig. 4
figure 4

Survival from first survey till end of follow up by age left school, only individuals born within 12 years before or after the cut-off birth date (31–3–1933) for males and females separately Dots represent average survival from survey entry to survey end by year of birth for survival

Note that, as stated earlier the survival gaps, depicted in the right-hand plots of Fig. 4, are based on the raw survival data and could exist for many reasons including both selection bias or a causal impact.

We re-estimate the standard Gompertz and the principal strata model separately for males and females. The first two columns of Table 2 provide the estimated coefficients for the basic (M)PH Gompertz model.Footnote 5 Again the standard Gompertz model predicts a large reduction of the mortality hazard from staying in school beyond age 15. Including a gamma distributed unobserved heterogeneity (MPH) hardly affects the size of this effect. For men, the preferred model leads to a negative but statistically insignificant effect of staying longer in school on the hazard rate among the compliers for women the estimate is positive (and also statistically insignificant).

We base the bandwidth choice around April 1933 again on a randomization test and use a bandwidth of 14 years (males only) or 13 years (females only). Note that the different bandwidths imply that the sum of the sample size of the males and of the females is not equal to the total sample size (\(N=2750\)). The second panel of Table 2 reports the estimated educational gains in the implied life-expectancy, the first two columns for the standard Gompertz model and the third and fourth columns for the principal strata model. In the standard Gompertz model we find a total association between staying in school beyond age 15 and life-expectancy of 5.4 years (males) and 5.1 years (females). The estimated educational gains using the principal strata model are not statistically significant and for women the estimate is negative.

Table 2 Estimated coefficient of schooling on the mortality hazard, males and females separately

In a local randomization view of RDD additional covariates, data beyond the outcome and the running variable, are used to find the bandwidth around the threshold (see Sect. 3.2). Researchers often use additional covariates to reduce the variance of their empirical estimates. A common strategy is to include the covariates additively separably and linearly-in-parameters in a local linear RD regression (Calonico et al. 2019). Table 3 provides the estimated effect on the hazard and the estimated educational gain using a RDD (principal strata model) with additional covariates (the exogenous variables, region and sex). These results do not differ substantially from those without covariates and are still statistically insignificant.

Table 3 Estimated impact of schooling on the mortality hazard and estimated educational gain in life-expectancy, principal strata model including covariates

A common issue in using fuzzy regression discontinuity designs is the choice of the bandwidth for whom to select around the threshold. There is always a trade-off between bias and precision. The chosen window of 12 years around the threshold of being born in April 1933 is based on the randomization test. Table 4 reports the total educational gains in life-expectancy estimated for the principal strata model using smaller bandwidths from 10 down to 5 years. For all bandwidths the estimated total educational gains are statistically insignificant.

Table 4 Estimated educational gain in life-expectancy principal strata model, by bandwidth

Another issue is that we identified a few (5%) never-takers, individuals that report leaving school before the age of 15 in the post-policy period when the legal school leaving age had been increased. We, therefore, check what happens if either (1) we remove never takers form the sample or (2) assume that these people stayed in school beyond age 15. Again we re-estimate the complier model and calculate the educational gains in life-expectancy. Table 5 presents the estimated total educational gains with these two adjustments for never takers. None of the estimated educational gains are statistically significant.

Table 5 Estimated educational gain in life-expectancy at age 18 principal strata model, adjustment for Never takers

Finally, using a local treatment framework implies that all the assumptions (in Sect. 3.3) hold. We visually test the exclusion restriction (assumption 3) and the monotonicity assumption (Assumption 4). A joint graphical ‘test’ of the exclusion restriction and Monotonicity (Kitagawa 2015; Mourifié and Wan 2017) is:

$$\begin{aligned} S(t|D=1,Z=1)\Pr (D=1|Z=1)&\ge S(t|D=1,Z=0)\Pr (D=1|Z=0) \\ S(t|D=0,Z=0)\Pr (D=0|Z=0)&\ge S(t|D=0,Z=1)\Pr (D=0|Z=1) \end{aligned}$$

Thus, a figure with these four curves \(S(t|\cdot )\times \Pr (D|Z)\) serves as a graphical test of the validity of these assumptions regarding the instrument, as defined by the threshold before and after the reform of 1947. Figure 5 depicts the four curves and shows that the two inequalities hold (for the age interval with sufficient observations: ages 54–76).

Fig. 5
figure 5

Graphical joint test of exclusion restriction and monotonicity of using the cut-off birth of the 1947 reform as an instrument

6 Conclusion

We investigate the educational gain in life-expectancy using data for England and Wales from the Health and Lifestyle Survey. For causal identification of the educational gain we propose a Regression Discontinuity Design implied by the increase in the minimum school leaving age in 1947 (from 14 to 15) together with a principal stratification method for the mortality hazard rate. The principal stratification framework is a general potential outcomes framework for causal inference with instruments. It defines complier types (always takers, compliers and never takers) for educational attainment, that depend on the policy reform.

A simple Gompertz mortality rate model suggests that staying in school beyond age 15 years significantly increases life-expectancy. However estimates of causal effects obtained from the principal strata method indicate that the total educational gain is not statistically significant. We conducted a range of robustness tests, allowing for additional covariates, smaller bandwidths around the threshold and ruling out never-takers and did not find substantial changes in the estimated results. This reinforces earlier evidence that shows only a small effect (Mazumder 2008, 2012; Jones et al. 2011; Van Kippersluis et al. 2011; Fletcher 2015; Meghir et al. 2018; Basu et al. 2018) or no effect (Albouy and Lequien 2009; Clark and Royer 2013; Jürges et al. 2013) of education on mortality. Our empirical application shows that this finding stands up to a rigorous analysis of the mortality hazard based on non-linear duration analysis and in the British educational system.