Profiting from pilot studies: Analysing mortality using Bayesian models with informative priors

https://doi.org/10.1016/j.baae.2012.11.003Get rights and content

Abstract

Pilot studies are often used to help design ecological studies. Ideally the pilot data are incorporated into the full-scale study data, but if the pilot study's results indicate a need for major changes to experimental design, then pooling pilot and full-scale study data is difficult. The default position is to disregard the preliminary data. But ignoring pilot study data after a more comprehensive study has been completed forgoes statistical power or costs more by sampling additional data equivalent to the pilot study's sample size. With Bayesian methods, pilot study data can be used as an informative prior for a model built from the full-scale study dataset. We demonstrate a Bayesian method for recovering information from otherwise unusable pilot study data with a case study on eucalypt seedling mortality. A pilot study of eucalypt tree seedling mortality was conducted in southeastern Australia in 2005. A larger study with a modified design was conducted the following year. The two datasets differed substantially, so they could not easily be combined. Posterior estimates from pilot dataset model parameters were used to inform a model for the second larger dataset. Model checking indicated that incorporating prior information maintained the predictive capacity of the model with respect to the training data. Importantly, adding prior information improved model accuracy in predicting a validation dataset. Adding prior information increased the precision and the effective sample size for estimating the average mortality rate. We recommend that practitioners move away from the default position of discarding pilot study data when they are incompatible with the form of their full-scale studies. More generally, we recommend that ecologists should use informative priors more frequently to reap the benefits of the additional data.

Zusammenfassung

Pilotstudien werden oft genutzt, um das Design von ökologischen Untersuchungen zu bestimmen. Idealerweise werden die Daten aus der Pilotstudie in den Datensatz der Hauptstudie inkorporiert, aber wenn die Pilotstudie die Notwendigkeit größerer Veränderungen an der Versuchsanlage anzeigt, ist das Zusammenführen von Daten aus Pilot- und Hauptstudie schwierig. Die normale Entscheidung ist dann, die vorläufigen Daten nicht zu berücksichtigen. Aber die Ergebnisse aus der Pilotstudie zu ignorieren, nachdem die Hauptstudie abgeschlossen wurde, bedeutet, auf Teststärke zu verzichten, oder der Aufwand steigt durch das Sammeln zusätzlicher Daten, die den Probenumfang der Pilotstudie ausgleichen. Mit Bayesschen Methoden können Daten aus der Pilotstudie als informative a-priori-Verteilung (‘informative prior’) für ein Modell genutzt werden, das aus dem Datensatz der Hauptstudie hergestellt wird. Wir demonstrieren eine Bayessche Methode zur Gewinnung von Information aus anders nicht nutzbaren Pilotstudiendaten anhand einer Fallstudie zur Mortalität von Eukalyptussetzlingen. Eine Pilotstudie zur Mortalität von Eukalyptussetzlingen wurde 2005 in SO-Australien durchgeführt. Eine größere Studie mit einem modifizierten Design wurde im Folgejahr durchgeführt. Die beiden Datensätze unterschieden sich erheblich, so dass sie nicht ohne weiteres zusammengeführt werden konnten. A-posteriori-Schätzwerte der Modellparameter für die Pilotstudie wurden einem Modell für den zweiten, größeren Datensatz zugrundegelegt. Die Überprüfung des Modells zeigte, dass die Hinzunahme einer informativen a-priori-Verteilung die Vorhersagekraft des Models in Bezug auf die Trainingsdaten erhielt. Die Hinzunahme einer informativen a-priori-Verteilung verbesserte die Genauigkeit des Modells für die Vorhersage eines Validierungsdatensatzes und steigerte Genauigkeit und effektive Probengröße für die Bestimmung der durchschnittlichen Mortalitätsrate. Wir empfehlen, dass Praktiker von der Standardpraxis abrücken sollten, Daten aus Pilotstudien zu verwerfen, wenn diese mit ihrer Hauptstudie inkompatibel sind. Ganz allgemein empfehlen wir, dass Ökologen informative a-priori-Verteilungen häufiger einsetzen sollten, um die Vorteile zusätzlicher Daten zu nutzen.

Introduction

The ability of Bayesian analyses to formally incorporate prior information has been little exploited in ecological research despite its distinct appeal in a world where data and resources for research are limited and the impetus for rapid learning is great. Many textbooks on Bayesian methods for ecologists introduce the concept of informative priors in the first few pages (e.g., Kéry 2010; McCarthy 2007), yet researchers typically use very vague priors. In effect, these researchers assert that they have no prior knowledge of model parameters. Informally and formally, researchers use prior information to determine questions, sampling regimes, and model structures, and to interpret results. Whereas use of informative priors is rare, perhaps because it is hard to express prior knowledge as a probability distribution (Clyde 1999) or because informative priors are perceived as overly subjective (Dennis 1996), though subjectivity is not a requirement of priors (Hobbs & Hilborn 2006). Another reason why informative priors are not used is a fear that they could reduce model accuracy. A prior not only affects the precision of estimates, but also the location of the posterior and therefore, potentially the predictive accuracy.

Here we extend the domain of using informative priors in ecological modelling (see Choy, O’Leary, & Mengersen 2009; Dupuis & Joachim 2006; McCarthy & Masters 2005; McCarthy, Citroen, & McCall 2008) with pilot study data. The primary goal of a pilot study is to inform the design of the subsequent full-scale study. Pilot studies are small studies aimed to help reduce important uncertainties, and reveal the sample size needed to detect particular effects. Or, they can reveal major drivers of system variation and help identify at what spatial and temporal scales the variation is propagated, or help refine a set of predictors. However, the data generated by a pilot study may not simply inform the design of the full-scale study, but may also help address the fundamental research question. Many ecologists’ default position is to disregard the preliminary data, a stance recommended by texts on data collection and analysis (e.g. Green 1979). However, treating the results of a pilot study as an informative prior using Bayesian methods provides a formal and transparent way to combine two otherwise incompatible data sources, improving the cost-effectiveness of the research.

We illustrate the use of Bayesian informative priors to recover the inferential and predictive power of otherwise unusable pilot study data with a case study on eucalypt tree seedling mortality. Mortality rate is a key demographic parameter to be estimated. Understanding how and why it varies, is key to describing and learning about population dynamics of all species (Zens & Peart 2003). However, the mortality events from which rates are calculated are often rare in absolute terms, and in many systems may also be episodic. Combined, these issues make mortality rates difficult to characterise, so large datasets that include many individuals and span large spatial and temporal ranges are often required.

Including prior information in a Bayesian model will increase the precision of relevant parameter estimates and posterior predictive distributions (McCarthy 2007). The effect on model accuracy is harder to define though no less important and can only be done via model validation. Effects of priors on accuracy have received limited attention in ecology. If a prior disagrees with the likelihood then the model will be less accurate with respect to the training data than it would without the prior information. But the cost to predictive accuracy specific to the training data may be outweighed by the increased generality of the model, as it can include information from a wider range of sources than the training data alone. In this paper, we demonstrate how to treat the knowledge learned during a pilot study on eucalypt seedling mortality as an informative Bayesian prior. We express this source of prior knowledge as the degree to which the full-scale study budget would need to increase to recover the loss of information by not including the prior. The particular demonstration of using informative priors we provide here highlights their general benefit.

Section snippets

Seedling survival experiment design and analyses

During 2005–2009 a pilot and full-scale transplant survival experiment were undertaken at 21 sites on 14 farming properties in the Goulburn–Broken Catchment, Victoria, Southeastern Australia. The pilot study began in October 2005 when 54 Grey Box eucalypt (Eucalyptus microcarpa) seedlings were planted in each of four grazing exclosures in a split plot design with two treatments. For the first treatment plants were watered during the first six months at fortnightly intervals while the second

Results

The average mortality rate was lower for the seedlings planted during the pilot study (13% per month, assuming an average maximum temperature of 28 °C), than for those planted for the full-scale study (39% per month, assuming an average maximum temperature of 28 °C). But in both cases there was much variation between sites. In the full-scale study topographically wetter sites tended to have lower rates of mortality. In both experiments, periods and places with higher average maximum temperatures

Discussion

The effect of priors on model precision is well known and documented (e.g. McCarthy & Masters 2005). However, the effect on accuracy is rarely examined, if at all. We know of no such examples in the ecological literature. Here, the observed small improvement in accuracy may be due to the increase in model generality and scope achieved by including the prior derived from the pilot study. This represents a general benefit of including prior information, and in particular from a pilot study, as it

Conclusion

We have demonstrated how preliminary data can be used as an informative Bayesian prior. A key finding of this study is that including the prior information increased the precision of some parameters at the same time as improving or at least not compromising the model's predictive accuracy. As well as changing the precision, including prior information also changed the location of some posterior distributions. Changing the location could be beneficial or undesirable depending on what is driving

Acknowledgements

We thank Libby Rumpff, Megan Watson, James Camac, Chris Jones, Rhiannon Apted, Warwick McCallum and Alex Thompson for help in the field. We also acknowledge the assistance of Carla Miles (Goulburn Broken Catchment Management Authority), Kate Hill (Department of Sustainability and Environment) and the land owners who allowed us to undertake this study on their properties. We also thank Bob O’hara, Rod Fensham, and anonymous reviewers for helpful comments on earlier versions of this manuscript.

References (26)

  • M. Zens et al.

    Dealing with death data: Individual hazards, mortality and bias

    Trends in Ecology & Evolution

    (2003)
  • P. Allison

    Discrete-time methods for the analysis of event histories

    Sociological Methodology

    (1982)
  • S. Choy et al.

    Elicitation by design in ecology: Using expert opinion to inform priors for Bayesian statistical models

    Ecology

    (2009)
  • M. Clyde

    Bayesian model averaging: A tutorial: Comment

    Statistical Science

    (1999)
  • B. Dennis

    Discussion: Should ecologists become Bayesians?

    Ecological Applications

    (1996)
  • J.A. Dupuis et al.

    Bayesian estimation of species richness from quadrat sampling data in the presence of prior information

    Biometrics

    (2006)
  • G.E. Garrard et al.

    A predictive model of avian natal dispersal distance provides prior information for investigating response to landscape change

    Journal of Animal Ecology

    (2012)
  • A. Gelman

    Prior distributions for variance parameters in hierarchical models

    Bayesian Analysis

    (2006)
  • A. Gelman

    Scaling regression inputs by dividing by two standard deviations

    Statistics in Medicine

    (2008)
  • A. Gelman et al.

    Bayesian data analysis. Texts in Statistical Science

    (2004)
  • R.H. Green

    Sampling design and statistical methods for environmental biologists

    (1979)
  • J.A. Hanley et al.

    The meaning and use of the area under a receiver operating characteristic (ROC) curve.

    Radiology

    (1982)
  • N.T. Hobbs et al.

    Alternatives to statistical hypothesis testing in ecology: A guide to self teaching

    Ecological Applications

    (2006)
  • Cited by (17)

    • Modelling invasive alien plant distribution: A literature review of concepts and bibliometric analysis

      2021, Environmental Modelling and Software
      Citation Excerpt :

      Modelling using Bayesian inference can increase the precision of model parameter estimates once it allies previous knowledge (a prior) with newly collected data (the likelihood) to produce a posterior distribution (Morris et al., 2015). Increased accuracy of estimates from the use of informative priors is well established, and improvement has been proved in ecological contexts (e.g., McCarthy and Masters 2005; McCarthy et al., 2008; Morris et al., 2013; 2015; Marcot et al., 2019). Indeed, the increase in precision is an inherent feature of using informative priors (Morris et al., 2015).

    • Quantifying uncertainty about forest recovery 32-years after selective logging in Suriname

      2017, Forest Ecology and Management
      Citation Excerpt :

      If the prior information specified does not reduce the model fit, the model DIC value (which is analogous to the AIC in a likelihood framework) will improve (Spiegelhalter et al., 2002). Improvements indicate that the prior information specified is consistent with the data and that the data had an overwhelmingly large influence on the posterior distribution (Morris et al., 2013). Bayes’ principle is especially suited to our needs given that we aim to quantify and communicate uncertainty around post-logging recovery in a probabilistic manner, and overcome data limitations with a quantitatively rigorous approach to better inform forest managers (Hobbs and Hooten, 2015; McCarthy and Masters, 2005).

    View all citing articles on Scopus
    View full text