Use of confidence intervals to demonstrate performance against forest management standards

https://doi.org/10.1016/j.foreco.2007.04.048Get rights and content

Abstract

The objective of continuous improvement embedded in forest management standards relies on the capacity of management to respond appropriately to evidence of performance provided by monitoring. This evidence is rarely unequivocal. Under a null hypothesis of no effect, two kinds of errors in interpretation are possible—inferring an effect where none exists (Type I error) and inferring no effect when in fact one exists (Type II error). If the monitoring relates to possible improvement in growth or yield then a Type I error leads to false optimism and a Type II error to false pessimism. If monitoring concerns a potential environmental or social impact, a Type I error implies alarmism and a Type II error a false sense of security.

Explicit consideration of statistical power in designing and interpreting monitoring data is an effective buffer against these errors. However, strict application of statistical power may be impractical. In particular, the requirement to specify tolerable error rates and effect sizes will be difficult in many circumstances where the perspectives of managers, auditors or stakeholders are contested or perceived to be arbitrary or vague. We advocate the use of confidence intervals as an alternative to power calculations. Confidence intervals offer an accessible approach to communicating performance under a standard and the extent to which a monitoring program is able to distinguish compliance from non-compliance. We illustrate these arguments and tools through a hypothetical example involving a proposed change in silviculture where the magnitude of gains in yield and environmental impacts are unclear.

Introduction

Forestry standards seek to synthesize what is understood to be best practice. Examples include the Forest Stewardship Council International Standard (FSC, 2000), criteria and indicators developed under the Montreal Process (1999), and the Australian Forestry Standard (Standards Australia, 2003). Implicitly, accreditation under a standard asserts a company's commitment to best management and continuous improvement. The validity of this assertion rests on demonstrated compliance. This paper outlines how data gathered in monitoring can be analysed and communicated in a way that is accessible to managers, auditors and stakeholders. Its motivation is facilitation of evidence-based continuous improvement.

The criteria and indicators contained in a standard represent an attempt to encapsulate values associated with forests in a form that is amenable to measurement. Suter (1993) describes a hierarchical process for translating broad management goals into measurement endpoints. Management goals are statements that embody broad objectives. They are often ambiguous or vague, but carry with them a clear social or organizational mandate. Measurement endpoints are elements that can actually be measured. They can be regarded as operational definitions of management goals. In their emphasis on measurable outcomes as a means of demonstrating compliance, forestry standards inevitably tend toward reductionism at the expense of holistic ecosystem-level perspectives on forest processes. We regard the specification of measurable endpoints as a necessity in management systems where the insights from monitoring underpin adaptive management and continuous improvement.

The technical task of gathering and interpreting monitoring data within a continuous improvement framework takes place against a complex social and political background involving conflicting values and multiple narratives regarding the magnitude and seriousness of various impacts (Raison et al., 2001). Managers may be over-confident in the effectiveness of their actions (Morgan and Henrion, 1990, Ludwig et al., 1993) and tend to regard the claims of environmentalists and others as alarmist. The attitude of community stakeholders toward natural resource managers is often characterized by skepticism rather than trust (Bocking, 2004). They may be inclined to view the claims of managers as imbuing a false sense of security. Hard data and scientific rigor are often invoked as the ultimate arbiters of contested claims. However, there is a distinct tendency for people to draw firm conclusions from meager data that are inconsistent with what they themselves might regard as a reasonable burden of proof (Tversky and Kahneman, 1971).

Statistical analysis provides a basis for assessing the extent to which data is consistent with alternative assertions. There are several approaches including frequentist and Bayesian analysis. The most commonly-used and taught approach is frequentist analysis involving null hypothesis testing, whereby the calculated probability (p-value) of obtaining the observed data given the null hypothesis is based on the same sampling procedure being implemented many times. Our focus here is on frequentist interpretation of data, but we note that Bayesian analyses may be equally appropriate. Bayesian methods combine prior information and the probabilities of obtaining the data under alternative hypotheses to obtain updated estimates of the evidence in favor of the hypotheses. The Bayesian analogue of confidence intervals is Bayesian credible intervals. Quinn and Keough (2002) and McCarthy (2007) provide further discussion of the different schools of statistical inference and their application to the biological sciences.

The standard frequentist approach to the evidence provided by data is poorly placed to deal with typical natural resource management issues (Burgman, 2005). Science has an asymmetric view of evidence as a consequence of its focus on the accumulation of knowledge. It is strongly averse to the possibility of concluding from a study that an effect exists when in fact it does not. Science is less concerned with the possibility of concluding no effect when one in fact exists. This is problematic because forest managers, auditors and stakeholders need to be aware of the possibility of both kinds of errors.

Specifically: Inferences drawn from monitoring data can make two kinds of mistakes (Table 1)—inferring an effect or impact when there is none (Type I error; denoted α) or inferring no effect or impact when there is one (Type II error; denoted β). Where monitoring involves testing economic aspects of sustainability's triple bottom line (e.g. the effectiveness of a thinning regime or fertiliser treatment on yield), Type I and Type II errors generally imply false optimism and false pessimism, respectively. The costs of wrong inferences are borne directly by the company. Where the issue involves impacts on the environmental or social dimensions of sustainability, a Type I error translates to false alarmism and a Type II error promotes a false sense of security. The costs of Type I errors in these circumstances are again borne largely by the company through (unreasonable) loss of community and/or commercial reputation. The costs of Type II errors are borne by the broader community through (unacknowledged) attrition of public good values or resources.

Standard statistical conventions commonly used in scientific research specify tolerable Type I error rates (commonly α = 0.05) but are blind to Type II errors. Statistical power is a measure of the confidence with which we would have detected a particular effect, if one existed (Fowler et al., 1998). It is defined as 1  β (i.e. the complement of the Type II error rate). Explicit consideration of statistical power is required when decisions are sensitive to Type II errors as well as Type I errors. The likelihood of Type I and Type II errors in any study or monitoring program will decrease as data accumulate and the reliability of evidence improves.

In consultation with auditors and stakeholders, a monitoring program could consider acceptable thresholds for committing Type I and Type II errors, and the program designed accordingly (Mapstone, 1995, Di Stefano, 2001, Foster, 2001, Field et al., 2004). But what is a tolerable impact? What is a reasonable burden of proof in demonstrating compliance (or non-compliance)? To what extent should the burden of proof be conditioned by the cost of collecting data? If answers to these questions can be given, it is a reasonably straight-forward technical exercise for a statistician to calculate the sampling effort (and the monitoring budget) required to make an assessment of whether or not a forest manager is complying with a standard or criterion.

Strict approaches to the use of statistical power in designing monitoring programs assume that clear answers to these questions are available. But in practice, such clarity rarely exists. Notions of burden of proof and what might be regarded a tolerable impact are not solely technical questions. They involve resolution of the individual and collective judgments and preferences of industry, auditors and stakeholders.

Statistical power is an essential concept in the design of monitoring programs and the interpretation of results. However, our guess is that forest managers, auditors and stakeholders will not immediately resolve the issues and values associated with alternative perspectives on how large an impact is tolerable or what burden of proof should apply. Here we advocate a loose interpretation of statistical power that encourages progressive exploration of these themes using confidence intervals. Confidence intervals provide a simple graphical basis for informed discussion on what might constitute compliance and non-compliance and on the extent to which a monitoring program is able to distinguish compliance from non-compliance.

The focus of this paper is the statistical treatment and presentation of data gathered through monitoring. Its particular emphasis is sampling error. There are other aspects to designing monitoring programs that are beyond the scope of this paper. We treat neither the broad principles of sampling design (see Philip, 1994), nor details such as selecting reference or control sites (see ANZECC/ARCMANZ, 2000) or measurable indicators (see Prabhu et al., 2001).

The next section explores the concept of statistical power in some detail. An example is used to show how statistical inference can be misused, and how consideration of power insulates against misuse. We then make a distinction between formal and loose interpretations of statistical power and recommend confidence intervals to communicate performance. The discussion emphasizes that blanket prescriptions for demonstrating compliance (or non-compliance) are unlikely to be efficient or operationally feasible. We urge auditors, managers and stakeholders to differentiate aspects or criteria within a standard that are of greater importance from those of lesser importance (recognising the management context of individual circumstances), and to allocate their monitoring resources accordingly.

Section snippets

Statistical power

Uncertainty is inevitable in natural resource management, where sampling is undertaken in environments characterized by high variability. Through use of a detailed example, we explore the central importance of uncertainty and statistical power in monitoring. We then outline how confidence intervals can be used as an effective analytical approach to summarize data and also as an effective and accessible means of communicating performance to auditors and stakeholders.

Confidence intervals as an alternative to power calculations

The formulation of power calculations is not intuitive. In the scenario above, the calculation requires us to specify a null hypothesis of no effect and an alternative hypothesis equivalent to an effect size deemed to be of commercial, social or environmental consequence. Values for α and β need to be specified. In a priori calculation, the number of samples needed to satisfy these thresholds is calculated. In the context of auditing a standard, the objective of the calculations is to

Discussion

We recommended confidence intervals as graphical means of facilitating evidence-based continuous improvement, providing the basis for informed discussion on what might (a) constitute compliance and non-compliance and (b) the extent to which a monitoring program is able to distinguish compliance from non-compliance.

Monitoring programs that clearly differentiate circumstances in which management complies or does not comply typically demand intensive sampling (Mapstone, 1995). Our guess is that

Acknowledgments

This work was funded by the Forest & Wood Products Research & Development Corporation (FWPRDC) through a grant made available to the Australian Forestry Standard Ltd. The FWPRDC is jointly funded by the Australian forest and wood products industry and the Australian Government. For comments received on a draft we thank Julian Di Stefano, Mark Edwards, David Flinn, Hans Drielsma, Ross Peacock, Wayne Hammond, John Wiedemann, Erwin Epp, Marks Nester, Kevin Swanepoel and three anonymous reviewers.

References (39)

  • J. Di Stefano

    Power analysis and sustainable forest management

    For. Ecol. Manage.

    (2001)
  • J. Di Stefano

    A confidence interval approach to data analysis

    For. Ecol. Manage.

    (2004)
  • J.R. Foster

    Statistical power in forest monitoring

    For. Ecol. Manage.

    (2001)
  • ANZECC/ARCMANZ, 2000. Australian Guidelines for Water Quality Monitoring and Reporting. Australian and New Zealand...
  • AS/NZS

    Risk Management (AS/NZS 4360:2004)

    (2004)
  • S. Bocking

    Nature's Experts. Science, Politics and the Environment

    (2004)
  • M.A. Burgman

    Risks and Decisions for Conservation and Environmental Management

    (2005)
  • Burgman, M.A., Ades, P., Hickey, J., Williams, M., Davies, C., Maillardet, R., 1998. Methodological guidelines for the...
  • P.H. Crowley

    Resampling methods for computer-intensive data analysis in ecology and evolution

    Annu. Rev. Ecol. Syst.

    (1992)
  • J. Di Stefano

    How much power is enough? Against the development of an arbitrary convention for statistical power calculations

    Funct. Ecol.

    (2003)
  • J. Di Stefano et al.

    Effect size estimates and confidence intervals: an alternative focus for the presentation and interpretation of ecological data

  • P.G. Fairweather

    Statistical power and design requirements for environmental monitoring

    Aust. J. Mar. Freshwater Res.

    (1991)
  • Fidler, F., 2006. From statistical significance to effect estimation. Statistical reform in psychology, medicine and...
  • F. Fidler et al.

    Impact of criticism of null-hypothesis significance testing on statistical reporting practices in conservation biology

    Conserv. Biol.

    (2006)
  • S.A. Field et al.

    Minimizing the costs of environmental management decisions by optimizing statistical thresholds

    Ecol. Lett.

    (2004)
  • F. Fischer

    Citizens, Experts, and the Environment

    (2000)
  • J. Fowler et al.

    Practical Statistics for Field Biology

    (1998)
  • FSC, 2000. FSC Principles and Criteria for Forest Stewardship (FSC-STD-01-001). Forest Stewardship Council,...
  • M.J. Gardner et al.

    Confidence intervals rather than P values

  • Cited by (24)

    • The replicability crisis in science and protected area research: Poor practices and potential solutions

      2022, Journal for Nature Conservation
      Citation Excerpt :

      This is not apparent when using p-values. An estimation approach is enhanced when an effect size and its interval are compared with a pre-defined effect size (a standard or threshold) considered to be of biological, social or management importance (Walshe & Wintle, 2006; Walshe et al., 2007). This is akin to proposing an alternative hypothesis and helps mitigate concerns about research simply becoming an unfocused data gathering exercise (Underwood, 2000).

    • Evaluation of bankfull stage from plotted channel geometries

      2022, Journal of Hydrology: Regional Studies
    • Population viability analysis using Bayesian networks

      2022, Environmental Modelling and Software
      Citation Excerpt :

      We assessed the three modeling constructs -- RAMAS with 100 replicates, RAMAS with 1000 replicates, and Uninet -- based on graphical comparisons. Mean and uncertainty values were examined with non-overlapping 95% confidence intervals (RAMAS), and with 95% credible intervals (Uninet) that are equivalent to significance at the p = 0.05 level for a two-sample t-test (Walshe et al., 2007). Here we make the assumption that the confidence and credible intervals are approximately equivalent.

    • Bayesian decision network modeling for environmental risk management: A wildfire case study

      2020, Journal of Environmental Management
      Citation Excerpt :

      Finally, to test the sensitivity of the ranking to the input cost data, we varied the utility values for each treatment and asset by taking the upper and lower estimate (e.g. Table 2) and calculated the revised ranking of the expected values for each combination of management decisions. Mean and 95% values for the ranks were then calculated and significant differences between treatments identified by non-overlapping confidence intervals which approximates significance of a t-test at p = 0.05 (Walshe et al., 2007). Performance of the model varied across fire area and the assets (Fig. 3).

    View all citing articles on Scopus
    View full text