Use of confidence intervals to demonstrate performance against forest management standards
Introduction
Forestry standards seek to synthesize what is understood to be best practice. Examples include the Forest Stewardship Council International Standard (FSC, 2000), criteria and indicators developed under the Montreal Process (1999), and the Australian Forestry Standard (Standards Australia, 2003). Implicitly, accreditation under a standard asserts a company's commitment to best management and continuous improvement. The validity of this assertion rests on demonstrated compliance. This paper outlines how data gathered in monitoring can be analysed and communicated in a way that is accessible to managers, auditors and stakeholders. Its motivation is facilitation of evidence-based continuous improvement.
The criteria and indicators contained in a standard represent an attempt to encapsulate values associated with forests in a form that is amenable to measurement. Suter (1993) describes a hierarchical process for translating broad management goals into measurement endpoints. Management goals are statements that embody broad objectives. They are often ambiguous or vague, but carry with them a clear social or organizational mandate. Measurement endpoints are elements that can actually be measured. They can be regarded as operational definitions of management goals. In their emphasis on measurable outcomes as a means of demonstrating compliance, forestry standards inevitably tend toward reductionism at the expense of holistic ecosystem-level perspectives on forest processes. We regard the specification of measurable endpoints as a necessity in management systems where the insights from monitoring underpin adaptive management and continuous improvement.
The technical task of gathering and interpreting monitoring data within a continuous improvement framework takes place against a complex social and political background involving conflicting values and multiple narratives regarding the magnitude and seriousness of various impacts (Raison et al., 2001). Managers may be over-confident in the effectiveness of their actions (Morgan and Henrion, 1990, Ludwig et al., 1993) and tend to regard the claims of environmentalists and others as alarmist. The attitude of community stakeholders toward natural resource managers is often characterized by skepticism rather than trust (Bocking, 2004). They may be inclined to view the claims of managers as imbuing a false sense of security. Hard data and scientific rigor are often invoked as the ultimate arbiters of contested claims. However, there is a distinct tendency for people to draw firm conclusions from meager data that are inconsistent with what they themselves might regard as a reasonable burden of proof (Tversky and Kahneman, 1971).
Statistical analysis provides a basis for assessing the extent to which data is consistent with alternative assertions. There are several approaches including frequentist and Bayesian analysis. The most commonly-used and taught approach is frequentist analysis involving null hypothesis testing, whereby the calculated probability (p-value) of obtaining the observed data given the null hypothesis is based on the same sampling procedure being implemented many times. Our focus here is on frequentist interpretation of data, but we note that Bayesian analyses may be equally appropriate. Bayesian methods combine prior information and the probabilities of obtaining the data under alternative hypotheses to obtain updated estimates of the evidence in favor of the hypotheses. The Bayesian analogue of confidence intervals is Bayesian credible intervals. Quinn and Keough (2002) and McCarthy (2007) provide further discussion of the different schools of statistical inference and their application to the biological sciences.
The standard frequentist approach to the evidence provided by data is poorly placed to deal with typical natural resource management issues (Burgman, 2005). Science has an asymmetric view of evidence as a consequence of its focus on the accumulation of knowledge. It is strongly averse to the possibility of concluding from a study that an effect exists when in fact it does not. Science is less concerned with the possibility of concluding no effect when one in fact exists. This is problematic because forest managers, auditors and stakeholders need to be aware of the possibility of both kinds of errors.
Specifically: Inferences drawn from monitoring data can make two kinds of mistakes (Table 1)—inferring an effect or impact when there is none (Type I error; denoted α) or inferring no effect or impact when there is one (Type II error; denoted β). Where monitoring involves testing economic aspects of sustainability's triple bottom line (e.g. the effectiveness of a thinning regime or fertiliser treatment on yield), Type I and Type II errors generally imply false optimism and false pessimism, respectively. The costs of wrong inferences are borne directly by the company. Where the issue involves impacts on the environmental or social dimensions of sustainability, a Type I error translates to false alarmism and a Type II error promotes a false sense of security. The costs of Type I errors in these circumstances are again borne largely by the company through (unreasonable) loss of community and/or commercial reputation. The costs of Type II errors are borne by the broader community through (unacknowledged) attrition of public good values or resources.
Standard statistical conventions commonly used in scientific research specify tolerable Type I error rates (commonly α = 0.05) but are blind to Type II errors. Statistical power is a measure of the confidence with which we would have detected a particular effect, if one existed (Fowler et al., 1998). It is defined as 1 − β (i.e. the complement of the Type II error rate). Explicit consideration of statistical power is required when decisions are sensitive to Type II errors as well as Type I errors. The likelihood of Type I and Type II errors in any study or monitoring program will decrease as data accumulate and the reliability of evidence improves.
In consultation with auditors and stakeholders, a monitoring program could consider acceptable thresholds for committing Type I and Type II errors, and the program designed accordingly (Mapstone, 1995, Di Stefano, 2001, Foster, 2001, Field et al., 2004). But what is a tolerable impact? What is a reasonable burden of proof in demonstrating compliance (or non-compliance)? To what extent should the burden of proof be conditioned by the cost of collecting data? If answers to these questions can be given, it is a reasonably straight-forward technical exercise for a statistician to calculate the sampling effort (and the monitoring budget) required to make an assessment of whether or not a forest manager is complying with a standard or criterion.
Strict approaches to the use of statistical power in designing monitoring programs assume that clear answers to these questions are available. But in practice, such clarity rarely exists. Notions of burden of proof and what might be regarded a tolerable impact are not solely technical questions. They involve resolution of the individual and collective judgments and preferences of industry, auditors and stakeholders.
Statistical power is an essential concept in the design of monitoring programs and the interpretation of results. However, our guess is that forest managers, auditors and stakeholders will not immediately resolve the issues and values associated with alternative perspectives on how large an impact is tolerable or what burden of proof should apply. Here we advocate a loose interpretation of statistical power that encourages progressive exploration of these themes using confidence intervals. Confidence intervals provide a simple graphical basis for informed discussion on what might constitute compliance and non-compliance and on the extent to which a monitoring program is able to distinguish compliance from non-compliance.
The focus of this paper is the statistical treatment and presentation of data gathered through monitoring. Its particular emphasis is sampling error. There are other aspects to designing monitoring programs that are beyond the scope of this paper. We treat neither the broad principles of sampling design (see Philip, 1994), nor details such as selecting reference or control sites (see ANZECC/ARCMANZ, 2000) or measurable indicators (see Prabhu et al., 2001).
The next section explores the concept of statistical power in some detail. An example is used to show how statistical inference can be misused, and how consideration of power insulates against misuse. We then make a distinction between formal and loose interpretations of statistical power and recommend confidence intervals to communicate performance. The discussion emphasizes that blanket prescriptions for demonstrating compliance (or non-compliance) are unlikely to be efficient or operationally feasible. We urge auditors, managers and stakeholders to differentiate aspects or criteria within a standard that are of greater importance from those of lesser importance (recognising the management context of individual circumstances), and to allocate their monitoring resources accordingly.
Section snippets
Statistical power
Uncertainty is inevitable in natural resource management, where sampling is undertaken in environments characterized by high variability. Through use of a detailed example, we explore the central importance of uncertainty and statistical power in monitoring. We then outline how confidence intervals can be used as an effective analytical approach to summarize data and also as an effective and accessible means of communicating performance to auditors and stakeholders.
Confidence intervals as an alternative to power calculations
The formulation of power calculations is not intuitive. In the scenario above, the calculation requires us to specify a null hypothesis of no effect and an alternative hypothesis equivalent to an effect size deemed to be of commercial, social or environmental consequence. Values for α and β need to be specified. In a priori calculation, the number of samples needed to satisfy these thresholds is calculated. In the context of auditing a standard, the objective of the calculations is to
Discussion
We recommended confidence intervals as graphical means of facilitating evidence-based continuous improvement, providing the basis for informed discussion on what might (a) constitute compliance and non-compliance and (b) the extent to which a monitoring program is able to distinguish compliance from non-compliance.
Monitoring programs that clearly differentiate circumstances in which management complies or does not comply typically demand intensive sampling (Mapstone, 1995). Our guess is that
Acknowledgments
This work was funded by the Forest & Wood Products Research & Development Corporation (FWPRDC) through a grant made available to the Australian Forestry Standard Ltd. The FWPRDC is jointly funded by the Australian forest and wood products industry and the Australian Government. For comments received on a draft we thank Julian Di Stefano, Mark Edwards, David Flinn, Hans Drielsma, Ross Peacock, Wayne Hammond, John Wiedemann, Erwin Epp, Marks Nester, Kevin Swanepoel and three anonymous reviewers.
References (39)
Power analysis and sustainable forest management
For. Ecol. Manage.
(2001)A confidence interval approach to data analysis
For. Ecol. Manage.
(2004)Statistical power in forest monitoring
For. Ecol. Manage.
(2001)- ANZECC/ARCMANZ, 2000. Australian Guidelines for Water Quality Monitoring and Reporting. Australian and New Zealand...
Risk Management (AS/NZS 4360:2004)
(2004)Nature's Experts. Science, Politics and the Environment
(2004)Risks and Decisions for Conservation and Environmental Management
(2005)- Burgman, M.A., Ades, P., Hickey, J., Williams, M., Davies, C., Maillardet, R., 1998. Methodological guidelines for the...
Resampling methods for computer-intensive data analysis in ecology and evolution
Annu. Rev. Ecol. Syst.
(1992)How much power is enough? Against the development of an arbitrary convention for statistical power calculations
Funct. Ecol.
(2003)
Effect size estimates and confidence intervals: an alternative focus for the presentation and interpretation of ecological data
Statistical power and design requirements for environmental monitoring
Aust. J. Mar. Freshwater Res.
Impact of criticism of null-hypothesis significance testing on statistical reporting practices in conservation biology
Conserv. Biol.
Minimizing the costs of environmental management decisions by optimizing statistical thresholds
Ecol. Lett.
Citizens, Experts, and the Environment
Practical Statistics for Field Biology
Confidence intervals rather than P values
Cited by (24)
The replicability crisis in science and protected area research: Poor practices and potential solutions
2022, Journal for Nature ConservationCitation Excerpt :This is not apparent when using p-values. An estimation approach is enhanced when an effect size and its interval are compared with a pre-defined effect size (a standard or threshold) considered to be of biological, social or management importance (Walshe & Wintle, 2006; Walshe et al., 2007). This is akin to proposing an alternative hypothesis and helps mitigate concerns about research simply becoming an unfocused data gathering exercise (Underwood, 2000).
Evaluation of bankfull stage from plotted channel geometries
2022, Journal of Hydrology: Regional StudiesPopulation viability analysis using Bayesian networks
2022, Environmental Modelling and SoftwareCitation Excerpt :We assessed the three modeling constructs -- RAMAS with 100 replicates, RAMAS with 1000 replicates, and Uninet -- based on graphical comparisons. Mean and uncertainty values were examined with non-overlapping 95% confidence intervals (RAMAS), and with 95% credible intervals (Uninet) that are equivalent to significance at the p = 0.05 level for a two-sample t-test (Walshe et al., 2007). Here we make the assumption that the confidence and credible intervals are approximately equivalent.
Bayesian decision network modeling for environmental risk management: A wildfire case study
2020, Journal of Environmental ManagementCitation Excerpt :Finally, to test the sensitivity of the ranking to the input cost data, we varied the utility values for each treatment and asset by taking the upper and lower estimate (e.g. Table 2) and calculated the revised ranking of the expected values for each combination of management decisions. Mean and 95% values for the ranks were then calculated and significant differences between treatments identified by non-overlapping confidence intervals which approximates significance of a t-test at p = 0.05 (Walshe et al., 2007). Performance of the model varied across fire area and the assets (Fig. 3).
Adaptive management in context of MPAs: Challenges and opportunities for implementation
2020, Journal for Nature ConservationDeveloping and testing models of the drivers of anthropogenic and lightning-caused wildfire ignitions in south-eastern Australia
2019, Journal of Environmental Management