ReviewHuman reliability analysis: A critique and review for managers
Research highlights
► Human behaviour is often a root or significant contributing cause of system failure. ► Human reliability analyses focus on errors rather than the effects of all forms of human behaviour. ► Organisation design and culture can correlate failure risks. ► A research programme to address these issues is suggested. ► Managers need to question the numbers that risk and reliability analyses provide.
Introduction
Complex systems are never 100% reliable: they fail, sometimes catastrophically, more usually reparably. Perrow, 1984, Perrow, 1994 has argued that failures are an inevitable consequence of the increasing complexity of our systems. Whatever the case, inevitable or not, failures undoubtedly occur. Even in systems that appear to be largely technological rather than human, we find that in the majority of cases there is a human element involved. Maybe some erroneous or even malicious behaviour initiates the failure; maybe the human response to some event is insufficient to avoid system failure; or maybe the original design of the system did not anticipate a potential failure or unfavourable operating conditions.
Statistics show human error is implicated in (see also Hollnagel, 1993):
- •
over 90% of failures in the nuclear industry (Reason, 1990a), see also (United States Nuclear Regulatory Commission, 2002);
- •
over 80% of failures in the chemical and petro-chemical industries (Kariuki and Lowe, 2007);
- •
over 75% of marine casualties (Ren et al., 2008);
- •
over 70% of aviation accidents (Helmreich, 2000);
- •
over 75% of failures in drinking water distribution and hygiene (Wu et al., 2009).
In addition to highly technological industries, there are other complex systems involving applications of technology in which we include complex mathematical modelling, software and web-based systems. The growth of service industries with new business models implies an even greater dependence of businesses, organisations and even economies on reliable human interactions. For instance, recently human checks and balances failed to detect dubious investment behaviour of a trader at Société Générale and led to a loss of some €4.9bn, large enough to have economic and financial effects beyond the bank. The current ‘credit crunch’ owes not a little to misjudgement and error in the banking and finance sectors, indicating the growing interdependence of many disparate parts of the modern global economy. It also owes a lot to a loss of investors’ confidence and trust, both of which inform human behaviour. These data indicate how vulnerable our systems are, even after many years of refinement and improvement; and how important an understanding of human behaviour is if we are to reduce the risk to systems. Another high profile example is the leak in the THORP plant at Sellafield (Thermal Oxide Reprocessing Plant) that was discovered in 2005 (see Board of Inquiry, 2005). This relatively modern plant had been designed to a high standard of safety, but information indicating a system problem was available for some months and yet went unnoticed. Despite previous incidents in 1998 and earlier in 2005, the information that should have suggested a leak, or at least a problem requiring investigation, was misinterpreted. The prevailing attitude was that the system was error-free and hence information that could suggest the contrary was ignored or dismissed.
Managerial processes are critical to successful operation of any complex system; and the quality of management processes depends on their understanding of the import and limitations of the results of the risk (and other) analyses that are provided to them. We emphasise here that all managers, whether or not they have an explicit responsibility for risk management, need to have some understanding of the assumptions and limitations of such analyses. In this article, we examine current and past approaches to human reliability analysis (HRA). We discuss its assumptions, limitations and potential in qualitative terms so that managers can better assess the value of the information that it provides them and so manage risks more effectively. We also suggest that further development of HRA methodologies should take more account of the managerial practices that could be applied to reduce the failures that occur at the interface of human behaviour and technology.
Managers understand human behaviour; good managers understand human behaviour extremely well. To bring out the best in a team one needs to know how each will respond to a request, an instruction, an incentive or a sanction. Yet only the most foolhardy and overconfident of managers would claim that they can predict human behaviour perfectly all the time – or even 95% of the time. The problem is that we often need to design systems with very high reliabilities, many times with overall failure rates of less than 1 in 10 million (i.e. 1 in 10−7). To design and analyse such systems we need a deep understanding of human behaviour in all possible circumstances that may arise in their management and operation. And that is the challenge facing HRA. Our current understanding of human behaviour is not sufficiently comprehensive: worse, current HRA methodologies seldom use all the understanding that we do have.
Of course, there is a trivial mathematical answer to this. If we are to achieve an overall system reliability of 10−7, we do not need humans to be perfectly reliable. We simply need to know how reliable they are and then ensure that we arrange and maintain sufficient safety barriers around the system to ensure that overall system failure probabilities are as low as required. Suppose we construct seven independent safety barriers perhaps some involving humans, some purely technological and suppose each has a probability of 1 in 10 of failing, then arranging them (conceptually) in sequence so that the whole system fails if and only if every one of the seven fails gives an overall probability of system failure of
The problem with this is that there are few barriers that are truly independent, most systems offer opportunities to ‘bypass’ these barriers. Moreover, human behaviour tends to introduce significant correlations and dependencies which invalidate such calculations, reducing the benefit that each extra safety barrier brings; such problems with protective redundancy are well known (for example, Sagan, 2004). So the simplistic calculation does not apply, and we shall argue that we have yet to develop sufficiently complex mathematical modelling techniques to describe human behaviour adequately for risk and reliability analyses.
In many ways the roles of risk and reliability analysis in general and of HRA in particular are often misunderstood by system designers, managers and regulators. In a sense they believe in the models and the resulting numbers too much and fail to recognise the potential for unmodelled and possibly unanticipated behaviours – physical or human – to lead to overall system breakdown (cf. French and Niculae, 2005). Broadly there are two ways in which such analyses may be used.
- •
When HRA is incorporated into a summative analysis, its role is to help estimate the overall failure probabilities in order to support decisions on, e.g., adoption, licensing or maintenance. Such uses require quantitative modelling of human reliability; and overconfidence in these models can lead to overconfidence in the estimated probabilities and poor appreciation of the overall system risks.
- •
There are also formative uses of HRA in which recognising and roughly ranking the potential for human error can help improve the design of the system itself and also the organisational structures and processes by which it is operated. Effective HRA not only complements sound technical risk analysis of the physical systems, but also helps organisations develop their safety culture and manage their overall risk. Indeed, arguably it is through this that HRA achieves its greatest effect.
These uses are not independent – in designing, licensing and managing a system one inevitably iterates between the two – they do differ, however, fundamentally in philosophy. In summative analysis the world outside the system in question learns from the outcome of an analysis; in formative analysis the world inside the system learns from the process of analysis. In summative analysis the ideal is almost to be able to throw away the process and deal only with the outcome; in formative analysis the ideal is almost to throw away the outcome and draw only from the process. While we believe that HRA has a significant potential to be used more in formative ways; we are concerned at its current ability to fulfil a summative role, providing valid probabilities of sequences of failure events in which human behaviour plays a significant role. We believe that there is scope for considerable overconfidence in the summative power of HRA currently and that management, regulators and society in general need to appreciate this, lest they make poorly founded decisions on regulating, licensing and managing systems.
The four of us were part of a recent UK EPSRC funded multi-disciplinary project Rethinking Human Reliability Analysis Methodologies to survey and critique HRA methodologies (Adhikari et al., 2008). Our purpose in this paper is to draw out the relevant conclusions from this project for the management community and, perhaps as well, for our political masters who create the regulatory context in which complex systems have to operate. Overall we believe that current practices in and uses of HRA are insufficient for the complexities of modern society. We argue that the summative outputs of risk and reliability analyses should be taken with the proverbial pinch of salt. But not all our conclusions will be negative. There is much to be gained from the formative use of HRA to shape management practices and culture within organisations and society which can lead to better, safer and less risky operations.
In the next section we briefly survey the historical development underlying concepts of HRA and its role in risk and reliability analyses. We reflect on the widely quoted Swiss Cheese Model (Reason, 1990b), which seeks to offer a qualitative understanding of system failure – though we shall argue that it may actually lead to systematic misunderstandings! In Section 0 we turn to modern theories of human behaviour, particularly those related to judgement and decision. A key issue is that HRA focuses on human errors, whereas many systems failures may arise not just despite, but sometimes because of fully appropriate and rational behaviour on the part of those involved. Thus we need a broader understanding of human behaviour than that relating to human error. We also need to recognise that cultural, organisational, social and other contexts influence behaviour, perhaps correlating behaviour across a system, thus invalidating assumptions of independence commonly made in risk and reliability analyses. One of the flaws common to many current HRA methodologies is that they tend to focus on easily describable, sequential, generally low-level operational tasks. Yet the human behaviour that is implicated in many system failures may occur in other quite different contexts, maybe in developing higher level strategy or during the response to an unanticipated initiating failure event. In recent years there have been many studies of organisational forms which seem to be more resilient to system failures than might be expected and we discuss such studies of high reliability organisations (HROs) briefly in Section 0. Another flaw common to many current HRA methodologies is the lack of specification of the domain of applicability – hence making it difficult to select appropriate methods for a given problem. Therefore in Section 0, we use Snowden’s Cynefin classification of decision contexts (Snowden, 2002, Snowden and Boone, 2007) to categorise different circumstances in which human behaviour may be involved in system failure. We believe that the use of Cynefin – or a similar categorisation of decision contexts – can help in delineating when different HRA methodologies are appropriate. Moreover, it points to areas in which we lack a really sound, appropriate HRA methodology. Our final two sections draw our discussion to a close, suggesting that:
- •
by drawing together current understandings from HRA with other domains of knowledge in behavioural, management and organisational theories, we can make better formative use of HRA in designing systems, process and the organisations that run these;
but that:
- •
the state of the art in quantitative HRA is too poor to make the summative assessments of risk and reliability that our regulators assume, and that society urgently needs to recognise this.
Section snippets
HRA methodologies and the Swiss cheese model
Reliability analysis and risk analysis are two subjects with a great deal of overlap (Aven, 2003, Barlow and Proschan, 1975, Bedford and Cooke, 2001, Høyland and Rausand, 1994, Melnick and Everitt, 2008). The former is generally narrower in scope and tends to deal with engineered systems subject to repeated failures and the need for preventative maintenance policies to address these. Key concepts in reliability engineering include component availability, reliability and maintainability; mean
Human behaviour and human error
Human behaviour is complex and often non rational. For instance, it seems sensible to use modern technological advances to make the physical components of a system safer. But there is some evidence that making subsystems safer could make the overall system less safe because of the propensity of humans to take less care personally when a system takes more care (Adams, 1988, Hollnagel, 1993). In this section we survey some recent findings from behavioural decision studies and consider how this
High reliability organisations
The past 20 years has seen several studies of high reliability organisations (HROs), which Roberts (1990) defined as organisations failing with catastrophic consequences less than one time in 10,000. These studies recognise that certain kinds of social organisation are capable of making even inherently vulnerable technologies reliable enough for a highly demanding society.
An HRO encourages a culture and operating style which emphasises the need for reliability rather than efficiency (Weick, 1987
Decision contexts
There is a further aspect of context that HRA should consider: decision context. The judgements and decisions needed of humans in a system can vary from those needed to perform mundane repetitive operational tasks through more complex circumstances in which information needs to be sought and evaluated to identify appropriate actions to the ability to react to and deal with unknown and unanticipated. Decision processes will vary accordingly. Design decisions can inadvertently introduce further
Toward an extended model of HRA
Summative HRA and related approaches emphasise quantification and prediction. While cognitive understanding of people and cultural perspectives on organisations are acknowledged, the gulf between these and quantitative risk models is generally considered too significant to be bridged. Yet the conjoining of these approaches could yield a superior model of safety critical organisations and the people working within them. In the short term, exploring the interfaces between HRA and behavioural,
Conclusion: a message for managers
The key point that we have been trying to convey in this paper is the current dislocation between the mechanistic reductionist assumptions on which current HRA methodologies are primarily built and our current understandings of human and organisational behaviour. We must bring these into better register. Managers, regulators, politicians and the public need to beware of this lest they believe the numbers that are sometimes touted about the safety of our systems. This should not be read as a
Acknowledgements
This work was supported by the Engineering and Physical Sciences Research Council (Contract number: EP/E017800/1). We are grateful to our co-investigators and colleagues on this: Sondipon Adhikari, Clare Bayley, Jerry Busby, Andrew Cliffe, Geeta Devgun, Moetaz Eid, Ritesh Keshvala, David Tracy and Shaomin Wu. We are also grateful for many helpful discussions with Ronald Boring, Roger Cooke and John Maule.
References (101)
Importance of human contribution within the human reliability analysis (IJS-HRA)
Journal of Loss Prevention in the Process Industries
(2008)- et al.
Safety design: towards a new philosophy
Safety Science
(2006) Looking for errors of omission and commission or the hunting of the Snark revisited
Reliability Engineering and System Safety
(2000)- et al.
Integrating human factors into process analysis
Reliability Engineering and System Safety
(2007) - et al.
Problem structuring methods in action
European Journal of Operational Research
(2004) - et al.
Model-based human reliability analysis: prospects and reliability
Reliability Engineering and System Safety
(2004) - et al.
Assessment of complex socio-technical systems – theoretical issues concerning the use of organisational culture and organisational core task concepts
Safety Science
(2007) - et al.
A methodology to model causal relationships in offshore safety assessment focusing on human and organisational factors
Journal of Safety Research
(2008) - et al.
Human reliability analysis has a role in preventing drinking water incidents
Water Research
(2009) Risk homeostasis and the purpose of safety regulation
Ergonomics
(1988)
Foundation of Risk Analysis: A Knowledge and Decision Oriented Perspective
The unbearable automaticity of being
American Psychologist
Automaticity of social behavior: direct effects of trait construct and stereotype activation on action
Journal of Personality and Social Psychology
Reviews on decision making
Administrative Science Quarterly
Managerial Decision Making
Probabilistic Risk Analysis: Foundations and Methods
Attention and Self-Regulation: A Control Theory Approach to Human Behavior
Heuristic and systematic information processing within and beyond the persuasion context
Drs Pangloss and Strangelove meet organizational theory: high reliability organizations and nuclear weapons accidents
Sociological Forum
Bayesian belief networks for safety assessment of computer-based systems
Editorial of the special issue: design process and human factors integration. cognition
Technology and Work
How to integrate safety in design: methods and models
Human Factors and Ergonomics in Manufacturing & Service Industries
Trading on illusions: unrealistic perceptions of control and trading performance
Journal of Occupational and Organizational Psychology
The affect heuristic in judgments of risks and benefits
Journal of Behavioral Decision Making
Problem structuring methods I
Journal of the Operational Research Society
Problem structuring methods II
Journal of the Operational Research Society
Decision Behaviour, Analysis and Support
Believe in the model: mishandle the emergency
Journal of Homeland Security and Emergency Management
Models of ecological rationality: the recognition heuristic
Psychological Review
Risk mitigation in virtual organizations
Organization Science
On error management: lessons from aviation
British Medical Journal
Human Reliability Analysis: Context and Control
Cognitive Reliability and Error Analysis Method – CREAM
System Reliability Theory
Practical Guide to Human Reliability Assessment
Cited by (105)
Defining and characterizing model-based safety assessment: A review
2024, Safety ScienceA systemic approach for stochastic reliability management in human–machine systems
2024, Decision Analytics JournalDynamic assessment method for human factor risk of manned deep submergence operation system based on SPAR-H and SD
2024, Reliability Engineering and System SafetySafety leadership: A bibliometric literature review and future research directions
2024, Journal of Business ResearchReliability assessment of manufacturing systems: A comprehensive overview, challenges and opportunities
2024, Journal of Manufacturing SystemsA framework to determine the holistic multiplier of performance shaping factors in human reliability analysis – An explanatory study
2024, Reliability Engineering and System Safety