Multi-objective optimization of IT service availability and costs

https://doi.org/10.1016/j.ress.2015.11.004Get rights and content

Highlights

  • A redundancy allocation problem for IT service design, the ITRAP, is developed.

  • Petri net Monte Carlo simulation and meta-heuristics are used for optimization.

  • Inter-component dependencies and defective operator interaction can be modeled.

  • The ITRAP overcomes the assumption of independent component failures.

  • A use case experiment demonstrates that the ITRAP provides more realistic results.

Abstract

The continuous provision of highly available IT services is a crucial task for IT service providers in order to fulfill service level agreements with customers. Although the introduction of redundant components increases availability, the associated cost may be very high. Therefore, decision makers in the IT service design stage face a trade-off between cost and availability in order to define suitable service level objectives. Although this task can be seen as a redundancy allocation problem, the existing definitions in this area are not transferable to IT service design due to the assumption of independent component failures, which has been identified as unrealistic in IT systems.

In this paper, a multi-objective redundancy allocation problem for IT service design is defined. Therefore, a Petri net Monte Carlo simulation is developed that estimates the availability and costs of a specific design. In order to provide (sub)optimal solutions to an IT service redundancy allocation problem, two meta-heuristics, namely a genetic algorithm and tabu search, are adapted. The approach is utilized to optimize the IT service design of an application service provider in terms of availability and cost to demonstrate its feasibility and suitability.

Introduction

The importance of IT services is ever increasing. On one hand, trends such as Cloud Computing bring millions of consumers in contact with IT services. On the other hand, even internal IT organizations are commonly understood as IT service providers in order to effectively manage costs and the business value of IT [1]. Service or Operational Level Agreements (SLAs/OLAs)1 document the quality of service that is to be expected by an IT service consumer, for instance, for the service availability [2].

Availability is one of the most crucial quality aspects for customers [3], and can be defined as the likelihood that a service is able to provide its function at a certain point in time [4]. Although service availability is seen “at the core of customer satisfaction and business success” [5, p. 127], even the big IT companies are suffering severe service disruptions that last hours or even days (e.g. Amazon [6], Apple [7] and Microsoft [8]). In August 2013, a five minute inaccessibility of the Google services led to a 40% decrease in internet traffic and an estimated revenue loss of over US-$500,000 for Google alone [9]. However, smaller enterprises are also affected by unavailability: 134 companies that have been studied by the Aberdeen Group each suffered on average a revenue loss of more than one million US-$ in 2012 due to IT downtime [10].

The two basic approaches to increase system availability are the introduction of more reliable components and the implementation of redundancy mechanisms [11]. However, the associated cost of these approaches may not be justified by the availability improvements. In addition to this, the special characteristics of software components limit their reliability [12]. The balancing of availability and cost with respect to desired or existing service level objectives is one of the core activities in IT Service Management, and is described in well-known frameworks such as the ISO 20000 (Service Continuity and Availability Management) [13], CoBIT 5 (Managing Availability and Capacity) [14] and the IT Infrastructure Library (ITIL) (Availability Management).

In the 2011 version of ITIL, service availability is characterized as an essential service quality attribute immediately influencing customer satisfaction [5]. Additionally, SLA violations in the operation phase can lead to penalty costs and loss of reputation for the IT service provider [15]. Since design changes due to insufficient service quality in the operation phase (reactive measures) can be very costly, measures to achieve sufficient IT service availability should be considered in the service design stage (proactive measures) [5], [16]. Nevertheless, the lack of feasible supporting tools for high availability design is also noticed [5]. This is mainly caused by the fact that classical analytical availability/reliability models that have been successfully applied in other domains assume independent component failures [17]. This assumption is unrealistic in modern IT systems due to the presence of inter-component dependencies, thus rendering results obtained from these models useless for decision support [18].

Examples of these inter-component dependencies are common cause failures and imperfect switching. In the former case, even heterogeneous components can be subject to the same fault under certain conditions [12]. Imperfect switching describes the phenomenon that a redundant component may not cover the failure of an active component due to problems in the switching process [19].

In addition, operator interaction may be required for component recovery or switching, leading to the fact that operator errors are a major cause for IT service unavailability [20], [21]. However, these errors are not represented in classical availability models.

In order to provide a general modeling approach that overcomes the independent failure assumption, several approaches that are applicable for IT service availability estimation from design information were recently developed. The majority of these approaches are based on the modeling of the availability state-space that allows for the introduction of dependencies [22]. However, the capability of these approaches for decision support in availability management is questionable since these approaches require a high modeling effort and were never integrated with optimization procedures in order to suggest (sub)optimal design configurations.

Such approaches have been developed in the context of redundancy allocation problems (RAPs). Under this term, several reliability/availability models and optimization procedures are subsumed that can be utilized to optimize system design in terms of availability, cost and other constraints. Therefore, required subsystems and possible component choices for each subsystem are modeled so that (sub)optimal combinations of component choices can be identified. Since the combinatorial computation of availability in these approaches assumes independent component failures, they are not applicable for IT service design. Nevertheless, the developed optimization procedures are mainly based on flexible meta-heuristics such as evolutionary algorithms, which can be applied to a wide range of problems. Therefore, the question of whether or not these procedures can be integrated with IT service availability estimation methods for IT service design optimization arises.

The goal of this work is to provide decision support for IT service designers. Therefore, a redundancy allocation problem is defined that models the relevant aspects of IT service availability and costs depending on possible IT service designs. In the course of the paper, this problem is referred to as the ITRAP. In combination with a suitable IT service availability estimation method and solution algorithm, the ITRAP can be utilized to optimize availability and costs of an IT service based on design information.

A constructivist approach is followed in order to achieve this goal (cf. e.g. [23], [24]). In Section 2, the related literature is presented to outline the relevance of the investigated problem. In this section, suitable approaches in the topics of IT service availability estimation and redundancy allocation optimization in order to develop a RAP for IT service design are identified as well. Based on the literature analysis, requirements for a RAP for IT service design are derived.

These requirements as well as the ITRAP artifact are presented in Section 3. This artifact is a framework designed in order to reach the goal of this work and consists of the problem definition, an availability and costs estimation method based on Petri net Monte Carlo simulation, and two adapted solution algorithms (a genetic algorithm and a tabu search). The artifact is evaluated by applying a prototypical implementation of the ITRAP to an IT service design optimization in a real-world use-case, an international application service provider, which is presented in Section 4. Section 5 concludes the paper by discussing the contribution of the paper as well as by providing an outlook to further research activities.

Section snippets

The redundancy allocation problem

In reliability/availability optimization, a redundancy allocation problem (RAP) can be utilized to determine suitable redundancy configurations for systems design. Therefore, it is instantiated with system design information, e.g. about reliability characteristics of possible component choices for required functional units of a system. On that basis, optimization algorithms identify (sub)optimal solutions in terms of availability, cost and other constraints. The first RAPs were defined during

Requirements analysis

A RAP for IT service design can be described by the optimization problem in Eq. (1): for a given time t, designs X have to be found in which the costs of the service C(X,t) are minimized while the service availability A(X,t) is maximized. A multi-objective problem is chosen since it is more flexible with respect to resource constraints [25], and thus can support the definition of feasible and cost-efficient service-level objectives for availability in the design stage.minC(X,t)maxA(X,t)

With

Evaluation

In this section, the concept of the ITRAP is evaluated by applying its prototypical implementation to the optimization of the IT service design in a real-world use-case. This design is required by an IT service provider that hosts SAP ERP systems (application service provider – ASP) for hundreds of customers all over the world and is described in the following.

Conclusion

Unavailability of IT services is inconvenient for both providers and consumers. Besides a loss of reputation and possible opportunity costs for the IT service provider, violations of the service-level agreements may lead to penalty costs. Although ensuring sufficient IT service availability is a crucial task that is mainly affected by IT service design, there is a lack of suitable decision support systems that help IT service designers building highly available systems at low costs.

As a

References (93)

  • M.-S. Chern

    On the computational complexity of reliability redundancy allocation in a series system

    Oper Res Lett

    (1992)
  • D. Coit et al.

    Solving the redundancy allocation problem using a combined neural network/genetic algorithm approach

    Comput Oper Res

    (1996)
  • T.-C. Chen et al.

    Immune algorithms-based approach for redundant reliability problems with multiple component choices

    Comput Ind

    (2005)
  • T. Taguchi et al.

    Optimal design problem of system reliability with interval coefficient using improved genetic algorithms

    Comput Ind Eng

    (1999)
  • M. Ouzineb et al.

    Tabu search for the redundancy allocation problem of homogenous series–parallel multi-state systems

    Reliab Eng Syst Saf

    (2008)
  • S. Wang et al.

    Modelling redundancy allocation for a fuzzy random parallel–series system

    J Comput Appl Math

    (2009)
  • J. Safari

    Multi-objective reliability optimization of series–parallel systems with a choice of redundancy strategies

    Reliab Eng Syst Saf

    (2012)
  • A. Dolatshahi-Zand et al.

    Design of SCADA water resource management control center by a bi-objective redundancy allocation problem and particle swarm optimization

    Reliab Eng Syst Saf

    (2015)
  • A. Chambari et al.

    An efficient simulated annealing algorithm for the redundancy allocation problem with a choice of redundancy strategies

    Reliab Eng Syst Saf

    (2013)
  • L. Wang et al.

    A coevolutionary differential evolution with harmony search for reliability–redundancy optimization

    Exp Syst Appl

    (2012)
  • Y.-C. Hsieh et al.

    An effective immune based two-phase approach for the optimal reliability–redundancy allocation problem

    Appl Math Comput

    (2011)
  • G. Kanagaraj et al.

    A hybrid cuckoo search and genetic algorithm for reliability–redundancy allocation problems

    Comput Ind Eng

    (2013)
  • S. Kulturel-Konak et al.

    Multi-objective tabu search using a multinomial probability mass function

    Eur J Oper Res

    (2006)
  • J. Ward et al.

    Strategic planning for information systems

    (2002)
  • A. Keller et al.

    The WSLA frameworkspecifying and monitoring service level agreements for web services

    J Netw Syst Manag

    (2003)
  • E. Zambon et al.

    A2thOSavailability analysis and optimisation in SLAs

    Int J Netw Manag

    (2012)
  • D. Siewiorek et al.

    The theory and practice of reliable system design

    (1982)
  • L. Hunnebeck

    ITIL service design

    (2011)
  • Henschen D. Amazon Outage Scrooges Netflix, Heroku, URL:...
  • Keizer G. Apple service outage stretches into hours, URL:...
  • Whitney L. Microsoft pins Hotmail, outlook outage on hot data center, URL:...
  • Tweney D. 5-minute outage costs Google $545,000 in revenue, URL:...
  • Csaplar D. The cost of downtime is rising, URL:...
  • D.-H. Chi et al.

    Optimal design for software reliability and development cost

    IEEE J Sel Areas Commun

    (1990)
  • International Organization for Standardization. ISO/IEC 20000-1;...
  • Information Systems Audit and Control Association. COBIT 5, ISACA;...
  • Terlit D, Krcmar H. Generic Performance Prediction for ERP and SOA Applications. In: Proceedings of the 18th European...
  • Callou G, Maciel P, Tutsch D, Arajo J, Ferreira J, Souza R. A Petri net-based approach to the quantification of data...
  • N. Milanovic et al.

    Automatic generation of service availability models

    IEEE Trans Serv Comput

    (2011)
  • Bosse S. Predicting an IT services availability with respect to operator errors. In: Proceedings of the 19th Americas...
  • U. Franke et al.

    An architecture framework for enterprise IT service availability analysis

    Softw Syst Model

    (2014)
  • Trivedi K, Ciardo G, Dasarathy B, Grottke M, Matias R, Rindos A, et al. Achieving and assuring high availability. In:...
  • Hevner AR, March ST, Park J, Ram S. Design science in information systems research. MIS Q....
  • K. Peffers et al.

    A design science research methodology for information systems research

    J Manag Inf Syst

    (2008)
  • W. Kuo et al.

    An annotated overview of system-reliability optimization

    IEEE Trans Reliab

    (2000)
  • R. Soltani

    Reliability optimization of binary state non-repairable systemsa state of the art survey

    Int J Ind Eng Comput

    (2014)
  • Cited by (0)

    View full text