1 Introduction

In 1950, 29.6% of the world’s population lived in urban areas. Since then, this percentage has increased every year, reaching 55.3% in 2018. The United Nations (2018) expects this trend to continue, such that by 2050, an estimated 68% of the world’s population will be living in urban areas. North America and Europe already have reached 82.2% and 74.5% urbanization, respectively; by 2050, those values will likely be 89% in North America and 83% in Europe. In line with these trends, Vienna’s population—like that of many cities all over the world—is growing and likely will exceed two million by 2025 (Statistik Austria 2017; Hanika 2018). Such population growth, combined with traffic congestion, efforts to reduce emissions, and municipal ambitions to improve the quality of life for residents (e.g., pedestrianization, reducing auto mobile as well as truck traffic) and tourists, make it necessary to readjust complex urban public transportation systems constantly. Strategic and tactical planners must determine whether existing provisions are effective, now and in the future.

Complex urban mass rapid transit systems (e.g., subway, metro, tube, underground, heavy rail) are a critical component of successful cities. They were introduced to improve movement in urban areas and reduce congestion. These motives remain valid but—as initially mentioned—also are reinforced by pedestrianization, tourism, and environmental demands. Such networks are ubiquitous, such that 25% of cities with a population of one million, 50% with two million people, and all cities with a population above ten million host them (Roth et al. 2012). In some cities, they have existed for more than a century, and there are significant similarities among the different networks, despite the unique cultures, economies, and historical developments in each city. That is, most networks consist of a set of stations delimited by a ring-shaped core from which branches extend beyond the city. Subway networks also tend to experience similar peak times, in the morning and afternoon, reflecting increased demand by employees traveling to or from their workplaces (Sun et al. 2014). In turn, the same issue confronts all subway systems: fluctuating demand must be satisfied by adjusting and re-adjusting transport capacity.

The term headway refers to the time difference between two consecutive vehicles. Changes to headway (or its inverse, frequency, defined by vehicles per unit of time) might lead to overcrowded train stations and vehicles, unless they are carefully planned. This challenge constitutes the transit network frequencies setting problem (TNFSP), for which the solution demands a balance between capital and operational expenditures (e.g., infrastructure preservation, potential expansion) with passenger satisfaction (i.e., service level). A balance between these conflicting goals produces optimal time-dependent headways for each passenger line.

The planning process for public transportation usually proceeds in the following order: (1) network and line planning, (2) frequency (i.e., headway) setting, (3) timetabling, (4) vehicle scheduling, (5) duty scheduling, and (6) crew rostering (Ceder and Wilson 1986; Ceder 2001; Guihaire and Hao 2008; Liebchen 2008). In typical planning approaches, the earlier planning stage provides the input for the subsequent tasks. The second step provides the optimization of headways. Comprehensive surveys of public transport planning are available in Guihaire and Hao (2008), Farahani et al. (2013), and Ibarra-Rojas et al. (2015). For example, Guihaire and Hao (2008) and Ibarra-Rojas et al. (2015) review 26 contributions dealing with headway optimization, showing that early works assume fixed demand and are based on analytic models (Newell 1971; Salzborn 1972; Schéele 1980; Han and Wilson 1982). Six of the 26 works (Shrivastava et al. 2002; Shrivastava and Dhingra 2002; Yu et al. 2011; Huang et al. 2013; Li et al. 2013; Wu et al. 2015), as well as Ruano et al. (2017) more recently, employ genetic algorithms for the headway optimization problem. Most of them also use non-linear models, though Li et al. (2013) offer a simulation model. Evolutionary algorithms have been applied in similar settings (Zhao and Zeng 2006; Guihaire and Hao 2008; Yu et al. 2011), though only a few contributions address time-dependent demand (Niu and Zhou 2013; Sun et al. 2014). Herbon and Hadas (2015) focus on morning and afternoon peaks and use a generalized newsvendor model. Unlike studies that attempt to optimize a single line (e.g., the latter two and Ceder, 1984, 2001), we seek to optimize several lines (i.e., a whole network) at once (Yu et al. 2011). Following sparse contributions (Vázquez-Abad and Zubieta 2005; Mohaymany and Amiripour 2009; Ruano et al. 2017) that apply simulation to a problem-specific context, we turn to simulation optimization. This approach has proven successful in similar application contexts (Osorio and Bierlaire 2013; Osorio and Chong 2015; Chong and Osorio 2018).

Several problems arise when devising a model of a complex service system like an urban mass rapid transit network. The data pertaining to the structure of an existing transportation system (e.g., subway network’s lines and stations) are relatively easy to obtain, but passenger data are not. To model demand, we need to know how many passengers want to travel from one specific location to another and when. Prior contributions use count data (Ceder 1984), smart card data (Pelletier et al. 2011), or mobile phone data (Friedrich et al. 2010), and various technologies can support pedestrian counting and tracking (Bauer et al. 2009). We employ hourly origin-destination-matrices, originally created by the MatchMobile project (IKK 2017), and infrared count data. It also is difficult to gauge passenger behavior and how they decide which route to take (Agard et al. 2007). Raveau et al. (2014) identify regional differences in passenger behavior too. For reviews on route choice, the reader is referred to Bovy and Stern (1990) and Frejinger (2008).

This article seeks to add to this line of literature, by making several contributions. We propose a multi-objective simulation-based headway optimization scheme for complex urban mass rapid transit systems that is inspired by a real-world case but generic enough to fit other cities’ complex rail-bound public transport systems. Three evolutionary algorithms, applied to solve the associated headway optimization problem, are embedding in a simulation optimization framework. In addition to two traditional multi-objective evolutionary algorithms, we apply a newly developed two-phase algorithm. The first phase is a single-objective covariance matrix adaptation evolutionary strategy (SO-CMA-ES), and the second is a problem-specific multi-directional local search (MD-LS). Finally, we conduct a comprehensive computational evaluation, using test instances based on a real-world subway network.

Section 2 contains an introduction to the Viennese subway system and presents the objective functions (including constraints). In Sect. 3, we describe the solution method and its building blocks, namely, the discrete event simulation model and (heuristic) optimization algorithms. Next, Sect. 4 explains the computational experiment setup and the process of tuning the applied algorithms’ respective parameters, then Sect. 5 presents the results. Last, Sect. 6 concludes and proposes some research directions.

2 Problem statement

We introduce the Viennese subway system (including network features and passenger volume) in Sect. 2.1. Section 2.2 details the objective functions, alternatives, and the decision process.

2.1 The Viennese subway network

The Viennese subway network is a complex mass rapid transit system. As of 2016, it had a total length of 78.5 km and spanned five lines. Figure 1 depicts the whole subway network, with crossing stations marked with abbreviated station identifiers. There are 104 stations at 93 locations; at ten of them, two or three (single-case, Karlsplatz KP) lines intersect. The abbreviation SC marks Stephansplatz (i.e., the city center) and its renowned landmark St. Stephen’s Cathedral. The U5 line will be constructed between 2019 and 2024. Table 1 contains some facts and figures about the Viennese subway system.

Fig. 1
figure 1

Schematic plan of the Viennese subway network as of 2016 (crossing stations are marked with abbreviated station identifiers)

Table 1 Facts and figures on the Viennese subway system

During the course of a regular weekday (work and school day from Monday to Thursday), there are 1.37 million passenger movements. Figure 2 depicts a simulation of the passenger volume with the currently employed solution. There are two peaks: between 7:00 and 9:00 and between 16:00 and 19:00. The U1 and U3 lines carry the highest passenger volume, followed by U4 and U6; U2 has the lowest passenger load. The U1 has a higher morning peak between 7:00 and 8:00, whereas the others’ highest peak is between 8:00 and 9:00. Then during the afternoon peak, U3 experiences the highest volume. This is due to passengers not only moving from home to work and then straight back again, but head to other destinations here and there. Between the U3 stations Westbahnhof (W) and Volkstheather (VT), a pedestrianized shopping area called Mariahilfer Straße serves as an alternate stop for many passengers. The passenger volume reaches 24,000 in the 2-h morning and 23,000 in the 3-h afternoon peak.

Fig. 2
figure 2

Passenger volume over time (simulation of currently employed headways)

2.2 Objective functions and constraints

As mentioned in Sect. 1, there are two conflicting goals: cost minimization (measured in productive fleet mileage m) and service level maximization (measured in average waiting time per passenger, denoted w). We seek an appropriate trade-off that achieves the optimal hourly headway for each line of the Viennese urban mass rapid transit system. The productive fleet mileage m is defined as the sum of all vehicle runs (i.e., how often vehicles were released) times their respective lines length, so it does not contain unproductive mileage such as turning maneuver distances. Equations 1 and 2 contain the objective functions for these aforementioned target measures; because of their different scale (tens of thousands of kilometers of fleet mileage versus a few minutes of average waiting time), we employ typical normalization and scalarization approaches. We must rely on a stochastic simulation model for the solution evaluation, so we can only approach Pareto-optimal solutions, without guaranteeing exact optimality. Therefore, our goal is to determine a good approximation set of the Pareto front, that is, a set of objective vectors (each corresponding to a solution candidate), such that no vector in that set dominates another one in the same set (Zitzler et al. 2003). An objective vector dominates another vector if it is at least as good as the latter in all objectives, and better for at least one objective.

$$\begin{aligned} \min Z_1= & {} \frac{m-m_{{ min}^*}}{m_{{ max}}-m_{{ min}^*}}. \end{aligned}$$
(1)
$$\begin{aligned} \min Z_2= & {} \frac{w-w_{{ opt}}}{w_{{ max}^*}-w_{{ opt}}}. \end{aligned}$$
(2)

Because the first phase of our two-phase algorithm (Sect. 3.2) is a single-objective algorithm, we apply the traditional weighted sum-based approach and transform a bi-objective into a single-objective optimization problem (Eq. 3). This approach has been used previously as well (Schmaranzer et al. 2018, 2019); for this study its use is limited to the first phase of our optimization scheme and the determination of the required minimum/maximum objective values (Sect. 4). Table 2 lists the notation.

$$\begin{aligned} \min Z = Z_1 \cdot \varphi + Z_2 \cdot (1-\varphi ). \end{aligned}$$
(3)
Table 2 Notation

Equation 1, related to fleet mileage, is deterministic (i.e., not subject to randomness). Equation 2, regarding the average waiting time per passenger, is stochastic, due to the randomness of passenger creation (Poisson process) and other stochastic influences (e.g., vehicle travel times between stations, passenger transfer times; see Sect. 3.1). Replications (i.e., simulation re-runs of the model; Sect. 3.1) are required to account for statistical significance. We employ a varying number of replications (see Sect. 4 and Algorithm 4), based on the weight given to the waiting time component of the objective. In the single-objective case (i.e., using Eq. 3), its average varies, and there is a negative correlation with \(\varphi \). Therefore, weight \(\varphi \) influences the variance of the objective value Z. In a multi-objective setting—featuring the direct use of Eqs. 1 and 2, where only the latter is of consequence—the average number of replications does not vary. The sole constraint type is the stations’ respective platform utilization \(u_{sp}\). If a platform’s capacity (two people per square meter) is exceeded, the solution is infeasible. The capacity of vehicles is limited too, but it does not directly cause infeasibility, because waiting passengers who are unable to board an overcrowded vehicle continue waiting, increasing the average waiting time w per passenger, which reduces the solution’s quality in that regard.

Guihaire and Hao (2008) and Ibarra-Rojas et al. (2015) cite several other objectives and constraints applied in similar works. A typical determinant of headway optimization is fleet size. Its minimization can be used as an objective (Salzborn 1972), and its current or maximum could be used as a constraint (Furth and Wilson 1981; Han and Wilson 1982). Because the demand is fluctuating, we decided, in collaboration with the Viennese public transportation provider, against using the fleet size as an objective. Similarly, vehicle runs measure the number of times a line’s vehicles go back and forth (Ceder 1984). Unlike fleet size, it is not driven by peak hours. Unfortunately it also disregards the lines’ respective lengths. Because of these reasons we decided to use the productive fleet mileage as an objective form the transportation provider’s view (Eq. 1). Because the transportation provider uses proportional cost rates (i.e., operating and maintenance cost per productive kilometer), the overall total cost can be immediately derived from the mileage.

In terms of the customer’s view (i.e., service level), some studies use passenger-related total or average times like travel time (Schéele 1980; Dollevoet et al. 2015) or waiting time (Furth and Wilson 1981). An extensive sensitivity analysis of a previous version of the simulation model (Schmaranzer et al. 2016) revealed that the travel time is only driven by the waiting time, and a passenger’s invehicle time increases only in bunching situations (i.e., when there are too many active vehicles in a line, so they start blocking and slowing down one another). The current model ensures a minimum headway (i.e., minimum on the time before another vehicle can be released), so this situation is no longer likely, and instead the invehicle time remains consistent, as does the transfer time. That is, alighting passengers do not use space on the waiting platform, and there are no limitations on concurrent transfer operations. For more details on both, see Sect. 3.1.

3 Methodology

A subway system constitutes a queuing network with synchronization and non-exponentially distributed service times. It is too complex to be investigated using simple analytic methods from queuing theory like Jackson networks (Jackson 1963) and its extensions. Instead we apply simulation optimization, as introduced by Fu (2002). For current reviews on the general method, the reader is referred to Juan et al. (2015) and Amaran et al. (2016). As a functional principle (Fig. 3), an optimizer or algorithm—in our case an evolutionary algorithm—generates candidate solutions that are handed over to a simulation model for evaluation. The result of this process (i.e., solution quality and feasibility status) then reenters the optimizer, which generates new and hopefully better solutions. Section 3.1 describes the core of the optimization approach, or the simulation model itself. The optimization-related part, in particular our proposed bi-objective approach, is the subject of Sect. 3.2.

Fig. 3
figure 3

Functional principle of simulation optimization (Fu 2002)

3.1 Discrete event simulation model

This section contains a brief description of the simulation model of the Viennese network as of 2016. Detailed model descriptions can be found in Schmaranzer et al. (2016, 2018, 2019). The second paper contains an earlier version using data from 2014. The first published article uses not only older data (form 2012) and does not consider passenger distribution along platforms and within vehicles.

The basic functional principle of a complex urban mass rapid transit system can be modeled by a directed graph, as in Fig. 4 for two lines. The three interacting entities are the stationsvehicles, and passengers in each line. Both lines contain five stations each (\(s_0\) to \(s_4\) and \(s_5\) to \(s_9\)). Vehicles only move on the black continuous arcs. At the end-of-line stations (\(s_0\), \(s_4\), \(s_5\), and \(s_9\)), additional arcs allow vehicles to perform a turning maneuver and possibly start anew. With the exception of stations \(s_2\) and \(s_7\), all station have their own geographical location g; the others intersect at \(g_2\). In this case, additional directed arcs (italics, dotted) allow passengers to move to another line’s waiting platform. Initially, each passenger has a geographic origin and a destination (e.g., \(g_0\) to \(g_8\)) that defines a path that contains the crucial stations (e.g., \(s_0\), \(s_2\), \(s_7\), \(s_9\)). The path allows passengers to know where to begin their journey (\(s_0\)) and where to exit a vehicle, because they have reached their final destination (\(s_9\)) or must perform a transfer (\(s_2\) to \(s_7\)).

Fig. 4
figure 4

Example of two intersecting lines (directed graph)

Each station is assigned to a specific line and has either one island platform or two separate platforms. The island platform variant potentially serves and shares its capacity with both directions. If there are separate platforms, or an island platform has an impassable wall in the middle, half of its capacity applies to each direction. The model distinguishes between platforms with and without shared capacity. Among 104 stations, 47 have an island platform without a physical barrier, and the remaining 57 have either two separate platforms or an island platform with a physical barrier, creating separate capacities. The platform capacity is two people per square meter, and it depends on its surface area (excluding safety distance between the waiting passengers and moving vehicles). As mentioned in Sect. 2.1, ten of 93 locations have two or three (single-case) stations, allowing passenger transfer between lines. Waiting platforms—as well as the upcoming vehicle entities—are about 120 m long. To account for passenger distribution, both are divided into three sections (front, middle, back) of about 40 m, or two wagons each. We included this factor at the request of the Viennese public transportation provider.

The creation of passenger entities is driven by a time-dependent Poisson process (i.e., exponential distribution) that starts at 04:45 and continues until 01:00. It requires hourly origin-destination-matrices; we use matrices based on those originally created by the MatchMobile project (IKK 2017) and infrared count data. The former are based on anonymous mobile phone data from the year 2014. Our industrial partner provided count data from 2016, which we used to create new, updated versions of the origin-destination-matrices. For the precise procedure, see Schmaranzer et al. (2019). Origin-destination-pairs represent the number of passengers who wish to travel, within a certain time frame (e.g., from 08:00 to 08:59), from one geographical location to another. As described in the example at the beginning of this section (Fig. 4), each passenger is assigned a path, generated by Dijkstra’s shortest path algorithm (Dijkstra 1959). Every line of the Viennese mass rapid transit system (Fig. 1) can be reached from another line by up to two transfer operations, so the penalty weights for these arcs (see passenger transfer in Fig. 4) were set to 1400 m and 3.9 min. Without these penalty weights, passenger would be tempted to perform unnecessary transfers (i.e., more than two). The weight for non-transfer arcs is the symmetric distance between stations or the average of the two mean values of the vehicle travel times between two stations. Full factorial experiments revealed that when 85% of passengers make their path decisions based on distance, and 15% base it on time, we obtain the smallest overall difference from the infrared count data; this combination is also deemed realistic by our industrial partner. Simulation on the full Viennese network (Fig. 6b) using the currently employed headways reveals that about 37% of all passengers perform one or two transfer operations. A passenger’s journey begins at a station’s waiting platform and ends at the final destination. Provided that the platform still has free capacity, a new or transfer passenger gets assigned to a platform’s section and potentially boards the vehicle’s corresponding section. The passenger distribution along platforms is derived from the vehicles’ doors’ infrared count data of boarding passengers. If the platform section has no free capacity, the passenger moves on to the neighboring one. In a case in which the overcrowded section has two direct neighbors (i.e., the middle section), the chance that the passenger moves to one of the other two is 50%. If a vehicle’s section is overcrowded, the passenger is forced to continue waiting on the platform, thereby increasing the average waiting time w per passenger. Once a vehicle has reached a passenger’s final destination or transfer station, the passenger alights. Several stations have separate platforms (i.e., one per direction), so transfer times can be direction-dependent. Realistic passenger transfer times can be implemented by measuring the distances of all reasonable combinations (i.e., no transfers within the same line) at a geographic location with several stations, using the finding of Weidmann (1994) that the average walking speed of passengers is 1.34 m/s, with a deviation of ± 19%. The model uses a triangular distribution with ± 20% as minimum/maximum. Alighting passengers do not use space on the platform at which they arrive, and there are no limitations on concurrent transfer operations. Due to the structural individuality of stations, some of which also serve as underpasses, modeling each station’s unique entries to platforms is beyond the scope of this study.

Vehicles (in our case, subway trains) are released from the lines’ end-of-line stations, in accordance with the lines’ respective current headway setting, during their period of operation (from 04:30 until 01:00). Capacity depends on the vehicle type. The old U type will be replaced soon, so we focus on the remaining two. The U1 to U4 lines are served by the V type, which holds up to 878 people. The U6 line is served by type T, which holds up to 776 people. However, 100% vehicle utilization is highly undesirable and unrealistic, because at each stop, some people would have to make room and even temporarily get out to allow alighting passengers to leave. Furthermore, passengers might carry a bag, luggage, baby carriage, or bike with them. To account for these considerations, we reduce vehicle type capacity by around 20%, to 702 and 621 passengers, respectively. Their top speed is 80 km/h, and their average speeds during operation are 33.0 km/h (V type) and 30.6 km/h (T type). The vehicle travel time refers to the time difference between a vehicle’s arrival time at two consecutive stations, so it includes the dwelling time at the first station. Dwelling time refers to the time difference between a vehicle’s arrival at a station and its departure. It includes boarding and alighting time, defined as the time difference between a standing vehicle opening its doors, thereby allowing aboard passengers to alight and waiting passengers to board the vehicle, and closing them again. A comprehensive statistical analysis of vehicles’ station arrival times (about 2500 samples each) reveals, that vehicle travel time depends not on time but rather on direction. Conventional wisdom might suggest that the vehicle travel time suffers from increased passenger volume at peak hours (i.e., increased dwelling time), but instead, whenever the dwelling time is longer than expected, the driver compensates by increasing the vehicle’s speed. The top speed of 80 km/h represents a hard limit though. However, there seems to be enough buffer time within the schedule and enough restraint among passengers not to cause non-compensable delays. Most vehicle travel times appear almost normally distributed, with a longer tail on the right side, so we decided to use a log-normal distribution. Visual goodness-of-fit tests (Q–Q-plots and density-histogram plots) conducted in R provided strong support for this choice. A more detailed analysis can be found in Schmaranzer et al. (2019) and Schmaranzer et al. (2018). Once a vehicle has reached one end of a line, it remains in the station for 0.41 min (± 0.02), accounting for about 25 s of dwelling time. Thereafter, the productive fleet mileage m of the current solution increases by the respective line’s length, and the vehicle has to perform a turning maneuver. At some end-of-line stations, it is routine to turn vehicles after 20:30 simply by crossing over to the other direction’s rail, prior to the arrival at the last station, to save time and vehicles. However, in most cases—depending on the infrastructure of the respective end-of-line station—a turning maneuver takes 4–8 min (± 0.34 to ± 1.00). We employ triangular distribution for both.

We use key performance indicators to validate the model: vehicle cycle time (i.e., vehicle travel time from one end of line station to another) and passenger-based ones (i.e., count data) at crossing stations. The deviation from the respective target values was low and approved of by the Viennese transportation provider.

This discrete event simulation model of the Viennese subway system could be used for other urban mass transit systems as well. The biggest obstacle would probably be devising the hourly origin-destination-matrices. Data about platform capacity, passenger distribution, vehicle transfer times, and turning maneuver times would be difficult to ascertain without the respective transportation provider’s support. The network itself (i.e., lines and their stations) and data on the vehicle capacity can be easily be found on the internet.

Planned and unplanned service disruptions also occur in complex real urban mass rapid transit systems. Some network variants (i.e., except the full variant \(l=5\)), which we introduce in Sect. 4.1, could be seen as the former (i.e., planned disruption of service on one or several lines). Unplanned disruptions are not part of our model. For studies of disruption management in general and recovery models on railway networks in particular, the reader is referred to Yu and Qi (2004) and Cacchiani et al. (2014), respectively.

The simulation model was developed in AnyLogic 7.0.3 (64 Bit, Linux) and uses some other Java libraries (JGraphT 1.0.1Apache POI 3.15, and Apache Math 3.6.1).

3.2 Two-phase algorithm

Our two-phase algorithm starts by applying a weighted sum-based optimization approach to generate solutions, which then serve as starting points for an adaptation of the multi-directional local search algorithm (MD-LS), as in Tricoire (2012). Such a two-phase optimization concept has already been successfully applied to combinatorial optimization problems, like the dial-a-ride (Parragh et al. 2009) or vehicle routing problem (Matl et al. 2019).

Table 3 Parameters of the multi-directional local search
Fig. 5
figure 5

Example: second phase (multi-directional local search)

In our case, the single-objective covariance matrix adaptation evolutionary strategy (SO-CMA-ES) is used in the first phase (Hansen and Ostermeier 2001). Unlike genetic algorithms (Holland 1975), the SO-CMA-ES does not rely on crossover operators but creates new offspring by means of a sophisticated sampling approach (Sect. 4.2). The weight \(\varphi \) required for the single-objective function (Eq. 3) is set from zero to one in steps of 0.1, resulting in eleven start points, who are next handed over to the second phase. The run time t budget is split equally between both phases, so using a varying number of start points was not possible.

In the second phase, we use our multi-directional local search (MD-LS), which requires the three parameters listed in Table 3. In Fig. 5 we provide a brief example and explain the basic idea of this scheme (including the aforementioned parameters). In the first stage (Fig. 5a), a set F consisting of either start points from phase one or the preceding iteration—in which case it is already a non-dominated front (i.e., Pareto front approximation)—is required. We use the selection distance parameter c (e.g., 0.1) to decide which of the already evaluated solutions in F should be added to the new set V, which only contains solutions selected for the next step (Fig. 5b). The first and last members in the sorted set F are always selected for set V; the others only get selected if their distance to the preceding selection is at least c. Next, up to 15 new neighboring solutions based on the current selected solution v are created (Fig. 5c). The parameter step size b is applied to v, thereby creating new neighboring solutions. The mileage objective (Eq. 1) can be evaluated deterministically (as stated in Sect. 2.2), so we can quickly evaluate if a new neighbor is within reasonable bounds. In the third stage, we try to create up to 15 new neighbors whose objective value \(Z_1\) lies between a search area that is restricted on the y-axis. The number of tries for this stage is limited by 100, not to be confused with the number of attempted moves a per selected point. The lower and upper bounds are defined by the selection distance parameter \(c\pm 5\%\). Finally, the new neighbors are evaluated with the simulation model (Fig. 5d). Because we now have both objective values (i.e., \(Z_1\) and \(Z_2\)) for all neighbors, we can check whether the solutions are within the new search area, which is the overlap of the former search area on the y-axis plus a new one on the x-axis that uses its neighboring selected points’ objective functions’ values as bounds. In case the current v has only one neighbor (i.e., located at one of either ends), we use its current position and (0,1) or (1,0)—depending on the objective—as bounds. The best solution within the search area is then used as new basis v. The number of attempted moves per v is limited by the parameter a.

Algorithm 1 gives a more detailed explanation of our approach. It takes a set F containing potential start points. The best solutions collector Y is initialized first. The outer loop iterates though all objectives in O, in our case \(Z_1\) and then \(Z_2\). Next, F gets sorted by the current objective o. In the very first iteration, we only got eleven evaluated solutions in F (i.e., those created by the first phase). For this and the unlikely case that later, due to non-dominated sorting, there are eleven or fewer evaluated solutions in F, we select all of them (i.e., \(V \leftarrow F\)). Otherwise, the function “SelectStartPoints” (Algorithm 2) chooses which ones to add to V. It always adds the first and thereby best solution (with regard to the current objective o) to V. Thereafter, it ensures that the next additions keep a certain distance by using the parameter selection distance c as a measure. The last and worst solution for the current objective o also gets selected (i.e., added to V). We then iterate through the selected ones V and try to create up to 15 new neighbors using the current v as the basis.

figure a
figure b
figure c

The function “CreateNewNeighbor” (Algorithm 3) works as follows: Because the basic idea is to only change only part of the original solution v, we first iterate through every line of the solution vector. Each line has a 50% chance of being selected for small alterations. If a line is selected, a certain contiguous part of the solution vector is then chosen at random. The starting point of a potential alteration is marked by q, and the end e is another uniformly distributed point between q and d (the latter being the number of decision variables per line—see Table 2). The step size b is then applied between zero and two times (depending on the uniformly distributed integer h) to each position in the aforementioned random subsequence of the solution vector. It depends on the current objective o whether those positions’ values increase or decrease. Counter k ensures that the new neighbor n is not equal to original solution v.

As described in the small example (Fig. 5c), we try to create 15 new neighbors within mileage bounds. If, after 100 trials, we have not succeeded in creating those 15 neighbors, we stop the neighborhood generation procedure. Because \(Z_1\) is not subject to randomness, a relatively accurate value can be calculated without using the simulation model (± 1%). To get an accurate number for \(Z_1\) and the missing value for \(Z_2\), we perform a simulation-based evaluation of the new neighbors and remove those whose objectives are not within both bounds (i.e., overlapping search area) from the neighborhood N.

Finally, the neighboring solutions N get sorted according to the current objective; if the first solution in it is better than the one from which it originates, it becomes the newly selected starting point v. In each iteration, a temporary set Y stores the best solutions. Provided that it is not identical to the last element in Y, it is added to Y. It depends on the parameter a (Table 3) how often this process goes on, before continuing to the next v. After a whole iteration through all objectives O is finished, the solutions of set Y are added to the initial set F. We then perform a non-dominated sorting of the combined set, thus creating a new Pareto front approximation F that serves as input for the next iteration. The stopping criterion is the run time limit t, which depends on the complexity of the respective test instance.

4 Computational experiment setup

As stated in Sect. 2.2, we apply a normalization and scalarization approach. Table 4 contains the minimum and maximum values for productive fleet mileage and average waiting time per passenger required for the objective functions (see Eqs. 1 and 2 as well as Table 2). The technically lowest possible headway is 1.5 min, so obtaining the maximum fleet mileage \(m_{{ max}}\) and optimal average waiting time \(w_{{ opt}}\) is straightforward: By using the aforementioned headway value of 1.5 min for all decision variables, these extreme vales could be derived. Since the average waiting time is stochastic, we used 50 replications to ensure statistical significance. Because up to 37% of passengers (depending on the network variant) perform one or two transfer operations, the optimum average waiting time is significantly higher than the expected average waiting time per waiting process (0.75 min). Due to the capacity constraint (Sect. 2.2), the lowest feasible fleet mileage \(m_{{ min}^*}\) and the corresponding highest average waiting time \(w_{{ max}^*}\) were more difficult to obtain: The weight \(\varphi \) used in Eq. 3 was set to one (i.e., only mileage minimization/optimization). We used 2-day long (i.e., 48 h) parallel optimization runs (one core for the optimizer, 15 for the simulation model) with the covariance matrix adaptation evolution strategy CMA-ES (Sect. 3.2), and used 21 decision variables d per line l (i.e., network variant), to determine the lowest feasible fleet mileage \(m_{{ min}^*}\). To speed up optimization, and because of the fleet mileage being deterministic, three replications were used to increase the likelihood of the optimized solution being feasible. Thereafter, we ran 50 replications on the found solutions (one per network variant) to get an accurate value for the highest average waiting time \(w_{{ max}^*}\) and to ensure the respective solution’s feasibility.

Table 4 Passenger data, \(m_{{ min}^*}\), \(m_{{ max}}\), \(w_{{ opt}}\) and \(w_{{ max}^*}\) per network variant

As mentioned in Sect. 2.2, there are several stochastic elements, so replications are required to account for statistical significance. We employ a varying number of replications, with a minimum of three and a maximum of 50. The sequential evaluation process terminates once a 99.9% confidence interval with a relative error of one percent has been constructed. Algorithm 4 contains the function used for this purpose. The procedure works as described by Law (2013). The criteria being (as depicted in Algorithm 4) the mean waiting time w per passenger (\(Z_2\)) for all multi-objective algorithms and the multi-directional local search (i.e., the second phase). For the SO-CMA-ES (first phase) it is Z (Eq. 3). If a platform’s capacity is exceeded (Sect. 2.2), the simulation is terminated immediately, the solution is deemed infeasible, and, to save time, no (more) replications are performed.

figure d

The weight \(\varphi \), as used in the single-objective function (Eq. 3), has an influence on the standard deviation of the objective value Z, and therefore the average number of replications varies, and there is a negative correlation with \(\varphi \).

The huge number of samples (i.e., up to 1.37 million passenger movements) makes the standard deviation in average waiting time per passenger very small. Therefore, we introduce a “global denominator” that reduces the number of passenger entities and the capacities (platforms and vehicles) by a factor of ten. This step increased the standard deviation, but reduced the simulation run time significantly, by a factor of about six (0.58 instead of 3.52 s per run on an Intel i7-4770 with up to 3.9 GHz), with almost negligible inaccuracy with regard to the objective function (Eq. 2) values.

All experiments in this contribution were conducted on the Vienna Scientific Cluster 3 (VSC 2018), which is a high performance computing (HPC) cluster comprised of 2020 nodes, each one equipped with two Intel Xeon E5-2650v2 processors (2.6 GHz, eight cores) and at least 64 GB RAM.

4.1 Real-world test instances

The solution method described in Sect. 3 is applied to 16 different real-world test instances, based on the Viennese subway system, which serve two purposes: First, to tune (see next Sect. 4.3) the respective parameters of the applied single- and multi-objective algorithms, we needed a quick instance combination (\(l\cdot d\)). Second, a proper comparison of different solutions approaches necessitates multiple problem instances. The instances were created by using four different versions of the Viennese subway network l and four different numbers of decision variables d per line.

With regard to the network versions, Fig. 6a contains the whole Viennese subway network. It contains all five lines, so we refer to it as \(l=5\). The other variants (\(l=4\)\(l=3\) and \(l=2\); Fig. 6b–d, respectively) are reduced versions, created by removing one line at a time, according to each line’s passenger volume (see Fig. 2).

Fig. 6
figure 6

Network variants (schematic plan of the Viennese subway network)

For the number of decision variables per line, because the origin-destination-matrices change hourly, changing headways on an hourly basis is natural. The Viennese subway system operates from 04:45 to 01:00 (Monday–Thursday), so each line has 21 decision variables (\(d=21\)). To fill and empty the lines with vehicles, their release starts at 04:30 and ends at 01:00. In the hourly variant, the first decision variable of each line applies to the time period prior to 05:00. Because the origin-destination-matrices are also hourly, 21 is the highest value used for d. Other variants are 2- and 3-h long headways, which produce eleven (\(d=11\)) and seven (\(d=7\)) decision variables per line, respectively. The smallest version re-uses headways by means of indices and works as such: each line has merely four decision variables (\(d=4\)). Those values are assigned to 21 specific time periods: {0, 0, 1, 2, 2, 3, 1, 1, 3, 3, 3, 3, 2, 2, 2, 3, 1, 1, 0, 0, 0}. The fourth and fifth as well as the 13th, 14th and 15th, for example, all have an index of two. So the solution’s line’s value at this particular index is used as the headway for the morning (07:00–09:00) and afternoon (16:00–19:00) peaks. The total number of decision variables lies between eight and 105 (\(l\cdot d\)), and the latter variant is the largest or full real-world instance.

Its optimization run time limit t was set to 110 h (i.e., 110 h of optimization on one single CPU core). The run time of the other instances was set in relation to the total number of decision variables (15 min accuracy). The smallest one has a optimization run time of 13.75 h. One evaluation of all 16 test instances takes about 728.75 h of computation on a single CPU core. Given, that three different algorithms are used, and five independent optimization runs (not to be confused with replications) are performed, a total of 10,931.25 h is required. The experiments were conducted on the VSC3 (VSC 2018), as introduced at the end of Sect. 4. To speed up the whole process, several nodes with 16 CPU cores each were used simultaneously.

4.2 Other multi-objective algorithms and solution encoding

In addition to the two-phase algorithm (SO-CMA-ES and MD-LS), introduced in Sect. 3.2, we implemented a non-dominated sorted genetic algorithm (NSGA-II) and the multi-objective covariance matrix adaptation evolutionary strategy (MO-CMA-ES) for comparison purposes.

The NSGA-II (Deb et al. 2002) is a genetic algorithm for multi-objective optimization. Its selection mechanism is rank-based and relies on the identification of non-dominated solutions in the population. Apart from crossover and mutation probability, two parameters, the population size and selected parents factor, need to be set. Crossover and mutation operators create new candidate solutions (i.e., offspring). We used three standard crossover operators for real-valued encodings: (1) an average crossover that calculates an average value of the parents’ values at the respective position of their gene material; (2) an arithmetic crossover that randomly performs an average calculation or simply takes the value from the first parent; and (3) a blend alpha crossover (Takahashi and Kita 2001) that calculates an interval and uses it as boundary for a new random value. For each offspring to be created, one of the crossover operators is chosen at random. Thereafter, a fast non-dominated sorting procedure ranks all solutions, thereby structuring the population into several fronts.

The MO-CMA-ES (Igel et al. 2007) is the multi-objective version of the single-objective covariance matrix adaptation evolutionary strategy (SO-CMA-ES), which we used in the first phase of our two-phase optimization scheme (Sect. 3.2). Unlike the NSGA-II, which relies on crossover and mutation operators, it creates new candidate solutions by means of a sophisticated sampling approach. However, it is similar to the NSGA-II in terms of elitism and selection based on non-dominated sorting.

The algorithms’ initial populations are based on the currently employed, real-world headways. The base solution is the very first individual in the initial population, and the remaining ones are generated by sampling from a normal distribution with the currently employed headways as mean and 20% standard deviation.

Schmaranzer et al. (2019) study four different solution encoding variants (including continuous and discrete values) and find that the SO-CMA-ES achieved the best results, in less time, when using continuous factor encoding. This encoding uses factors instead of values, applied to the currently employed headways. If, for example, the initial headway is 5.0 min and the factor (i.e., decision variable) is 1.1, the resulting headway is 5.5 min. The desired lower and upper bounds for the final headways are 1.5 and 20 min, so the bounds for the factors have to be set accordingly.

As for software, we used several libraries from HeuristicLab 3.3.15 (Wagner et al. 2014), which is a metaheuristics framework developed in C#.

4.3 Algorithm parameter tuning

Algorithms have various parameters that must be tuned to fit the problem. We defined reasonable values for the parameters and ran full factorial experiments, as follows: the population size (i.e., front size) was set to 50, 75, 100, 150, and 200. Because the SO/MO-CMA-ES usually tend toward lower population sizes, the following values were tested: 35, 50, 65, 80, and 100. In the NSGA-II, the crossover and mutation probabilities were set to 90% and 100%, and to 5%, 10%, 15%, 20%, 30%, and 40%, respectively. The selected parents were set to 1.5 and 2.0 times the population size. The initial \(\sigma \) in both CMA-ES was set as a fraction of the parameter range. Thereby, its resulting value depends on the solution encoding’s bounds. Values of 1 / 8, 1 / 6, and 1 / 4 were tested. For the remaining SO-CMA-ES parameters, we tested 0, 50, 100, 150, 200, 300, and 500 initial iterations. The \(\mu \) parameter was set to “NULL”, 1, 5, and 10. Finally, for the parameters of MD-LS, the step size was set to 0.005, 0.010, 0.025, 0.050, and 0.1. The number of attempted moves per points was set to 3, 5, 7, 10, and 15. The selection distance values of 0.05, 0.10, and 0.15 were tested.

The instance used for tuning has only two lines (\(l=2\)) and 21 parameters (\(d=21\)) per line (Sect. 4). Therefore, this instance has 42 decision variables (\(l\cdot d\)) in total and only one crossing station (Fig. 6d). This combination of the network l and number of decision variables d per line offers the best combination of low run time (due to the significantly reduced passenger volume) but still a sufficiently high number of headways to be set. Table 5 contains the tuned parameter values for all algorithms. Note that the population size does not vary that much in most cases. Higher population sizes (up to 200 here and up to 300 in Schmaranzer et al. 2018) have been tested, but lower ones lead to better results, likely due to the long evaluation time. Apart from the population size (or front size, in case of the MD-LS), the algorithms have different parameters. If an algorithm does not feature a certain parameter, it is marked by “–”.

Table 5 Tuned parameters for all algorithms

The NSGA-II’s rather high mutation probability is likely due to the already decent solution currently being employed, on which the initial solution is based. Another explanation could be the low diversity within the initial population.

5 Computational results

This section is structured as follows: We examine the results of the first phase from our algorithm (Sect. 5.1), then present the results and summarize them in tabular form (Sect. 5.2). Next, Sect. 5.3 compares the resulting best Pareto front approximations. Section 5.4 details the real-world instance.

5.1 Analysis of the first phase

The result of the first phase are eleven start points that serve as input for the second phase. Figure 7 contains four of 16 instances’ starting points; the remaining twelve can be found in Appendix A.1. They contain one of each network (i.e., number of lines l), and one of each decision variables per line d variants. As mentioned at the end of Sect. 4.1, five independent and reproducible optimization runs were performed. The low standard deviation prompted us to plot only the ones that led to the best result (i.e., highest hypervolume) in the second phase. We decided to use normalized values because we compare different network variants l.

Fig. 7
figure 7

Excerpt: Eleven start positions from phase one (best out of five runs)

Figure 7a depicts the smallest instance with a total of eight decision variables. The best solution in terms of quality of fleet mileage is very close to the optimum; the same holds for the other instances with more decision variables (Fig. 7b–d). However, the quality of average waiting time per passenger deteriorates from 0.0035, 0.0037, 0.0889, and finally to 0.1324. This deterioration becomes more apparent when looking at the corresponding fleet mileage qualities: 0.7332, 0.6257, 0.5340, and 0.4376. One reason for the better solutions on one end is the aforementioned (Sect. 2.2) influence of the weight \(\varphi \) on the required number of replications in the single-objective case (i.e., SO-CMA-ES only). The lower the weight \(\varphi \) (i.e., higher priority on average waiting time per passenger), the higher the number of replications. We know from previous works (Schmaranzer et al. 2018, 2019) that the number of evaluated solution with a weight of \(\varphi =0.25\) decreases by almost 30% and increases by about 16% with a weight of \(\varphi =0.75\), when compared with the equally balanced \(\varphi =0.50\).

Apart from managing fewer evaluations due to the influence of the weight \(\varphi \) on Z subjection to randomness (SO-CMA-ES; single-objective case only), we must also consider that average waiting time per passenger optimization is harder than fleet mileage optimization. The average waiting time per passenger is time-dependent, because so is the passenger volume (Fig. 2). So a slightly tighter headway during a peak hour may lead to a better result than a much tighter headway in an off-peak hour. Fleet mileage is not time-dependent; it simply does not matter if the headway widened during peak or off-peak hours. As long as a wider headway does not lead to infeasibility and reduces the number of vehicle releases, the resulting fleet mileage is reduced. Of course, the affected line’s length has an influence on how great the savings are, but the affected position in the solution vector has no influence.

5.2 Multi-objective optimization results

Table 6 contains the average results obtained from the three algorithms, NSGA-II, MO-CMA-ES, and the two-phase algorithm, in terms of the resulting fronts’ hypervolumes (Zitzler and Thiele 1999). The hypervolume measures the size of the objective space covered by an approximation set. According to Riquelme et al. (2015), it is the most commonly used performance metric in multi-objective optimization, also known as S metric, hyper-area, or Lebesgue measure. It is unary and considers accuracy, diversity, and cardinality. The required reference point was set to (1.1, 1.1). It was not possible to use the concept of the Nadir-point (Ehrgott and Tenfelde-Podehl 2003) for that purpose, because the required lexicographic optimization could not be conducted in this stochastic, simulation-based context.

Table 6 Tabular average results (five independent and reproducible optimization runs; best in bold; worse in italic)

In the smallest test instance (8 decision variables), the MO-CM-ES performs best. However, all three optimization schemes perform rather well. The relative difference among the three competitors is 0.98%. Up to \(l=4\), the test instances with the smallest number of decision variables for each line (\(d=4\)) is solved best by the MO-CMA-ES, followed by our proposed two-phase algorithm. The latter performs best in 13 test instances and second best in the remaining three. Considering that the NSGA-II ranks third in all 16 instances, we used its results as a baseline. When compared with the other two algorithms, it becomes apparent that the relative distance increases with the number of lines l and the number of variables per line d. The NSGA-II’s performance is decreasing when comparing the four hypervolume results within a specific number of lines l. This effect can also be observed from the MO-CMA-ES results. Overall, the two-phase algorithm performed 4.59%, and the MO-CMA-ES 3.40%, better compared with the NSGA-II. The hypervolume in the largest test instance with 105 decision variables was considerably smaller for the NSGA-II.

5.3 Details on the best Pareto front approximations

Table 7 contains the multiplicative binary \(\epsilon \) (Zitzler et al. 2003) of the MO-CMA-ES and the NSGA-II (best out of five), where our two-phase optimization scheme serves as a reference point. The \(\epsilon \) indicator gives the factors by which approximation sets (MO-CMA-ES and NSGA-II) are worse than a reference set (two-phase) with respect to all objectives. Precisely speaking, the \(\epsilon \)-values reported in Table 7 represent the smallest quantities by which the objective vectors in the approximation set of our two-phase approach have to be multiplied to ensure that the whole set is weakly dominated by the respective comparison set. Our two-phase optimization scheme performs rather well, and the MO-CMA-ES can handle the increasing complexity (i.e., number of lines l and resulting total number of decision variables \(l\cdot d\)) better than the NSGA-II.

Table 7 Binary multiplicative \(\epsilon \) with two-phase as reference

Figures 8, 9, 10 and 11 contain the best Pareto fronts approximations (with respect to the hypervolume) for all algorithms (best out of five independent and reproducible optimization runs). As in Sect. 5.1, we included four representative plots by increasing the network l and number of decision variables per line d variants, from the smallest to the largest test instance; the remaining twelve plots can be observed in Appendix A.2.

In Fig. 8, there is not much of a difference among the three solution schemes. All cover a broad range, though the MO-CMA-ES and the two-phase algorithm cover a slightly higher spread (i.e., get closer to the optimum) in terms of fleet milage quality (lower right corner).

With 21 decision variables (Fig. 9), the MO-CMA-ES, followed by the two-phase scheme, cover a wide range of the Pareto front approximation. Also, both manage to move the front closer to the origin.

Fig. 8
figure 8

Eight decision variables (\(l=2;d=4\))

Fig. 9
figure 9

21 decision variables (\(l=3;d=7\))

Fig. 10
figure 10

44 decision variables (\(l=4;d=11\))

Fig. 11
figure 11

105 decision variables (\(l=5;d=21\))

In the test instance with 44 decision variables (Fig. 10), the trend continues: the MO-CMA-ES and two-phase algorithm perform best. The NSGA-II covers a far smaller spread for both objectives.

Figure 11 contains the largest test instance. Here, the two-phase optimization scheme performs best, in terms of both coverage and in moving the Pareto front approximation toward the origin. In the middle of the front, the NSGA-II manages to produce better solutions than the MO-CMA-ES. However, in terms of spead the MO-CMA-ES performs better than the NSGA-II.

A reason for the MO-CMA-ES’s good performance in terms of average waiting time per passenger (\(Z_2\)) is that it does not use bounds in a traditional manner Igel et al. (2007). Whenever a newly created solution violates the bounds, it does not correct it but evaluates the next feasible one and adds a tiny penalty to its quality, we only used the quality values without any penalties. This aspect, in conjunction with the continuous factor encoding, creates the following effect: when the MO-CMA-ES uses a negative factor, the resulting headway is negative, such that it gets replaced by the lower bound of 1.5 min, which is also the technically lowest possible headway. The end result is good quality for the waiting time’s end of the front.

Moshaiov and Abramovich (2014) use the NSGA-II and MO-CMA-ES for the evolution of multi-objective neuro-controllers. They also conclude that the advantage of the NSGA-II (i.e., better solutions compared with the MO-CMA-ES) is restricted to part of the front, whereas the MO-CMA-ES is superior over a large part of the front. Because the results from the first phase in our two-phase optimization scheme (Sect. 5.1) do not provide good results in terms of \(Z_2\) (i.e., waiting time) for test instances with a high number of decision variables, our multi-directional local search managed to improve that end of the final front quite a bit (e.g., 0.1324–0.0037 for the largest instance).

5.4 Detailed results of the real-world instance

Practitioners are often interested in multiple solutions from which they can choose. Furthermore, extreme solutions that lead to, for example, 50% more fleet mileage with similar increases in fleet size and drivers create real-world implementation issues, especially in the short term. A large order of new vehicles requires time and preparations by manufacturers, as does the process for recruiting and training of new drivers.

Table 8 contains some results on solutions that are close to the currently employed headways, which result in 45,943 km fleet mileage and 2.69 min average waiting time per passenger. To ensure and further improve the solutions’ accuracy, all were re-evaluated using 50 replications. The one in bold, for example, results in 44,579 km and 2.67 min. It actually offers both a cost reduction of about 3% (less fleet mileage) and service quality improvement of 0.7%, compared with the currently employed setting.

As mentioned, the average invehicle time and both average transfer times do not vary. The average total travel time thus is driven by the average waiting time per passenger, which justifies our choice to use it as a service level indicator.

Table 8 Selected solutions from best Pareto front approx. (two-phase algorithm)

6 Conclusion and perspectives

We developed a two-phase multi-objective optimization scheme, comprised of a single-objective CMA-ES and a multi-directional local search, and compared it with two well-known multi-objective population-based algorithms (NSGA-II and MO-CMA-ES). The algorithms were applied to a multi-objective simulation-based headway optimization problem for complex public mass rapid transit systems. The first phase of our algorithm uses a single-objective algorithm (SO-CMA-ES), which yields elevens start points, from which the multi-directional local search embarks on its task of creating a Pareto front approximation. Computational experience is gained from several test instances based on a real-world subway network (i.e., the Viennese mass rapid transit system). The two-phase optimization scheme performed best in 81% of the test instances. The NSGA-II’s performance deteriorated with increasing test instance complexity, yet it yields better solutions than the MO-CMA-ES in a restricted part of the front, whereas the MO-CMA-ES is superior over a large spread of the front. The headways that are currently employed by the transport provider already offer a good balance between the objectives of fleet mileage and average waiting time per passenger reduction. However, one of the solutions derived by our proposed multi-objective algorithm offers both, a cost reduction of about 3% (less fleet mileage) and a service quality improvement of 0.7% (lower average waiting time per passenger).

An extension of the proposed approach to other means of public transport, like trams or buses, would offer an interesting pathway for further research. Kiefer et al. (2018) recently investigated public transport network preventive maintenance tasks. The headways for the required replacement services could be determined by our multi-objective simulation optimization scheme. Other extensions involve the implementation of planned disruptions. For the time being, the real-world system is still unaffected by this study. Possible future impacts include changes in the lines’ respective hourly headways (i.e. new schedule), planning of vehicle acquisition, infrastructure alterations, and disruption management.