1 Introduction

Modern information systems such as customer relationship management (CRM) and enterprise resource planning (ERP) systems record transactions corresponding to activities executed within the business processes that these systems support. For example, a CRM system typically records transactions corresponding to the creation of a customer lead, a request for quote, and various other activities related to customer leads, quotes, and purchase orders. These transactional records can be extracted via SQL queries or via dedicated application programming interfaces (APIs) and used to analyze the execution of the business processes supported by the CRM system, such as the lead-to-quote or the quote-to-order process.

Process mining [33] is a family of techniques to analyze transactional records associated to a given business process, also known as an event log, in order to extract insights about the performance of the process. Among other things, process mining techniques allow us to discover a process model from an event log, an operation known as automated process discovery. Automatically discovered process models allow analysts to understand how the process is executed in reality and to uncover unexpected behavior. When enhanced with performance information (e.g., average activity durations or waiting times), such models are also used for performance analysis, e.g., to detect bottlenecks.

The problem of automated process discovery has been intensely studied in the past two decades [8]. Research in this field has led to a wide range of automated process discovery approaches (APDAs) that strike various trade-offs between accuracy,Footnote 1 model complexity, and execution time. Existing approaches in this field rely on parameters (with certain default values) to strike this tradeoff. Analysts need to fine-tune these parameters to find a model with the best possible trade-off between different model quality metrics. This article addresses the question of how to automate the fine-tuning of automated process discovery techniques.

A few studies have suggested that the accuracy of APDAs can be enhanced by applying optimization metaheuristics. Early studies in this direction considered population-based metaheuristics (P-metaheuristics), chiefly genetic algorithms [13, 16]. However, these heuristics are computationally heavy, requiring execution times in the order of hours to converge when applied to real-life logs [8]. Such high execution times make these techniques inapplicable in the context of exploratory and interactive process discovery, where an analyst may need to discover process models corresponding to several variants of a process (e.g., one process model per type of product, per type of customer, or per region or country) in order to compare the behavior of the process in different settings. Accordingly, other studies have considered the use of single-solution-based metaheuristics (S-metaheuristics) such as simulated annealing [18, 29], which are less computationally demanding. However, these latter studies remain at the level of proposals without validation on real-life logs and comparison of trade-offs between alternative metaheuristics.

In this setting, this article studies the following question: to what extent can the accuracy of APDAs be improved by applying single-solution-based metaheuristics? To address this question, the article outlines a framework to enhance APDAs by applying optimization metaheuristics. The core idea is to perturb the intermediate representation of event logs used by several of the available APDAs, namely the directly-follows graph (DFG). The paper specifically considers perturbations that add or remove edges with the aim of improving fitness or precision, and in a way that allows the underlying APDA to discover a process model from the perturbed DFG.

The proposed framework can be instantiated by linking it to three components: (i) an automated process discovery approach; (ii) an optimization metaheuristic; and (iii) the quality measure to be optimized, such as fitness, precision, or F-score. The article considers instantiations of the framework corresponding to three APDAs (Inductive Miner [24],Footnote 2 Fodina [34], and Split Miner [9]), four optimization metaheuristics (iterative local search, repeated local search, tabu search, simulated annealing), and one accuracy measure (Markovian F-score).

Using a benchmark of 20 real-life logs, the article compares the accuracy gains yielded by the above optimization metaheuristics relative to each other, and relative to the baseline (unoptimized) APDAs upon which they rely. The experimental evaluation also considers the impact of metaheuristic optimization on model complexity measures as well as on execution times.

This article is an extended and revised version of a conference paper [10]. In the conference paper, we presented an approach to optimize the accuracy of one automated process discovery approach, namely Split Miner, by applying S-metaheuristics, and we reported a comparison between the benefits of applying single-solution-based metaheuristics against the benefits of applying P-metaheuristics (using Evolutionary Tree Miner [13] as representative APDA of this category). Our former comparison [10] showed that S-metaheuristics outperform P-metaheuristics not only in terms of execution time efficiency, but also in terms of accuracy of the discovered process models; such a result also supported the findings of the latest literature review of automated process discovery approaches [8]. This article extends our previous approach [10] into a modular framework that can be used to optimize other APDAs, specifically those APDAs that construct a DFG from the event log and use it as an intermediate artifact to discover a process model. This article also extends the conference paper by considering not only Split Miner, but also two other APDAs, namely Fodina and Inductive Miner. Finally, the article reports an empirical evaluation covering all three approaches (Split Miner, Fodina, and Inductive Miner). The evaluation not only proves the applicability and relevance of S-metaheuristics to the problem of automated process discovery, but it also highlights the benefits yielded by S-metaheuristics.

The rest of the paper is structured as follows. The next section gives an overview of APDAs and optimization metaheuristics, where we discuss the background and the related work. Section 3 presents the proposed metaheuristic optimization framework and its instantiations. Section 4 reports on the empirical evaluation, and Sect. 5 draws conclusions and future work directions.

2 Background and related work

In this section, we give an overview of existing approaches for automated process discovery, followed by an introduction to optimization metaheuristics in general, and their application to automated process discovery in particular.

2.1 Automated process discovery

The execution of business processes is often recorded in the form of event logs. An event log is a collection of event records produced by individual instances (i.e., cases) of the process. The goal of automated process discovery is to generate a process model that captures the behavior observed in or implied by an event log. To assess the goodness of a discovered process model, four quality dimensions are used [33]: fitness, precision, generalization, and complexity. Fitness (a.k.a. recall) measures the amount of behavior observed in the log that is captured by the model. A perfectly fitting process model is one that recognizes every trace in the log. Precision measures the amount of behavior captured in the process model that is observed in the log. A perfectly precise model is one that recognizes only traces that are observed in the log. Generalization measures to what extent the process model captures behavior that, despite not being observed in the log, is implied by it. Finally, complexity measures the understandability of a process model, and it is typically measured via size and structural measures. In this paper, we focus on fitness, precision, and F-score (the harmonic mean of fitness and precision).

A recent comparison of state-of-the-art APDAs [8] showed that an approach capable of consistently discovering models with the best fitness-precision trade-off is currently missing. The same study showed, however, that we can obtain consistently good trade-offs by hyperparameter-optimizing some of the existing APDAs based on DFGs—Inductive Miner [24], Structured Heuristics Miner [7], Fodina [34], and Split Miner [9]. These algorithms have a hyperparameter to tune the amount of filtering applied when constructing the DFG. Optimizing this and other hyperparameters via greedy search [8], local search strategies [14], or sensitivity analysis techniques [27], can greatly improve the accuracy of the discovered process models. Accordingly, in the evaluation reported later we use a hyperparameter-optimized version of Split Miner as one of the baselines.

The problem of discovering accurate process models from event logs is inevitably related to that of ensuring event log quality. There is a rich collection of methods for detecting and handling data quality issues in event logs [?]. However, this latter body of work is largely orthogonal to the contribution of this article, as this article focuses on discovering process models assuming that data quality issues have been addressed. This having been said, the methods presented in this paper partially address one type of data quality issue, namely the presence of noise (infrequent behavior) in an event log [?]. To mitigate the impact of noise on the discovered process model, automated process discovery approaches, including those extended in this paper, apply a dependency filtering step. The optimization techniques proposed in this article fine-tune the level of filtering in order to maximize the accuracy of the discovered process model.

2.2 Optimization metaheuristics

The term optimization metaheuristics refers to a parameterized algorithm, which can be instantiated to address a wide range of optimization problems. Metaheuristics are usually classified into two broad categories [12]: (i) single-solution-based metaheuristics, or S-metaheuristics, which explore the solution space one solution at a time starting from a single initial solution of the problem; and (ii) population-based metaheuristics, or P-metaheuristics, which explore a population of solutions generated by mutating, combining, and/or improving previously identified solutions. S-metaheuristics tend to converge faster toward an optimal solution (either local or global) than P-metaheuristics, since the latter by dealing with a set of solutions require more time to assess and improve the quality of each single solution. P-metaheuristics are computationally heavier than S-metaheuristics, but they are more likely to escape local optima. Providing an exhaustive discussion on all the available metaheuristics is beyond the scope of this paper, in the following, we focus on the four S-metaheuristics that we integrated in our optimization framework and on the P-metaheuristics that have been previously adapted to address the problem of automated process discovery.

Iterated Local Search [30] starts from a (random) solution and explores the neighboring solutions (i.e., solutions obtained by applying a change to the current solution) in search of a better one. When a better solution cannot be found, it perturbs the current solution and starts again. The perturbation is meant to avoid local optimal solutions. The exploration of the solution-space ends when a given termination criterion is met (e.g., maximum iterations, timeout).

Tabu Search [19] is a memory-driven local search. Its initialization includes a (random) solution and three memories: short, intermediate, and long term. The short-term memory keeps track of recent solutions and prohibits to revisit them. The intermediate-term memory contains criteria driving the search toward the best solutions. The long-term memory contains characteristics that have often been found in many visited solutions, to avoid revisiting similar solutions. Using these memories, the neighborhood of the initial solution is explored and a new solution is selected accordingly. The solution-space exploration is repeated until a termination criterion is met.

Simulated Annealing [22] is based on the concepts of temperature (T, a parameter chosen arbitrarily) and energy (E, the objective function to minimize). At each iteration the algorithm explores (some of) the neighboring solutions and compares their energies with the one of the current solution. This latter is updated if the energy of a neighbor is lower, or with a probability that is function of T and the energies of the current and candidate solutions, usually \(e^{-\frac{\left| E_1 - E_2 \right| }{T}}\). The temperature drops over time, thus reducing the chance of updating the current solution with a higher-energy one. The algorithm ends when a termination criterion is met, which often relates to the energy or the temperature (e.g., energy below a threshold or \(T = 0\)).

Evolutionary (Genetic) Algorithms [11, 20] are inspired by Darwin’s theory of evolution. Starting from a set of (random) solutions, a new solution is generated by mixing characteristics of two parents selected from the set of the current solutions, such an operation is known as crossover. Subsequently, mutations are applied to the new solutions to introduce randomness and avoid local optimal solution. Finally, the solutions obtained are assessed and a subset is retained for the next iteration. The algorithm continues until a stop criterion is met.

Swarm Particle Optimization [21] starts from a set of (random) solutions, referred to as particles. Each particle is identified using the concepts of position and velocity. The position is a proxy for the particle qualities and it embeds the characteristics of the solution, while the velocity is used to alter the position of the particles at each iteration. Furthermore, each particle has memory of its best position encountered during the roaming of the search space, as well as the best position encountered by any other particle. At each iteration, the algorithm updates the particles positions according to their velocities and updates the best positions found. When a termination condition is met, the algorithm returns the particle having the absolute best position among the whole swarm.

Imperialist Competitive Algorithm [4] is inspired by the historical colonial period. It starts from a (random) set of solutions, called countries. Each country is assessed via an objective function, and a subset is selected as imperialistic countries (the selection is based on their objective function scores). All the countries left (i.e., those having low objective function scores) are considered colonies of the closest (by characteristics) imperialistic country. Then, each colony is altered to resemble its imperialistic country, the objective function scores are re-computed, and the colonies that became better than their imperialistic country are promoted to imperialistic countries and vice-versa. When a termination condition is met, the country with the highest objective function score is selected as the best solution.

2.3 Optimization metaheuristics in automated process discovery

Optimization metaheuristics have been considered in a few previous studies on automated process discovery. An early attempt to apply P-metaheuristics to automated process discovery was the Genetic Miner proposed by de Medeiros [16], subsequently overtaken by the Evolutionary Tree Miner [13]. Other applications of P-metaheuristics include the contribution of Alizadeh and Norani [3] who showed to improve fitness and precision of the discovered process models by using the imperialist competitive algorithm, outperforming some state-of-the-art APDAs (including \(\alpha ++\) [36], Flexible Heuristics Miner [35], and Fodina [34]); however, the implementation of the method designed by Alizadeh et al. is not publicly available, and the benchmark they used differ from the one suggested in the latest literature review [8]. Some research studies adapted the particle swarm optimization metaheuristic to solve the problem of automated process discovery from event logs [15, 17], but these studies are seminal and they lack of a solid evaluation on real-life logs. One of the most recent studies tried to combine evolutionary computation with particle swarm optimization [25] by extending the work of Buijs et al. [13], but also in this case the authors did not provide a working implementation of their method, and they did not assess it on public datasets, so that it is difficult to estimate the real benefits of their proposed improvements. In our context, the main limitation of P-metaheuristics is that they are computationally heavy due to the cost of constructing a solution (i.e., a process model) and evaluating its accuracy. This leads to execution times in the order of hours, to converge to a solution that in the end is comparable to those obtained by state-of-the-art APDAs that do not rely on optimization metaheuristics [8].

Finally, a handful of studies have considered the use of S-metaheuristics to automatically discover optimal process models, specifically simulated annealing [18, 29], but these proposals are preliminary and have not been compared against state-of-the-art approaches on real-life logs.

3 Metaheuristic optimization framework

This section outlines our framework for optimizing APDAs by means of S-metaheuristics (cf. Sect. 2). First, we give an overview of the framework and its core components. Next, we discuss the adaptation of the S-metaheuristics to the problem of process discovery. Finally, we describe the instantiations of our framework for Split Miner, Fodina, and Inductive Miner.

3.1 Preliminaries

In order to discover a process model, an APDA takes as input an event log and transforms it into an intermediate representation from which a process model is derived. Below, we define one of the most popular intermediate representations, that is the directly-follows graph (DFG). Although other intermediate representations are available in the literature (e.g., behavioral profiles [28]), our framework focuses only on DFGs for two main reasons: first, because they are adopted by many state-of-the-art automated process discovery approaches [7, 9, 24, 34, 35]; second, because they allow us to leverage the Markovian accuracy [5] to facilitate the application of metaheuristics and the navigation of the solution space as we show later in this section.

Definition 1

(Event Log) Given a set of activities \({\mathscr {A}}\), an event log \({\mathscr {L}}\) is a multiset of traces where a trace \(t \in {\mathscr {L}}\) is a sequence of activities \(t= \langle a_1, a_2, \dots , a_n \rangle \), with \(a_i \in {\mathscr {A}}, 1 \le i \le n\).

Definition 2

[Directly-follows graph (DFG)] Given an event log \({\mathscr {L}}\), its directly-follows graph (DFG) is a directed graph \({\mathscr {G}}= (N, E)\), where: N is the set of nodes, \(N = \{ a \in {\mathscr {A}}\mid \exists t \in {\mathscr {L}}\wedge a \in t\}\); and E is the set of edges \(E = \{(x, y) \in N \times N \mid \exists t = \langle a_1, a_2, \ldots , a_n \rangle , t\in {\mathscr {L}}\wedge a_i = x \wedge a_{i+1} = y \left[ 1 \le i \le n-1 \right] \}\).

By definition, each node of the DFG represents an activity recorded in at least one trace of the event log, while each edge of a DFG represents a directly-follows relation between two activities (represented by the node source and the node target of the edge). An APDA is said to be DFG-based if it first generates the DFG of the event log, then applies an algorithm to manipulate the DFG (e.g., removing edges), and finally converts the processed DFG into a process model. Such a processed DFG will not adhere any more to Definition 2; therefore, we redefine it as Refined DFG.

Definition 3

(Refined DFG) Given an event log \({\mathscr {L}}\) and its DFG \({\mathscr {G}}_{{\mathscr {L}}} = (N, E)\), a Refined DFG is a directed graph \({\mathscr {G}}= (N', E')\), where: \(N' \subseteq N\) and \(E' \subseteq E\). If \(N' = N\) and \(E' = E\), the refined DFG is equivalent to the event log DFG.

Examples of DFG-based APDAs are Inductive Miner [24], Heuristics Miner [7, 35], Fodina [34], and Split Miner [9]. Different DFG-based APDAs may extract different Refined DFGs from the same log. Also, a DFG-based APDA may discover different Refined DFGs from the same log depending on its hyperparameter settings (e.g., a filtering threshold). The algorithm(s) used by a DFG-based APDA to discover the Refined DFG from the event log and convert it into a process model may greatly affect the accuracy of an APDA. Accordingly, our framework focuses on optimizing the discovery of the Refined DFG rather than its conversion into a process model.

Given that a Refined DFG is a binary graph, it is possible to represent it in the form of a matrix as follows.

Definition 4

(DFG-Matrix) Given a Refined DFG \({\mathscr {G}}= (N, E)\) and a function \(\theta : N \rightarrow [1, \left| N \right| ]\),Footnote 3 the DFG-Matrix is a squared matrix \(X_{{\mathscr {G}}} \in [0,1]\cap {\mathbb {N}}^{\left| N\right| \times \left| N\right| }\), where each cell \(x_{i,j} = 1 \Longleftrightarrow \exists (a_1,a_2)\in E \mid {\theta }(a_1)=i \wedge {\theta }(a_2)=j\), otherwise \(x_{i,j}=0\).

In the remaining of this paper, we refer to the Refined DFG as DFG for simplicity reason.

3.2 Framework overview

As shown in Fig. 1, our framework takes three inputs (in addition to the log): (i) the optimization metaheuristics; (ii) the objective function to be optimized (e.g., F-score); (iii) and the DFG-based APDA to be used for discovering a process model.

Fig. 1
figure 1

Overview of our optimization framework

Algorithm 1 describes how our framework operates, while Fig. 2 captures the control flow representation of the Algorithm 1. First, the input event log is given to the APDA, which returns the discovered (refined) DFG and its corresponding process model (lines 1 and 2). This (refined) DFG becomes the current DFG, while the process model becomes the best process model (so far). This process model’s objective function score (e.g., the F-score) is stored as the current score and the best score (lines 3 and 4). The current DFG is then given as input to the function GenerateNeighbors, which applies changes to the current DFG to generate a set of neighboring DFGs (line 6). The latter ones are given as input to the APDA, which returns the corresponding process models. The process models are assessed by the objective function evaluators (line 9 to 13). When the metaheuristic receives the results from the evaluators (along with the current DFG and its score), it chooses the new current DFG and updates the current score (lines 14 and 15). If the new current score is higher than the best score (line 16), it updates the best process model and the best score (lines 17 and 18). After the update, a new iteration starts, unless a termination criterion is met (e.g., a timeout, a maximum number of iterations, or a minimum threshold for the objective function). In the latter case, the framework outputs the best process model identified, i.e., the process model scoring the highest value for the objective function.

Fig. 2
figure 2

Algorithm 1—control flow sketch

3.3 Adaptation of the optimization metaheuristics

To adapt iterative local search (ILS), tabu search (TABU), and simulated annealing (SIMA) to the problem of automated process discovery, we need to define the following three concepts: (i) the problem solution space; (ii) a solution neighborhood; (iii) the objective function. These design choices influence how each of the metaheuristics navigates the solution space and escapes local minima, i.e., how to design the Algorithm 1 functions: GenerateNeighbors and UpdateDFG, resp. lines 6 and 14.

Solution space Our goal being the optimization of APDAs, we are forced to choose a solution space that fits well our context regardless the selected APDA. If we assume that the APDA is DFG-based (that is the case for the majority of the available APDAs), we can define the solution space as the set of all the DFG discoverable from the event log. Indeed, any DFG-based APDA can generate deterministically a process model from a DFG.

Fig. 3
figure 3

UDF when selecting ILS as optimization metaheuristic

Solution neighborhood Having defined the solution space as the set of all the DFG discoverable from the event log, we can refer to any element of this solution space as a DFG-Matrix. Given a DFG-Matrix, we define its neighborhood as the set of all the matrices having one different cell value (i.e., DFGs having one more/less edge). In the following, every time we refer to a DFG we assume it is represented as a DFG-Matrix.

figure f

Objective function It is possible to define the objective function as any function assessing one of the four quality dimensions for discovered process models (introduced in Sect. 2). However, being interested in optimizing the APDAs to discover the most accurate process model, in our optimization framework instantiations we refer to the objective function as the F-score of fitness and precision. Furthermore, we remark that our framework could operate also with objective functions that take into account multiple quality dimensions striving for a trade-off, e.g., F-score and model complexity, provided the multiple quality dimensions can be combined into a unique objective function.

Having defined the solution space, a solution neighborhood, and the objective function, we can turn our attention on how ILS, TABU, and SIMA navigate the solution space. ILS, TABU, and SIMA share similar traits in solving an optimization problem, especially when it comes to the navigation of the solution space. Given a problem and its solution space, any of these three S-metaheuristics starts from a (random) solution, discovers one or more neighboring solutions, and assesses them with the objective function to find a solution that is better than the current one. If a better solution is found, it is chosen as the new current solution and the metaheuristic performs a new neighborhood exploration. If a better solution is not found, e.g., the current solution is locally optimal, the three metaheuristics follow different approaches to escape the local optimum and continue the solution space exploration. Algorithm 1 orchestrates and facilitates the parts of this procedure shared by the three metaheuristics. However, we must define the functions GenerateNeighbors (GNF) and UpdateDFG (UDF).

The GNF receives in input a solution of the solution space, i.e., a DFG, and it generates a set of neighboring DFGs. By definition, GNF is independent from the metaheuristic and it can be as simple or as elaborate as we demand. An example of a simple GNF is a function that randomly selects neighboring DFGs turning one cell of the input DFG-Matrix to 0 or to 1. While, an example of an elaborate GNF is a function that accurately selects neighboring DFGs relying on the feedback received from the objective function assessing the input DFG, as we show in Sect. 3.4.

Fig. 4
figure 4

UDF when selecting TABU as optimization metaheuristic

Fig. 5
figure 5

UDF when selecting SIMA as optimization metaheuristic

The UDF (captured in Algorithm 2) is the core of our optimization framework, and it implements the metaheuristic itself. The UDF receives in input the selected metaheuristic (\(\omega \)), the neighboring DFGs and their corresponding objective function scores (S), the current DFG (\({\mathscr {G}}_c\)), the current score (\(s_c\)), the APDA (\(\alpha \)), and the event log (\({\mathscr {L}}\)). Then, we can differentiate two cases: (i) among the input neighboring DFGs there is at least one having a higher objective function score than the current; (ii) none of the input neighboring DFGs has a higher objective function score than the current. In the first case, UDF always outputs the DFG having the highest score regardless of the selected metaheuristic (see Algorithm 2, lines 4, 11, and 33—respectively, for ILS, TABU, and SIMA). In the second case, the current DFG may be a local optimum, and each metaheuristic escapes it with a different strategy. Figures 34, and 5 show the high-level control flow of how ILS, TABU, and SIMA update the current DFG (that is, the UDF—Algorithm 2).

figure g

Iterative Local Search applies the simplest strategy, it perturbs the current DFG (Algorithm 2, line 7). The perturbation is meant to alter the DFG in such a way to escape the local optimum, e.g., randomly adding and removing multiple edges from the current DFG. The perturbed DFG is the output of the UDF.

Tabu Search relies on its three memories to escape a local optimum (Algorithm 2, line 25 to 30). The short-term memory (a.k.a. Tabu-list), which contains DFG that must not be explored further. The intermediate-term memory, which contains DFGs that should lead to better results and, therefore, should be explored in the near future. The long-term memory, which contains DFGs (with characteristics) that have been seen multiple times and, therefore, not to explore in the near future. TABU updates the three memories each time the UDF is executed. Given the set of neighboring DFGs and their respective objective function scores (see Algorithm 1, map S), TABU adds each DFG to a different memory. DFGs worsening the objective function score are added to the Tabu-list. DFGs improving the objective function score, yet less than another neighboring DFG, are added to the intermediate-term memory. DFGs that do not improve the objective function score are added to the long-term memory. Also, the current DFG is added to the Tabu-list, it being already explored. When TABU does not find a better DFG in the neighborhood of the current DFG, it returns the latest DFG added to the intermediate-term memory. If the intermediate-term memory is empty, TABU returns the latest DFG added to the long-term memory. If both these memories are empty, TABU requires a new (random) DFG from the APDA, and outputs its DFG.

Simulated Annealing avoids getting stuck in a local optimum by allowing the selection of DFGs worsening the objective function score (Algorithm 2, line 36 to 40). In doing so, SIMA explores areas of the solution space that other S-metaheuristics do not. When a better DFG is not found in the neighborhood of the current DFG, SIMA analyzes one neighboring DFG at a time. If this neighbor does not worsen the objective function score, SIMA outputs it. Instead, if the neighboring DFG worsens the objective function score, SIMA outputs it with a probability of \(e^{-\frac{\left| s_n - s_c \right| }{T}}\), where \(s_n\) and \(s_c\) are the objective function scores of, (respectively), the neighboring DFG and the current DFG, and the temperature T is an integer that converges to zero as a linear function of the maximum number of iterations. The temperature is fundamental to avoid updating the current DFG with a worse one if there would be no time to recover from the worsening (i.e., too few iterations left for continuing the exploration of the solution space from the worse DFG).

3.4 Framework instantiation

To assess our framework, we instantiated it for three APDAs: Split Miner [9], Fodina [34], and Inductive Miner [24]. These three APDAs are all DFG-based, and they are representatives of the state of the art. In fact, the latest APDAs literature review and benchmark [8] showed that Fodina, Split Miner, and Inductive Miner outperformed other APDAs when their hyperparameters were optimized via a brute-force approach. Therefore, we decided to focus on those DFG-based APDAs that would benefit the most from the application of our optimization framework.

To complete the instantiation of our framework for any concrete DFG-based APDA, it is necessary to implement an interface that allows the metaheuristics to interact with the APDA (as discussed above). Such an interface should provide four functions: DiscoverDFG and ConvertDFGtoProcessModel (see Algorithm 1), the Restart Function (RF) for TABU, and the Perturbation Function (PF) for ILS.

The first two functions, DiscoverDFG and ConvertDFGtoProcessModel, are inherited from the DFG-based APDA, in our case Split Miner, Fodina, and Inductive Miner. We note that Split Miner and Fodina receive as input parameter settings that can vary the output of the DiscoverDFG function. Precisely, Split Miner has two parameters: the noise filtering threshold, used to drop infrequent edges in the DFG, and the parallelism threshold, used to determine which potential parallel relations between activities are used when discovering the process model from the DFG. While, Fodina has three parameters: the noise filtering threshold, similar to the one of Split Miner, and two threshold to detect, respectively, self-loops and short-loops in the DFG. Instead, the DFG-based variant of Inductive Miner [24] that we integrated in our optimization framework does not receive any input parameters.

To discover the initial DFG (Algorithm 1, line 1) with Split Miner, default parameters are used.Footnote 4 We removed the randomness for discovering the initial DFG because most of the times, the DFG discovered by Split Miner with default parameters is already a good solution [9], and starting the solution space exploration from this latter can reduce the total exploration time.

Similarly, if Fodina is the selected APDA, the initial DFG (Algorithm 1, line 1) is discovered using the default parameters of Fodina,Footnote 5 even though there is no guarantee that the default parameters allow Fodina to discover a good starting solution [8]. Yet, this design choice is less risky than randomly choosing the values of the input parameters in order to discover the initial DFG, because it is likely Fodina would discover unsound models when randomly tuned, given that it does not guarantee soundness.

On the other hand, Inductive Miner [24] does not apply any manipulation to the discovered initial DFG. In this case, we pseudorandomly generate an initial DFG starting from a given seed, to ensure determinism. Differently than the case of Fodina, this is a suitable design choice for Inductive Miner, because it always guarantees block-structured sound process models, regardless of the DFG.

Function RF is very similar to DiscoverDFG, since it requires the APDA to output a DFG. The only difference is that RF must output a different DFG every time it is executed. We adapted the DiscoverDFG function of Split Miner and Fodina to output the DFG discovered with default parameters the first time it is executed, and a DFG discovered with pseudorandom parameters for the following executions. The case of Inductive Miner is simpler, because the DiscoverDFG function always returns a pseudorandom DFG. Consequently, we mapped RF to the DiscoverDFG function.

Finally, function PF can be provided either by the APDA (through the interface) or by the metaheuristic. However, PF can be more effective when not generalised by the metaheuristic, allowing the APDA to apply different perturbations to the DFGs, taking into account how the APDA converts the DFG to a process model. We chose a different PF for each of the three APDAs.

  • Split Miner PF We invoke Split Miner’s concurrency oracle to extract the possible parallelism relations in the log using a randomly chosen parallelism threshold. For each new parallel relation discovered that is not present in the current solution, two edges are removed from the DFG, whils, for each deprecated parallel relation, two edges are added to the DFG.

  • Fodina PF Given the current DFG, we analyze its self-loops and short-loops relations using random loop thresholds. As a result, a new DFG is generated where a different set of edges is retained as self-loops and short-loops.

  • Inductive Miner PF Since Inductive Miner does not perform any manipulation on the DFG, we could not determine an efficient way to perturb the DFG. Thus, we set PF = RF, so that instead of perturbing the current DFG, a new random DFG is generated. This variant of the ILS is called Repeated local search (RLS). In the evaluation reported in Sect. 4, we use only RLS for Inductive Miner, and both ILS and RLS for Fodina and Split Miner.

Fig. 6
figure 6

Algotihm 3—control flow sketch

To complete the instantiation of our framework, we need to set an objective function. With the goal of optimizing the accuracy of the APDAs, we chose as objective function the F-score of fitness and precision. Among the existing measures of fitness and precision, we selected the Markovian fitness and precision presented in [5, 6].Footnote 6 The rationale for this choice is that these measures of fitness and precision are the fastest to compute among state-of-the-art measures [5, 6]. Furthermore, these measures indicate what edges could be added to or removed from the DFG to improve the fitness or precision of the model. This feedback allows us to design an effective GNF.

In the instantiation of our framework, the objective function’s output is a data structure composed of: the Markovian fitness and precision of the model, the F-score, and the mismatches between the model and the event log identified during the computation of the Markovian fitness and precision, i.e., the sets of edges that could be added to improve fitness or removed to improve precision. Algorithm 3 illustrates how we build this data structure, its high-level control flow sketch is captured in Fig. 6.

Given an event log and a process model, we generate their respective Markovian abstractions by applying the method described in [5] (lines 1 and 2). We recall that the Markovian abstraction of the log/model is a graph, where each edge represents a subtraceFootnote 7 observed in the log/model. Next, we collect all the edges of the Markovian abstraction of the log and of the model into two different sets: \(E_l\) and \(E_m\) (lines 3 and 4). These two sets are used to determine the Markovian fitness and precision of the process model [5], by applying the formula in lines 4 and 10. We note that the edges in \(E_l\) that cannot be found in \(E_m\) (set \(E_{df}\), line 6) represent subtraces of the log that cannot be found in the process model. Vice-versa, the edges in \(E_m\) that cannot be found in \(E_l\) (set \(E_{dp}\), line 11) represent subtraces of the process model that cannot be found in the log. We analyze these subtraces to detect directly-follows relations, i.e., DFG edges (lines 9 and 14), that can be added or removed from the DFG that generated the process model in order to either improve fitness or precision. Precisely, the DFG edges that can be added to improve fitness are those captured by the directly-follows relations that we can find in the Markovian abstraction edges in set \(E_{df}\). On the other hand, the edges that can be removed to improve precision are those captured by the directly-follows relations that we can find in the Markovian abstraction edges in set \(E_{dp}\). Once these edges to be added or removed are identified (sets \(E_f\) and \(E_p\)), we can output the final data structure, which comprises the Markovian fitness and precision, their F-score, and the two sets \(E_f\) and \(E_p\).

figure h
Fig. 7
figure 7

Algotihm 4—control flow sketch

Given the above objective function’s output, our GNF is described in Algorithm 4, while Fig. 7 captures its high-level control flow sketch.

figure i

This function receives as input the current DFG (\({\mathscr {G}}_c\)), its objective function score (the data structure \(s_c\)), and the number of neighbors to generate (\(\textit{size}_n\)). If fitness is greater than precision, we retrieve from \(s_c\) the set of edges (\(E_m\)) that could be removed from \({\mathscr {G}}_c\) to improve its precision (line 2). Conversely, if precision is greater than fitness, we retrieve from \(s_c\) the set of edges (\(E_m\)) that could be added to \({\mathscr {G}}_c\) to improve its fitness (line 4). The reasoning behind this design choice is that, given that our objective function is the F-score, it is preferable to increase the lowest of the two measures (precision or fitness). That is, if the fitness is lower, we increase fitness, and conversely if the precision is lower we increase precision. Once we have \(E_m\), we randomly select one edge from it, generate a copy of the current DFG (\({\mathscr {G}}_n\)), and either remove or add the randomly selected edge according to the accuracy measure we want to improve (precision or fitness), see lines 7 to 13. If the removal of an edge generates a disconnected \({\mathscr {G}}_n\), we do not add this latter to the neighbors set (N), line 10. We keep iterating over \(E_m\) until the set is empty (i.e., no mismatching edges are left) or N reaches its maximum size (i.e., \(\textit{size}_n\)). We then return N. The algorithm ends when the maximum execution time or the maximum number of iterations is reached.

4 Evaluation

We implemented the proposed optimization framework as a Java command-line application.Footnote 8 This tool uses Split Miner, Fodina, and Inductive Miner as the underlying APDAs, and the Markovian accuracy F-score as the objective function (cf. Sect. 3.4). Using this implementation, we undertook to empirically evaluate the magnitude of improvements in accuracy delivered by different instantiations of the framework.

4.1 Dataset, quality measures, and experimental setup

For our evaluation, we used the dataset of the benchmark of automated process discovery approaches in [8], which to the best of our knowledge is the most recent benchmark on this topic. This dataset includes twelve public logs and eight private logs. The public logs originate from the 4TU Centre for Research Data and include the BPI Challenge (BPIC) logs (2012–17), the Road Traffic Fines Management Process (RTFMP) log and the SEPSIS log. These logs record executions of business processes from a variety of domains, e.g., healthcare, finance, government, and IT service management. The eight proprietary logs are sourced from several companies in the education, insurance, IT service management, and IP management domains.

Table 1 reports the characteristics of the logs. The dataset comprises simple logs (e.g., BPIC13\(_{\mathrm{cp}}\)) and very complex ones (e.g., SEPSIS, PRT2) in terms of percentage of distinct traces, and both small logs (e.g., BPIC13\(_{\mathrm{cp}}\) and SEPSIS) and large ones (e.g., BPIC17 and PRT9) in terms of total number of events.

Table 1 Descriptive statistics of the real-life logs (public and proprietary)

From each of these logs, we discovered 16 process models by applying the following techniques:

  • Split Miner with default parameters (SM);

  • Split Miner with hyper-parameter optimization\(^9\) (HPO\(_\mathrm{sm}\));

  • Split Miner optimized with our framework using the following optimization metaheuristics: RLS\(_\mathrm{sm}\), ILS\(_\mathrm{sm}\), TABU\(_\mathrm{sm}\), SIMA\(_\mathrm{sm}\);

  • Fodina with default parameters (FO);

  • Fodina with hyper-parameter optimization\(^9\) (HPO\(_{\mathrm{fo}}\));

  • Fodina optimized with our framework using the following optimization metaheuristics: RLS\(_{\mathrm{fo}}\), ILS\(_{\mathrm{fo}}\), TABU\(_{\mathrm{fo}}\), SIMA\(_{\mathrm{fo}}\);

  • Inductive Miner IM\(_{\mathrm{d}}\);

  • Inductive Miner optimized with our framework using the following optimization metaheuristics: RLS\(_{\mathrm{imd}}\), TABU\(_{\mathrm{imd}}\), SIMA\(_{\mathrm{imd}}\).Footnote 9

For each of the above metaheuristics, we set the maximum execution time to five minutes and the maximum number of iterations to 50. The same timeout was also applied to the hyper-parameter optimizations.

For each of the discovered models, we measured fitness, precision, complexity, and execution time. For measuring fitness and precision, we adopted two different sets of measures. The first set of measures is based on alignments, computing fitness, and precision with the approaches proposed by Adriansyah et al. [1, 2] (alignment-based accuracy). Alignment-based fitness selects for each trace in the log, the closest trace recognized by the process model, and measures the minimal number of error-corrections required to align these two traces (a.k.a. minimal alignment cost). The final fitness score is equal to one minus the normalized sum of the minimal alignment cost between each trace in the log and the closest corresponding trace recognized by the model. Alignment-based precision builds a prefix automaton from the event log, then it replays the process model behavior on top of the log prefix automation (with the aid of alignments) and counts the number of times that the model can perform a move that the prefix automaton cannot. Each of these mismatching moves is called an escaping edge. The final value of precision is function of the number of detected escaping edges. For more details regarding the alignment-based fitness and precision, we refer to the corresponding studies [1, 2].

The second set of measures is based on Markovian abstractions, computing fitness, and precision with the approaches in [5]. The Markovian fitness generates a Markovian abstraction from the behavior recorded in the event log and a Markovian abstraction from the behavior allowed by the process model. As mentioned in the previous section, a Markovian abstraction is a graph where each node represents a subtrace of a fixed length. The Markovian fitness relies on a graph comparison algorithm [23] to identify the edges of the Markovian abstraction generated from the log that do not appear in the Markovian abstraction generated from the process model. Similarly, the Markovian precision is calculated by identifying (via the same graph comparison algorithm [23]) the edges of the Markovian abstraction of the process model that do not appear in the Markovian abstraction of the log. For more details regarding the alignment-based fitness and precision, we refer to the corresponding study [5].

For assessing the complexity of the models we relied on size, control-flow complexity (CFC), and structuredness. Size is the total number of nodes of a process model; Control flow complexity (CFC) is the amount of branching induced by the split gateways in a process model; Structuredness is the percentage of nodes located inside a single-entry single-exit fragment of a process model.

Note that we did not measure the generalization of the discovered process models because available generalization measures assess the capability of an APDA to generalise the behavior recorded in the event log during the discovery of a process model, and they do not assess the generalization of the process model itself [32]. However, this should not be seen as a limitation of this study, since our objective is to analyze the benefits yielded by our optimization framework in terms of F-score (through fitness and precision).

We used the results of these measurements to compare the quality of the models discovered by each baseline APDA (SM, FO, IM\(_{\mathrm{d}}\)) against the quality of the models discovered by the respective optimized approaches.

All the experiments were performed on an Intel Core i5-6200U@2.30 GHz with 16 GB RAM running Windows 10 Pro (64-bit) and JVM 8 with 14 GB RAM (10 GB Stack and 4 GB Heap). The framework implementation, the batch tests, the results, and all the (public) models discovered during the experiments are available for reproducibility purposes at https://doi.org/10.6084/m9.figshare.11413794.

4.2 Split Miner

Tables 2 and 3 show the results of our comparative evaluation for Split Miner. Each row reports the quality of each discovered process model in terms of accuracy (both alignment-based and Markovian), complexity, and discovery time. We held out from the tables four logs: BPIC13\(_{\mathrm{cp}}\), BPIC13\(_{\mathrm{inc}}\), BPIC17, and PRT9. For these logs, none of the metaheuristics could improve the accuracy of the model already discovered by SM. This is due to the high fitness score achieved by SM in these logs. By design, our metaheuristics try to improve precision by removing edges, but in these four cases, no edge could be removed without compromising the structure of the model (i.e., the model would become disconnected).

Table 2 Comparative evaluation results for the public logs—Split Miner
Table 3 Comparative evaluation results for the proprietary logs—Split Miner
Fig. 8
figure 8

BPIC14\(_\mathrm{f}\) model discovered with SIMA\(_\mathrm{sm}\) (above) and with SM (below)

Fig. 9
figure 9

RTFMP model discovered with SIMA\(_\mathrm{sm}\) (above) and with SM (below)

For the remaining 16 logs, all the metaheuristics consistently improved the Markovian F-score over that achieved by SM. Also, all the metaheuristics performed better than HPO\(_\mathrm{sm}\), except in two cases (BPIC12 and PRT1). Overall, the most effective optimization metaheuristic was ILS, which delivered the highest Markovian F-score nine times out of 16, followed by SIMA\(_\mathrm{sm}\) (eight times), RLS\(_\mathrm{sm}\) and TABU\(_\mathrm{sm}\) (six times each). We note, however, that the F-score difference between the four metaheuristics is small (in the order of one to two percentage points).

Despite the fact that the objective function of the metaheuristics was the Markovian F-score, all four metaheuristics also optimized in half of the cases the alignment-based F-score. This is due to the fact that any improvement in the Markovian fitness translates into an improvement in the alignment-based fitness. This does not hold for precision. The result highlights the partial correlation between alignment-based and Markovian measures, already discussed in the previous section.

By close inspection to the complexity of the models, we note that most of the times (nine cases out of 16) the F-score improvement achieved by the metaheuristics comes at the cost of size and CFC. This is expected, since SM tends to discover models with higher precision than fitness [9]. To improve the F-score, new behavior is added to the model in the form of new edges (note that new nodes are never added); this leads to new gateways and consequently to higher size and CFC. On the other hand, when precision is lower than fitness, and thus the metaheuristic aims to increase the value of precision to improve the overall F-score, the result is the opposite: the model complexity reduces as edges are removed. This is the case of the RTFMP and PRT10 logs. Supporting examples of these two possible scenarios are Figs. 8 and 9. Figure 8 shows the models discovered by SIMA\(_\mathrm{sm}\) and SM from the BPIC14\(_\mathrm{f}\) log, where the model discovered by SIMA\(_\mathrm{sm}\) is more complex than that obtained with SM because it was necessary to improve its fitness (adding edges). While Fig. 9 shows the models discovered by SIMA\(_\mathrm{sm}\) and SM from the RTFMP log, where the model discovered by SIMA\(_\mathrm{sm}\) is simpler than that obtained with SM because it was necessary to improve the precision (removing edges).

Comparing the results obtained by the metaheuristics with HPO\(_\mathrm{sm}\), we can see that our approach allows us to discover models that cannot be discovered simply by tuning the parameters of SM. This relates to the solution space exploration. Indeed, HPO\(_\mathrm{sm}\) can only explore a limited number of solutions (DFGs), i.e., those that can be generated by underlying APDA (SM in this case) by varying its parameters. In contrast, the metaheuristics go beyond the solution space of HPO\(_\mathrm{sm}\) by exploring new DFGs in a pseudorandom manner.

In terms of execution times, the four metaheuristics perform similarly, having an average discovery time close to 150 s. While this is considerably higher than the execution time of SM (\(\sim 1\) second on average), it is much lower than HPO\(_\mathrm{sm}\), while consistently achieving higher accuracy.

4.3 Fodina

Tables 4 and 5 report the results of our comparative evaluation for Fodina. In these tables, we used “–” to report that a given accuracy measurement could not be reliably obtained due to the unsoundness of the discovered process model. We held out from the tables two logs: BPIC12 and SEPSIS, because none of the six approaches (base APDA, hyper-parameter optimized and the four metaheuristics) was able to discover a sound process model. This is due to Fodina’s design which does not guarantee soundness.

Table 4 Comparative evaluation results for the public logs—Fodina

Considering the remaining 18 logs, eleven times all the metaheuristics improved the Markovian F-score w.r.t. HPO\(_{\mathrm{fo}}\) (and consequently FO), while 16 times at least one metaheuristic outperformed both FO and HPO\(_{\mathrm{fo}}\). The only two cases where none of the metaheuristics was able to discover a more accurate process model than HPO\(_{\mathrm{fo}}\) were PRT2 and BPIC14\(_\mathrm{f}\). In the former log, this is because none of the metaheuristics discovered a sound process model within the given timeout of five minutes. However, we note that HPO\(_{\mathrm{fo}}\) took almost four hours to discover a sound process model from the PRT2 log. In the latter log, this is because all the metaheuristics discovered the same model of HPO\(_{\mathrm{fo}}\).

Among the optimization metaheuristics, TABU\(_{\mathrm{fo}}\) performed the best. This metaheuristic achieved 14 times out of 18 the highest Markovian F-score, followed by ILS (ten times). However, like for Split Miner, the differences in the achieved F-score between the four metaheuristics are small. There is a difference of only 1–2 percentage points between the metaheuristics with highest F-score and the one with lowest F-score.

Table 5 Comparative evaluation results for the proprietary logs—Fodina

In the case of Fodina, the results achieved by the metaheuristics on the alignment-based F-score are more remarkable than the case of Split Miner, and in-line with the results obtained on the Markovian F-score. Indeed, 50% of the times, all the metaheuristics were able to outperform both FO and HPO\(_{\mathrm{fo}}\) on the alignment-based F-score, and more than 80% of the times, at least one metaheuristic scored a higher alignment-based F-score than FO and HPO\(_{\mathrm{fo}}\). Such a result is remarkable considering that the objective function of the metaheuristics was the Markovian F-score.

Regarding the complexity of the models discovered by the metaheuristics, more than 50% of the times, it is lower than the complexity of the models discovered by FO and HPO\(_{\mathrm{fo}}\), and in the remaining cases in-line with the two baselines. Such a difference with the results we obtained for SM relates to the following two factors: (i) Split Miner discovers much simpler models than Fodina, and any further improvement is difficult to achieve; (ii) Fodina natively discovers more fitting models than Split Miner and hence, the metaheuristics aim at improving precision, ultimately removing model edges, and so reducing its complexity.

In terms of execution times, the four metaheuristics perform similarly, with an execution time between 150 and 300 s, slightly higher than the case of Split Miner.

4.4 Inductive Miner

Table 6 displays the results of our comparative evaluation for Inductive Miner. We held out from the table the five BPIC15 logs, because none of the three metaheuristics could discover a model within the five minutes timeout. This was due to scalability issues experienced by the Markovian accuracy, already known for the case of IM [6].

Table 6 Comparative evaluation results for the public and proprietary logs—Inductive Miner

In the remaining 15 logs, 13 times all the metaheuristics improved the Markovian F-score w.r.t. IM\(_{\mathrm{d}}\), and only for the BPIC17\(_\mathrm{f}\) log none of the metaheuristics could outperform IM\(_{\mathrm{d}}\). The best performing metaheuristic was SIMA\(_{\mathrm{imd}}\), eight times achieving the highest Markovian F-score, followed by TABU\(_{\mathrm{imd}}\) and RLS\(_{\mathrm{imd}}\), which scored seven, respectively, six times the highest Markovian F-score. Again, we note that the differences in the achieved F-score across the four metaheuristics are small. There are several cases in which multiple metaheuristics achieve the same F-score, and a difference of only 1–2 percentage point between the best-performing and the worst-performing metaheuristics.

The results of the metaheuristics on the alignment-based F-score are similar to the case of Fodina, and they are broadly in-line with the results achieved on the Markovian F-score. Indeed, 80% of the times, all the metaheuristics were able to outperform IM\(_{\mathrm{d}}\), failing only in two logs out of 15.

Regarding the complexity of the models discovered by the metaheuristics, we recorded little variation w.r.t. the complexity of the models discovered by IM\(_{\mathrm{d}}\). Size and CFC did not notably improve nor worsen, except for the PRT9 and the BPIC14\(_\mathrm{f}\) logs, where both size and CFC were reduced by about 30%.

In terms of execution times, the three metaheuristics perform similarly, with an average execution time close to 300 s, meaning that the majority of the times the solution-space exploration was interrupted by the timeout.

4.5 Discussion

The results of the evaluation show that the use of metaheuristics optimization brings consistent improvements in accuracy with respect to the baseline discovery approaches, in 80% of the cases. Furthermore, it produces consistently higher alignment-based F-score, even though this measure was not used as an objective function, due to the low scalability of alignment-based precision.

The drawback of using metaheuristics optimization is the longer execution time—several minutes versus less than a few seconds for the baselines.

In a small number of cases, the optimization framework did not yield any F-score improvement with respect to the corresponding unoptimized approach, due to: (i) a small solution-space (i.e., the baseline already discovers the best process model); or (ii) scalability issues (i.e., the Markovian accuracy could be computed within the timeout). While the former scenario is beyond our control and strictly relates to the complexity of the input event log, the latter reminds us of the limitations of the state-of-the-art accuracy measures (and especially precision) in the context of automated process discovery, and justifies our design choice of a modular optimization framework, that allows the use of new accuracy measures as objective functions in the future, which may be able to overcome such scalability issues.

Another remarkable finding is that the metaheuristically optimized versions of Split Miner and Fodina consistently outperform their hyper-parameter optimized counterparts. This means that the space of possible process models that can explored by tweaking the parameters in input (e.g., the noise filter threshold) is not as rich as the space of process models that can be generated by repeatedly perturbing the DFG.

Finally, we found that all four metaheuristics considered in the evaluation led to similar F-scores. The differences in F-score between the best-performing and the worst-performing metaheuristics are generally negligible, in the order of 1–2 percentage points. For Inductive Miner, all four metaheuristics end up exploring the search space in a similar manner, leading to the same results in several cases. This may be explained by the fact that the set of possible models that Inductive Miner can generate is narrower than that generated by Fodina or Split Miner, because Inductive Miner can only generate block-structured models.

5 Conclusion

This paper showed that the use of S-metaheuristics is a promising approach to enhance the accuracy of DFG-based automated process discovery approaches. The outlined approach takes advantage of the DFG’s simplicity to define efficient perturbation functions that improve fitness or precision while preserving structural properties required to ensure model correctness.

The evaluation showed that the metaheuristically optimized approaches consistently achieve higher accuracy than the corresponding unoptimized baselines (Split Miner, Fodina, and Inductive Miner—directly-follows). This observation holds both when measuring accuracy via Markovian F-score and via alignment-based F-score. As expected, these accuracy gains come at the expense of higher execution times. This is naturally given that the metaheustics needs to execute the baseline approach several hundred times and it needs to measure the accuracy of each of the resulting process models. The evaluation also showed that the choice of optimization metaheuristic (among those considered in the paper) does not have a substantial effect on the accuracy of the resulting process models, nor on their size or complexity.

In its current form, the framework focuses on improving F-score. In principle, the framework could also be used to optimize other objective functions, such as model complexity, measured by number of edges or control-flow complexity measures, for example. Related to the above, the framework could be extended to optimize multiple dimensions simultaneously, using multi-objective (Pareto-front) optimization techniques instead of single-objective ones. Along the same lines, it may also be possible to adapt the framework in order to optimize one measure (e.g., F-score), subject to one or more constraints on other measures (e.g., that the number of edges in the discovered model must be below a given threshold). Lastly, another opportunity for future work may be the automation of the best metaheuristic selection, without compromising the time performance of the framework.

Another limitation of the proposed approach is that the DFG perturbations employed in the optimization phase do not use the frequencies of the directly-follows relations (i.e., arc frequencies in the DFG are not used). In other words, the proposed approach makes the following two design choices: (i) the perturbations either add an arc or remove an arc but they do not alter the frequency of an arc; and (ii) the decision as to which arcs to add or remove is not taken based on arc frequencies. The rationale for the first of these design choices is that modifying the arc frequencies would require us to have a criterion for deciding by how much should be frequencies be altered. This criterion would have to be dependent on the way the underlying automated process discovery algorithm uses the arc frequencies. We opted not to do in order to ensure the optimization method is independent of the underlying base algorithm. The rationale for the second choice is that the perturbations should have an element of randomness in order to allow the metaheuristics to explore a wider subset of the search space. Poor perturbations are likely to lead to solutions with lower F-scores, which are eventually discarded by the metaheuristics, but a transformation that leads to solutions with lower F-scores may later give rise to other solutions with higher F-score as the search unfolds. This having been said, it is possible that perturbations that remove arcs based on arc frequency might help the heuristics to focus on areas of the search space with higher F-score. A direction for future work is to explore other perturbation heuristics including frequency-aware ones.

A third limitation of the framework is that it only considers four S-metaheuristics. There is room for investigating further metaheuristics such as variants of simulated annealing, e.g., using different cooling schedules. Along a similar direction, this study could be extended to investigate the trade-offs between S-metaheuristics and P-metaheuristics in this setting.

Finally, the evaluation put into evidence scalability limitations of the Markovian precision measure for some datasets. These limitations are not specific to this precision measure—they also apply, and sometimes to a larger extent, to other precision measures including ETC precision and entropy-based precision [26]. There is a need for more scalable measures of precision in order to make metaheuristic optimization more broadly applicable in the context of automated process discovery.