Keywords

1 Introduction

Process mining can be seen as the missing link between model-based process analysis (e.g., simulation and verification) and data-oriented analysis techniques such as machine learning and data mining [1]. It seeks the “confrontation” between real event data and process models (automatically discovered or hand-made). As process mining techniques mature, more ambitious types of analysis come into reach. Whereas classical process mining techniques focus on a single process, this paper focuses on comparing different processes using event data.

In [3], we proposed the notion of process cubes where events and process models are organized using different dimensions. Each cell in the process cube corresponds to a set of events that can be used to discover a process model, to check conformance, or to discover bottlenecks. The idea is related to the well-known OLAP (Online Analytical Processing) data cubes and associated operations such as slice, dice, roll-up, and drill-down [20]. However, there are also significant differences because of the process-related nature of event data. For example, process discovery based on events is incomparable to computing the average or sum over a set of numerical values. Moreover, dimensions related to process instances (e.g. male versus female students), subprocesses (e.g. group assignments versus individual assignments), organizational entities (e.g. students versus lecturers), and time (e.g. years or semesters) are semantically different and it is challenging to slice, dice, roll-up, and drill-down process mining results efficiently.

This paper focuses on comparative process mining using process cubes. We discuss the main challenges related to comparative process mining. To do this, we use a data set describing behavior of students taking the “Business Information Systems” (2II05) course given at Eindhoven University of Technology. The data set contains two types of events: (1) events generated by students watching recorded video lectures and (2) events generated by students making exams. To understand differences in behavior among different student groups we apply comparative process mining and sketch the possible dimensions of the process cube.

The remainder is organized as follows. In Sect. 2 we briefly introduce the process mining spectrum. Section 3 defines the process cube notion as a means to view event data from different angles. Then we discuss a concrete case study analyzing the way students watch recorded lectures and correlate this with exam results (Sect. 4). Section 5 lists the main requirements and open challenges. Related work is briefly discussed in Sect. 6. Section 7 concludes the paper.

Fig. 1.
figure 1

Dotted chart showing all events related to the course Business Information Systems (2II05). The dotted chart was created using one of the over 600 ProM plug-ins (cf. www.processmining.org) (Color figure online).

2 Process Mining

Process mining provides a powerful way to analyze operational processes based on event data. Unlike classical purely model-based approaches, process mining is driven by “raw” observed behavior instead of assumptions or aggregate data. Unlike classical data-driven approaches, it is truly process-oriented and relates events to high-level end-to-end process models.

Normally, event logs serve as the starting point for process mining. An event log can be viewed as a multiset of traces. Each trace describes the life-cycle of a particular case (i.e., a process instance) in terms of the activities executed. Often event logs store additional information about events, e.g., the resource (i.e., person or device) executing or initiating the activity, the timestamp of the event, or data elements recorded with the event.

Process mining has been applied in a wide variety of organizations (e.g., banks, hospitals, municipalities, governmental agencies, webshops, and high-tech system manufacturers). Moreover, there are dozens of process mining techniques answering a wide variety of questions [1]. Due to space restrictions we can only illustrate a fraction of the available process mining techniques. To do so, we use a concrete data set involving two types of events recorded for students that took the Business Information Systems (2II05) course at Eindhoven University of Technology from 2009 until 2014. A view event refers to a student watching a particular lecture. An exam attempt event refers to a student taking an exam. Since the course is quite challenging, it is not uncommon that students need to resit the 2II05 exam multiple times. There are at least two exams per year. The initial log contains 6744 events generated by 287 students.

Figure 1 shows a so-called dotted chart taking the viewpoint that each student corresponds to a case (i.e., process instance). The dotted chart has 287 horizontal lines each corresponding to a student. The dots along such a line define the corresponding trace. The color of the dot refers to the corresponding activity, e.g., viewing a particular lecture. The red dots in Fig. 1 refer to exam attempts. It can be seen that some students need to take multiple exams and that students tend to watch irregularly. Note that the video lectures are an additional service to the students (i.e., next to regular lectures).

Let us zoom in on the group of 47 students taking the course in the period January 2011–August 2011. These students took the exam on 21-6-2011 and/or the retake exam on 16-8-2011. There were 24 lectures (two lectures per week) in the period from January until May 2011. Students could watch the lectures via an internet connection soon after recording. For example, activity “2II05 College 11b” refers to the second lecture in the 11th week. Figure 2 shows a process model discovered for this event log. Indeed students tend to watch the videos in chronological order. However, Fig. 2 only considers the most frequent paths. The actual process is more “Spaghetti-like”.

Fig. 2.
figure 2

Process model created using Disco while abstracting from infrequent paths. There are 24 activities corresponding to the video lectures and one activity corresponding to an exam attempt. The coloring indicates the frequency of the activities.

We have made a process model having 24 viewing activities in sequence followed by an exam. Hence, the model only allows for the trace \(\langle \)“2II05 College 1a”, “2II05 College 1b”, “2II05 College 2a”, “2II05 College 2b”, \(\ldots \), “Exam”\(\rangle \). Using ProM we can check the conformance of such a model in various ways. For example, we can compute alignments that map the traces in the event log to valid paths in the model [5, 10]. Figure 3 shows four such alignments. Overall, the fitness is low (\(0.33\)) showing that despite the tendency to watch lectures in the expected order, few students do so consistently. Also the times at which students watch the videos show a lot of variation. For example, it can be noted that, other than just before the exam, students rarely watch video lectures in the second half of the course.

Fig. 3.
figure 3

Alignments for four students showing that few students actually watch the lectures sequentially. The “synchronous moves” show where model and log agree. The “moves on model” show where the student skipped a video lecture. The “moves on log” show where the student watched a video lecture that was not next in line.

3 Process Cubes

Figures 1, 2 and 3 show some example results computed for a given event log, i.e., a collection of events. However, often the goal is to compare different variants of the same process or to zoom in on particular parts or aspects of the process. OLAP (Online Analytical Processing) tools aim to support this by organizing data in a cube having multiple dimensions [20]. However, OLAP is only used for numerical comparisons, e.g., the average transaction amount in different shops on different days in the week. In a process cube [3] we organize events using different dimensions and compute a (process) model per sublog associated to a cell. This way we can slice, dice, roll-up, and drill-down process mining results easily.

Throughout the paper we assume the following universes [3].

Definition 1

(Universes).  \(\mathcal{U}_V\) is the universe of possible attribute values (e.g., strings, numbers, etc.). \(\bot \in \mathcal{U}_V\) denotes a missing (“null”) value. \(\mathcal{U}_S = \mathcal {P}(\mathcal{U}_V)\) is the universe of value sets. \(\mathcal{U}_H = \mathcal {P}(\mathcal{U}_S)\) is the universe of value set collections (set of sets).

Note that \(v\in \mathcal{U}_V\) is a single value (e.g., \(v = 300\)), \(V\in \mathcal{U}_S\) is a set of values (e.g., \(V = \{ male , female \}\)), and \(H \in \mathcal{U}_H\) is a collection of sets. For example, \(H = \{ \{x\in \mathbb {N} \mid x < 50\}, \{x\in \mathbb {N} \mid 40 \le x < 80\}, \{x\in \mathbb {N} \mid x \ge 70\}\}\).

An event base is a “raw” collection of events having properties.

Definition 2

(Event Base, [3]). An event base \( EB = (E,P,\pi )\) defines a set of events \(E\), a set of event properties \(P\) , and a function \(\pi \in P \rightarrow (E \rightarrow \mathcal{U}_V)\) . For any property \(p\in P\), \(\pi (p)\) (denoted \(\pi _p\)) is a function mapping an event onto a value for property \(p\) . If \(\pi _p(e)=v\) , then event \(e \in E\) has a property \(p \in P\) and the value of this property is \(v \in \mathcal{U}_V\) . We write \(\pi _p(e)=\bot \) for missing values.

Independent of the event base \( EB \) we define the structure of the process cube. The structure is fully characterized by the dimensions of the cube.

Definition 3

(Process Cube Structure, [3]). A process cube structure is a triplet \( PCS = (D, type , hier )\) where:

  • \(D\) is a set of dimensions,

  • \( type \in D \rightarrow \mathcal{U}_S\) is a function defining the possible set of values for each dimension, e.g., \( type ( age ) = \{0,1,2, \ldots , 120\}\) for \( age \in D\) , and

  • \( hier \in D \rightarrow \mathcal{U}_H\) defines a hierarchy for each dimension such that for any \(d\in D\): \( type (d) = \bigcup hier (d)\).

Note that a hierarchy is merely a collection of sets of values. To relate an event base and a process cube structure, both need to be compatible, i.e., dimensions should correspond to properties (\(D \subseteq P\)) and concrete event property values need to be of the right type (\(\pi _d(e) \in type (d)\) for any \(d\in D\) and \(e \in E\)). Moreover, for process mining we often assume that \(\{ case , activity , time , resource \} \subseteq D \subseteq P\), i.e., each event refers to a case and an activity, occurred at a particular time, and was executed by a particular resource. These properties do not need to be decided upfront and can be changed during analysis. For example, the notion of \( case \) may be changed to create another viewpoint (see Chap. 4 in [1]).

Fig. 4.
figure 4

Two example hierarchies, both having three levels. The left-hand-side hierarchy can be used to group events according to the patient’s age. The right-hand-side hierarchy can be used to group events according to the role of the resource performing the activity.

To clarify the notion of a process cube structure, let us consider an example \( PCS = (D, type , hier )\).

  • \(D = \{ patient\_id , type , age , activity , staff , time , \ldots \}\) defines the set of dimensions.

  • \( type \) is a function mapping each dimension onto a set of possible values:

    • \( type ( patient\_id ) = \{ 99000,99001, \ldots 99999\}\) is the set of patient identifiers (this value can be used as a case identifier),

    • \( type ( type ) =\{ gold , silver \}\) is the set of patient types (patients of type \( gold \) have a better insurance allowing for extra privileges),

    • \( type ( age ) =\{0,1,2, \ldots , 140, ? \}\) is the set of possible ages (value “\(?\)” denotes that the age of the patient is unknown),

    • \( type ( activity ) =\{ blood\_test , doctor\_visit , X\_ray , handle\_payment , \ldots \}\) is the set of activities,

    • \( type ( staff ) =\{ Peter , Sue , Ellen , Tom , \ldots \}\) is the set of resources (doctors, nurses, and

    • \( type ( time )\) is the set of possible timestamps.

  • \( hier \) is a function defining a hierarchy for each dimension:

    • \( hier ( opatient\_id ) = \{\{ 99000, 99001, \ldots 99999\}, \{99000\}, \{99001\}, \{99002\}, \ldots ,\{99999\}\}\). The element \(\{ 99000, 99001, \ldots ,99999\}\) can be seen as the root of the hierarchy, i.e., all possible values. The other singleton elements can be seen as the leaves of the hierarchy, e.g., the set \(\{99023\}\) represents one individual patient.

    • \( hier ( type ) = \{ \{ gold , silver \}, \{ gold \},\{ silver \}\}\). Set \(\{ gold , silver \}\) can be seen as the root of the hierarchy, i.e., all patient types. The singleton \(\{ gold \}\) refers to patients having special privileges and the singleton \(\{ silver \}\) refers to “normal” patients.

    • \( hier ( age ) = \{ \{0,1,2, \ldots , 140, ?\}, \{0,1,2, \ldots , 18\}, \{19,20, \ldots , 140\}, \{0\}, \{1\}, \{2\},\ldots , \{140\}, \{?\} \}\) defines an age hierarchy with three levels (see Fig. 4), e.g., the element \(\{0,1,2, \ldots , 18\}\) represents the subset “young”.

    • \( hier ( activity ) = \{ \{ blood\_test , doctor\_visit , X\_ray , handle\_payment , \ldots \},\) \(\{ blood\_test \}, \{ doctor\_visit \}, \ldots \}\) groups all activities in a simple hierarchy,Footnote 1

    • \( hier ( staff ) = \{ \{ Peter , Sue , Ellen , Tom , \ldots \}, \{ Peter , Sue , \ldots \}, \{ Ellen , Tom , \ldots \}, \{ Ellen , \ldots \}, \{ Peter \}, \{ Sue \}, \{ Ellen \}, \{ Tom \}, \ldots \}\) defines a role hierarchy as shown in Fig. 4. Staff member \( Ellen \) is both a doctor and a surgeon and appears in four subsets of \( hier ( staff )\).

    • \( hier ( time )\) can be used to group timestamps into years, months, weekdays, days, etc.

Figure 4 shows two hierarchies. The arcs are based on inclusion and can be interpreted as “is a”. In this paper we use a rather simplistic, but also general, notion of hierarchy. Only the possible elements of a dimension are mentioned and no further constraints are given. We do not specify which elements can be used at the same time (see also [3]). Normally, one will make the different levels explicit and only select elements of a given level. We also do not specify navigation rules and do not explicitly name the sets in a hierarchy. The names used Fig. 4 (“all”, “young”, and “nurse”) are not formalized, but should of course be supported by software tools.

Fig. 5.
figure 5

Process cube view having three selected dimensions: \(D_{sel} = \{ type , age , staff \}\).

A process cube view defines which dimensions are visible and which events are selected.

Definition 4

(Process Cube View, [3]). Let \( PCS = (D, type , hier )\) be a process cube structure. A process cube view is a pair \( PCV = (D_{sel}, sel )\) such that:

  • \(D_{sel} \subseteq D\) are the selected dimensions,

  • \( sel \in D \rightarrow \mathcal{U}_H\) is a function selecting the part of the hierarchy considered per dimension. Function \( sel \) is such that for any \(d \in D\): \( sel (d) \subseteq hier (d)\).

Figure 5 shows an example of a process cube view. \(D_{sel} = \{ type , age , staff \}\) are the selected dimensions. \( sel ( type ) = \{\{ gold \},\{ silver \}\}\), \( sel ( age ) = \{ \{0,1,2, \ldots , 18\}, \{19,20, \ldots , 140\}, \{?\} \}\), and \( sel ( staff ) = \{ \{ Peter , Sue , \ldots \}, \{ Ellen , Tom , \ldots \}, \{ Ellen , \ldots \} \}\). Note that function \( sel \) returns a set of sets for each dimension. For the non-selected dimensions \(D\setminus D_{sel}\) we cannot see the value of \( sel \) in Fig. 5, but still these dimensions may have been used for filtering. For example, it may be that \( sel ( activity ) = \{ \{ blood\_test \} \}\), i.e., only events related to blood tests are considered. Moreover, it could be that \( sel ( type )\) is used to select events that happened in 2013 and 2014.

By removing elements from \(D_{sel}\), the number of dimensions is reduced. This is orthogonal to function \( sel \) which decides on the granularity and filtering of each dimension. Given an event base and a process cube view, we can compute an event log for every cell in the process cube.

Definition 5

(Materialized Process Cube View). Let process cube structure \( PCS = (D, type , hier )\) and event base \( EB = (E,P,\pi )\) be compatible. The materialized process cube for some view \( PCV = (D_{sel}, sel )\) of \( PCS \) is \(M_{ EB , PCV } = \{ (c, events (c)) \mid c \in cells \}\) with \( cells = \{ c \in D_{sel} \rightarrow \mathcal{U}_S \mid \forall _{d\in D_{sel}} \ c(d) \in sel (d)\}\) being the cells of the cube and \( events (c) = \{e \in E \mid \forall _{d\in D_{sel}} \ \pi _d(e) \in c(d) \ \wedge \ \forall _{d\in D} \ \pi _d(e) \in \bigcup sel (d)\}\) the set of events per cell.

The term “materialized” may be misleading, it is just used to express that, in order to apply standard process mining techniques, we need to create an event log (e.g., in XES format [28]) for every cell in the process cube view. Figure 5 highlights one of the 18 cells: This cell contains all events corresponding to a “normal” patient having a unknown age, and performed by a nurse. The events in this cell can be transformed into an event log and analyzed using process mining techniques. Moreover, results for different cells can be compared.

Using the above definition we can formalize notions such as slice, dice, roll-up, and drill-down for process cubes.

Definition 6

(Slice, [3]). Let \( PCS = (D, type , hier )\) be a process cube structure and \( PCV = (D_{sel}, sel )\) a view of \( PCS \). For any \(d \in D_{sel}\) and \(V \in sel (d)\): \( slice _{d,V}( PCV ) = (D_{sel}', sel ')\) with \(D_{sel}' = D_{sel} \setminus \{d\}\), \( sel '(d) = \{ V \}\) , and \( sel '(d') = sel (d')\) for \(d' \in D \setminus \{d\}\).

Through slicing, a dimension \(d\) is removed. At the same time one value set \(V\) is chosen for the removed dimension.

Fig. 6.
figure 6

The process cube view after slicing based on dimension \( staff \).

For example, Fig. 6 shows \( slice _{ staff , \{ Peter , Sue , \ldots \}}( PCV )\), starting from the view in Fig. 5. This new process cube view is the result of slicing using dimension \( staff \) and value set \(\{ Peter , Sue , \ldots \}\), i.e., the \( staff \) dimension is no longer selected and only events performed by nurses are considered in the resulting view.

Definition 7

(Dice, [3]). Let \( PCS = (D, type , hier )\) be a process cube structure and \( PCV = (D_{sel}, sel )\) a view of \( PCS \). Let \( res \in D_{sel} \not \rightarrow \mathcal{U}_H\) be a restriction such for any \(d\in dom ( res )\): \( res (d) \subseteq sel (d)\). \( dice _{ res }( PCV ) = (D_{sel}, sel ')\) with \( sel '(d) = res (d)\) for \(d \in dom ( res )\) and \( sel '(d) = sel (d)\) for \(d \in D \setminus dom ( res )\).

Dicing does not remove a dimension, but restricts the values sets for one or more dimensions.

Fig. 7.
figure 7

The result after dicing based on dimensions \( staff \) and \( age \).

Consider Fig. 5 again. Suppose that we would like to remove events referring to patients whose age is unknown and that we only consider events performed by nurses and doctors. Hence, \( dom ( res ) = \{ staff , age \}\) because these are the two dimensions we would like to dice. Moreover, \( sel ( age ) = \{ \{0,1,2, \ldots , 18\},\{19, 20, \ldots , 140\} \}\) (note that \(\{?\}\) was removed) and \( sel ( staff ) = \{ \{ Peter , Sue , \ldots \}, \{ Ellen , Tom , \ldots \} \}\) (note that the surgeon role was removed). The result is shown in Fig. 7.

Definition 8

(Change Granularity, [3]). Let \( PCS = (D, type , hier )\) be a process cube structure and \( PCV = (D_{sel}, sel )\) a view of \( PCS \). Let \(d \in D_{sel}\) and \(H \in \mathcal{U}_H\) such that: \(H \subseteq hier (d)\) and \(\bigcup H = \bigcup sel (d)\). \( chgr _{d,H}( PCV ) = (D_{sel}, sel ')\) with \( sel '(d) = H\) , and \( sel '(d') = sel (d')\) for \(d' \in D \setminus \{d\}\).

Drilling down is an example of an operator for changing the granularity of the cube, e.g., refining a year into months. However, it is also possible to make the view more coarse (roll-up).

Fig. 8.
figure 8

The process cube view after changing the granularity of dimensions \( type \) and \( staff \). The \( type \) dimension was rolled up (made coarser) and the \( staff \) dimension was refined (drill-down to the level of individual staff members).

Figure 8 shows a process cube view that was obtained by changing granularity of the original view depicted in Fig. 5. Function \( chgr _{d,H}\) was applied twice: the \( type \) dimension was coarsened and \( staff \) dimension was refined. Events related to silver and gold patients have been merged, but still the \( type \) dimension is visible. Moreover, events are now related to individual staff members rather than roles. Figure 8 now uses the leaves of the \( staff \) hierarchy.

Through slicing (cf. Fig. 6), dicing (cf. Fig. 7), and changing the granularity (cf. Fig. 8), we can change the process cube view in Fig. 5. At any point in time we can generate an event log per cell and compare the process mining results. To be able to apply process mining per cell, the classical requirements need to be satisfied, i.e., events need to be (partially) ordered (e.g., based on some timestamp), one needs to select a case identifier to correlate events and an event classifier to determine the activities. See [1, 3, 28] for more information on process mining, process cubes, and event logs.

Based on the definitions in this section we have developed an initial prototype (called ProCube) using the process mining framework ProM and the Palo OLAP toolset [31]. ProCube runs as a plugin in ProM. The plug-in creates sublogs per cell on-the-fly and visualizes process models discovered using the fuzzy miner [26] and the heuristic miner [45], social networks derived using ProM’s social network miner [7], and dotted charts [42] computed per cell. The prototype has many limitations (too slow for high-dimensional process cubes, poor visualization of the results, and limited support for hierarchies), but nicely illustrates the concepts. Currently, we are working on more mature process cube support through software.

4 Video Lectures: A Case Study

The primary goal of the process cube notion is to facilitate comparison. To illustrate this, we return to the data set described in Sect. 2. At Eindhoven University of Technology various lectures are recorded and made available online. The Lecture Capturing System (LCS) used is Mediasite developed by SonicFoundry. This system is able to provide an audit trail, but does not provide any form of process mining. We also used a database with exam results to relate student performance to viewing behavior. Student names were replaced by anonymous identifiers before analysis. Our analysis builds on the PhD research of Pierre Gorissen who analyzed the viewing behavior of students using Mediasite by means of more traditional methods rather than process mining [25].

4.1 Data Available on Video Lectures and Exams

To understand the data available consider the class diagram shown in Fig. 9. A course has a unique code and a name. The same course may be given multiple times, e.g., once or twice per year. Such a course instance has a start date and an end date, in-between these dates video lectures are recorded. Per course there are exams. Each exam refers to the last course instance given. Per course instance there are one or more corresponding exams (typically a regular exam and an additional exam for the students that failed). A student may use multiple exam attempts. Such an attempt refers to an exam on a particular day and a student. Students may view lectures, also of earlier course instances. An atomic view refers to one student and one lecture. Per atomic view the interval watched is recorded, e.g., a student watches the first 14 min of “2II05 College 11b”. Also the time and date at which the student watches the fragment is recorded.

Fig. 9.
figure 9

Overview of the raw data available.

4.2 Identifying Events

Students can view a lecture in smaller chunks, e.g., a student can fast forward 20 min, then watch 5 min, go back to be beginning and watch 10 min, etc. Therefore, the same student may generate hundreds of atomic views of the same lecture on the same day. Since we would like to have a model at the level of lectures, we add a new class named “View (derived)” in Fig. 10. Entities of this class correspond to compositions of atomic views. Attribute “nof_views” in Fig. 10 refers to the number of atomic events and “total_duration” is the sum of all the durations of the corresponding atomic events. We also record the start time of the first and last atomic view of the lecture by that student on that day.

Fig. 10.
figure 10

The class diagram with the two types of events considered: views and exam attempts.

Based on the available data we propose two types of events: exam attempts (entities of class “Exam_Attempt”) and views (entities of class “View (derived)”). Figure 10 shows all the properties of these events. These properties follow directly from the class model. For example, a view refers to a student and her properties (id, name, gender, and nationality), a view refers to a video lecture and its properties (name and recording date). Because a video lecture belongs to a course instance and therefore also to a course, additional properties can be derived.

4.3 Defining the Process Cube Structure and Selecting Views

The two events types in Fig. 10 list many properties. When merging both event types in an event base \( EB = (E,P,\pi )\) we take the union of these. Each of the properties may serve as a dimension in the process cube structure \( PCS = (D, type , hier )\). If a course is composed of different parts, function \( hier \) can be used to reflect this. Also there is a natural hierarchy for the time domain (years, months, weeks, days, etc.). It is also possible to group exams in a hierarchical manner: a course instance or course can also be viewed as a collection of exams. This way we obtain a hierarchy consisting of three levels: individual exams, exams belonging to a course instance, and exams for a course. The video lectures can also be grouped in a hierarchical manner.

Fig. 11.
figure 11

Given a process cube view, the cells can be materialized. Subsequently, process mining techniques can be applied to the cell sublogs.

The process cube view \( PCV = (D_{sel}, sel )\) defines the cells that we would like to consider. We can slice the cube to focus on a particular course. For example we can select only events related to course “2II05” and remove the dimensions “Course.course_code” and “Course.course_name” from \(D_{sel}\). The process cube operators defined in Sect. 3 can be used to create the desired view \( PCV = (D_{sel}, sel )\). At any time the view \( PCV \) can be materialized resulting in an event log per cell (sublogs). Figure 11 illustrates the overall process. Once we have an event log per cell, we can apply any process mining technique to all cells and compare the results. This facilitates comparative process mining.

Recall that for process mining we often assume that \(\{ case , activity , time , resource \} \subseteq D\). \(\pi _{ case }(e)\) is the case associated to event \(e\). There are two obvious choices: a case is an exam attempt or a case is a student. If a student used three exam attempts and we assume the first notion, then the same student will generate three cases. Also other choices are possible, e.g., cases may also correspond to video lectures or course instances. To simplify the interpretation of the results in this paper, we assume that \(\pi _{ case }(e)\) equals the “Student.Student_id” event attribute in Fig. 10. For exam events we choose \(\pi _{ activity }(e)\) to be the string “Exam”. For view events we choose \(\pi _{ activity }(e)\) to be the title of the lecture, e.g., “2II05 College 11b”. For exam events we choose \(\pi _{ time }(e)\) to be the date and time of the exam. For view events we choose \(\pi _{ time }(e)\) to be the date and time of viewing. We actually have start and complete events for all activities and can thus measure the duration of activities. Resources seem less relevant here because the case is a student and hence \(\pi _{ resource }(e)\) refers to the case itself.

4.4 Analyzing Process Cube Views: Some Examples

Assume we have “sliced and diced” the process cube in such a way that we only consider the course instance of 2II05 running from January 2011 until August 2011. Moreover, \(D_{sel}\) contains only the dimensions gender (male of female) and nationality (Dutch or international). This results in four cells. Table 1 shows some results for this process cube view. Dutch students tend to watch the video lectures more frequently and are more likely to pass. Students that pass tend to watch the video lectures more frequently than the ones that do not pass. Note that Table 1 is based on just the 47 students that followed this particular course instance. Hence, based on these findings we cannot (and should not) generalize.

Table 1. Some numerical results based on a simple process cube view having only two selected dimensions.

Table 1 also shows the average trace fitness for each of the four cells. Conformance checking based on alignments and a sequential idealized process model are used to compute these numbers (see Sect. 2). The average trace fitness is 1 if all the students watch all video lectures in the default order and conclude with an exam. The Dutch students that passed have the highest average trace fitness (0.39). International students and students that failed have a lower average trace fitness. The average trace fitness for all students that passed is 0.37. This is significantly higher than the average trace fitness for all students that did not pass (which is 0.28).

Any process mining algorithm can be applied to the materialized cells of a process cube view. Figure 12 shows dotted charts for each of the four cells also used in Table 1. It shows that the Dutch students that passed often took the first exam and passed immediately. They also watched the lectures right from the start. The students that did not pass often skipped the first exam or even made two unsuccessful attempts. It can also be seen that some students systematically watched the videos to prepare for the exam.

Fig. 12.
figure 12

Four dotted charts based on the simple process cube view having two dimensions.

For each cell we can also discover process models using process mining techniques. Here, we compare the students that passed (Fig. 13) with the students that did not pass (Fig. 14), i.e., we only use one dimension for comparison. Figure 13 is based on 22 cases and Fig. 14 is based on 25 cases. Hence, it is not easy to generalize the results. However, there are obvious differences between both process models that confirm our earlier findings. Students that pass, tend to watch the lectures more regularly. For example, students that pass the course tend to start by watching the first lecture (see connection from the start node to “2II05 College 1a”), whereas students that fail tend to start by making the exam (see connection from the start node to “Exam”) rather than watching any video lectures.

Although the small of number students in the selected course instance does not allow for reliable generalizations, the case study nicely illustrates the applicability of process cubes for comparative process mining. Due to space restrictions, we could only present a fraction of the results and touched only a few of the available dimensions. For example, we also compared different courses instances, investigated differences in study behavior between male and female students, etc.

5 Requirements and Challenges

The case study just presented nicely illustrates the usefulness of process cubes for comparative process mining. However, the case study also reveals severe limitations of the current approach and implementation.

5.1 Performance

The process cube notion is most useful if the users can interactively “play with the cube” (slice, dice, drill-down, roll-up, etc.). However, this necessitates a good performance. There are two potential performance problems: (1) it takes too long to materialize the cells selected for analysis (i.e., create the event logs) and (2) it takes too long to compute the process mining results for all selected cells. Most process mining algorithms are linear in the number of cases and exponential in the number of activities or average trace length. Hence, it may be worthwhile to precompute results. However, if there are many dimensions each having many possible values, then this is infeasible. In the latter case one also needs to deal with the sparsity problem [36]. Suppose that there are just 10 dimensions each having just 10 possible values. Then, there are \(10^{10}\) cells at the lowest level of granularity. This means that even if we have one million of events, at least 99.99 % of cells will be empty. This creates an enormous overhead if sparsity is not handled well by the implementation [31].

5.2 Interpreting the Results: Comparing Graphs

Another difficulty is the problematic interpretation of the results. The goal of showing multiple process mining results is to facilitate comparison. However, this is far from trivial as is illustrated by the four dotted charts in Fig. 12. How to highlight differences and commonalities among multiple dotted charts?

Process mining results are often presented as graphs. In earlier sections we presented a few discovered process models showing only the control-flow. Most discovery algorithms return process models that are graphs (Petri nets, BPMN models, transition systems, Markov chains, etc.). Moreover, many other types of mining algorithms create graphs, e.g., organizational graphs, social networks, or richer process maps also showing data flow and work distribution [1]. The layout of such graphs is typically not tailored for comparison. See for example Fig. 13 where the “Exam” activity appears in the upper half of the diagram and Fig. 14 where the same activity appears at the bottom of the process map. To visualize such networks the nodes in the different graphs need to be aligned. Assuming that there is a mapping between the nodes of the graphs, e.g., based on activity labels, there are different ways of visualizing such graph alignments. One can use a “side-by-side” approach where the individual graphs are shown next to each other. To highlight the aligned nodes one can try to give related nodes the same relative position. One can also create a “all-in-one” graph that is the union of all individual graphs and use coloring or annotations to facilitate the reconstruction of the individual graphs. Brasch et al. [15] propose a mixture of the above two approaches (2.5D layouts). Here, the correspondence of aligned nodes is implied by drawing all 2D layouts simultaneously. The third dimension is used to place corresponding nodes on top of each other. The all-in-one approach seems to be most promising for a binary comparison of two models. Here one can take one of the two graphs as a reference and then highlight the differences in a so-called comparison graph. For example, if the arcs in the individual graphs are annotated with durations, then the comparison graph can show the differences highlighting parts that are faster or slower than the reference process.

Fig. 13.
figure 13

Process model for the students that passed.

Fig. 14.
figure 14

Process model for the students that did not pass.

5.3 Refinements

Next to challenges in performance and visualization, also conceptual refinements of the process cube notion are needed. We use the data set presented in Sect. 4 to illustrate these.

As Fig. 10 indicates, we have two very distinct types of events (exam attempts and views) having different properties. By taking the union over these properties, we get many missing values (i.e., \(\pi _p(e)=\bot \) for property \(p\) and event \(e\)). For example, an exam attempt event does not have a “nof_views” property and a view event does not have a “mark”. Moreover, one should make sure that the properties having the same name across different event types actually have the same semantics. For process mining, it makes no sense to only focus on events of one particular type. End-to-end processes may refer to a wide variety of activities. To address this problem one could extend the process cube notion into an array of process cubes (one for each type of event) with explicit relations between them. Alternatively, one can also define dimensions to be irrelevant for certain event types. For example, when slicing the process cube using the “mark” dimension (e.g. \( slice _{ mark ,\{8\}}( PCV )\)) one would like to retain all view events and only remove exam attempts having a different mark (e.g. not equal to 8). In the current formalization this can be mimicked by including events having a \(\bot \) value when doing a slice or dice action.

When applying process mining, events need to refer to a case and an activity, and have a timestamp. Hence, \(\{ case , activity , time \} \subseteq D\). However, these are not ordinary dimensions. For example, we may alter the notion of a case during analysis. For instance, using the event data described in Fig. 10, we can first consider students to be the cases and later change the case notion into exam attempts. We can investigate the study progress of a group of students across different courses (i.e., a case should refer to a student). We can also investigate the results for a particular exam (i.e., a case refers to an exam attempt). Similarly, we can change the activity notion. For example, in our analysis we did not distinguish between different exams when naming activities (\(\pi _{ activity }(e) =\) “Exam”). We could also have opted for different activity names depending on the exam (date and/or course). The process cube should support various case and activity notions and not consider these to be fixed upfront.

The same event may appear in multiple cells, e.g., organizational units may partially overlap and subprocesses may have common interface activities [3]. The notion of hierarchy should be as general as described in Sect. 3 to allow for this.

A case is defined as a collection of events. Hence, a case cannot have zero events, because it will simply not appear in the process cube. This may lead to incorrect or misleading interpretations. On the one hand, some process mining algorithms cannot handle empty traces. On the other hand, also empty traces contain information. See for example the importance of empty traces in decomposed process mining [2]. Now all dimensions in \( PCV \) are event properties. However, some properties clearly reside at the case level. For example, a view event in Fig. 10 has as property the gender of the student that viewed the lecture. However, this is more a property of the case (i.e., student) rather than the event (i.e., view). As a result, information is replicated across events. Hence, it may be useful to distinguish between case dimensions and event dimensions in a process cube. Moreover, it is often useful to exploit derived case information. For example, we may want to use the number of failed exam attempts or flow time. Generally speaking processes are executed within a particular context [6], and events and cases may need to be enriched with derived context information. For an event \(e\), we may want to know the utilization level of the resources involved in the execution of \(e\), and insert this utilization level as another dimension. For a case, we may want to know the relative duration compared to other cases and insert it as a dimension at the case level.

6 Related Work

See [1] for an introduction to process mining and the Process Mining Manifesto [27] for the main challenges in process mining. For example, dozens of process discovery [1, 8, 9, 13, 14, 18, 19, 21, 23, 24, 29, 30, 33, 41, 45, 46] and conformance checking [5, 1012, 16, 22, 24, 34, 35, 39, 44] approaches have been proposed in literature. This paper is not about new process mining techniques, but builds on the notion of process cubes introduced in [3]. Process cubes are related to the well-known OLAP (Online Analytical Processing) cubes [20] and large process model repositories [38].

7 Conclusion

As process mining techniques are maturing and more event data become available, we no longer want to restrict analysis to a single process. We would like compare different variants of the process or different groups of cases. Organizations are interested in comparative process mining to understand how processes can be improved. We propose to use process cubes as a way to organize event data in a multi-dimensional data structure tailored towards process mining.

To illustrate the process cube concept, we used a data set containing partial information about the study behavior of students. When do they view video lectures? What is the effect of viewing these lectures on the exam results? Learning analytics is the broader field that aims to answer such questions. It is defined as “the gathering and analysis of and reporting on data relating to students and their environment for the purpose of gaining a better understanding of and improving education and the environment in which it is provided” [40]. The terms Learning Analytics (LA) and Educational Data Mining (EDM) are used interchangeably and both advocate the intelligent use of event data [17]. Although Pechenizkiy et al. explored the use of process mining in the context of LA and EDM, most of the existing analysis approaches are not process-centric [17, 37, 43]. We consider the comparison of the learning processes inside and between courses as an essential ingredient for a better understanding of the study behavior of students.

The results presented in this paper are only the starting point for a more comprehensive analysis of learning behavior. First of all, we would like to address the foundational challenges described in Sect. 5 (i.e., refining the process cube concept and improving performance in higher-dimensional cubes). Currently, major implementation efforts are ongoing to provide better support for process cubes. We are developing a solution embedded in ProM (work of Shengnan Guo) and a solution based on database technology calling ProM plug-ins from outside ProM (work of Alfredo Bolt). The latter solution is related to calling ProM plug-ins from other tools like RapidMiner [32] and KNIME. Next, we would like to apply the process cube notion to all video lectures recorded at Eindhoven University of Technology and offer such an in-depth analysis as a service for teachers and students. Finally, we would like to apply comparative process mining based on process cubes to new data sets. For example, we will use process cubes to analyze the Massive Open Online Course (MOOC) on “Process Mining: Data science in Action” [4]. Moreover, we envision that more and more data on learning behavior will become available from very different sources. This supports our quest to store event data in such a way that interactive analysis becomes possible. Of course there are also serious privacy concerns. Students should be aware of what is recorded and be able to use it to their advantage. For teachers there is often no need to know the progress of individuals. Often it is sufficient to understand the effectiveness of lectures and teaching material at a group level. Only when students give their consent, one should use the information at the level of individual students.