Discovery of Business Process Models from Incomplete Logs

Wang, Lili; Fang, Xianwen; Shao, Chifeng

doi:10.3390/electronics11193179

Open AccessArticle

Discovery of Business Process Models from Incomplete Logs

by

Lili Wang

^1,2

,

Xianwen Fang

^1,2,*

and

Chifeng Shao

¹

College of Mathematics and Big Data, Anhui University of Science and Technology, Huainan 232001, China

²

Anhui Province Engineering Laboratory for Big Data Analysis and Early Warning Technology of Coal Mine Safety, Huainan 232001, China

^*

Author to whom correspondence should be addressed.

Electronics 2022, 11(19), 3179; https://doi.org/10.3390/electronics11193179

Submission received: 10 September 2022 / Revised: 27 September 2022 / Accepted: 27 September 2022 / Published: 3 October 2022

(This article belongs to the Section Computer Science & Engineering)

Download

Browse Figures

Versions Notes

Abstract

:

The completeness of event logs and long-distance dependencies are two major challenges for process mining. Until now, most process mining methods have not been able to discover long-distance dependency and assume that the directly-follows relationship in the log is complete. However, due to the existence of high concurrency and the cycle, it is difficult to guarantee that the real-life log is complete regarding the directly-follows relationship. Therefore, process mining needs to be able to deal with incompleteness. In this paper, we propose a method for discovering process models including sequential, exclusive, concurrent, and cyclic structures from incomplete event logs. The method analyzes the co-occurrence class of the log and the model and then uses the technology of combining the behavior profile and co-occurrence class to obtain the communication behavior profile of the co-occurrence class. Furthermore, a method of constructing a substructure from the event log using the co-occurrence class is presented. Finally, the whole process model is built by combining those substructures. The experimental results show that the proposed method can discover process models with complex structures involving cycles from incomplete event logs and also can deal with long-distance dependency in the event log. Meanwhile, the discovered process model has a good degree of consistency with the original model.

Keywords:

Petri net; incomplete event log; long-distance dependency; behavior profile; degree of consistency

1. Introduction

The operation of an information system will generate a large number of event logs. These event logs record what activities will occur and in what order they occur in the business process. In addition, there is much other related information, such as executors and execution resources [1]. Process mining extracts valuable insights about processes from these event logs to analyze, monitor, and improve actual business processes. According to van der Aalst [2], process mining techniques can be divided into three directions: (1) process discovery, which uses event logs to generate process models; (2) conformance checking, which compares the behaviors generated by the model with the behaviors in the event log. The aim is to detect, locate, and explain deviations and to measure the severity of these deviations; and (3) process enhancement, which aims to improve or extend an existing process model using information about the actual process recorded in some event log. To avoid “spaghetti-like” process models, process discovery usually consists of three steps: first, preprocessing the noise behavior in the event log, then discovering the process model from the preprocessed event log, and finally evaluating the quality of the process model [3]. This paper does not consider log preprocessing steps and only focuses on how to discover process models from event logs.

In addition to noise, the real-life event log is often incomplete, that is, it is impossible to show all possible behaviors [4]. Most of the existing process mining algorithms assume that the log is weakly complete, i.e., they assume that the direct successor relationship is complete, the causal dependency is complete, or that the log is complete [5]. Obviously, the assumption that the log is complete is impossible in reality. One of the factors leading to incomplete logs is high concurrency or a cycle. In [6], the authors proposed the concept of a conjoint occurrence class, which solved the incompleteness caused by high concurrency and illustrated how to discover process models from incomplete logs. However, this method has difficulty coping with cycles and does not consider the relationship between the model-based conjoint occurrence class and the log-based conjoint occurrence class.

Based on [6], we further provide a novel concept of a model-based co-occurrence class to capture activities that are always occur together in any executable trace of the model. Meanwhile, the relationship between the model-based co-occurrence class and log-based co-occurrence class is analyzed. Then, how to construct a cyclic structure containing multiple types of substructures is elaborated besides sequential, exclusive, and concurrent structures. Finally, a method of discovering process models from incomplete event logs using the technology of combining the co-occurrence class and behavior profile is presented. The proposed method in this paper solves the problem of the long-distance dependencies and incompleteness caused by the cycle or high concurrency. The experimental results show that when the log is incomplete, the proposed method can still build a process model that has a high degree of consistency with the original model.

In the following, Section 2 first discusses related work. The problem statement is given in Section 3. Section 4 explains some basic knowledge about the co-occurrence class of the log and model. In Section 5, the method of constructing a substructure using the co-occurrence is described. In Section 6, we report on experimental results. Conclusions and future work are presented in Section 7.

2. Related Work

Until now, a series of process mining methods have been proposed, mainly including a directly-follows relationship between activities [7], domain-based theory [8,9,10,11], genetic-based algorithm [12,13], frequency-based technology [14,15,16], cut-based technology [17], abstraction-based technology [18], and so on. Of these, the method based on the directly-follows relationship between activities can only find some subclasses of Petri nets and cannot guarantee soundness and fitness, such as the

α

algorithm [19] and its derivatives [20,21,22]. The technique of domain-based theory mainly includes language-based domain miners [8], state-based domain miners [9], and integer linear programming (ILP) miners [10,11]. Although these methods guarantee the fitness of the model, they cannot guarantee the soundness of the model. The mining method based on the genetic algorithm iterates the process model until the acceptable quality is obtained. It selects the fittest individuals and generates new individuals using two genetic operators, such as crossover and mutation [12,13]. Although the process model obtained by this method has good fitness, it consumes a great deal of running time and cannot guarantee the soundness of the model.

The frequency-based method takes frequencies of events and sequences into account when constructing a process model, such as the Heuristic Mining (HM) algorithm [14], flexible heuristics miner (FHM) [15], or Fodina [16]. It calculates the dependency measure between activities and filters noise behavior by removing causal dependencies below a given threshold. This technique sometimes generates a process model with incorrect behaviors; meanwhile, soundness and fitness are not guaranteed. Cut-based technology, such as the Inductive Miner (IM) [17] and Inductive Miner-infrequent (IMi) algorithms [22], adopts the idea of divide and conquer and uses four cuts to recursively split the log into several sub-logs until it cannot find a cut. Subsequently, it constructs the sub-model from each sub-log. Finally, a block-structured process model can be built by merging all sub-models. Although the soundness of the process model is guaranteed, its precision is very low, that is, the discovered process model is over-generalized. Process discovery algorithms based on abstract techniques, such as fuzzy miner [18], produce models without executable semantics, so neither soundness nor fitness is guaranteed. In [23], state-of-the-art process discovery methods based on the split gateway discovery were presented. Split Miner combines a novel method to filter directly-follows graphs induced by event logs, with a method for identifying combinations of split gateways that accurately capture concurrency, conflict, and causal relations between neighbors in directly-follows graphs. Split Miner is the first method to guarantee that the discovered BPMN model is deadlock-free and is not restricted to producing a block-structured model. However, when the number of activities in the log is large, the time complexity of the algorithm is high.

At present, most process discovery methods can deal with noises [24,25,26,27,28]. Apart from noise, another problem of the real event log is incompleteness [29]. Since the event log only records the observed behaviors of the business process during its past, it cannot capture the behaviors that will occur in the future. Hence, the event log is usually incomplete. However, few algorithms have the capability of handling the incompleteness of event logs. In [17], based on the IM algorithm, a process discovery algorithm handling incomplete event logs was proposed, which introduced probabilistic behavioral relations that are less sensitive to incompleteness. It adopted the divide-and-conquer approach in the same way as IM and searched for an optimal partition of activities by estimating probabilities of the relations between activities. However, the method proposed in [17] has difficulty in handling long-distance dependency and cannot handle incompleteness caused by high concurrency well. In [6], the authors presented conjoint occurrence classes to handle the incompleteness caused by the concurrency and cycles. This method can infer the causal and concurrent behavior not exhibited in the log. However, the method in [6] cannot handle loop structures and the long-distance dependency hidden in the log. In this work, a co-occurrence class based-log and based-model are provided, respectively, to solve the incompleteness and long-distance dependency; then, we propose the method of discovering a process model with multiple different structures including cycles.

3. Problem Statement

At present, most process discovery methods cannot discover indirect dependencies between activities, such as long-distance dependencies [15] (such as S1 and S2 in Figure 1). The behavior profile uses the weak order relationship between activities to describe the behavior dependency relationship among activities, involving strict order relation, exclusiveness relation, and interleaving order relation [30]. Among them, the exclusiveness relation is the strictest relationship, indicating that two activities cannot appear in any trace at the same time. However, the behavior profile only describes the behavior dependency between activities and cannot capture the actual occurrence dependency and causal dependency of activities in the actual business process, i.e., whether the occurrence of one activity will affect the occurrence of another activity and vice versa.

The strict order relationship only indicates that there is a sequence constraint between activities, but it cannot illustrate whether an occurrence dependency relationship between activities exists. As shown in Figure 2, activity A and activity B are both in a strict order relationship. From Figure 2 and Table 1, we can know the strict order relationship only indicates that if activities A and B occur at the same time, then activity A must occur before activity B. Hence, the sequence constraint of activity A and activity B is captured, while the actual occurrence dependency between activity A and activity B cannot be captured.

The interleaving order relationship only indicates that there are no occurrence order constraints between activities, but it does not mandate that activities must be enabled at the same time. For example, activity A and activity B are in an interleaving order relationship in Figure 3. However, only activity A and activity B in Figure 3a are concurrent, which is also the most typical structural feature corresponding to the interleaving order relationship. Table 2 shows the occurrence order and occurrence dependency of activities A and B as shown in Figure 3. Table 2 indicates that the occurrence dependency constraint between activities is difficult to capture only according to the interleaving order relationship.

Another problem is that the event log represents the behavior that has been observed during the actual operation of the system and cannot contain all possible behaviors of the system. Thus, the log is often incomplete. One of the problems faced by process discovery is that the log cannot contain enough information to discover a process model, i.e., the log might be incomplete [29]. This might cause the discovered process model to exclude the actual behaviors or contain some additional behaviors that are not in the real process. The existing process mining usually assumes that either the activities are complete or the directly-follows relationships between activities are complete. However, when there exist high concurrency behaviors, the second assumption is often unrealistic. For example, a process consists of 10 concurrent activities, and the log consists of about 10,000 cases. There are 10! = 3,628,800 possible sequences caused by concurrent behavior. Obviously, the number of cases in the log is far less than the actual possible sequence, so it is impossible that the log samples all possible directly-follows relationships between activities. There are many factors leading to logs being possibly incomplete. However, cycles and high concurrency are important factors. When the number of activities on concurrent branches is large, it is impossible for the event log to sample all possible sequences of activities. Figure 4a shows an initial process model

M_{0}

of each concurrent branch containing two activities. An incomplete event log

L_{1}

is generated by the model

M_{0}

, where

L_{1} = {< x a b c d y >, < x a c d b y >, < x c a b d y >, < x c a b d y >}

.The process model obtained by applying the IM algorithm and the HM algorithm to the event log is shown in Figure 4b,c. The precision of the model

M_{1}

is 1, and the fitness is low. The fitness and precision of the model

M_{2}

are both 1, but this model is overfitting. Therefore, incomplete logs caused by concurrency will lead to low-quality process models or even erroneous results.

To solve the above problems, a method for discovering the process model from incomplete logs is presented. We adopt the technology of combining behavioral profiles and co-occurrence classes to analyze the structural characteristics and communication behavior relationships of co-occurrence classes, then merge several co-occurrence classes into a larger substructure to find the process model. The research framework of the proposed method is shown in Figure 5. The proposed method solves the problem that behavior profiles cannot capture the occurrence dependency between activities and the incompleteness caused by high concurrency. The experimental results show that the proposed method can find long-distance dependency, and the discovered process model has a good consistency with the original model.

4. Construction Method of Co-Occurrence Class

In the actual business process, some activities may be mandatory nodes of the whole process, and the occurrence of some activities will lead to the subsequent occurrence of other activities, or the occurrence of some activities can be accompanied by the occurrence of some activities that precede it. Obviously, in addition to the constraints on the order of occurrence, there also exist occurrence dependencies between activities. Therefore, this section analyzes the invariant predecessor set and invariant successor set of activities to capture the co-occurrence invariant set of activities in the log or model. Activities are divided into several subsets with co-occurrence dependency in the log or model through the co-occurrence classes. Section 4.1 presents the method of discovering co-occurrence classes from the event log and analyzes the characteristics of co-occurrence classes. Section 4.2 provides the method of discovering co-occurrence classes from the model.

4.1. Discovery of Co-Occurrence Classes from Logs

The event log shows the actual execution of activities in the business process. By analyzing the occurrence dependencies between activities in each trace, the occurrence dependency between activities in the log can be found. In the following, the relevant definitions of trace handling operations are given.

Definition 1

(Trace handling operations [6]). Let L be an event log,

A_{L}

is an activity set over L. For

\forall a \in A_{L}

, if there exists a trace

σ \in L

, such that

a \in σ

, where

| σ | = n

:

(1)

σ [k] = a

indicates that the k-th element in the trace σ is an activity a;

(2)

P o s (a, σ) = {p, q, r, \dots, s}

such that

σ [p] = a, \dots, σ [s] = a

;

(3) The direct predecessor activity of the i-th activity in the trace σ is defined as

P r e d (i, σ) = σ [i - 1], i \in {2, 3, \dots, n}

;

(4) The direct successor activity of the i-th activity in the trace σ is defined as

S u c c (i, σ) = σ [i + 1], i \in {1, 2, \dots, n - 1}

;

(5) Invariant set of activities that appear before activity a in trace σ is denoted as

P r e d s e t (a, σ)

,

P r e d s e t (a, σ) = \{\begin{matrix} ϕ i f P o s (a, σ) = 1 \\ ⋃_{i = 2}^{p} P r e d (i, σ) \cap \dots \cap ⋃_{i = q + 2}^{r} P r e d (i, σ) \cap ⋃_{i = r + 2}^{s} P r e d (i, σ) \\ i f P o s (a, σ) = {p, \dots, q, r, s}, w h e r e p < \dots < q < r < s \end{matrix}

(1)

(6) Invariant set of activities that appear after activity a in trace σ is denoted as

S u c c s e t (a, σ)

,

S u c c s e t (a, σ) = \{\begin{matrix} ϕ i f P o s (a, σ) = n \\ ⋃_{i = p}^{q - 2} S u c c (i, σ) \cap ⋃_{i = q}^{r - 2} S u c c (i, σ) \cap \dots \cap ⋃_{i = s}^{n - 1} S u c c (i, σ) \\ i f P o s (a, σ) = {p, q, r, \dots, s}, w h e r e p < q < r < \dots < s \end{matrix}

(2)

For example, considering a trace

σ = {a b c d b c d e f}

, using Definition 1, we obtain

σ [3] = c

,

P o s (b, σ) = {2, 5}

,

P r e d (4, σ) = σ [3] = c

,

S u c c (4, σ) = σ [5] = b

,

P r e d s e t (a, σ) = Ø

,

P r e d s e t (b, σ) = {a} \cap {c, d} = Ø

,

S u c c s e t (a, σ) = {b, c, d, e, f}

,

S u c c s e t (b, σ) = {c, d, e, f} \cap {c, d} = {c, d}

.

P r e d s e t (a, σ)

consists of activities appearing between two activities a together and before the first occurrence of the activity a.

S u c c s e t (a, σ)

consists of activities appearing between two activities a together and after the last occurrence of the activity a. Therefore,

P r e d s e t (a, σ)

and

S u c c s e t (a, σ)

denote the set of possible activities that must occur before the activity a and after the activity a in any trace

σ

, respectively. To capture the predecessor and successor activities that have co-occurrence dependencies with an activity in the log, Definition 2 gives the invariant predecessor set and the invariant successor set of the activity in the log.

Definition 2

((Log) Invariant predecessor set, Invariant successor set). Let L be an event log,

A_{L}

is an activity set over L. For

\forall a \in A_{L}

, invariant predecessor set of activity a is defined as

I n v P r e d S (a, L) = ⋂_{σ \in L} P r e d s e t (a, σ)

, and invariant successor set of activity a is defined as

I n v S u c c S (a, L) = ⋂_{σ \in L} S u c c s e t (a, σ)

.

The invariant predecessor set in Definition 2 represents the set of activities that always occur before the activity a in any trace of the log. Similarly, the invariant successor set represents the activity set that always occurs after the activity a in any trace of the log. The activities in the invariant predecessor set or the invariant successor set not only have a strict order relationship with the activity a but also have a co-occurrence relationship with the activity a. Definition 3 further gives the concept of a co-occurrence invariant set based on Definition 2.

Definition 3

((Log) Co-occurrence invariant set). Let L be an event log,

A_{L}

is an activity set over L. For

\forall a \in A_{L}

, co-occurrence invariant set of activities a is denoted as

O i (a, L) = I n v P r e d S (a, L) \cup I n v S u c c S (a, L) \cup {a}

.

In Definition 3,

O i (a, L)

presents a set of activities that occur simultaneously with the activities a in any trace of the log L. There is not only a behavior dependency but also a co-occurrence dependency between them.

For example, given an activity log

L_{2}

as shown in Table 3, the process model shown in Figure 6 is the original model of the log

L_{2}

. This paper takes the event log

L_{2}

as an example to illustrate the solution process of our proposed method.

Table 4 presents the invariant predecessor set, invariant successor set, and co-occurrence invariant set for each activity in the event log

L_{2}

.

Since the co-occurrence invariant set represents the set of activities that appear simultaneously with the current activity in all traces of the log, based on this, this paper adopts the co-occurrence invariant set to capture the occurrence dependency between activities. To facilitate the discovery of a co-occurrence dependency among several activities, the activities in the log can be divided into several subsets with equivalent behavior by using the co-occurrence invariant set. Therefore, Definition 5 introduces the concept of co-occurrence classes.

Definition 4

(Co-occurrence matrix). Let L be an event log, and

A_{L}

is an activity set over L and

| A_{L} | = n

. Matrix

A_{n \times n}

is called as the co-occurrence matrix, where

A_{i j}

represents the co-occurrence relation between

a_{i}, a_{j} \in A_{L}

, if

a_{j} \in O i (a_{i}, L)

; then,

A_{i j} = 1

.

Definition 5

(Co-occurrence class). Let L be an event log,

A_{L}

is an activity set over L.

A_{n \times n}

is a co-occurrence matrix of the log L.

a_{p}, a_{q}, a_{r}, \dots, a_{s} \in A_{L}

, an activity set

C O C = {a_{p}, a_{q}, a_{r}, \dots, a_{s}}

is called a co-occurrence class if and only if

A_{i j} = 1

,where

i = p, q, r, \dots, s; j = p, q, r, \dots, s

.

Definition 5 indicates that there exists a co-occurrence dependency between all activities in the co-occurrence class; that is, they always appear in a trace of the log together. Table 5 shows the co-occurrence relationship matrix of the event log.

Using Definition 5, we can get co-occurrence classes from Table 5:

C O C S e t = {C O C_{1}, C O C_{2}, C O C_{3}, C O C_{4}, C O C_{5}, C O C_{6}, C O C_{7}, C O C_{8}}

, where

C O C_{1} = {b, c}

,

C O C_{2} = {d, e}

,

C O C_{3} = {a, m, f, g, i, k}

,

C O C_{4} = {h, j}

,

C O C_{5} = {n, o}

,

C O C_{6} = {p}

,

C O C_{7} = {q}

,

C O C_{8} = {r}

.

The co-occurrence class has some interesting properties. The relevant conclusions and proofs are given below.

Property 1

(Co-occurrence). Let L be an event log,

C O C_{k}

is a co-occurrence class got from event log L. For

\forall a_{i} \in C O C_{k}

, if

\exists σ \in L

such that

a_{i} \in σ

, then for

\forall a_{j} \in C O C_{k} \ a_{i}

,

a_{j} \in σ

holds.

Proof.

If

\exists a_{j} \in C O C_{k} \ a_{i}

such that

a_{j} \notin σ

, then

a_{j} \notin Pr e d s e t (a_{i}, σ)

and

a_{j} \notin S u c c s e t (a_{i}, σ)

. So

a_{j} \notin I n v Pr e d S (a, L)

and

a_{j} \notin I n v S u c c S (a, L)

. Obviously,

a_{j} \notin O i (a_{i}, L)

. Therefore,

a_{i}, a_{j}

cannot both belong to

C O C_{k}

. There exists a contradiction. □

Property 2

(Invariant order). Let L be an event log,

C O C_{k} = {a_{p}, a_{q}, a_{r}, \dots, a_{s}}

is a co-occurrence class got from the event log L. If

\exists σ_{i} \in L

, activities

a_{p}, a_{q}, a_{r}, \dots, a_{s}

satisfy partial order relation

a_{p} ≺ a_{q} ≺ a_{r} ≺ \dots ≺ a_{s}

, for

\forall a_{i} \in C O C_{k}, \exists σ_{j} \in L

, if

a_{i} \in σ_{j}

, then activities

a_{p}, a_{q}, a_{r}, \dots, a_{s}

also satisfy partial order relation

a_{p} ≺ a_{q} ≺ a_{r} ≺ \dots ≺ a_{s}

in a trace

σ_{j}

. i.e., if activities in

C O C_{k}

occurred in a trace, they should occur in the same order.

Proof.

Assume there exist two activities that appear in two different traces in different orders, and let two activities be

a_{q}, a_{r}

,

a_{p} ≺ a_{q} ≺ a_{r} ≺ a_{l} ≺ \dots ≺ a_{s}

holds in trace

σ_{i}

, but

a_{p} ≺ a_{r} ≺ a_{q} ≺ a_{l} ≺ \dots ≺ a_{s}

holds in trace

σ_{j}

. Thus,

I n v Pr e d S (a_{q}, L) = {\dots, a_{p}} \cap {\dots, a_{p}, a_{r}} \cap \dots

,

I n v S u c c S (a_{q}, L) = {a_{r}, a_{l}, \dots, a_{s}, \dots} \cap {a_{l}, \dots, a_{s}, \dots} \cap \dots

. Clearly,

a_{r} \notin O i (a_{q}, L)

.Therefore, it is impossible for

a_{q}, a_{r}

to belong to the same co-occurrence class at the same time. There exists a contradiction. □

Algorithm 1, Algorithm 2, and Algorithm 3, respectively, show how to obtain the invariant predecessor set, invariant successor set, and co-occurrence class of any activity from the log.

Algorithm 1: An invariant predecessor set of any activity

Algorithm 2: An invariant successor set of any activity

Algorithm 3: Getting co-occurrence classes from the event log

In Algorithm 1, Step 2 initializes the invariant predecessor set to an empty set, Steps 3–9 solve the invariant predecessor set of the activity in each trace, Steps 10–15 get the invariant predecessor set of the activity in the log, and Steps 11–14 indicate that once the invariant predecessor set becomes an empty set, the algorithm is stopped.

In Algorithm 2, Step 2 initializes the invariant successor set to an empty set, Steps 3–9 solve the invariant successor set of the activity in each trace, Steps 10–15 obtain the invariant successor set of the activity in the log, and Steps 11–14 indicate that once the invariant successor activity set becomes an empty set, the algorithm is stopped.

Steps 1–5 in Algorithm 3 are used to find the invariant set of any activity, Steps 7–11 initialize each element in the co-occurrence relationship matrix to 0, Steps 12–20 update the value of the co-occurrence relationship matrix according to the invariant activity set, Steps 22–25 set each activity to an initial co-occurrence class, and Steps 21–37 construct all possible co-occurrence classes by merging the initial co-occurrence classes according to the co-occurrence relationship matrix.

4.2. Discovering Co-Occurrence Classes from Models

The co-occurrence invariant relationship of activities in the log is closely related to their structural relationship in the model. To facilitate the discovery of their structural relationship in the model according to the co-occurrence invariant relationship of activities in the log, we present some definitions related to the invariance of activity occurring in the model.

Definition 6

(Execution sequence). Execution sequence

ε_{P}

of the process model

N = (A, a_{i}, a_{o}, C, F,

T)

is defined as an executable sequence of the form

{a_{i}} • A^{*} • {a_{o}}

from the start activity to the end activity of the process model. All execution sequences of the process modelN are denoted as

T (N)

.

Definition 7

((Model) Invariant predecessor set, Invariant successor set). Let N be a process model,

A_{N}

be an activity set of model, and

T (N)

be a set of executable sequences of the model. For

\forall a \in A_{N}

, the invariant predecessor set of activity a is defined as

I n v P r e d S (a, L) = ⋂_{σ \in T (N)} P r e d s e t (a, σ)

, and the invariant successor set of activity a is defined as

I n v S u c c S (a, L) = ⋂_{σ \in T (N)} S u c c s e t (a, σ)

.

The set of invariant predecessors of activity a in model N represents the set of activities that occur before a in all executable sequences of the model N. Similarly, the set of invariant successors of activity a in the model represents the set of activities that occur after a in all executable sequences of the model N.

Definition 8

((Model) co-occurrence invariant set). Let N be a process model and

A_{N}

be an activity set of models. For

\forall a \in A_{N}

, the co-occurrence invariant set of activity a is defined as

O i (a, N) = I n v P r e d S (a, N) \cup I n v S u c c S (a, N) \cup {a}

.

Definition 8 denotes

O i (a, N)

, which represents the set of activities that always occur simultaneously with the activity a in any execution sequence of the model.

Definition 9

((Model) co-occurrence class). Let N be a process model and

A_{N}

be an activity set of the model, where

| A_{N} | = n

,

A_{n \times n}

is the co-occurrence relationship matrix of the model N. For

a_{p}, a_{q}, a_{r}, \dots, a_{s} \in A_{N}

, a set of activities

C O C = {a_{p}, a_{q}, a_{r}, \dots, a_{s}}

is called as a co-occurrence class if and only if

A_{i j} = 1

, where

i = p, q, r, \dots, s; j = p, q, r, \dots, s

.

Clearly, Definition 9 denotes

\forall a_{i}, a_{j} \in C O C

, if

a_{i} \in O i (a_{j}, N)

, then

a_{j} \in O i (a_{i}, N)

, i.e., activities in a co-occurrence always occur simultaneously in an execution sequence.

According to Definition 9, Table 6 gives the invariant predecessor set, invariant successor set, and co-occurrence invariant set of each activity in the four basic structures (as shown in Figure 7). Table 7 presents co-occurrence classes of the four basic structures obtained from Table 6.

Table 7 shows that the exclusive structure and the cyclic structure have the same set of co-occurrence classes, the sequential structure has only one co-occurrence class that contains all activities, and the concurrent structure is decomposed into two co-occurrence classes according to the concurrent branch, one of which contains the and-gateway activities.

5. Building Substructures Using Log Co-Occurrence Classes

Section 4 presents the solution method for the co-occurrence invariant set of any activity in the log or model; then, the co-occurrence class is provided. This section further illustrates how to leverage the co-occurrence classes in the event log to build a model substructure that includes the sequence, conflict, concurrency, and cycle. Section 5.1 presents the method of constructing the basic substructure from a single co-occurrence class, and Section 5.2 addresses the communication behavior relationship between the co-occurrence classes and how to combine multiple co-occurrence classes into larger substructures.

5.1. Building the Basic Substructure from a Single Co-Occurrence Class

Properties 1 and 2 indicate that although activities in the co-occurrence class always occur simultaneously in the same order, the structural features of activities in the co-occurrence class are unknown. Theorems 1, 2, and 3, respectively, give their corresponding substructures according to the behavioral relationship of activities in the co-occurrence class in the log.

Theorem 1

(A substructure corresponding to the co-occurrence class with a directly-follows relationship). Let L be an event log and

C O C_{k} = {a_{k_{1}}, a_{k_{2}}, \dots, a_{k_{p}}}

,

\forall σ \in L

, if

σ [i] = a_{k_{1}}, σ [i + 1] = a_{k_{2}}, \dots, σ [i + p] = a_{k_{p}}

holds, then

a_{k_{1}}, a_{k_{2}}, \dots, a_{k_{p}}

form a sequential substructure.

Proof.

We assume activities

a_{k_{1}}, a_{k_{2}}, \dots, a_{k_{p}}

constitute non-sequential substructures, i.e., these activities can constitute the substructure of a cycle, concurrency, or conflict except sequence. From the characteristics of a cycle, concurrency, and conflict substructure, it is known that activities

a_{k_{1}}, a_{k_{2}}, \dots, a_{k_{p}}

cannot belong to the same co-occurrence class. Therefore, there exists a contradiction with the assumption. □

If activities in the co-occurrence class occur continuously in any trace, it can be known from Theorem 1 that there exists a causal dependence between these activities.

For the concurrent structure, according to the behavioral characteristics of the concurrent structure, different concurrent branches will produce different co-occurrence classes. For instance, the concurrent structure shown in Figure 7c produces two co-occurrence classes {x,a,b,y} and {c,d}. Since the order of activities on different concurrent branches does not have restrictions, the activities in a co-occurrence class may not necessarily occur consecutively in an executable trace. In Figure 7c, the activities in the co-occurrence class {x,a,b,y} do not occur consecutively in the executable sequence {xacbdy}. However, causal occurrence dependencies still exist between the activities from each concurrent branch. In the following, Theorem 2 describes the substructure characteristics corresponding to co-occurrence classes with concurrent relationships.

Theorem 2

(A substructure corresponding to the co-occurrence class with a concurrent relationship). Let L be an event log, and

C O C_{i} = {a_{i_{1}}, a_{i_{2}}, \dots, a_{i_{p}}}

,

C O C_{j} = {a_{j_{1}}, a_{j_{2}}, \dots, a_{j_{q}}}

. if

\exists a_{i_{k}} \in C O C_{i}, \exists a_{j_{l}} \in C O C_{j}

,

1 \leq k \leq p, 1 \leq l \leq q

,

a_{i_{k}} | | a_{j_{l}}

holds, then the activities from

C O C_{i}

and

C O C_{j}

form a sequential substructure, respectively.

Proof.

As

\exists a_{i_{k}} \in C O C_{i}, \exists a_{j_{l}} \in C O C_{j}

, and

a_{i_{k}} | | a_{j_{l}}

holds, then

C O C_{i}

and

C O C_{j}

contain activities from different concurrent branches, respectively, and one of the branches should include and-split and and-join activities. From the interleaving behavior relationship, we can know that there is no ordering constraint between activities from two concurrent branches in a trace. According to Property 2,

a_{i_{1}} ≺ a_{i_{2}} ≺ \dots ≺ a_{i_{p}}

,

a_{j_{1}} ≺ a_{j_{2}} ≺ \dots ≺ a_{j_{q}}

holds. Therefore, even if the activities in

C O C_{i}

or

C O C_{j}

do not occur consecutively in the trace, there still exists a causal dependency between the activities from each co-occurrence class. As a result, the activities from each concurrent branch constitute a sequential substructure respectively. □

Theorem 2 shows that activities in concurrent branches constitute sequential substructures even if they do not occur consecutively in the trace. Theorem 3 addresses the substructure characteristics formed by activities in other co-occurrence classes.

Theorem 3

(A substructure corresponding to the co-occurrence class with strict order relationship). Let L be an event log and

A_{L}

be an activity set over L,

C O C_{k} = {a_{k_{1}}, \dots, a_{k_{i}}, a_{k_{i + 1}}, \dots,

a_{k_{p}}}

, if

\exists σ \in L

,

σ [k] = a_{k_{i}}, σ [l] = a_{k_{i + 1}}

(where

l > k + 1

) and for

\forall a_{j} \in A_{L} / C O C_{k}

,

a_{k_{i}}

and

a_{j}

are not in a concurrent relationship, then

a_{k_{1}}, \dots, a_{k_{i}}

form a sequence substructure, along with

a_{k_{i + 1}}, \dots, a_{k_{p}}

form a sequence substructure. Meanwhile, there exists a non-sequential substructure containing at least one visible activity between

a_{k_{i}}

and

a_{k_{i + 1}}

.

Proof.

As

σ [k] = a_{k_{i}}, σ [l] = a_{k_{i + 1}}

, there exists a trace

σ

such that

a_{k_{i}}

and

a_{k_{i + 1}}

do not occur consecutively. For

\forall a_{j} \in A_{L} / C O C_{k}

, as

a_{k_{i}}

and

a_{j}

are not in a concurrent relationship, it can be seen that the activities in

C O C_{k}

do not belong to any concurrent branch. Thus, when activities

a_{k_{i}}, a_{k_{i + 1}}

do not occur consecutively, this indicates that there exists at least one activity on the path from

a_{k_{i}}

to

a_{k_{i + 1}}

; meanwhile, they belong to the other co-occurrence class without

C O C_{k}

. Obviously, the activity is in a sequential relationship with

a_{k_{i}}

and

a_{k_{i + 1}}

; otherwise, the activity must belong to the same co-occurrence class with

a_{k_{i}}

and

a_{k_{i + 1}}

. Therefore, there is a non-sequential substructure containing at least one activity between

a_{k_{i}}

and

a_{k_{i + 1}}

. According to Theorems 1 and 2, the remaining activities of

C O C_{k}

form sequential substructures, respectively. The activities in

C O C_{k}

form the substructure shown in Figure 8. □

5.2. Merging Structures of Co-Occurrence Classes

Section 5.1 illustrates a method for discovering a substructure from a single co-occurrence class. This section presents the method for merging the structure of co-occurrence classes according to the communication behavior relationship between them to construct more complex substructures. Section 5.2.1 addresses the communication behavior relationship between co-occurrence classes, and Section 5.2.2 explains how to merge multiple structures of co-occurrence classes to discover more complex substructures.

5.2.1. Communication Behavior Relationship between Co-Occurrence Classes

According to the behavior dependencies of the activities from different co-occurrence classes in the log, the communication behavior dependencies between the co-occurrence classes can be obtained. These communication behavior relationships include a sequential communication relationship, exclusive communication relationship, concurrent communication relationship, and cycle communication relationship. The analysis of the communication behavior relationship between the co-occurrence classes provides a theoretical basis for constructing a larger substructure by combining the co-occurrence classes.

Property 3

(Sequential communication relationship). Let L be an event log, and

C O C S e t s = {C O C_{1}, C O C_{2}, \dots, C O C_{n}}

is a set of co-occurrence classes from event log L. If

\forall a \in C O C_{i}

,

\forall b \in C O C_{j}

,

a \to_{L} b

,

b ↛_{L} a

holds, then

C O C_{i} ⇝ C O C_{j}

.

Proof.

Assume

C O C_{i} = {a_{i_{1}}, a_{i_{2}}, \dots, a_{i_{p}}}

and

C O C_{j} = {a_{j_{1}}, a_{j_{2}}, \dots, a_{j_{q}}}

.

a_{i_{1}} ≻ a_{i_{2}} ≻ \dots ≻ a_{i_{p}}

and

a_{j_{1}} ≻ a_{j_{2}} ≻ \dots ≻ a_{j_{q}}

holds; meanwhile,

a_{i_{p}} \to_{L} a_{j_{1}}

a_{j_{1}} ↛_{L} a_{i_{p}}

also holds. Therefore,

C O C_{i}

and

C O C_{j}

are in a sequential relationship. □

Property 4

(Concurrent communication relationship). Let L be an event log,

C O C S e t s = {C O C_{1}, C O C_{2}, \dots, C O C_{n}}

is a set of co-occurrence classes from the event log L.

\exists a \in C O C_{i}

,

\exists b \in C O C_{j}

, if

a | | b \land a ∦ a \land b ∦ b

holds, then

C O C_{i} | | C O C_{j}

.

Proof.

Assume

C O C_{i} = {a_{i_{1}}, a_{i_{2}}, \dots, a_{i_{p}}}

,

C O C_{j} = {a_{j_{1}}, a_{j_{2}}, \dots, a_{j_{q}}}

, and

a_{i_{1}} ≻ a_{i_{2}} ≻ \dots ≻ a_{i_{p}}

a_{j_{1}} ≻ a_{j_{2}} ≻ \dots ≻ a_{j_{q}}

holds.

\exists a \in C O C_{i}

,

\exists b \in C O C_{j}

, if

a | | b

holds, then there must exist an activity (called and-split activity) where by firing this activity, each branch where the activity a or activity b is included will receive a token. If and-split and and-join are visible activities, according to the characteristics of concurrent structure, then and-split and and-join activities must appear in the co-occurrence invariant set of activity a or b.Therefore, they must belong to the co-occurrence class of activity a or b. If and-split and and-join are invisible activities, then they do not appear in the co-occurrence class of the activity a or b. Assume

a_{i_{1}}, a_{i_{p}}

are the visible and-split and and-join activities, respectively. As

a | | b

, and

a_{i_{2}} ≻ \dots ≻ a ≻ \dots ≻ a_{i_{p - 1}}

,

a_{j_{1}} ≻ \dots ≻ b ≻ \dots ≻ a_{j_{q}}

holds,

C O C_{i} \ {a_{i_{1}}, a_{i_{p}}} | | C O C_{j}

, we denote as

C O C_{i} | | C O C_{j}

. □

Property 5

(Exclusive communication relationship). Let L be an event log,

C O C S e t s = {C O C_{1}, C O C_{2}, \dots, C O C_{n}}

is a set of co-occurrence classes from the event log L. If

\exists a \in C O C_{i}

,

\exists b \in C O C_{j}

,

a + b

holds, then

C O C_{i} + C O C_{j}

.

Proof.

Obviously it is true. □

Property 6

(Cycle communication relationship). Let L be an event log,

C O C S e t s = {C O C_{1}, C O C_{2}, \dots, C O C_{n}}

is a set of co-occurrence classes from the event log L. If

\exists a \in C O C_{i}

,

\exists b \in C O C_{j}

,

a | | b \land a | | a \land b | | b

holds, then

C O C_{i} ⥀ C O C_{j}

.

Proof.

As

a | | b

,

a ≻ b

and

b ≻ a

holds. Meanwhile, as

a | | a, b | | b

,

a ≻ a, b ≻ b

holds. Therefore, the activity from

C O C_{i}

and

C O C_{j}

are in cycle relationship. □

5.2.2. Combining Multiple Co-Occurrence Classes to Find More Complex Substructures

According to the communication behavior relationship between co-occurrence classes, the method of constructing multiple substructures from co-occurrence classes is provided below. These substructures include concurrency, conflict, sequence, and loop.

If

\exists b \in A_{L}, \exists C O C_{i} \in C O C S e t,

for

\forall a \in C O C_{i}

,

b \to a

holds, then it is abbreviated as:

b \to C O C_{i}

.If

\forall a \in C O C_{i},

a \to b

holds, then it is abbreviated as:

C O C_{i} \to b

. Similarly, for

| |

and +, it is abbreviated as:

b | | C O C_{i}

,

b + C O C_{i}

.

(1) Concurrent substructure: It is known that

C O C_{i} | | C O C_{j}

, if exists activities

x_{1}, x_{2}, \dots, x_{i},

\dots, y_{1}, \dots, y_{k} \in C O C_{i}

, where

x_{1} > x_{2} > \dots > x_{i}

,

y_{1} > y_{2} > \dots > y_{k}

,

x_{i} ≻ y_{1}

, and

x_{i} \to C O C_{j}, C O C_{j} \to y_{1}

holds, then

x_{i}, y_{1}

corresponding to and-split and and-join activity respectively. If not exist

x_{i}

or

y_{1}

such that

x_{i} \to C O C_{j}

or

C O C_{j} \to y_{1}

, then and-split or and-join are hidden transitions. Meanwhile,

\forall a \in C O C_{i} \ {x_{1}, x_{2}, \dots, x_{i}, y_{1}, \dots, y_{k}}

,

\forall b \in C O C_{j}

,

a | | b

holds, then the typical substructure of

C O C_{i}

and

C O C_{j}

is shown in Figure 9.

(2) Conflict substructure: If

C O C_{i} + C O C_{j}

, then

C O C_{i}

and

C O C_{j}

constitute the branch of the conflict substructure, respectively.

(3) Loop substructure: For loop structures, the key is to distinguish the do and redo part in a loop structure. Firstly, the characteristics of three different loop substructures are analyzed, and the invariant predecessor set, invariant successor set, co-occurrence invariant set, and co-occurrence class of each activity in different loop substructures are obtained. Then, different loop substructures are obtained according to the characteristics of the invariant predecessor set, the invariant successor set, the co-occurrence invariant set, and the co-occurrence class of the activities in the log.

When the loop body is a sequential structure, a typical structure is shown in Figure 10, and Table 8 shows the invariant predecessor (or successor) set and co-occurrence invariant set of each activity in the loop substructure.

Co-occurrence classes obtained from Table 8 are

C O C S e t s = {{x, y}, {a, b}, {c, d}}

Corollary 1.

Let L be an event log,

C O C_{i}, C O C_{j}

are co-occurrence classes from the event log L, and

C O C_{i} ⥀ C O C_{j}

. If

\forall a \in C O C_{i}

,

I n v P r e d s (a) \cap I n v S u c c S (a) = C O C_{j}

holds, then

C O C_{i}

is the redo substructure of the loop structure, while

C O C_{j}

is the do substructure of the loop structure.

When the loop body is an exclusive structure, a typical structure is shown in Figure 11, and Table 9 shows the invariant predecessor (or successor) set and co-occurrence invariant set of each activity in the loop substructure.

Co-occurrence classes obtained from the Table 9 are as follows:

C O C S e t s = {{x, y}, {a, b}, {c, d}, {e, f}}

.

Corollary 2.

Let L be an event log,

C O C_{i}, C O C_{j}, C O C_{k}

are co-occurrence classes from the event log L, and

C O C_{i} ⥀ C O C_{j} ⥀ C O C_{k}

, if the following two conditions hold:

(1) For

\forall σ \in L

,

O c c u r (C O C_{i}, σ) + O c c u r (C O C_{j}, σ) = O c c u r (C O C_{k}, σ) + 1

;

(2)

\exists σ \in L

,

O c c u r (C O C_{i}, σ) = 0, O c c u r (C O C_{j}, σ) > 0

or

O c c u r (C O C_{j}, σ) = 0,

O c c u r (C O C_{i}, σ) > 0

where

O c c u r (C O C_{k}, σ)

represents the number of times the sequence formed by

C O C_{k}

appears in the trace σ, then

C O C_{i}, C O C_{j}

is the do part of the loop structure with

C O C_{i}

and

C O C_{j}

in an exclusive substructure, and

C O C_{k}

is the redo part of the loop structure.

When the loop body is a concurrent structure, a typical structure is shown in Figure 12, and Table 10 shows the invariant predecessor (or successor) set and co-occurrence invariant set of each activity in the loop substructure.

Co-occurrence classes obtained from Table 10 are

C O C S e t s = {{x, a, b, y}, {c, d}, {e}}

.

Corollary 3.

Let L be an event log,

C O C_{i}, C O C_{j}, C O C_{k}

are co-occurrence classes from the event log L, and

C O C_{i} ⥀ C O C_{j} ⥀ C O C_{k}

. If the following two conditions hold,

(1) for

\forall σ \in L

,

\forall a \in C O C_{k}

,

I n v Pr e d s (a) \cap I n c S u c c s (a) = C O C_{i} \cup C O C_{j}

,

(2) for

\forall σ \in L

,

O c c u r (C O C_{i}, σ) = O c c u r (C O C_{j}, σ)

,

then

C O C_{i}, C O C_{j}

is the do part of the loop structure with

C O C_{i}

and

C O C_{j}

in a concurrent substructure, while

C O C_{k}

is the redo part of the loop structure.

Definition 10.

[Merge activities in relation → [6]] When the left activities in two → relations are the same (a→ b, a→ c), two possible substructures can be created.

(a) The places of → relations are merged into a single one if and only if

(b ↛ c \lor c ↛ b)

. This is denoted as

[a, b + c]

.

(b) The places of→ relation are not merged if and only if

(b \to c \lor c \to b)

. This is denoted as

[a, b | c]

.

Similarly, for → relations when the right activity is common, i.e.,

(b \to a \lor c \to a)

, the substructure created will be either

[b | c, a]

or

[b + c, a]

accordingly. In general, a set of → relations in the form

(a \to b, a \to c, \dots, a \to d)

may produce either

[a, b + c + d]

or

[a, b | c | | d]

. Consequently, the merging can be applied to composed relations, i.e.,

[b + c, a]

and

[b + c, d]

leads

[b + c, a + d]

.

Algorithm 4 presents the construction steps for discovering process models from event logs using the methods in Section 5.1 and Section 5.2. Step 1 constructs the co-occurrence class from the event log, Steps 2–4 use the method in Section 5.1 to construct the corresponding substructure for each co-occurrence class, Steps 5–7 use the method in Section 5.2.1 to obtain the communication behavior relationship between any two co-occurrence classes, Steps 8–13 use the method proposed in Section 5.2.2 to merge several co-occurrence classes into a larger sub-module, and Step 14 further merges the remaining sub-modules into a complete process model through the directly-follows relationship between activities.

Algorithm 4: Construct Petri net from the event log

6. Experiment Evaluation

Experiment purpose. In this section, we experimentally evaluate whether the method proposed in this paper can discover long-distance dependencies from event logs and can rediscover process models with complex cycle structures from small event logs. Several event logs were used to compare our method with other process discovery methods, such as Inductive Miner (IM) [17], the Heuristics Miner (HM) algorithm [14], Inductive Miner-Incompleteness (IMin) [17], and Conjoint Occurrence Miner (CoM) [6]. The experimental evaluation in this section mainly answers two questions. The first question is whether the proposed method can discover long-distance dependencies. The second question is whether the proposed method can correctly discover the original model with a complex structure from incomplete event logs caused by concurrency. In addition to concurrent, sequential, and exclusive structures, the model also may contain complex loop structures.

Experiment design. For problem 1, we generate event logs containing 1000 traces by simulating the model M1 with long-distance dependencies shown in Section 3. Then, we try to use these logs to rediscover the process model using different methods. We report on the experimental comparison results of the quality of the process model and the consistency between the original model and the discovered model [31]. Meanwhile, we report the discovery of long-distance dependencies in the original model. For problem 2, event logs with different completeness levels of directly-follows relationships are generated by simulating model M2 shown in Figure 6. These logs are obtained by removing some of the directly-follows relationships derived from the concurrent structure. Different algorithms are used to reconstruct the original process model M2 from these event logs of different sizes, and the differences between the discovered process model and the original model are reported. The method proposed in [31] is used to measure the Degree of Trace Consistency (DTC) and Degree of Profile Consistency (DPC) of alignment between the models. All miners use default parameters, and h in IMin is set to 0.

Experiment Result. From Table 11, it can be seen that the proposed method and the CoM method can discover long-distance dependencies, but compared with the CoM method, the proposed method can better discover the cyclic structure. IMin, IM, and HM cannot rediscover the original process model due to their inability to discover long-distance dependencies. The CoM method also cannot discover the original process model due to its limitations in the cyclic structure. When different methods are applied to the event log generated by M1, the consistency between the discovered process model and the original model is measured from two aspects by using the method given in [31]. The experimental results show that the method proposed in this paper has good results regarding the consistency of the discovered process model and the original model in terms of both the degree of trace consistency and degree of profile consistency. Since long-distance dependencies do not affect the behavior profile relationship between activities, the process model obtained by the IMin, IM, and HM algorithms has a good degree of profile consistency with the original model.

Figure 13 shows that the IM algorithm and the HM algorithm require that the event log is complete with a directly-follows relationship, and IMin, COM, and the proposed method can handle an incomplete event log. Although the IMin method can handle the incompleteness of the event log, it cannot discover the long-distance dependency; thus, the value of DTC is lower than that of DPC between the model discovered by IMin and the original model. IMin mainly retains the behavior relationship with high probability based on the directly-follows relationship between activities to determine the behavior between the activities. Hence, the requirement for completeness of IMin is lower than that of the IM and HM algorithms but is higher than the proposed method and the CoM method. The proposed method and CoM method require that the completeness degree of the log is up to 85%. However, because the CoM method has some limitations in discovering the cycle structure, the consistency degree between the discovered model and the original model cannot reach 1, even if the log completeness degree is more than 85%. IMin requires the completeness ratio of the log to be about 90%. However, since the process model discovered by IMin is block-structured, some hidden transitions can be added when building the model. These hidden transitions may cause the model to generate additional behaviors, which may lead to a certain deviation between the discovered model and the original model. Thus, for the IMin method, even if the log completeness ratio reaches 90%, the consistency degree between the discovered model and the original model is still less than 1.

As the results of our experiments show, the proposed method has two advantages compared to the others. One advantage is that our method can handle incomplete event logs better than the IMin algorithm. However, the IM and HM approaches are heavily dependent on the log completeness. Although both the proposed method and the COM method can deal with the incompleteness caused by high concurrency, the COM method has limitations in discovering the loop structure, whereas the proposed method can also discover complex loop structures in addition to sequential, exclusive, and concurrent structures. The other advantage is that the proposed method can discover long-distance dependencies from the event logs. As the experiment shows, IMin, IM, and HM do not have an inability to discover long-distance dependencies.

7. Conclusions and Future Work

Based on [6], a method for discovering a process model with multiple structure types from incomplete event logs is presented that can discover not only sequential, concurrent, and conflicting structures but also complex cyclic structures. This paper analyzes the relationship between co-occurrence classes from log and co-occurrence classes from the model and addresses communication behavior profiles between co-occurrence classes from a log using the technology of the behavior profile. A series of propositions are presented to point out how to construct the substructure through the co-occurrence class. Subsequently, the whole process model is constructed from the event log. The proposed method can still discover the original process model from an incomplete event log, even if the model contains a complex cyclic structure. In addition, it also can correctly discover a long-distance dependency implied in the log. For large event logs, it will be time-consuming to construct co-occurrence classes and analyze their communication behavior profiles. In the future, we will consider using co-occurrence classes to construct frequent patterns from event logs to analyze local behaviors in business processes.

Author Contributions

Methodology, L.W.; formal analysis, L.W.; writing—original draft preparation, L.W.; writing—review and editing, X.F.; visualization, C.S.; project administration, X.F.; supervision, X.F. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation, China (No. 61572035, 61402011), the Leading Backbone Talent Project in Anhui Province, China (No. 2020-1-12), the Natural Science Foundation of Anhui Province, China (No. 2008085QD178), Anhui Province Academic and Technical Leader Foundation (No. 2019H239), Anhui Province College Excellent Young Talents Fund Project of China (No. gxyqZD2020020), Open Research Fund of Anhui Province Engineering Laboratory for Big Data Analysis and Early Warning Technology of Coal Mine Safety (NO. CSBD2022-ZD03).

Data Availability Statement

All the data in this paper were obtained by simulating the model. Experimental data and code related to this paper can be obtained by contacting the corresponding author.

Conflicts of Interest

The authors declare no conflict of interest.

References

Van Der Aalst, W. Process Mining: Discovery, Conformance and Enhancement of Business Processes; Springer: Berlin/Heidelberg, Germany, 2011. [Google Scholar]
Van Der Aalst, W. Process Mining: Overview and Opportunities. ACM Trans. Manag. Inf. Syst. (TMIS) 2012, 3, 1–17. [Google Scholar] [CrossRef]
Van Der Aalst, W. Process Mining: A 360 Degree Overview; Springer: Cham, Switzerland, 2022; pp. 3–34. [Google Scholar]
Farkhady, R.Z.; Aali, S.H. A probabilistic approach for process mining in incomplete and noisy logs. Lect. Notes Eng. Comput. Sci. 2011, 2188, 415–420. [Google Scholar]
ZarehFarkhady, R.; Aali, S.H. A two phase approach for process mining in incomplete and noisy Logs. Int. J. Comput. Sci. Issues (IJCSI) 2012, 9, 160–165. [Google Scholar]
Tapia-Flores, T.; Rodríguez-Pérez, E.; López-Mellado, E. Discovering Process Models from Incomplete Event Logs using Conjoint Occurrence Classes. In Proceedings of the ATAED@ Petri nets/ACSD, CEUR Workshop Proceedings 1592, Torun, Poland, 20–21 June 2016; pp. 31–46. [Google Scholar]
Leemans, S.J.; Poppe, E.; Wynn, M.T. Directly follows-based process mining: Exploration & a case study. In Proceedings of the 2019 International Conference on Process Mining (ICPM), Aachen, Germany, 24–26 June 2019; pp. 25–32. [Google Scholar]
Bergenthum, R.; Desel, J.; Lorenz, R.; Mauser, S. Process mining based on regions of languages. In Proceedings of the International Conference on Business Process Management, Brisbane, Australia, 24–28 September 2007; pp. 375–383. [Google Scholar]
Solé, M.; Carmona, J. Process mining from a basis of state regions. In Proceedings of the International Conference on Applications and Theory of Petri Nets, Braga, Portugal, 21–25 June 2010; pp. 226–245. [Google Scholar]
van derWerf, J.M.E.; van Dongen, B.F.; Hurkens, C.A.; Serebrenik, A. Process discovery using integer linear programming. Fundam. Inform. 2009, 94, 387–412. [Google Scholar] [CrossRef] [Green Version]
van Zelst, S.J.; van Dongen, B.F.; van der Aalst, W.M.; Verbeek, H. Discovering workflow nets using integer linear programming. Computing 2018, 100, 529–556. [Google Scholar] [CrossRef] [Green Version]
de Medeiros, A.K.A.; Weijters, A.J.; van der Aalst, W.M. Genetic process mining: An experimental evaluation. Data Min. Knowl. Discov. 2007, 14, 245–304. [Google Scholar] [CrossRef] [Green Version]
Van der Aalst, W.M. Decomposing Petri nets for process mining: A generic approach. Distrib. Parallel Databases 2013, 31, 471–507. [Google Scholar] [CrossRef] [Green Version]
Weijters, A.; van Der Aalst, W.M.; De Medeiros, A.A. Process mining with the heuristics miner-algorithm. Tech. Univ. Eindhoven Tech. Rep. WP 2006, 166, 1–34. [Google Scholar]
Weijters, A.; Ribeiro, J.T.S. Flexible heuristics miner (FHM). In Proceedings of the 2011 IEEE symposium on computational intelligence and data mining (CIDM), Paris, France, 11–15 April 2011; pp. 310–317. [Google Scholar]
vanden Broucke, S.K.; De Weerdt, J. Fodina: A robust and flexible heuristic process discovery technique. Decis. Support Syst. 2017, 100, 109–118. [Google Scholar] [CrossRef]
Leemans, S.J.; Fahland, D.; van der Aalst, W.M. Discovering block-structured process models from incomplete event logs. In Proceedings of the International Conference on Applications and Theory of Petri Nets and Concurrency, Tunis, Tunisia, 23–27 June 2014; pp. 91–110. [Google Scholar]
Günther, C.W.; Van Der Aalst, W.M. Fuzzy mining–Adaptive process simplification based on multi-perspective metrics. In Proceedings of the International Conference on Business Process Management, Brisbane, Australia, 24–28 September 2007; pp. 328–343. [Google Scholar]
Van der Aalst, W.; Weijters, T.; Maruster, L. Workflow mining: Discovering process models from event logs. IEEE Trans. Knowl. Data Eng. 2004, 16, 1128–1142. [Google Scholar] [CrossRef]
De Medeiros, A.A.; Van Dongen, B.F.; Van der Aalst, W.M.; Weijters, A. Process mining: Extending the α-algorithm to mine short loops. Eindh. Univ. Technol. 2004, 10, 145–170. [Google Scholar]
Wen, L.; Van Der Aalst, W.M.; Wang, J.; Sun, J. Mining process models with non-free-choice constructs. Data Min. Knowl. Discov. 2007, 15, 145–180. [Google Scholar] [CrossRef]
Guo, Q.; Wen, L.; Wang, J.; Yan, Z.; Yu, P.S. Mining invisible tasks in non-free-choice constructs. In Proceedings of the International Conference on Business Process Management, Rio de Janeiro, Brazil, 18–22 September 2016; pp. 109–125. [Google Scholar]
Augusto, A.; Conforti, R.; Dumas, M.; La Rosa, M.; Polyvyanyy, A. Split miner: Automated discovery of accurate and simple business process models from event logs. Knowl. Inf. Syst. 2019, 59, 251–284. [Google Scholar] [CrossRef] [Green Version]
Li, W.; Zhu, H.; Liu, W.; Chen, D.; Jiang, J.; Jin, Q. An anti-noise process mining algorithm based on minimum spanning tree clustering. IEEE Access 2018, 6, 48756–48764. [Google Scholar] [CrossRef]
Sun, X.; Hou, W.; Yu, D.; Wang, J.; Pan, J. Filtering out noise logs for process modelling based on event dependency. In Proceedings of the 2019 IEEE International Conference on Web Services (ICWS), Milan, Italy, 8–13 July 2019; pp. 388–392. [Google Scholar]
Nolle, T.; Luettgen, S.; Seeliger, A.; Mühlhäuser, M. Analyzing business process anomalies using autoencoders. Mach. Learn. 2018, 107, 1875–1893. [Google Scholar] [CrossRef] [Green Version]
Zhang, Z.; Hildebrant, R.; Asgarinejad, F.; Venkatasubramanian, N.; Ren, S. Improving Process Discovery Results by Filtering Out Outliers from Event Logs with Hidden Markov Models. In Proceedings of the 2021 IEEE 23rd Conference on Business Informatics (CBI), Bolzano, Italy, 1–3 September 2021; Volume 1, pp. 171–180. [Google Scholar]
Tax, N.; Sidorova, N.; van der Aalst, W.M. Discovering more precise process models from event logs by filtering out chaotic activities. J. Intell. Inf. Syst. 2019, 52, 107–139. [Google Scholar] [CrossRef] [Green Version]
Munoz-Gama, J. Handling Noise and Incompleteness; Springer: Cham, Switzerland, 2016; pp. 61–73. [Google Scholar]
Weidlich, M.; Polyvyanyy, A.; Desai, N.; Mendling, J.; Weske, M. Process compliance analysis based on behavioural profiles. Inf. Syst. 2011, 36, 1009–1025. [Google Scholar] [CrossRef]
Weidlich, M.; Mendling, J.; Weske, M. Efficient consistency measurement based on behavioral profiles of process models. IEEE Trans. Softw. Eng. 2010, 37, 410–429. [Google Scholar] [CrossRef]

Figure 1. A Petri net with long-distance dependencies.

Figure 2. An example of the strict order relation.

Figure 3. An example of the interleaving order.

Figure 4. Process mining results of incomplete logs caused by high concurrency.

Figure 5. Framework of the proposed method.

Figure 6. The original model of log

L_{2}

.

Figure 6. The original model of log

L_{2}

.

Figure 7. Four basic structures.

Figure 8. A substructure corresponding to the co-occurrence class with a strict order relationship.

Figure 9. Typical concurrent substructure composed of

C O C_{i}

and

C O C_{j}

.

Figure 9. Typical concurrent substructure composed of

C O C_{i}

and

C O C_{j}

.

Figure 10. Loop substructure with a sequential structure.

Figure 11. Loop substructure with an exclusive structure.

Figure 12. Loop substructure with a concurrent structure.

Figure 13. The degree of consistency comparison.

Table 1. The relationship between activity A and activity B shown in Figure 2.

Behavior Profile	Figure No.	Executable Sequence	Order of Occurrence	Occurrence Dependency
strict order $A ⇝ B$	Figure 2a	{A,B,AB}	A occurs before B	none
	Figure 2b	{AC,AB}	A occurs before B	Once activity B occurs, activity A must occur, but not vice versa
	Figure 2c	{AB,CB}	A occurs before B	Once activity A occurs, activity B must occur, but not vice versa

Table 2. The relationship between activity A and activity B shown in Figure 3.

Behavior Profile	Figure No.	Executable Sequence	Order of Occurrence	Occurrence Dependency
Interleaving order	Figure 3a	{CABD,CBAD}	No ordering constraint	A and B occur at most once and are in a co-occurrence relationship
	Figure 3b	{CAE,DBF,CABF,DBAE}	No ordering constraint	A and B occur at most once, but do not have any occurrence dependencies
	Figure 3c	{CAD,CBD,CAAD,CBBD,CABD, CBAD,CABAABBAD,……}	No ordering constraint	A and B occur infinite times and do not have any occurrence dependencies
	Figure 3d	{ABC,ABABC,ABABABC,……}	No ordering constraint	A and B occur infinite times, but there is a certain occurrence dependency, such as A and B occur the same number of times, and activity A must occur before B

Table 3. An event log

L_{2}

.

Table 3. An event log

L_{2}

.

Trace	Numbers	Sequence of Activities
$σ_{1}$	5	$a b c f g h i j k m q$
$σ_{2}$	4	$a b c f h g i j k m q$
$σ_{3}$	5	$a d e f g h j i k m q$
$σ_{4}$	6	$a d e f g i h j k m n o r$
$σ_{5}$	7	$a d e f h j g i k m n o p n o p q r$
$σ_{6}$	8	$a b c f g h i j k m n o p n o p n o r$
$σ_{7}$	6	$a f g h i j k b c m q$
$σ_{8}$	5	$a f h g i j k d e m q$
$σ_{9}$	4	$a f g h j i k d e m n o p n o r$
$σ_{10}$	3	$a f g i h j k b c m n o p n o r$

Table 4. Invariant set of all activities in the log

L_{2}

.

Table 4. Invariant set of all activities in the log

L_{2}

.

Activity	InvPredS	InvSuccS	Oi
a	Ø	{f,g,i,h,j,k,m}	{a,f,g,i,h,j,k,m}
b	{a}	{c,m}	{a,b,c,m}
c	{a,b}	{m}	{a,b,c,m}
d	{a}	{e,m}	{a,d,e,m}
e	{a,d}	{m}	{a,d,e,m}
f	{a}	{g,I,h,j,k,m}	{a,f,g,i,h,j,k,m}
g	{a,f}	{i,k,m}	{a,f,g,i,k,m}
h	{a,f}	{j,k,m}	{a,f,h,j,k,m}
i	{a,f,g}	{k,m}	{a,f,g,i,k,m}
j	{a,f,h}	{k,m}	{a,f,h,j,k,m}
k	{a,f,g,i,h,j}	{m}	{a,f,g,i,h,j,k,m}
m	{a,f,g,i,h,j,k}	Ø	{Ø a,f,g,i,h,j,k,m}
n		{Ø o}	{n,o}
o	{n}	Ø	{Ø n,o}
p	{n,o}	Ø	{Ø n,o,p}
q	{a,f,g,h,i,j,k,m}	Ø	{Ø a,f,g,h,i,j,k,m,q}
r	{a,f,g,h,i,j,k,m,n,o}	Ø	{Ø a,f,g,h,i,j,k,m,n,o}

Table 5. Co-occurrence matrix for log

L_{2}

.

Table 5. Co-occurrence matrix for log

L_{2}

.

	a	b	c	d	e	f	g	h	i	j	k	m	n	o	p	q	r
a	1					1	1	1	1	1	1	1
b	1	1	1									1
c	1	1	1									1
d	1			1	1							1
e	1			1	1							1
f	1					1	1	1	1	1	1	1
g	1					1	1		1		1	1
h	1					1		1		1	1	1
i	1					1	1		1		1	1
j	1					1		1		1	1	1
k	1					1	1	1	1	1	1	1
m	1					1	1	1	1	1	1	1
n													1	1
o													1	1
p													1	1	1
q	1					1	1	1	1	1	1	1				1
r	1					1	1	1	1	1	1	1	1	1			1

Table 6. Invariant set of activities in four basic structures.

Activity	Sequential Structure			Exclusive Structure
Activity	InvPredS	InvSuccS	Oi	InvPredS	InvSuccS	Oi
x	⌀	{a,b,c,d,y}	{x,a,b,c,d,y}	⌀	{y}	{x,y}
a	{x}	{b,c,d,y}	{x,a,b,c,d,y}	{x}	{b,y}	{x,a,b,y}
b	{x,a}	{c,d,y}	{x,a,b,c,d,y}	{x,a}	{y}	{x,a,b,y}
c	{x,a,b}	{d,y}	{x,a,b,c,d,y}	{x}	{d,y}	{x,c,d,y}
d	{x,a,b,c}	{y}	{x,a,b,c,d,y}	{x,c}	{y}	{x,c,d,y}
y	{x,a,b,c,d}	⌀	{x,a,b,c,d,y}	{x}	⌀	{x,y}
Activity	Concurrent Structure			Cycle Structure
Activity	InvPredS	InvSuccS	Oi	InvPredS	InvSuccS	Oi
x		{a,b,c,d,y}	{x,a,b,c,d,y}		{a,b,y}	{x,a,b,y}
a	{x}	{b,y}	{x,a,b,y}		{b}	{a,b}
b	{x,a}	{y}	{x,a,b,y}	{a}	⌀	{a,b}
c	{x}	{d,y}	{x,c,d,y}	{a,b}	{a,b,d}	{a,b,c,d}
d	{x,c}	{y}	{x,c,d,y}	{a,b,c}	{a,b}	{a,b,c,d}
y	{x,a,b,c,d}	⌀	{Ø x,a,b,c,d,y}	{x,a,b}	⌀	{x,a,b,y}

Table 7. Co-occurrence classes of the four basic structures.

Structure	Sequential	Exclusive	Concurrent	Cycle
co-occurrence class	{x,a,b,c,d,y}	{x,y},{a,b},{c,d}	{x,a,b,y},{c,d}	{x,y},{a,b},{c,d}

Table 8. Invariant set of activities in Figure 9.

The Loop Body Is a Sequential Structure
Activity	InvPreds	InvSuccS	Oi
x	⌀	{a,b,y}	{x,a,b,y}
a	⌀	{b}	{a,b}
b	{a}	⌀	{a,b}
c	{a,b}	{d,a,b}	{a,b,c,d}
d	{c,a,b}	{a,b}	{a,b,c,d}
y	{x,a,b }	⌀	{x,a,b,y}

Table 9. Invariant set of activities in Figure 11.

The Loop Body Is a Sequential Structure
Activity	InvPreds	InvSuccS	Oi
x	⌀	{y}	{x, y}
a	⌀	{b}	{a, b}
b	{a}	⌀	{a, b}
c	⌀	{d}	{c, d}
d	{c}	⌀	{c, d}
e	⌀	{f}	{e, f}
f	{e}	⌀	{e, f}
y	{x}	⌀	{x, y}

Table 10. Invariant set of activities in Figure 12.

The Loop Body Is a Sequential Structure
Activity	InvPreds	InvSuccS	Oi
x	⌀	{a, b, c, d, y}	{x, a, b, c, d, y}
a	{x}	{b, y}	{x, a, b, y}
b	{x, a}	{y}	{x, a, b, y}
c	{x}	{d, y}	{x, c, d, y}
d	{x, c}	{y}	{x, c, d, y}
e	{x, a, b, c, d, y}	{x, a, b, c, d, y}	{e, x, a, b, c, d, y}
y	{x, a, b, c, d}	⌀	{x, a, b, c, d, y}

Table 11. Experimental results obtained by different methods.

Item	Imin	IM	HM	CoM	The Proposed Method
Long-distance dependency	No	No	No	Yes	Yes
Cycle structure	Yes	Yes	Yes	No	Yes
rediscovery	No	No	No	No	Yes
DTC	93.66%	93.66%	95.45%	96.71%	100%
DPC	100%	100%	100%	98.70%	100%

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, L.; Fang, X.; Shao, C. Discovery of Business Process Models from Incomplete Logs. Electronics 2022, 11, 3179. https://doi.org/10.3390/electronics11193179

AMA Style

Wang L, Fang X, Shao C. Discovery of Business Process Models from Incomplete Logs. Electronics. 2022; 11(19):3179. https://doi.org/10.3390/electronics11193179

Chicago/Turabian Style

Wang, Lili, Xianwen Fang, and Chifeng Shao. 2022. "Discovery of Business Process Models from Incomplete Logs" Electronics 11, no. 19: 3179. https://doi.org/10.3390/electronics11193179

APA Style

Wang, L., Fang, X., & Shao, C. (2022). Discovery of Business Process Models from Incomplete Logs. Electronics, 11(19), 3179. https://doi.org/10.3390/electronics11193179

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Discovery of Business Process Models from Incomplete Logs

Abstract

1. Introduction

2. Related Work

3. Problem Statement

4. Construction Method of Co-Occurrence Class

4.1. Discovery of Co-Occurrence Classes from Logs

4.2. Discovering Co-Occurrence Classes from Models

5. Building Substructures Using Log Co-Occurrence Classes

5.1. Building the Basic Substructure from a Single Co-Occurrence Class

5.2. Merging Structures of Co-Occurrence Classes

5.2.1. Communication Behavior Relationship between Co-Occurrence Classes

5.2.2. Combining Multiple Co-Occurrence Classes to Find More Complex Substructures

6. Experiment Evaluation

7. Conclusions and Future Work

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

	a	b	c	d	e	f	g	h	i	j	k	m	n	o	p	q	r
a	1					1	1	1	1	1	1	1
b	1	1	1									1
c	1	1	1									1
d	1			1	1							1
e	1			1	1							1
f	1					1	1	1	1	1	1	1
g	1					1	1		1		1	1
h	1					1		1		1	1	1
i	1					1	1		1		1	1
j	1					1		1		1	1	1
k	1					1	1	1	1	1	1	1
m	1					1	1	1	1	1	1	1
n													1	1
o													1	1
p													1	1	1
q	1					1	1	1	1	1	1	1				1
r	1					1	1	1	1	1	1	1	1	1			1

	a	b	c	d	e	f	g	h	i	j	k	m	n	o	p	q	r
a	1					1	1	1	1	1	1	1
b	1	1	1									1
c	1	1	1									1
d	1			1	1							1
e	1			1	1							1
f	1					1	1	1	1	1	1	1
g	1					1	1		1		1	1
h	1					1		1		1	1	1
i	1					1	1		1		1	1
j	1					1		1		1	1	1
k	1					1	1	1	1	1	1	1
m	1					1	1	1	1	1	1	1
n													1	1
o													1	1
p													1	1	1
q	1					1	1	1	1	1	1	1				1
r	1					1	1	1	1	1	1	1	1	1			1

	a	b	c	d	e	f	g	h	i	j	k	m	n	o	p	q	r
a	1					1	1	1	1	1	1	1
b	1	1	1									1
c	1	1	1									1
d	1			1	1							1
e	1			1	1							1
f	1					1	1	1	1	1	1	1
g	1					1	1		1		1	1
h	1					1		1		1	1	1
i	1					1	1		1		1	1
j	1					1		1		1	1	1
k	1					1	1	1	1	1	1	1
m	1					1	1	1	1	1	1	1
n													1	1
o													1	1
p													1	1	1
q	1					1	1	1	1	1	1	1				1
r	1					1	1	1	1	1	1	1	1	1			1