Discovery of Business Process Models from Incomplete Logs

: The completeness of event logs and long-distance dependencies are two major chal-lenges for process mining. Until now, most process mining methods have not been able to discover long-distance dependency and assume that the directly-follows relationship in the log is complete. However, due to the existence of high concurrency and the cycle, it is difﬁcult to guarantee that the real-life log is complete regarding the directly-follows relationship. Therefore, process mining needs to be able to deal with incompleteness. In this paper, we propose a method for discovering process models including sequential, exclusive, concurrent, and cyclic structures from incomplete event logs. The method analyzes the co-occurrence class of the log and the model and then uses the technology of combining the behavior proﬁle and co-occurrence class to obtain the communication behavior proﬁle of the co-occurrence class. Furthermore, a method of constructing a substructure from the event log using the co-occurrence class is presented. Finally, the whole process model is built by combining those substructures. The experimental results show that the proposed method can discover process models with complex structures involving cycles from incomplete event logs and also can deal with long-distance dependency in the event log. Meanwhile, the discovered process model has a good degree of consistency with the original model.


Introduction
The operation of an information system will generate a large number of event logs. These event logs record what activities will occur and in what order they occur in the business process. In addition, there is much other related information, such as executors and execution resources [1]. Process mining extracts valuable insights about processes from these event logs to analyze, monitor, and improve actual business processes. According to van der Aalst [2], process mining techniques can be divided into three directions: (1) process discovery, which uses event logs to generate process models; (2) conformance checking, which compares the behaviors generated by the model with the behaviors in the event log. The aim is to detect, locate, and explain deviations and to measure the severity of these deviations; and (3) process enhancement, which aims to improve or extend an existing process model using information about the actual process recorded in some event log. To avoid "spaghetti-like" process models, process discovery usually consists of three steps: first, preprocessing the noise behavior in the event log, then discovering the process model from the preprocessed event log, and finally evaluating the quality of the process model [3]. This paper does not consider log preprocessing steps and only focuses on how to discover process models from event logs.
In addition to noise, the real-life event log is often incomplete, that is, it is impossible to show all possible behaviors [4]. Most of the existing process mining algorithms assume that the log is weakly complete, i.e., they assume that the direct successor relationship is complete, the causal dependency is complete, or that the log is complete [5]. Obviously, the assumption that the log is complete is impossible in reality. One of the factors leading to incomplete logs is high concurrency or a cycle. In [6], the authors proposed the concept of a conjoint occurrence class, which solved the incompleteness caused by high concurrency and illustrated how to discover process models from incomplete logs. However, this method has difficulty coping with cycles and does not consider the relationship between the model-based conjoint occurrence class and the log-based conjoint occurrence class.
Based on [6], we further provide a novel concept of a model-based co-occurrence class to capture activities that are always occur together in any executable trace of the model. Meanwhile, the relationship between the model-based co-occurrence class and logbased co-occurrence class is analyzed. Then, how to construct a cyclic structure containing multiple types of substructures is elaborated besides sequential, exclusive, and concurrent structures. Finally, a method of discovering process models from incomplete event logs using the technology of combining the co-occurrence class and behavior profile is presented. The proposed method in this paper solves the problem of the long-distance dependencies and incompleteness caused by the cycle or high concurrency. The experimental results show that when the log is incomplete, the proposed method can still build a process model that has a high degree of consistency with the original model.
In the following, Section 2 first discusses related work. The problem statement is given in Section 3. Section 4 explains some basic knowledge about the co-occurrence class of the log and model. In Section 5, the method of constructing a substructure using the co-occurrence is described. In Section 6, we report on experimental results. Conclusions and future work are presented in Section 7.

Related Work
Until now, a series of process mining methods have been proposed, mainly including a directly-follows relationship between activities [7], domain-based theory [8][9][10][11], genetic-based algorithm [12,13], frequency-based technology [14][15][16], cut-based technology [17], abstraction-based technology [18], and so on. Of these, the method based on the directly-follows relationship between activities can only find some subclasses of Petri nets and cannot guarantee soundness and fitness, such as the α algorithm [19] and its derivatives [20][21][22]. The technique of domain-based theory mainly includes language-based domain miners [8], state-based domain miners [9], and integer linear programming (ILP) miners [10,11]. Although these methods guarantee the fitness of the model, they cannot guarantee the soundness of the model. The mining method based on the genetic algorithm iterates the process model until the acceptable quality is obtained. It selects the fittest individuals and generates new individuals using two genetic operators, such as crossover and mutation [12,13]. Although the process model obtained by this method has good fitness, it consumes a great deal of running time and cannot guarantee the soundness of the model.
The frequency-based method takes frequencies of events and sequences into account when constructing a process model, such as the Heuristic Mining (HM) algorithm [14], flexible heuristics miner (FHM) [15], or Fodina [16]. It calculates the dependency measure between activities and filters noise behavior by removing causal dependencies below a given threshold. This technique sometimes generates a process model with incorrect behaviors; meanwhile, soundness and fitness are not guaranteed. Cut-based technology, such as the Inductive Miner (IM) [17] and Inductive Miner-infrequent (IMi) algorithms [22], adopts the idea of divide and conquer and uses four cuts to recursively split the log into several sub-logs until it cannot find a cut. Subsequently, it constructs the sub-model from each sub-log. Finally, a block-structured process model can be built by merging all submodels. Although the soundness of the process model is guaranteed, its precision is very low, that is, the discovered process model is over-generalized. Process discovery algorithms based on abstract techniques, such as fuzzy miner [18], produce models without executable semantics, so neither soundness nor fitness is guaranteed. In [23], state-of-the-art process discovery methods based on the split gateway discovery were presented. Split Miner combines a novel method to filter directly-follows graphs induced by event logs, with a method for identifying combinations of split gateways that accurately capture concurrency, conflict, and causal relations between neighbors in directly-follows graphs. Split Miner is the first method to guarantee that the discovered BPMN model is deadlock-free and is not restricted to producing a block-structured model. However, when the number of activities in the log is large, the time complexity of the algorithm is high.
At present, most process discovery methods can deal with noises [24][25][26][27][28]. Apart from noise, another problem of the real event log is incompleteness [29]. Since the event log only records the observed behaviors of the business process during its past, it cannot capture the behaviors that will occur in the future. Hence, the event log is usually incomplete. However, few algorithms have the capability of handling the incompleteness of event logs. In [17], based on the IM algorithm, a process discovery algorithm handling incomplete event logs was proposed, which introduced probabilistic behavioral relations that are less sensitive to incompleteness. It adopted the divide-and-conquer approach in the same way as IM and searched for an optimal partition of activities by estimating probabilities of the relations between activities. However, the method proposed in [17] has difficulty in handling long-distance dependency and cannot handle incompleteness caused by high concurrency well. In [6], the authors presented conjoint occurrence classes to handle the incompleteness caused by the concurrency and cycles. This method can infer the causal and concurrent behavior not exhibited in the log. However, the method in [6] cannot handle loop structures and the long-distance dependency hidden in the log. In this work, a co-occurrence class based-log and based-model are provided, respectively, to solve the incompleteness and long-distance dependency; then, we propose the method of discovering a process model with multiple different structures including cycles.

Problem Statement
At present, most process discovery methods cannot discover indirect dependencies between activities, such as long-distance dependencies [15] (such as S1 and S2 in Figure 1). The behavior profile uses the weak order relationship between activities to describe the behavior dependency relationship among activities, involving strict order relation, exclusiveness relation, and interleaving order relation [30]. Among them, the exclusiveness relation is the strictest relationship, indicating that two activities cannot appear in any trace at the same time. However, the behavior profile only describes the behavior dependency between activities and cannot capture the actual occurrence dependency and causal dependency of activities in the actual business process, i.e., whether the occurrence of one activity will affect the occurrence of another activity and vice versa. The strict order relationship only indicates that there is a sequence constraint between activities, but it cannot illustrate whether an occurrence dependency relationship between activities exists. As shown in Figure 2, activity A and activity B are both in a strict order relationship. From Figure 2 and Table 1, we can know the strict order relationship only indicates that if activities A and B occur at the same time, then activity A must occur before activity B. Hence, the sequence constraint of activity A and activity B is captured, while the actual occurrence dependency between activity A and activity B cannot be captured.  {AC,AB} A occurs before B Once activity B occurs, activity A must occur, but not vice versa Figure 2c {AB,CB} A occurs before B Once activity A occurs, activity B must occur, but not vice versa The interleaving order relationship only indicates that there are no occurrence order constraints between activities, but it does not mandate that activities must be enabled at the same time. For example, activity A and activity B are in an interleaving order relationship in Figure 3. However, only activity A and activity B in Figure 3a are concurrent, which is also the most typical structural feature corresponding to the interleaving order relationship. Table 2 shows the occurrence order and occurrence dependency of activities A and B as shown in Figure 3. Table 2 indicates that the occurrence dependency constraint between activities is difficult to capture only according to the interleaving order relationship.    Another problem is that the event log represents the behavior that has been observed during the actual operation of the system and cannot contain all possible behaviors of the system. Thus, the log is often incomplete. One of the problems faced by process discovery is that the log cannot contain enough information to discover a process model, i.e., the log might be incomplete [29]. This might cause the discovered process model to exclude the actual behaviors or contain some additional behaviors that are not in the real process. The existing process mining usually assumes that either the activities are complete or the directlyfollows relationships between activities are complete. However, when there exist high concurrency behaviors, the second assumption is often unrealistic. For example, a process consists of 10 concurrent activities, and the log consists of about 10,000 cases. There are 10! = 3,628,800 possible sequences caused by concurrent behavior. Obviously, the number of cases in the log is far less than the actual possible sequence, so it is impossible that the log samples all possible directly-follows relationships between activities. There are many factors leading to logs being possibly incomplete. However, cycles and high concurrency are important factors. When the number of activities on concurrent branches is large, it is impossible for the event log to sample all possible sequences of activities. Figure 4a shows an initial process model M 0 of each concurrent branch containing two activities. An incomplete event log L 1 is generated by the model M 0 , where L 1 = {< xabcdy >, < xacdby >, < xcabdy >, < xcabdy >}.The process model obtained by applying the IM algorithm and the HM algorithm to the event log is shown in Figure 4b,c. The precision of the model M 1 is 1, and the fitness is low. The fitness and precision of the model M 2 are both 1, but this model is overfitting. Therefore, incomplete logs caused by concurrency will lead to low-quality process models or even erroneous results.  To solve the above problems, a method for discovering the process model from incomplete logs is presented. We adopt the technology of combining behavioral profiles and co-occurrence classes to analyze the structural characteristics and communication behavior relationships of co-occurrence classes, then merge several co-occurrence classes into a larger substructure to find the process model. The research framework of the proposed method is shown in Figure 5. The proposed method solves the problem that behavior profiles cannot capture the occurrence dependency between activities and the incompleteness caused by high concurrency. The experimental results show that the proposed method can find long-distance dependency, and the discovered process model has a good consistency with the original model.

Construction Method of Co-Occurrence Class
In the actual business process, some activities may be mandatory nodes of the whole process, and the occurrence of some activities will lead to the subsequent occurrence of other activities, or the occurrence of some activities can be accompanied by the occurrence of some activities that precede it. Obviously, in addition to the constraints on the order of occurrence, there also exist occurrence dependencies between activities. Therefore, this section analyzes the invariant predecessor set and invariant successor set of activities to capture the co-occurrence invariant set of activities in the log or model. Activities are divided into several subsets with co-occurrence dependency in the log or model through the co-occurrence classes. Section 4.1 presents the method of discovering co-occurrence classes from the event log and analyzes the characteristics of co-occurrence classes. Section 4.2 provides the method of discovering co-occurrence classes from the model.

Discovery of Co-Occurrence Classes from Logs
The event log shows the actual execution of activities in the business process. By analyzing the occurrence dependencies between activities in each trace, the occurrence dependency between activities in the log can be found. In the following, the relevant definitions of trace handling operations are given.
Definition 1 (Trace handling operations [6]). Let L be an event log, A L is an activity set over L. For ∀a ∈ A L , if there exists a trace σ ∈ L, such that a ∈ σ, where |σ| = n : (1) σ[k] = a indicates that the k-th element in the trace σ is an activity a; (2) Pos(a, σ) = {p, q, r, · · · , s} such that σ[p] = a, · · · , σ[s] = a; (3) The direct predecessor activity of the i-th activity in the trace σ is defined as (4) The direct successor activity of the i-th activity in the trace σ is defined as Succ(i, σ) = σ[i + 1], i ∈ {1, 2, · · · , n − 1}; (5) Invariant set of activities that appear before activity a in trace σ is denoted as Predset(a, σ), Invariant set of activities that appear after activity a in trace σ is denoted as Succset(a, σ), For example, considering a trace σ = {abcdbcde f }, using Definition 1, we obtain σ [3] Predset(a, σ) consists of activities appearing between two activities a together and before the first occurrence of the activity a. Succset(a, σ) consists of activities appearing between two activities a together and after the last occurrence of the activity a.Therefore, Predset(a, σ) and Succset(a, σ) denote the set of possible activities that must occur before the activity a and after the activity a in any trace σ, respectively. To capture the predecessor and successor activities that have co-occurrence dependencies with an activity in the log, Definition 2 gives the invariant predecessor set and the invariant successor set of the activity in the log.
Definition 2 ((Log) Invariant predecessor set, Invariant successor set). Let L be an event log, A L is an activity set over L.For ∀a ∈ A L , invariant predecessor set of activity a is defined as InvPredS(a, L) = σ∈L Predset(a, σ), and invariant successor set of activity a is defined as InvSuccS(a, L) = σ∈L Succset(a, σ).
The invariant predecessor set in Definition 2 represents the set of activities that always occur before the activity a in any trace of the log. Similarly, the invariant successor set represents the activity set that always occurs after the activity a in any trace of the log. The activities in the invariant predecessor set or the invariant successor set not only have a strict order relationship with the activity a but also have a co-occurrence relationship with the activity a. Definition 3 further gives the concept of a co-occurrence invariant set based on Definition 2. Definition 3 ((Log) Co-occurrence invariant set). Let L be an event log, A L is an activity set over L. For∀a ∈ A L , co-occurrence invariant set of activities a is denoted as Oi(a, L) = InvPredS(a, L) ∪ InvSuccS(a, L) ∪ {a}.
In Definition 3, Oi(a, L) presents a set of activities that occur simultaneously with the activities a in any trace of the log L. There is not only a behavior dependency but also a co-occurrence dependency between them.
For example, given an activity log L 2 as shown in Table 3, the process model shown in Figure 6 is the original model of the log L 2 . This paper takes the event log L 2 as an example to illustrate the solution process of our proposed method.

Trace
Numbers Sequence of Activities σ 1 5 abc f ghijkmq σ 2 4 abc f hgijkmq σ 3 5 ade f ghjikmq σ 4 6 ade f gihjkmnor σ 5 7 ade f hjgikmnopnopqr σ 6 8 abc f ghijkmnopnopnor σ 7 6 a f ghijkbcmq σ 8 5 a f hgijkdemq σ 9 4 a f ghjikdemnopnor σ 10 3  Table 4 presents the invariant predecessor set, invariant successor set, and co-occurrence invariant set for each activity in the event log L 2 . Since the co-occurrence invariant set represents the set of activities that appear simultaneously with the current activity in all traces of the log, based on this, this paper adopts the co-occurrence invariant set to capture the occurrence dependency between activities. To facilitate the discovery of a co-occurrence dependency among several activities, the activities in the log can be divided into several subsets with equivalent behavior by using the co-occurrence invariant set. Therefore, Definition 5 introduces the concept of co-occurrence classes.
Definition 4 (Co-occurrence matrix). Let L be an event log, and A L is an activity set over L and |A L | = n. Matrix A n×n is called as the co-occurrence matrix, where A ij represents the co-occurrence relation between a i , a j ∈ A L , if a j ∈ Oi(a i , L); then, A ij = 1.
Definition 5 (Co-occurrence class). Let L be an event log, A L is an activity set over L. A n×n is a co-occurrence matrix of the log L. a p , a q , a r , · · · , a s ∈ A L , an activity set COC = {a p , a q , a r , · · · , a s } is called a co-occurrence class if and only if A ij = 1 ,where i = p, q, r, · · · , s; j = p, q, r, · · · , s. Definition 5 indicates that there exists a co-occurrence dependency between all activities in the co-occurrence class; that is, they always appear in a trace of the log together. Table 5 shows the co-occurrence relationship matrix of the event log.
Using Definition 5, we can get co-occurrence classes from Table 5: The co-occurrence class has some interesting properties. The relevant conclusions and proofs are given below.
Property 1 (Co-occurrence). Let L be an event log, COC k is a co-occurrence class got from event log L. For ∀a i ∈ COC k , if ∃σ ∈ L such that a i ∈ σ, then for ∀a j ∈ COC k \a i , a j ∈ σ holds.
Proof. If ∃a j ∈ COC k \a i such that a j / ∈ σ, then a j / ∈ Pr edset(a i , σ) and a j / ∈ Succset(a i , σ).So a j / ∈ Inv Pr edS(a, L) and a j / ∈ InvSuccS(a, L). Obviously, a j / ∈ Oi(a i , L).Therefore, a i , a j cannot both belong to COC k .There exists a contradiction.
Property 2 (Invariant order). Let L be an event log, COC k = {a p , a q , a r , · · · , a s } is a cooccurrence class got from the event log L. If ∃σ i ∈ L, activities a p , a q , a r , · · · , a s satisfy partial order relation a p ≺ a q ≺ a r ≺ · · · ≺ a s , for ∀a i ∈ COC k , ∃σ j ∈ L, if a i ∈ σ j , then activities a p , a q , a r , · · · , a s also satisfy partial order relation a p ≺ a q ≺ a r ≺ · · · ≺ a s in a trace σ j . i.e., if activities in COC k occurred in a trace, they should occur in the same order.
Proof. Assume there exist two activities that appear in two different traces in different orders, and let two activities be a q , a r , a p ≺ a q ≺ a r ≺ a l ≺ · · · ≺ a s holds in trace σ i , but a p ≺ a r ≺ a q ≺ a l ≺ · · · ≺ a s holds in trace σ j . Thus, Inv Pr edS(a q , L) = {· · · , a p } ∩ {· · · , a p , a r } ∩ · · · , InvSuccS(a q , L) = {a r , a l , · · · , a s , · · · } ∩ {a l , · · · , a s , · · · } ∩ · · · . Clearly, a r / ∈ Oi(a q , L).Therefore, it is impossible for a q , a r to belong to the same co-occurrence class at the same time. There exists a contradiction. Algorithm 1, Algorithm 2, and Algorithm 3, respectively, show how to obtain the invariant predecessor set, invariant successor set, and co-occurrence class of any activity from the log. Predset (a, σ) = φ; 6 else Pos(a, σ) = {p, · · · , q, r, s}( where p <, · · · , < r < s); Succset (a, σ) = φ; 6 else Pos(a, σ) = {p, · · · , q, r, s}( where p < q < r, · · · , < s; i=s Succ(i, σ); In Algorithm 1, Step 2 initializes the invariant predecessor set to an empty set, Steps 3-9 solve the invariant predecessor set of the activity in each trace, Steps 10-15 get the invariant predecessor set of the activity in the log, and Steps 11-14 indicate that once the invariant predecessor set becomes an empty set, the algorithm is stopped.
In Algorithm 2, Step 2 initializes the invariant successor set to an empty set, Steps 3-9 solve the invariant successor set of the activity in each trace, Steps 10-15 obtain the invariant successor set of the activity in the log, and Steps 11-14 indicate that once the invariant successor activity set becomes an empty set, the algorithm is stopped.
Steps 1-5 in Algorithm 3 are used to find the invariant set of any activity, Steps 7-11 initialize each element in the co-occurrence relationship matrix to 0, Steps 12-20 update the value of the co-occurrence relationship matrix according to the invariant activity set, Steps 22-25 set each activity to an initial co-occurrence class, and Steps 21-37 construct all possible co-occurrence classes by merging the initial co-occurrence classes according to the co-occurrence relationship matrix.

Discovering Co-Occurrence Classes from Models
The co-occurrence invariant relationship of activities in the log is closely related to their structural relationship in the model. To facilitate the discovery of their structural relationship in the model according to the co-occurrence invariant relationship of activities in the log, we present some definitions related to the invariance of activity occurring in the model. The set of invariant predecessors of activity a in model N represents the set of activities that occur before a in all executable sequences of the model N. Similarly, the set of invariant successors of activity a in the model represents the set of activities that occur after a in all executable sequences of the model N.

Definition 8 ((Model) co-occurrence invariant set).
Let N be a process model and A N be an activity set of models. For∀a ∈ A N , the co-occurrence invariant set of activity a is defined as Oi(a, N) = InvPredS(a, N) ∪ InvSuccS(a, N) ∪ {a}. Definition 8 denotes Oi(a, N), which represents the set of activities that always occur simultaneously with the activity a in any execution sequence of the model. Definition 9 ((Model) co-occurrence class). Let N be a process model and A N be an activity set of the model, where |A N | = n, A n×n is the co-occurrence relationship matrix of the model N. For a p , a q , a r , · · · , a s ∈ A N , a set of activities COC = {a p , a q , a r , · · · , a s } is called as a co-occurrence class if and only if A ij = 1, where i = p, q, r, · · · , s; j = p, q, r, · · · , s. Clearly, Definition 9 denotes ∀a i , a j ∈ COC, if a i ∈ Oi(a j , N), then a j ∈ Oi(a i , N), i.e., activities in a co-occurrence always occur simultaneously in an execution sequence.
According to Definition 9, Table 6 gives the invariant predecessor set, invariant successor set, and co-occurrence invariant set of each activity in the four basic structures (as shown in Figure 7). Table 7 presents co-occurrence classes of the four basic structures obtained from Table 6. Table 7 shows that the exclusive structure and the cyclic structure have the same set of co-occurrence classes, the sequential structure has only one co-occurrence class that contains all activities, and the concurrent structure is decomposed into two co-occurrence classes according to the concurrent branch, one of which contains the and-gateway activities.

Building Substructures Using Log Co-Occurrence Classes
Section 4 presents the solution method for the co-occurrence invariant set of any activity in the log or model; then, the co-occurrence class is provided. This section further illustrates how to leverage the co-occurrence classes in the event log to build a model substructure that includes the sequence, conflict, concurrency, and cycle. Section 5.1 presents the method of constructing the basic substructure from a single co-occurrence class, and Section 5.2 addresses the communication behavior relationship between the co-occurrence classes and how to combine multiple co-occurrence classes into larger substructures.

Building the Basic Substructure from a Single Co-Occurrence Class
Properties 1 and 2 indicate that although activities in the co-occurrence class always occur simultaneously in the same order, the structural features of activities in the cooccurrence class are unknown. Theorems 1, 2, and 3, respectively, give their corresponding substructures according to the behavioral relationship of activities in the co-occurrence class in the log.
Proof. We assume activities a k 1 , a k 2 , · · · , a k p constitute non-sequential substructures, i.e., these activities can constitute the substructure of a cycle, concurrency, or conflict except sequence. From the characteristics of a cycle, concurrency, and conflict substructure, it is known that activities a k 1 , a k 2 , · · · , a k p cannot belong to the same co-occurrence class. Therefore, there exists a contradiction with the assumption.
If activities in the co-occurrence class occur continuously in any trace, it can be known from Theorem 1 that there exists a causal dependence between these activities.
For the concurrent structure, according to the behavioral characteristics of the concurrent structure, different concurrent branches will produce different co-occurrence classes. For instance, the concurrent structure shown in Figure 7c produces two co-occurrence classes {x,a,b,y} and {c,d}. Since the order of activities on different concurrent branches does not have restrictions, the activities in a co-occurrence class may not necessarily occur consecutively in an executable trace. In Figure 7c, the activities in the co-occurrence class {x,a,b,y} do not occur consecutively in the executable sequence {xacbdy}. However, causal occurrence dependencies still exist between the activities from each concurrent branch. In the following, Theorem 2 describes the substructure characteristics corresponding to co-occurrence classes with concurrent relationships. Theorem 2 (A substructure corresponding to the co-occurrence class with a concurrent relationship). Let L be an event log, and COC i = {a i 1 , a i 2 , · · · , a i p }, COC j = {a j 1 , a j 2 , · · · , a j q }. if ∃a i k ∈ COC i , ∃a j l ∈ COC j , 1 ≤ k ≤ p, 1 ≤ l ≤ q, a i k ||a j l holds, then the activities from COC i and COC j form a sequential substructure, respectively.
Proof. As ∃a i k ∈ COC i , ∃a j l ∈ COC j , and a i k ||a j l holds, then COC i and COC j contain activities from different concurrent branches, respectively, and one of the branches should include and-split and and-join activities. From the interleaving behavior relationship, we can know that there is no ordering constraint between activities from two concurrent branches in a trace. According to Property 2, a i 1 ≺ a i 2 ≺ · · · ≺ a i p , a j 1 ≺ a j 2 ≺ · · · ≺ a j q holds. Therefore, even if the activities in COC i or COC j do not occur consecutively in the trace, there still exists a causal dependency between the activities from each cooccurrence class. As a result, the activities from each concurrent branch constitute a sequential substructure respectively. Theorem 2 shows that activities in concurrent branches constitute sequential substructures even if they do not occur consecutively in the trace. Theorem 3 addresses the substructure characteristics formed by activities in other co-occurrence classes. Theorem 3 (A substructure corresponding to the co-occurrence class with strict order relationship). Let L be an event log and A L be an activity set over L, COC k = {a k 1 , · · · , a k i , a k i+1 , · · · , a k p }, if ∃σ ∈ L, σ[k] = a k i , σ[l] = a k i+1 (where l > k + 1 ) and for ∀a j ∈ A L /COC k , a k i and a j are not in a concurrent relationship, then a k 1 , · · · , a k i form a sequence substructure, along with a k i+1 , · · · , a k p form a sequence substructure. Meanwhile, there exists a non-sequential substructure containing at least one visible activity between a k i and a k i+1 .
Proof. As σ[k] = a k i , σ[l] = a k i+1 , there exists a trace σ such that a k i and a k i+1 do not occur consecutively. For ∀a j ∈ A L /COC k , as a k i and a j are not in a concurrent relationship, it can be seen that the activities in COC k do not belong to any concurrent branch. Thus, when activities a k i , a k i+1 do not occur consecutively, this indicates that there exists at least one activity on the path from a k i to a k i+1 ; meanwhile, they belong to the other co-occurrence class without COC k . Obviously, the activity is in a sequential relationship with a k i and a k i+1 ; otherwise, the activity must belong to the same co-occurrence class with a k i and a k i+1 . Therefore, there is a non-sequential substructure containing at least one activity between a k i and a k i+1 . According to Theorems 1 and 2, the remaining activities of COC k form sequential substructures, respectively. The activities in COC k form the substructure shown in Figure 8.

Merging Structures of Co-Occurrence Classes
Section 5.1 illustrates a method for discovering a substructure from a single cooccurrence class. This section presents the method for merging the structure of co-occurrence classes according to the communication behavior relationship between them to construct more complex substructures. Section 5.2.1 addresses the communication behavior relationship between co-occurrence classes, and Section 5.2.2 explains how to merge multiple structures of co-occurrence classes to discover more complex substructures.

Communication Behavior Relationship between Co-Occurrence Classes
According to the behavior dependencies of the activities from different co-occurrence classes in the log, the communication behavior dependencies between the co-occurrence classes can be obtained. These communication behavior relationships include a sequential communication relationship, exclusive communication relationship, concurrent communication relationship, and cycle communication relationship. The analysis of the communication behavior relationship between the co-occurrence classes provides a theoretical basis for constructing a larger substructure by combining the co-occurrence classes.

Property 3 (Sequential communication relationship).
Let L be an event log, and COCSets = {COC 1 , COC 2 , · · · , COC n } is a set of co-occurrence classes from event log L. If ∀a ∈ COC i , ∀b ∈ COC j , a→ L b, b L a holds, then COC i COC j .
Proof. Assume COC i = {a i 1 , a i 2 , · · · , a i p } and COC j = {a j 1 , a j 2 , · · · , a j q }. a i 1 a i 2 · · · a i p and a j 1 a j 2 · · · a j q holds; meanwhile, a i p → L a j 1 a j 1 L a i p also holds. Therefore, COC i and COC j are in a sequential relationship.

Property 4 (Concurrent communication relationship).
Let L be an event log, COCSets = {COC 1 , COC 2 , · · · , COC n } is a set of co-occurrence classes from the event log L. ∃a ∈ COC i , ∃b ∈ COC j , if a||b ∧ a ∦ a ∧ b ∦ b holds, then COC i ||COC j .
Proof. Assume COC i = {a i 1 , a i 2 , · · · , a i p }, COC j = {a j 1 , a j 2 , · · · , a j q }, and a i 1 a i 2 · · · a i p a j 1 a j 2 · · · a j q holds. ∃a ∈ COC i , ∃b ∈ COC j , if a||b holds, then there must exist an activity (called and-split activity) where by firing this activity, each branch where the activity a or activity b is included will receive a token. If and-split and and-join are visible activities, according to the characteristics of concurrent structure, then and-split and and-join activities must appear in the co-occurrence invariant set of activity a or b.Therefore, they must belong to the co-occurrence class of activity a or b. If and-split and and-join are invisible activities, then they do not appear in the co-occurrence class of the activity a or b. Assume a i 1 , a i p are the visible and-split and and-join activities, respectively. As a||b, and a i 2 · · · a · · · a i p−1 , a j 1 · · · b · · · a j q holds, COC i \{a i 1 , a i p }||COC j , we denote as COC i ||COC j .

Property 5 (Exclusive communication relationship).
Let L be an event log, COCSets = {COC 1 , COC 2 , · · · , COC n } is a set of co-occurrence classes from the event log L. If ∃a ∈ COC i , ∃b ∈ COC j , a + b holds, then COC i + COC j .

Proof. Obviously it is true.
Property 6 (Cycle communication relationship). Let L be an event log, COCSets = {COC 1 , COC 2 , · · · , COC n } is a set of co-occurrence classes from the event log L. If ∃a ∈ COC i , ∃b ∈ COC j , a||b ∧ a||a ∧ b||b holds, then COC i COC j .
Proof. As a||b, a b and b a holds. Meanwhile, as a||a, b||b, a a, b b holds. Therefore, the activity from COC i and COC j are in cycle relationship.

Combining Multiple Co-Occurrence Classes to Find More Complex Substructures
According to the communication behavior relationship between co-occurrence classes, the method of constructing multiple substructures from co-occurrence classes is provided below. These substructures include concurrency, conflict, sequence, and loop.
If ∃b ∈ A L , ∃COC i ∈ COCSet, for ∀a ∈ COC i , b → a holds, then it is abbreviated as: b → COC i .If ∀a ∈ COC i , a → b holds, then it is abbreviated as: COC i → b . Similarly, for || and +, it is abbreviated as: b||COC i , b + COC i .
(1) Concurrent substructure: It is known that COC i ||COC j , if exists activities x 1 , x 2 , · · · , x i , · · · , y 1 , · · · , y k ∈ COC i , where x 1 > x 2 > · · · > x i , y 1 > y 2 > · · · > y k , x i y 1 , and x i → COC j , COC j → y 1 holds, then x i , y 1 corresponding to and-split and and-join activity respectively. If not exist x i or y 1 such that x i → COC j or COC j → y 1 , then andsplit or and-join are hidden transitions. Meanwhile, ∀a ∈ COC i \{x 1 , x 2 , · · · , x i , y 1 , · · · , y k }, ∀b ∈ COC j , a||b holds, then the typical substructure of COC i and COC j is shown in Figure 9. (2) Conflict substructure: If COC i + COC j , then COC i and COC j constitute the branch of the conflict substructure, respectively.
(3) Loop substructure: For loop structures, the key is to distinguish the do and redo part in a loop structure. Firstly, the characteristics of three different loop substructures are analyzed, and the invariant predecessor set, invariant successor set, co-occurrence invariant set, and co-occurrence class of each activity in different loop substructures are obtained. Then, different loop substructures are obtained according to the characteristics of the invariant predecessor set, the invariant successor set, the co-occurrence invariant set, and the co-occurrence class of the activities in the log.
When the loop body is a sequential structure, a typical structure is shown in Figure 10, and Table 8 shows the invariant predecessor (or successor) set and co-occurrence invariant set of each activity in the loop substructure.
x y a b c d Figure 10. Loop substructure with a sequential structure. Table 8. Invariant set of activities in Figure 9. Co-occurrence classes obtained from Table 8 are COCSets = {{x, y}, {a, b}, {c, d}} Corollary 1. Let L be an event log, COC i , COC j are co-occurrence classes from the event log L, and COC i COC j . If ∀a ∈ COC i , InvPreds(a) ∩ InvSuccS(a) = COC j holds, then COC i is the redo substructure of the loop structure, while COC j is the do substructure of the loop structure.

The Loop Body Is a Sequential
When the loop body is an exclusive structure, a typical structure is shown in Figure 11, and Table 9 shows the invariant predecessor (or successor) set and co-occurrence invariant set of each activity in the loop substructure. Table 9. Invariant set of activities in Figure 11.  Co-occurrence classes obtained from the Table 9 are as follows:

The Loop Body Is a Sequential
Corollary 2. Let L be an event log, COC i , COC j , COC k are co-occurrence classes from the event log L, and COC i COC j COC k , if the following two conditions hold: (1) For ∀σ ∈ L, Occur(COC i , σ) + Occur(COC j , σ) = Occur(COC k , σ) + 1; (2) ∃σ ∈ L, Occur(COC i , σ) = 0, Occur(COC j , σ) > 0 or Occur(COC j , σ) = 0, Occur(COC i , σ) > 0 where Occur(COC k , σ) represents the number of times the sequence formed by COC k appears in the trace σ, then COC i , COC j is the do part of the loop structure with COC i and COC j in an exclusive substructure, and COC k is the redo part of the loop structure.
When the loop body is a concurrent structure, a typical structure is shown in Figure 12, and Table 10 shows the invariant predecessor (or successor) set and co-occurrence invariant set of each activity in the loop substructure.
x y a b c d e Figure 12. Loop substructure with a concurrent structure. Table 10. Invariant set of activities in Figure 12. Corollary 3. Let L be an event log, COC i , COC j , COC k are co-occurrence classes from the event log L, and COC i COC j COC k . If the following two conditions hold, (1) for ∀σ ∈ L, ∀a ∈ COC k , Inv Pr eds(a) ∩ IncSuccs(a) = COC i ∪ COC j , (2) for ∀σ ∈ L, Occur(COC i , σ) = Occur(COC j , σ), then COC i , COC j is the do part of the loop structure with COC i and COC j in a concurrent substructure, while COC k is the redo part of the loop structure.

Definition 10.
[Merge activities in relation → [6]] When the left activities in two → relations are the same (a→ b, a→ c), two possible substructures can be created. Similarly, for → relations when the right activity is common, i.e., (b → a ∨ c → a) , the substructure created will be either [b|c, a]  Algorithm 4 presents the construction steps for discovering process models from event logs using the methods in Sections 5.1 and 5.2.
Step 1 constructs the co-occurrence class from the event log, Steps 2-4 use the method in Section 5.1 to construct the corresponding substructure for each co-occurrence class, Steps 5-7 use the method in Section 5.2.1 to obtain the communication behavior relationship between any two co-occurrence classes, Steps 8-13 use the method proposed in Section 5.2.2 to merge several co-occurrence classes into a larger sub-module, and Step 14 further merges the remaining sub-modules into a complete process model through the directly-follows relationship between activities.

Algorithm 4: Construct Petri net from the event log
Input: the event log L Output: Petri Net 1 cocset = GetCOCSet(L, A L ); 2 foreach coc in cocset do 3 substructset=Getsubstruct (coc) 4 end 5 for coc i , coc j in cocset do 6 GetbehaviorRelation coc i , coc j 7 end 8 for coc i , coc j in cocset do 9 if coc i coc j merge coc i , coc j , parallel ); 10 if coc i + coc j merge coc i , coc j , choice ); 11 if coc i coc j merge coc i , coc j , loop ); 12 if coc i coc j merge coc i , coc j , sequence ); 13 end 14 find the missing → relations; 15 build the final Petri Net as stated def 10

Experiment Evaluation
Experiment purpose. In this section, we experimentally evaluate whether the method proposed in this paper can discover long-distance dependencies from event logs and can rediscover process models with complex cycle structures from small event logs. Several event logs were used to compare our method with other process discovery methods, such as Inductive Miner (IM) [17], the Heuristics Miner (HM) algorithm [14], Inductive Miner-Incompleteness (IMin) [17], and Conjoint Occurrence Miner (CoM) [6]. The experimental evaluation in this section mainly answers two questions. The first question is whether the proposed method can discover long-distance dependencies. The second question is whether the proposed method can correctly discover the original model with a complex structure from incomplete event logs caused by concurrency. In addition to concurrent, sequential, and exclusive structures, the model also may contain complex loop structures. Experiment design. For problem 1, we generate event logs containing 1000 traces by simulating the model M1 with long-distance dependencies shown in Section 3. Then, we try to use these logs to rediscover the process model using different methods. We report on the experimental comparison results of the quality of the process model and the consistency between the original model and the discovered model [31]. Meanwhile, we report the discovery of long-distance dependencies in the original model. For problem 2, event logs with different completeness levels of directly-follows relationships are generated by simulating model M2 shown in Figure 6. These logs are obtained by removing some of the directly-follows relationships derived from the concurrent structure. Different algorithms are used to reconstruct the original process model M2 from these event logs of different sizes, and the differences between the discovered process model and the original model are reported. The method proposed in [31] is used to measure the Degree of Trace Consistency (DTC) and Degree of Profile Consistency (DPC) of alignment between the models. All miners use default parameters, and h in IMin is set to 0. Experiment Result. From Table 11, it can be seen that the proposed method and the CoM method can discover long-distance dependencies, but compared with the CoM method, the proposed method can better discover the cyclic structure. IMin, IM, and HM cannot rediscover the original process model due to their inability to discover long-distance dependencies. The CoM method also cannot discover the original process model due to its limitations in the cyclic structure. When different methods are applied to the event log generated by M1, the consistency between the discovered process model and the original model is measured from two aspects by using the method given in [31]. The experimental results show that the method proposed in this paper has good results regarding the consistency of the discovered process model and the original model in terms of both the degree of trace consistency and degree of profile consistency. Since long-distance dependencies do not affect the behavior profile relationship between activities, the process model obtained by the IMin, IM, and HM algorithms has a good degree of profile consistency with the original model.  Figure 13 shows that the IM algorithm and the HM algorithm require that the event log is complete with a directly-follows relationship, and IMin, COM, and the proposed method can handle an incomplete event log. Although the IMin method can handle the incompleteness of the event log, it cannot discover the long-distance dependency; thus, the value of DTC is lower than that of DPC between the model discovered by IMin and the original model. IMin mainly retains the behavior relationship with high probability based on the directly-follows relationship between activities to determine the behavior between the activities. Hence, the requirement for completeness of IMin is lower than that of the IM and HM algorithms but is higher than the proposed method and the CoM method. The proposed method and CoM method require that the completeness degree of the log is up to 85%. However, because the CoM method has some limitations in discovering the cycle structure, the consistency degree between the discovered model and the original model cannot reach 1, even if the log completeness degree is more than 85%.
IMin requires the completeness ratio of the log to be about 90%. However, since the process model discovered by IMin is block-structured, some hidden transitions can be added when building the model. These hidden transitions may cause the model to generate additional behaviors, which may lead to a certain deviation between the discovered model and the original model. Thus, for the IMin method, even if the log completeness ratio reaches 90%, the consistency degree between the discovered model and the original model is still less than 1.
As the results of our experiments show, the proposed method has two advantages compared to the others. One advantage is that our method can handle incomplete event logs better than the IMin algorithm. However, the IM and HM approaches are heavily dependent on the log completeness. Although both the proposed method and the COM method can deal with the incompleteness caused by high concurrency, the COM method has limitations in discovering the loop structure, whereas the proposed method can also discover complex loop structures in addition to sequential, exclusive, and concurrent structures. The other advantage is that the proposed method can discover long-distance dependencies from the event logs. As the experiment shows, IMin, IM, and HM do not have an inability to discover long-distance dependencies.

Conclusions and Future Work
Based on [6], a method for discovering a process model with multiple structure types from incomplete event logs is presented that can discover not only sequential, concurrent, and conflicting structures but also complex cyclic structures. This paper analyzes the relationship between co-occurrence classes from log and co-occurrence classes from the model and addresses communication behavior profiles between co-occurrence classes from a log using the technology of the behavior profile. A series of propositions are presented to point out how to construct the substructure through the co-occurrence class. Subsequently, the whole process model is constructed from the event log. The proposed method can still discover the original process model from an incomplete event log, even if the model contains a complex cyclic structure. In addition, it also can correctly discover a longdistance dependency implied in the log. For large event logs, it will be time-consuming to construct co-occurrence classes and analyze their communication behavior profiles. In the future, we will consider using co-occurrence classes to construct frequent patterns from event logs to analyze local behaviors in business processes.