Extraction of Missing Tendency Using Decision Tree Learning in Business Process Event Log

: In recent years, process mining has been attracting attention as an effective method for improving business operations by analyzing event logs that record what is done in business processes. The event log may contain missing data due to technical or human error, and if the data are missing, the analysis results will be inadequate. Traditional methods mainly use prediction completion when there are missing values, but accurate completion is not always possible. In this paper, we propose a method for understanding the tendency of missing values in the event log using decision tree learning without supplementing the missing values. We conducted experiments using data from the incident management system and conﬁrmed the effectiveness of our method.


Introduction
Information systems encourage the efficient execution and management of business processes. If the results of business process execution are recorded as an event log, we can effectively use it to improve the business process. The analysis of event logs is called process mining [1] and has attracted much attention in recent years. For example, process discovery [1] is a technique to automatically generate a business process model that satisfies the behavior by inputting an event log that records who executed what activity at what time. The visualization of business processes by using business process models can be used to understand the current situation. In addition to this, process mining techniques can also be used to check whether a process conforms to the organization's rules, analyze process performance, and suggest process improvements. The availability of event logs will allow for evidence-based analysis and increase opportunities for business process improvement [2].
Many process mining algorithms assume that the input event log is of high-quality. That is, it is required to have no missing values. However, due to technical (deviations could occur even for automatic logging systems due to machine breakdowns, system bugs, and resource constraints [3]) or human (human error) reasons, the event log may contain missing values [4]. A certain level of errors in event logs is often unavoidable, particularly when event logs are built by integrating several heterogeneous data sources or where manual logging is involved [5]. In addition to this, data failures also occur for the reason that they improve the adaptability of behavior during the execution of a process instance [6].
In data analysis, this phenomenon is called "garbage in, garbage out" [7] and analyzing poor quality data will only yield meaningless results [8,9]. Mans et al. have also shown that the quality of the event log is an important success factor for process mining projects [5]. Therefore, it is necessary to pre-process the event log before analyzing the data.
In the area of process mining, research on pre-processing of event logs has been conducted. Sim et al. proposed a likelihood-based Multiple Imputation to complete the missing values for events in the event log [10]. Conforti et al. proposed a method for repairing the timestamp with which an event was executed in the event log [11]. By using these methods, it is possible to complement the missing value in the event log with the predicted value. However, even if a method of repairing missing values is used, it is not always possible to guarantee that the missing parts are repaired correctly to what is truly executed, which may lead to erroneous repairs. Therefore, it is desirable to prevent missing elements in the data at the time of data acquisition. References [12,13] state that it is important to know how to systematically identify the root causes of data quality problems in event logs. Quality of data can be improved by (i) improving the way in which data are captured while they are being generated and (ii) improving the data after they have been acquired [8]. The above studies [10,11] are (ii), while our study is (i). By using these two perspective methods together, it is expected that data quality can be further improved.
In this paper, we do not repair the missing values, but we propose a method to understand the tendency of the missing events in the event log. By using decision tree learning to learn the information around the missing points in the event log, we can identify the tendency of the missing points from the branching of the constructed tree. Our proposed method is superior in that it is simple and easy for users to understand, considering the purpose of user support for process mining. Furthermore, in the absence of our method, a human would need to look closely at the data to see where the missing values are occurring, and it would be ad hoc. On the other hand, since our method can express the tendency of the occurrence of missing values in the event log by using a decision tree, we think that our method can analyze the cause of the missing values efficiently. We conducted experiments using real data of business processes published by Volvo IT Belgium and confirmed the effectiveness of our method. This paper is organized as follows. Section 2 explains the preliminary knowledge of this paper. Section 3 explains the proposed method. Section 4 evaluates the proposed method. Section 5 describes the related works. Section 6 summarizes this paper.

Background Knowledge
Section 2.1 describes about event logs. Section 2.2 describes about data quality in process mining.

Event Logs
Event Log is recorded by information systems as a result of events executed in business processes. Performed events are recorded in a business process instance (from the start to the end of a case). A trace is a division of executed events into business process instances, and a trace T i of length n is represented as an ordered set of events e in the following manner. i represents the identifier of the trace in the event log.
This indicates that the events were executed in sequence from left to right. Each event can also have attributes such as the person who executed the event and the time it was executed, but we do not use these information in this paper. Each event can be represented by an activity (for example, a register request in the handling of requests for compensation) that represents the content of each event in an easy-to-understand manner. Table 1 is an example of an event log. Trace 1 consists of four events, <e 11 , e 12 , e 13 , e 14 >, and each event has an activity label <A, B, C, D>. In addition, the person who executed the event (the resource) and the time of execution (timestamp) are recorded. In this paper, we extract missing tendencies with a particular focus on activity. · · · · · · · · · · · · · · ·

Data Quality
Ensuring data quality is a key challenge for successful process mining [1]. Reference [14] defines five levels of maturity for event log data quality from excellent-quality ( ) to low-quality ( ). Process mining techniques are considered to be applicable to , , and , and in principle, process mining techniques can be applied to and event logs as well. However, analysis using such a low-quality event log is problematic and the results cannot be trusted. Therefore, this study proposes a method to find the problems of the current low-quality event logs in order to obtain high-quality event logs in the future at the stage of acquiring event logs in the situation where only or level event logs are available.
There are various dimensions of data quality [15]. In particular, the reference [4] lists the following four dimensions related to data quality in process mining.
(1) Missing Data: This corresponds to the situation where data that should be recorded is missing.
For example, events and attributes in the log and their relationships are lost. Missing values are caused by problems with the logging system or human error. (2) Incorrect Data: This corresponds to the situation where the log is different from what should be recorded. For example, the entity or value recorded in the log are incorrect. (3) Imprecise Data: This corresponds to the case where the recorded data is coarse. If such data were recorded, we would not be able to perform the analysis that requires more accurate data. For example, if the timestamp unit is days, it is difficult to analyze in hours or minutes. (4) Irrelevant Data: This is when more data are recorded that are not of current interest than are needed.
Filtering irrelevant data is difficult and is an important issue in process mining.
This paper deals with the above (1) Missing Data, especially for situations where events and activities are missing. Missing events and activities are events and activities that should be recorded but are not. For example, a trace that should be recorded as <A, B, D, E> is recorded as <A, B, -, E>. A hyphenation (-) indicates that an event and activity is missing. When activity D is missing and becomes <A, B, E>, i.e., it may not be detected as missing, this paper focuses on the case where it can be detected as <A, B, -, E>, as in Reference [16].

Proposed Method
In this section, we describe a method for extracting the common tendency of traces that contain missing values in an activity using decision tree learning. Figure 1 is an overview of the proposed method. In general, the results of business process execution are recorded in an event log either by the information system or manually. Since the event log may contain missing values due to technical or human causes, we use a decision tree to extract the missing tendency. The proposed method consists of two phases: (1) vectorization of traces containing missing values of activities, and (2) construction and interpretation of decision trees.

Vectorization of Traces with Missing Activity
The usual event log is in the form of XES (eXtensible Event Stream), an XML-based format, and it is difficult to train a decision tree in this format. In existing research using machine learning in process mining, several methods have been proposed to extract features from the event log for training [17]. However, no method for training the event log including the missing values has been established. In this paper, we focus on the activities performed before and after the missing values in the trace and convert them to a form that can be used for decision tree learning by vectorizing the trace containing the missing values.
Algorithm 1 is an algorithm for generating a set of features from an event log that contains missing values. All the activities in all the traces (from the trace of ID 0 to the trace of ID n) are checked in sequence, and if a missing value is found, the ID of the activity executed immediately before the missing value is added to 200 and the ID of the activity executed immediately after the missing value is added to 100 to get the feature value. Here, we assume that the original types of activities are less than 100. The numbers 100 and 200 are added to the activity IDs in order to distinguish them from the original activities. It does not matter if the numbers to be added are different, such as 1000 and 2000. We also assume that the number of activities is less than or equal to 100 because if there are more than 100 activities, we cannot distinguish, for example, between features added to ID1 plus 100 and the originally existing ID101. If there are more than 100 activities, it is possible to handle them by increasing the number added to the ID. For example, if the number to be added is 200, our method can accommodate up to 200 activities. After generating a set of features, each trace is checked and a one-hot vector is generated, which is set to 1 if the corresponding feature exists and 0 if not. Table 2 shows an example of generating a vector from traces. In TraceID T2, 4 and 3 are executed immediately before and after the missing value represented by "-". Therefore for j = 0 to m do 4: if (activity ij = φ) then 5: if ((activity ij−1 + 200) / ∈ FeatureSet) then 6: FeatureSet += (activity ij−1 + 200) 7: end if 8: if ((activity ij−1 + 100) / ∈ FeatureSet) then 9: FeatureSet += (activity ij+1 + 100) 10: end if 11: end if 12: end for 13: end for 14: return FeatureSet

Construction and Interpretation of Decision Trees
The event log vectored by the method described in Section 3.1 is turned into a decision tree using the decision tree learning algorithm, CART [18]. Decision tree learning is a method for generating a tree classifier given a certain objective and explanatory variable and dividing the data into a top-down list. In this paper, we learn a decision tree with the objective variable as a variable that indicates whether the trace contains missing values or not, and the explanatory variable as information about the activity taking place before and after the missing values. Figure 2 shows an example of a decision tree constructed by the proposed method. The "200.0 <= 0.5" in the root node represents the branching condition of the trace, and when the feature 200.0 is less than or equal to 0.5, we assume a one-hot vector with a feature value of 0 or 1, so when the value of the feature 200.0 is 0, the trace branches to the left and If the value of the feature 200.0 is 1, the trace branches to the right. The reason why we chose 0.5 is because of the specification of scikit-learn, the machine learning library we used. It does not matter if the number is not 0.5, as we need to know that the feature of interest is 0 or 1. Samples represent the number of traces, where the number to the left of value represents the number of traces with no missing values and the number to the right of value represents the number of traces with missing values. In this example, 554 traces do not contain any missing values and 387 traces contain them. Entropy is an information-theoretic measure of uncertainty in a multi-set of elements. If the multi-set contains many different elements and each element is unique, then variation is maximal and it takes many "bits" to encode the individual elements [1]. From this decision tree, we can derive the rule of missing values that "there is always an activity with an activity ID of 0 immediately before the activity with missing values in the trace containing the missing values".

Evaluation
In this section, we describe the experiments to confirm the effectiveness of the proposed method and the results of the experiments.

Details of the Experiment
As an evaluation experiment of the proposed method, we apply the proposed method to an event log containing missing values, and check whether the tendency of missing values can be extracted by decision tree learning. The reason for this method of evaluation is that there are probably no studies that provide a quantitative comparison with this study.
The data set used was the data from the incident management system published by Volvo IT Belgium, which was the subject of the BPI challenge 2013 1 . The expected workflow of this system represents the process of recovering from the state of the incident to normal service when the incident occurs. The number of traces in this event log is 6042 and there are 13 different types of activities. Table 3 shows the list of activities. Each activity is assigned an ID. From now on, each activity is called by its ID. Because the above event log does not contain any missing values, the following three patterns of missing values were generated separately. The three patterns were determined by focusing on control flow, which is an important factor to consider in business processes. It was randomly decided which IDs were deleted. Table 3. Activity name and ID.

Activity ID
Activity Name 0 Accepted in-progress 1 Queued-Awaiting-assignment 2 Completed-resolve 3 Accepted assigned 4 Completed closed 5 Accepted wait-user 6 Accepted wait-implementation 7 Accepted wait 8 Completed In Call 9 Accepted wait-vender 10 Accepted wait-customer 11 Unmatched Unmatched 12 Completed cancelled 1 https://www.win.tue.nl/bpi/doku.php?id=2013:challenge. pattern 1: The activity immediately before or after the activity with an ID of 0 is deleted and the missing values are generated. pattern 2: When an activity with an activity ID of 0 or 2 occurs during a trace, the next activity is deleted and a missing value is generated. pattern 3: When an activity with IDs 5, 6, 7, 9, and 10 is performed during a trace, it generates a random number from 0 to 100. If the generated random number is greater than 40, the activity of IDs 5, 6, 7, 9 and 10 are deleted and the missing value is generated.
After generating the missing values using the method described in Section 4.1, the event log was vectorized and the decision tree was constructed using CART [18], a decision tree learning algorithm of the machine learning library scikit-learn [19].
Our experiments were conducted on an MacBook Air, 1.8 GHz Intel Core i5, each core being equipped with 8 GB main memory, running on macOS 10.13.6.

Results
We constructed three decision trees corresponding to each of patterns 1, 2, and 3 in Section 4.1 using the proposed method. The results are explained below.

Result of Pattern 1
In pattern 1, the decision tree obtained by applying the proposed method is shown in Figure 3. The value of the root node shows that the number of traces containing the missing values is 3692 in the event log. There are 2350 traces that do not contain missing values. A common rule for traces with missing values is that most of the traces (3589) with missing values have an activity with an activity ID of 0 just before the missing value. Then, it can be seen from the classification rules of the decision tree that in other traces that contain missing values, there is an activity with an activity ID of 0 that is performed immediately after the missing value. From these two rules in the resulting decision tree, we were able to extract a rule for the missing values contained in this event log, which states that an activity with an activity ID of 0 is always performed immediately before or after the missing value. This is equivalent to the missing rules of artificially generated pattern 1, so we can see that we have correctly extracted the rules common to each of the traces containing the missing values.

Result of Pattern 2
In pattern 2, the decision tree obtained by applying the proposed method is shown in Figure 4. The value of the root node shows that the number of traces containing the missing values is 1870 in the event log. There are 4172 traces that do not contain missing values. It can be seen that the rule "an activity with an activity ID of 0 or 2 is executed just before the missing value" can be extracted using the decision tree shown in Figure 4. This is equivalent to the missing rules of artificially generated pattern 2, so we can see that we have correctly extracted the rules common to the traces containing the missing values.

Result of Pattern 3
In pattern 3, the decision tree obtained by applying the proposed method is shown in Figure 5. The value of the root node shows that the number of traces containing the missing values is 2813 in the event log. There are 3229 traces that do not contain missing values. A common rule for traces containing missing values is that an activity with an activity ID of 0 takes place before the missing values. Furthermore, the fact that an activity with an activity ID of 2 is performed after the missing value is extracted as a rule. This result shows that the activity with an activity ID of 0 was performed before the missing value, and the activity with an activity ID of 2 was often performed after the missing value.

Summary and Discussion of the Results
In summary, in the case of patterns 1 and 2, where the activity was missing by a definitive rule, we were able to classify with 100% accuracy, but in the case of (3), where the activity was probabilistically missing, we were unable to classify 25 traces correctly. In the case of pattern 3, 25 out of 6042 traces could not be classified correctly, which means that the misclassification of 25 traces occurred.
If our methods are not used, analysts need to look closely at the data to see where the missing values are occurring without getting any assistance. On the other hand, since our method can express the tendency of the occurrence of missing values in the event log by using decision trees, we believe that our method can analyze the cause of the missing values efficiently when the pattern of the missing values is observed as shown in 1 and 2. In the case of pattern 3, a misclassification has occurred. To deal with this problem, we can modify the features and extend the interpretation of the constructed decision tree. This is a subject for future work.

Related Works
Various studies have been done on the quality of event logs. Many of these studies have proposed algorithms for repairing defects in the event log, such as the presence of missing values in activities and timestamps [3,10,11,16,[20][21][22][23][24][25][26][27][28][29][30][31]. By using these methods, we can obtain higher quality data that are close to the actual business process from the data with missing values. However, the completion of missing values is not always accurate. There are cases in which the completion of a missing value is different from the actual business process.
While existing methods of missing value completion improve data quality immediately, this study proposed a method to understand the tendency of missing values by focusing on the activity of the business process to support long-term accurate data recording.
The importance of obtaining high quality data is not limited to the field of process mining. Various research papers on data repair (e.g., [32][33][34][35][36][37]) and survey papers (e.g., [38,39]) exist. These are not concerned with the data quality of the event log in process mining. Since there are various formats of data, it is necessary to have a method that suits the analysis target. In our research, we focus on control flow, which is important in business process, to extract the deficit tendency of activity.
There are also studies that have detected problems related to the quality of event logs [40,41] and proposed evaluation methods [42,43]. These are similar to our study, but our study is different in target because it specializes in tendency extraction of missing value.
Andrews et al. proposed a quality-aware, semi-automated approach to extracting event logs from relational data [44]. Fani Sani et al. showed that many process mining algorithms are difficult to handle really large and noisy event data, and proposed a method for selecting traces by some pre-processing functions that reduce the complexity of the event log [45]. Whereas these studies are concerned with pre-processing to generate high-quality event logs, this study is not about pre-processing but is intended to provide continuous support for obtaining high-quality event logs.
Several domain-specific data quality frameworks [46][47][48] have been proposed. On the other hand, our method is considered to be applicable to a wide range of domains without domain knowledge.

Conclusions
In this paper, we proposed a method for identifying the tendency of missing values of activities in an event log with low-quality of business processes by using decision tree learning, which uses information on activities executed before and after the missing values as features. As a result of evaluation experiments, we showed that it is possible to extract rules for classifying traces with and without missing values with high accuracy.
The future work is to improve the ability to express the extracted rules. In this paper, we focused on the activities performed before and after the missing values and learned them as features. By increasing the number of features such as the person who performed the activity and the execution time, we are expected to be able to extract the tendency for various missing values.