Event Log Preprocessing for Process Mining: A Review

Process Mining allows organizations to obtain actual business process models from event logs (discovery), to compare the event log or the resulting process model in the discovery task with the existing reference model of the same process (conformance), and to detect issues in the executed process to improve (enhancement). An essential element in the three tasks of process mining (discovery, conformance, and enhancement) is data cleaning, used to reduce the complexity inherent to real-world event data, to be easily interpreted, manipulated, and processed in process mining tasks. Thus, new techniques and algorithms for event data preprocessing have been of interest in the research community in business process. In this paper, we conduct a systematic literature review and provide, for the first time, a survey of relevant approaches of event data preprocessing for business process mining tasks. The aim of this work is to construct a categorization of techniques or methods related to event data preprocessing and to identify relevant challenges around these techniques. We present a quantitative and qualitative analysis of the most popular techniques for event log preprocessing. We also study and present findings about how a preprocessing technique can improve a process mining task. We also discuss the emerging future challenges in the domain of data preprocessing, in the context of process mining. The results of this study reveal that the preprocessing techniques in process mining have demonstrated a high impact on the performance of the process mining tasks. The data cleaning requirements are dependent on the characteristics of the event logs (voluminous, a high variability in the set of traces size, changes in the duration of the activities. In this scenario, most of the surveyed works use more than a single preprocessing technique to improve the quality of the event log. Trace-clustering and trace/event level filtering resulted in being the most commonly used preprocessing techniques due to easy of implementation, and they adequately manage noise and incompleteness in the event logs.


Introduction
Process mining is a relatively new study area that has gained significant attention among computer science and business process modeling communities [1]. It is a powerful tool for organizations to obtain actual models for better understanding of the real operation of their business processes and for better decision making. Process mining techniques allow automatic discovery, conformance, and improvement of process models implemented by organizations through the extraction of knowledge from event logs as well as from the available documentation of the process model [2]. In this context, an event log is a collection of time-stamped event records produced by the execution of a business process.
Considering that the event log is the main input for process mining techniques, the quality of this information has a great impact on the resulting model. An event log with low quality (missing, erroneous or noisy values, duplicates, etc.) can lead to a complex, unstructured (spaghetti-type), and difficult to interpret model (as shown in Figure 1a); or a model that does not reflect the real behavior of the business process. Therefore, event log data preprocessing is considered a task that can substantially improve the performance  Mans et al. [4] describe event log quality as a two-dimensional spectrum where the first dimension is concerned with the abstraction level (or granularity) of activities that are part of the process model. The second one is concerned with the accuracy of the timestamp (in terms of its granularity, directness of registration, and correctness) of the events registered in the log. Emamjome et al. [5] present the data quality problem from a theoretical approach addressing the root causes and the social, material, and individual factors that contribute to low quality data in event data, which would be overlooked by existing data cleaning methods. Other works [6,7] describe data quality as being a multi-dimensional concept associated with accuracy/correctness, completeness, unambiguity/understandability and timeliness. In this paper, we relate the event log quality with the identification, visualization, and correction, or elimination of incorrect, or noise, missing, duplicate, and irrelevant events. In real scenarios, some process mining tasks work under the assumption that behavior related to the execution of the running process is stored correctly within the event log, and that each instance of the process stored in the event log is already finished. However, real-world event logs contain noisy or corrupt data records, which can be generated by different factors: some traces are duplicated, incomplete, inconsistent, or reflect some other incorrect behavior. These problems can be caused by several factors, including errors in the data transmission, errors during storage, technology limitations, or transcript errors when events arrive in the wrong order. Sometimes, the noise may be associated to the presence of rare events due to handling of exceptional cases, incorrect recording of selected tasks in the execution of the process, or even for the incorrect assignment of timestamps. Constantly, it is difficult to distinguish between noise and infrequent correct behavior in an event log, resulting in a mined model with less fidelity to the real model. Different noise types in the event log impact negatively in quality issues, making that the process mining algorithm return complex, incomprehensible, or even inaccurate results. Therefore, to reduce these negative effects, event log preprocessing is a necessary task in the majority of process mining algorithms. The preprocessing of the event log attempts to detect and remove noise events, traces, or activities that contain such undesired behavior. This work provides a comprehensive review of the most representative techniques for preprocessing event logs, which is crucial for the performance of process mining tasks.
For the first time, we present a state-of-the-art review of the approaches for event log preprocessing that include techniques based on heuristics, pattern-based methods, trace clustering, and hybrid methods. We present an extensive analysis of the surveyed works, qualitatively and quantitatively.
Our review provides answers, in the broad sense, for three main questions: (1) how can the different techniques of event log preprocessing be grouped? (2) What problems exist around achieving data quality in the event log? and (3) How does one determine if a preprocessing technique could substantially improve a data mining task? For example, the grouping and identification of associated challenges to preprocessing techniques can serve to process mining implementers, to know the diverse types of available techniques, to provide them with more elements to select the most appropriate technique based on the underlying algorithms, the type of quality problems addressed, or particular issues in the application domain.
This research work has three main contributions: 1.
We present, for the first time, a review of preprocessing techniques of event logs, also called data cleaning or data preparation techniques in the context of process mining.

2.
We provide a grouping of preprocessing and repairing techniques of event logs, required to build more robust process models.

3.
We present a study of relevant characteristics associated with preprocessing techniques used when making decisions about the use of a specific technique.
The remainder of this paper is organized as follows: Section 2 introduces the basic concepts related to event log preprocessing and to process mining. Section 3 presents the research methodology followed in this work to build this survey. Sections 3.2.1 and 3.2.2 present a proposal to group event log preprocessing techniques according to the approaches reported in the state-of-the-art. Section 3.3 outlines the tools used in each proposal submitted. Section 3.4 shows the representation schemes used for the manipulation and transformation of the event log. Section 3.5 presents the different problems identified in the event logs. Section 3.6 describes the tasks closely related to preprocessing. Section 3.7 identifies the attribute types to improve the quality of the event log. Section 4 provides insights on lessons learned and open problems. Finally, Section 5 concludes this work.

Preliminary Concepts
Process mining algorithms act over an event log-an event collection containing historical records from each business process instance. Each event produced during the execution of a business process instance (a case) corresponds to a trace. The set of all traces conform to the event log. This section presents some useful concepts for understanding the basis of event log preprocessing in the context of process mining.

Definition 1.
An event refers to a case, an activity, and a point in time. The event is characterized by a set of attributes such as, ID, timestamp, cost, resource, among others [8].

Definition 2.
A trace can be seen as a case, i.e., a finite sequence of events σ ∈ E * , such that each event appears only once [8].
Definition 3. An event log consists of a set of cases, and cases consist of events, such that each event appears, at most, once in the entire log [8].
The events for a case are represented in the form of a trace, i.e., a sequence of unique events. Moreover, cases, such as events, can have attributes. The structure of an event log is made up of the following elements:

Definition 4.
A Business process model is the graphical and analytic representation used to capture the behavior of an organization's business processes. A business process model is usually expressed through different graphic methods or notation languages, such as the flowchart, UML, workflows, Petri nets, BPMN, among others.
In the context of event log preprocessing, it is essential to identify the issues closely related to the quality of the data recorded in the event log. Therefore, some of the data quality issues that commonly occur in event logs are described below.
Noise/anomalous data: this problem corresponds to the scenario where the data in the event log contain errors, or meaningless data that deviate from the expected behavior. Incorrect or noise data may be the result from inconsistency or discrepancy in naming conventions or data codes used, or inconsistent formats for input fields, such as timestamps.
Hence it is necessary to use some techniques to eliminate or replace the noisy data.
Missing data: this problem can occur when different information can be missing in the event log, although it should be registered on a mandatory basis. This occurs, for example, when an attribute in an event is missing due to problems related to the sending, registration, or storage of events from an information system.
Irrelevant data: in this scenario, there may be event records that are irrelevant for the analysis of the model under study, but from these, it is possible to derive the record of a relevant event through some transformation and filtering processes.
Duplicated data: this problem is present when the same event is recorded in the event log more than once, by the same resource and with the same timestamp. Likewise, the problem may arise when an activity is registered more than once by the information system, sometimes causing the process model to become a complex model. Data diversity: this situation is present when the information system is very general and allows the diverse registration of events at different levels of granularity, which makes process models incomprehensible and difficult to represent.
Many of the issues of data quality previously mentioned have been addressed in the surveyed preprocessing techniques in this work.

Research Methodology
This section describes the methodology for the literature review presented in this work and the inclusion and exclusion criteria established for the selection of the surveyed works.

Systematic Review Process
The search and selection strategy of research works for this review was conducted in two stages. The first stage consisted of recovering the related works from three popular electronic libraries, including IEEE Xplore, Springer Link, and Science Direct, oriented on various disciplines, including process mining. Specifically, we collected papers since 2005 (period from which automatic algorithms for mining processes began to be proposed, such as the alpha algorithm) using the following terms "refining, repairing, cleaning, refinement, filtering, clustering, preprocessing, ordered, aligning, abstraction, anomalous detection, infrequent behavior, noisy, imperfection, traces, event log, process mining" identified in their title or abstract. These terms were combined to form search strings that serve as input queries to the three digital libraries (see Table 1). Given the generality of the articles retrieved from the selected digital libraries, in the second stage a search and selection strategy was applied in the inclusion/exclusion criteria, to decide which articles would be included in the final review.

Inclusion Criteria
Below are the inclusion criteria that were used to select the works analyzed and discussed in this review.

1.
Research works written in English.

2.
Research works published in journals, conferences, or theses.

3.
Works published from at least 2005.

Exclusion Criteria
The following are the criteria to discard research works that are not of interest for this revision.

1.
Works that are not related to process mining.

2.
Works that do not focus on specific domains in the field of process mining (industry, manufacturing); that is, ad hoc techniques for a given domain.

3.
Works that do not include evaluation and experimental results.
After filtering by the queried topic and removing the duplicated retrieved papers, a total of 95 papers were obtained and analyzed, considering the inclusion criteria. On this set, and after applying the exclusion criteria, the result was a set of 70 papers. All of these are included in our qualitative analysis. Figure 2 shows the distribution of the selected works (in %), from 2006 to 2020, based on the year of publication (during a period of three years). Although Figure 2 reveals a slow growing tendency, it is difficult to determine if that tendency will be kept.
As shown in Figure 3, 25% of the selected works were published in journals, 70% were reported at international conferences, and the rest are theses. Works formerly reported (2006)(2007)(2008) have had great influence in the community, as Figure 4 reveals a total of around 2400 cites. Moreover, most recent works (2017-2020) reveal more than 350 cites. Figure 5 shows a network of closely related terms that describe the different addressed topics of event log preprocessing. This network was formed from the relationships identified between concepts included in the abstract of the surveyed works. The color intensity in the network nodes refers to more general or abstract terms that contain other more specific terms, all related to describing the handling of preprocessing in process mining. This network of terms could be useful for scientists to understand and organize the diversity of existing preprocessing techniques.  During the literature review, a content study was performed. In this study, we identified and classified the common and relevant characteristics found in the surveyed papers. Table 2 outlines a general view and a summary of the most significant characteristics (C1-techniques, C2-tools, C3-representation schemes, C4-imperfection types, C5-related tasks, and C6-types of information), which are described in greater detail in the next sections.

C1. Techniques
Is there a way of grouping event log preprocessing techniques? Different criteria might lead to different taxonomies of data preprocessing techniques in the context of process mining. From the surveyed works, we organize the existing event log preprocessing techniques, in two main groups: transformation techniques and detection-visualization techniques. The main classification criterion is the approach followed by the preprocessing techniques to clean the data, which includes identification, isolation, and reparation of errors. Figure 6 schematically shows a possible taxonomy for the surveyed works. The proposed taxonomy organizes the diversity of existing preprocessing techniques and helps identify characteristics that they may have in common. Our grouping also serves to identify in which data quality issues that certain types of techniques are more suitable to use. The first category consists of techniques that perform transfor-mations in the event log in order to correct the imperfect behaviors (missing, irrelevant, duplicate data, etc.), before applying a process mining algorithm. The second category is comprised of techniques to detect or diagnose imperfections in an event log. While the second category of techniques only detect potential problems related to data quality in the event log, the techniques in the first category directly correct the imperfections found in the event log.

Transformation Techniques
Transformation techniques carry out operations and actions to mark changes in the original structure of the raw event log in order to improve the quality of the log. Within this group, there are two main approaches: filtering and time-based techniques. On the one hand, filtering techniques aim to determine the likelihood of the occurrence of events or traces based on its surrounding behavior. The events or traces with less frequency of occurrence are removed from the original event log. Filtering techniques are focused on removing logging mistakes to prevent their spreading to the process models. On the other hand, the objective of time-based techniques is to maintain and correct the order of the events recorded in the log from the timestamp information.
Filtering techniques fundamentally address the search and elimination of noise/anomalous events or traces with missing values. Their main characteristics involve the filtering of atypical behavior identified in the event log that may affect the performance of future process mining tasks. These techniques model the frequently occurring contexts of activities and filter out the contexts of events that occur infrequently in the log.
There are several works [9][10][11][12][13][14][15] reported in the literature that propose the development of filtering techniques. Conforti et al. [10] presented a technique that relies on the identification of anomalies in a log automaton. First, the technique builds an abstraction of the process behavior recorded in the log as an automaton (a directed graph). This automaton captures the direct follow dependencies between events in the log. Infrequent transitions are subsequently removed using an alignment-based replay technique while minimizing the number of events removed from the log.
van Zelst et al. [11] proposed an online/real-time event stream filter designed to detect and remove spurious events from event streams. The main idea of this approach is that dominant behavior attains higher occurrence probabilities within the automaton compared to spurious behavior. This filter was implemented as an open-source plugin for both ProM [16] and RapidProM [17] tools.
Wang et al. [9] presented the study of techniques for recovering missing events; thus, providing a set of candidates of more complete provenance. The authors used a backtracking idea to reduce the redundant sequences associated to parallel events. A branching framework was then introduced, where each branch could apply the backtracking directly. The authors constructed a branching index and developed reachability checking and lower bounds of recovery distances to further accelerate the computation.
Niek et al. [15] proposed four novel techniques for filtering out chaotic activities, which are defined as activities that do not have clear positions in the event sequence of the process model, for which the probability to occur does not change (or changes little) as an effect of occurrences of other activities, i.e., the chaotic activities are not part of the process flow.
Within preprocessing approaches based on event-level filtering, [12][13][14] used trace sequences as a structure for managing the event log. This structure allows, in most of these works, the ordering and calculation of the frequency of occurrence of events for the identification of noise/anomalous behavior in the event log.
Other works, such as in [18][19][20][21], present algorithms for detection and removal of anomalous traces of process-aware systems, where an anomalous trace can be defined as a trace in the event log that has a conformance value below a threshold provided as input for the algorithm. That is, anomalous traces, once discovered, must be analyzed to find out if they are incorrect executions or if they are acceptable but uncommon executions.
Cheng and Kumar [22] aimed to build a classifier on a subset of the log, and apply the classifier rules to remove noisy traces from the log. They presented two proposals; the first one to generate noisy logs from reference process models, and to mine process models by applying process mining algorithms to both the noisy log and the sanitized version of the same log, then comparing the discovered models with the original reference model. The second proposal consisted of comparing the models obtained before and after sanitizing the log using structural and behavior metrics.
Mohammadreza et al. [23] proposed a filtering approach based on conditional probabilities between sequences of activities. Their approach estimates the conditional probability of occurrence of an activity based on the number of its preceding activities. If this probability is lower than a given threshold, the activity is considered as an outlier. The authors considered both noise and infrequent behavior as outliers. Furthermore, they used a conditional occurrence probability matrix (COP-Matrix) for storing dependencies between current activities and previously occurred activities at larger distances, i.e., subsequences of increasing length. Other techniques to filter anomalous events or traces are presented in [19,20,22,[24][25][26][27].
Time-based techniques are other types of transformation techniques for data preprocessing in event logs. A wide variety of research works on event log preprocessing have focused on data quality issues related to timestamp information and their impacts on process mining [12,28]. Incorrect ordering of events can have adverse effects on the outcomes of process mining analysis. According to the surveyed works, time-based techniques have shown better results in data preprocessing. In [12,29], the authors established that one of the most latent and frequent problems in the event log is the one associated with anomalies related to the diversity of data (level of granularity) and the order in which the events are recorded in the logs. Therefore, strategies based on timestamp information are of great interest in the state-of-the-art.
Dixit et al. [12] presented an iterative approach to address event order imperfection by interactively injecting domain knowledge directly into the event log as well as by analyzing the impact of the repaired log. This approach is based on the identification of three classes of timestamp-based indicators to detect ordering related problems in an event log to pinpoint those activities that might be incorrectly ordered, and an approach for repairing identified issues using domain knowledge.
Hsu et al. [30] proposed a k-nearest neighbor method for systematically detecting irregular process instances using a set of activity-level durations, namely execution, transmission, queue, and procrastination durations. Activity-level duration is the amount of time required to complete an activity and contextual information, such as employee information and customer transactions extracted from the ERP system of a medium-sized logistics company. The distances between instances were calculated using the differences between adjusted durations.
Tax et al. [31] proposed a framework for the automated generation of label refinements based on the time attribute of events, allowing to distinguish behaviorally different instances of the same event type based on their time attributes. The events generated by one sensor were clustered using a mixture model consisting of components of the von Mises distribution, which is the circular equivalent of the normal distribution. Four strategies were applied for multiple label refinements on three event logs from the human behavior domain.
Song et al. [28] proposed an approach based on the minimum change principle to repair timestamps that do not conform to temporal constraints, e.g., to find a repair that is as close as possible to the original observation. The problem is tackled by identifying a concise set of promising candidates using an algorithm for computing the optimal repair from the generated candidates, and a heuristic approximation by selecting repairs from the candidates.
Rogge-Solti et al. [32] presented a method to repair the timed event logs by combining stochastic Petri nets, alignments, and Bayesian networks. The method decomposes the problem into two sub-problems: (a) repairing the time and (b) repairing the structure for each trace. This work takes all of the observed data into account and gets efficient estimations for the activity durations and path probabilities.
Fischer et al. [33] proposed an approach for detecting and quantifying timestamp imperfections in event logs based on 15 quality metrics structured along four data quality dimensions and log levels.

Detection-Visualization Techniques
Detection-visualization techniques aim to identify, group, and isolate those events or traces that can generate problems in the quality of the event log. Within this group, two approaches are identified: clustering and pattern-based techniques. Clustering techniques divide the event log into several subsets, facilitating the understanding and analysis of each member of the subsets. Then, the next step is the identification of noise/anomalous elements within the analyzed subsets. Clustering is one of the techniques most used for data preprocessing in process mining, which has been mostly used for the identification of quality issues associated with noisy values, as well as data diversity. From the formation of similar clusters, it is possible to identify imperfection patterns related to noisy data in the different attributes of the event logs.
Several techniques have been proposed in the last decade for trace clustering. They can be divided into three approaches: vector space approaches [34][35][36][37], context aware approaches [38][39][40][41][42], and model-based approaches [43][44][45][46][47][48]. Most of the clustering algorithms aforementioned consider only the event log as input, and use different internal representations for producing the clusters. Traditionally, these algorithms have been applied without taking into consideration the availability of a process model. In contrast, in recent works [49,50], a different view on traces clustering of an event log is presented. The authors assume that a process model exists and it is used to build simpler groupings of homogeneous traces. Table 3 summarizes the most relevant characteristics of the surveyed works of clustering techniques.  Within the detection-visualization techniques, some of them perform the preparation of event logs from the pattern identification based on the definition and application of heuristic rules. These rules are identified from observed behaviors or acquired experiences by expert analysts in process mining from the study of different event logs in different domains. Many of the pattern-based techniques state that the event log is not completely correct if a given pattern is not detected in the log [29]. These techniques usually work in conjunction with clustering and abstraction or alignment techniques; thus, allowing the identification of patterns related to noisy data or data diversity.
Suriadi et al. [29] propose determining event log quality by the description of a collection of eleven log imperfection patterns obtained from their experiences in preparing event logs. The definition of pattern is given as the abstraction from a concrete form, which keeps recurring in specific non-arbitrary contexts.
Ghionna et al. [52] describe an approach that combines the discovery of frequent execution patterns with a cluster based anomaly detection procedure. Special algorithms are used for decreasing the counting of spurious activities and for coding a method that simultaneously clusters a log and its associated S-patterns, respectively (patterns and clustering).
WoMine-i [53] extracts, infrequently, components in the logs from the model specification (tasks sequences, selections, parallels, loops, etc.). WoMine-i performs an a priori search starting with the minimal patterns and reduces the search space by pruning the infrequent patterns.
Jagadeesh et al. [54] propose an iterative method for transforming traces that identify the looping constructs and sub-processes and replace the repeat occurrences by an abstracted entity. Other pattern-based approaches are presented in [14,35,39,48,54]. Additionally, some process mining algorithms [55][56][57][58][59][60][61] incorporate mechanisms of event log preprocessing (embedded techniques) as part of their approach. These algorithms implicitly attempt to detect noise traces, hidden tasks, duplicate activities in the event log, which can sometimes be attributed to event ordering imperfections. However, the decisions and detections made during the execution of some process mining tasks (discovery, conformance, or enhancement) are implicitly incorporated in the discovered process model. In this case, embedded preprocessing techniques are able to exploit their coupling to the discovery technique, allowing each step or iteration to verify and validate if the built process model is a solid model. This is revealed from some works [60,62], where it is ensured that, from the identification of noisy data, as well as from the flexible configuration of parameters in the preprocessing techniques, it is possible to build more solid and robust models. Table 4 presents a general summary of some of the most popular event log preprocessing techniques previously discussed. In that table, we provide a notation to refer to a particular technique being used: A1 (event/trace level filtering), A2 (clustering), A3 (pattern-based techniques), A4 (Embedded techniques), A5 (time-based techniques), B1 (alignment), B2 (abstraction). Table 4 also shows the particular task (discovery-D, conformance-C or enhance-E) that is intended to be improved by including a preprocessing technique in that same order. In the table are also shown the main problems identified in the event log, such as missing data (mis), noise data (noi), diversity data (div), irrelevant data (irr), and duplicate data (dup). From Table 4, we can conclude that the trace clustering technique, and event/trace filtering are the two most frequently used techniques for the preprocessing task in process mining. Time-based preprocessing techniques have recently shown promising results in data preprocessing through the study, correction, and elimination of data associated with the timestamp attribute. Moreover, the table reveals that a vast majority of preprocessing techniques have been designed to improve the process model discovery, in order to improve the quality of the discovered models, reducing the complexity of the model through the management of clean data registered in the event log. Furthermore, about 60% of the studied techniques are available in process mining tools, such as ProM tool and a small percentage corresponds to individual applications that incorporate preprocessing techniques independently. Finally, Table 4 shows the two most frequent problems in event logs, the presence of noise and the data diversity or granularity level.

C2. Tools
What tools are available for the event logs preprocessing task?
Tools for event log preprocessing are generally included as part of some tools for process mining. However, many commercial software tools for process mining (Perceptive Process Mining by Lexmark [82], Interstage Business Process Manager Analytics by Fujitsu Ltd. [83], Minit by Gradient ECM [84], myInvenio by Cognitive Technology [85], etc.) do not support event logs preprocessing tasks that help improving the quality of event logs. There are particular tools, applications, or frameworks developed for specific preprocessing tasks of event logs [26,36,37,48,52,53,72]. Most of these tools are limited to a single process modeling language and use some type of data deployment or transformation. Moreover, there are specialized tools such as ProM [16], Apromore [86], Celonis [87], and RapidProm [88] that include different filters, routines, and algorithms for preprocessing the event log to support process mining tasks.
According to Will van der Aalst.
[8], there are three categories of process mining tools that contain event log preprocessing. Type-1 process mining tools are mainly built for answering ad-hoc questions about event log preprocessing. An example of this tool type is Disco [89], which allows the user to interactively filter the data and project that data immediately on a newly learned process model. In Type-2 process mining tools, the analytic workflow is made explicit; that is, the user can visualize and decide what elements to isolate or eliminate from the event log. An example of this tool type is RapidProM. Finally, tools of Type-3 are tailored towards answering predefined questions repeatedly in a known setting. These tools are typically used to create "process dashboards" that provide standard views of process models. For example, the tool called Celonis Process Mining supports the creation of such process-centric dashboards.
Next, we describe some tools that include preprocessing or event log repair strategies as part of their functioning. Among the criteria considered to select these tools are their popularity in the process mining area (as they are reported in several papers) and the inclusion of preprocessing techniques.
The ProM framework [16] provides different event log filters (Filter event log based on choice, Filter events based on attribute value, filter log using simple heuristics, filter in high-frequency trace, among others) for cleaning event logs. These filters are especially useful when handling real-life logs and they do not only allow for projecting data in the log, but also for adding data to the log, removing process instances (cases), and removing and modifying events. There are several other filter plug-ins in ProM for the removal or repairing of activities, attributes, and events (Remove activities that never have utility, remove all attributes with value-empty, remove events without timestamps, refine labels globally, etc.). ProM is the most popular process mining tool that mostly has preprocessing techniques, since many of the research proposals are available from ProM. However, most of the available preprocessing techniques are focused on event filtering and trace clustering. ProM handles multiple formats and multiple languages, e.g., Petri nets, BPMN, EPCs, social networks, etc. Through the import of plug-ins, a wide variety of models can be loaded ranging from a Petri net to LTL formulas.
The ProM framework allows for interaction between a large number of plug-ins, i.e., implementations of algorithms and formal methods for analysis of business process, process mining, social network analysis, organizational mining, clustering, decision mining, prediction, and recommendation.
Apromore [86] is an open-source platform for advanced models of business processes. It allows applying a variety of filtering techniques to slice and dice an event log in different ways. There are two main filter types supported by Apromore: case filter and event filter. Both filter types allow creating a filter based on particular conditions on the cases or events. A case filter allows slicing a log, i.e., to retain a subset of the process cases. An event filter allows dicing a log, i.e., to retain a fragment of the process across multiple cases. There are other filters, such as timeframe that allows retaining or removing those cases that are active in, contained in, started in, or ended in a particular period of time. Another filtering technique is the rework repetition filter that can be used for process sequences containing certain repetitions are removed.
RapidProM [17] is an extension of RapidMiner, where the process mining framework ProM is integrated within RapidMiner to combine the best of both. In RapidProM, complex process mining workflows can be modeled, executed, and subsequently reused for other data sets. This tool includes data cleaning and filtering methods to filter cases based on their throughput time, with the possibility of choosing a different performance annotation. The RapidProM operators focus on the analysis of event data and process models. These operators consider that events are related to process instances, and they should be handled as such.
Disco [89] is a process mining commercial tool that provides non-destructive filtering capabilities for explorative drill-down, and for focusing the analysis. These filters are accessible from any view and are easy to configure. They allow drilling down by case performance, time frame, variation, attributes, event relationships, or endpoints.
Celonis [87] uses machine learning to determine the specific root causes of deviations from a business process. This tool focuses on identifying inefficiencies or problems related with noisy events, or missing values through clustering and filtering algorithms. Table 5 summarizes the main characteristics of the previous discussed tools. These tools are the most popular and widely known. They are tools that include preprocessing algorithms of the event logs studied in this survey, and allow process mining tasks (discovery, conformance, and enhancement) in combination with preprocessing algorithms within the same tool. There are some other non commercial, automatic, or user-driven tools for repairing event logs. TimeCleanser [90] provides consolidated detection and repairing mechanisms to deal with data quality-related ordering issues in event logs. Li and Van der Aalst [91] present a framework for detecting deviations in complex event logs from control-flow perspective. This framework is based on an approach whose basic principle is that a case from a log is a deviation if it is not similar to the collection of mainstream cases in the log. In that work, the authors propose the creation of profiles rather than models or clusters, to detect deviations and improving the performance, iteratively. One can edit the profiling function to detect deviations from specific perspectives. However, in some situations, the approach does not handle properly the presence of loops in the models.
Interactive filtering [92] is an open source toolkit that enables the discovery of process models where users can filter iteratively infrequent behavior using different outlier filtering techniques (variant filtering, probabilistic methods, and sequential mining based method), and then to apply a discovery algorithm, such as the inductive visual miner and the interactive data-aware Heuristics miner. This tool shows the results of each change in thresholds or method on the discovered process model and allows user interaction.
Although there is an extensive list of commercial and free process mining tools that incorporate techniques for the preprocessing of event logs, so far, there is no tool that exclusively contains preprocessing strategies, capable of working with large event logs with different characteristics in a considerable time. Many of the tools that contain preprocessing techniques are limited to interacting with the user to make a better decision when including, isolating, or eliminating any event or trace.

C3. Representation Schemes of Event Logs Used in Preprocessing Techniques
What structures are more appropriate to represent and manipulate event logs in preprocessing techniques?
For years, the representation of information has been a basic need, almost in every domain, including process mining. Even though the total amount of storage space is not an important issue nowadays, since external memory (i.e., disk) can store huge amounts of events, and is very cheap, the time required to access the event logs is an important bottleneck in many algorithms. An appropriate structure or representation scheme of the event logs will provide efficient management of large event logs supporting algorithms that process the events directly from the representation. One of the most common event log representations used in the preprocessing techniques is the vector space model (or bag-ofevents) [43], where each trace is represented as a vector and each dimension corresponds to an event type. In this type of representation, the similarity between traces is measured using typical measures, such as Euclidean distance or Cosine similarity. Some proposed approaches for event log preprocessing use traces or event sequences as data structures for representation and manipulation of event logs, since they are simpler to filter, aggregate, or remove new events or traces on this structure. However, other structures, such as automatons, directed graphs, trace arrays, among others, have also been studied.
In [93], a graph repairing approach for detecting unsound structure, and repairing inconsistent event name is proposed. This approach repairs event data with inconsistent labeling but sound structure, using the minimum change principle to preserve the original information as much as possible. Then, an algorithm conducts the detection and repairing of dirty event data simultaneously, so that it either reports unsound structure or gives the minimum reparation of inconsistent event names. Moreover, an approximation algorithm, called PTIME, is presented in [93] to repair one transition at a time, which is repeatedly invoked until all violations are eliminated or no repairing can be further conducted.
Mueller-Wickop and Schultz [94] present an approach comprising four preprocessing steps for the reconstruction of process instance graphs to event log with a sequentially ordered list of activities by adding a directed sequence flow between activities of instance graphs. In this approach, instance graphs can be decomposed into independent parts, which can be mapped into a sequential event log. The first step is to mine the source data with the financial process mining (FPM) algorithm to obtain process instances represented as graphs. The second step consists of transforming these graphs to directed activity graphs. The third step is to enumerate all possible paths representing sub-sequences of the mined process instances. Finally, the fourth step is to store all paths in an event log representing the logical order of events.
According to the surveyed works, the most appropriate structure to represent, manipulate, and transform raw event logs is the traces or events sequence, which can be described as a vector of values whose implementation and use is simple. This structure also facilitates the tasks of inserting or deleting events or traces, mainly when working with cases or instances of a large length in the event log.

C4. Imperfection Types in the Event Log
What kind of imperfections are commonly identified in the event logs? Chandra Bose et al. [3] establish that the most real-life event logs tend to be finegranular, heterogeneous, voluminous, incomplete, and noisy, which can cause major problems in the different process mining tasks. On the one hand, fine granularity is related to the level of detail at which events are recorded. This can vary widely without considering the desired levels of analysis. This type of imperfection is closely related to the quality issue of data diversity (Section 2) where there are diverse types of records with different granularity levels due to a lack of good standards and guidelines for logging.
Often, the models produced from the discovery task are spaghetti-like and hard to comprehend due to fine granularity in the event log. Heterogeneity in the event log means that many of the real processes take place in diverse and unstructured environments, causing the generated event log to contain a heterogeneous mixture of these environments. Heterogeneity stems from operational processes that change over time to adapt to changing circumstances. The trace clustering techniques have been shown to be an effective way of dealing with heterogeneity.
Inside the fine granularity issue, there are also imperfections in the event logs related to coarse granular timestamps, which implies that the ordering of events within the log may not conform to the actual ordering in which the events occurred in reality; mixed granular timestamps, where there are events for which the level of granularity of their timestamps is different; incorrect timestamps, scenarios where the recorded timestamp of (some or all) events in the log does not correspond to the real time at which the event occurred.
Moreover, there are other common imperfections present in the event log associated to the data quality issue, such as missing attribute values or events missing anywhere within the trace, although they occurred in reality. There may also be problems in the event log associated with ambiguity between events, where multiple events have the same activity name. Another issue is the overlapping activity, where an instance of an activity is started and before it is completed, another instance of the same activity is started. Moreover, the presence of noisy data/outliers is common; that is, rare, exceptional, or anomalous execution behavior.
Authors in [29] identified a series of imperfections patterns commonly encountered in preprocessing raw source logs. These patterns are defined from the problems that are identified when transforming raw data source logs into an event log that is 'clean' and usable for process mining analysis. Some of the imperfection patterns presented in [29] are directly related to the issue of missing event values or noisy values, for example when the events in a log are not explicitly linked to their respective case identifiers, or when there are key process steps missing in the event log being analyzed but are recorded elsewhere. To address this type of imperfection pattern, most of the event-or trace-level filtering techniques studied in Section 3.2.1 attempt to identify missing events and correct them, or eliminate outlier events from the event log.
Another type of imperfection pattern also presented in [29] is related to problems in the timestamp attribute. This imperfection occurs when recording errors in the timestamp, or when the timestamp values are recorded in a different format from the expected (data diversity), or when recording events from electronic forms, such as the difference in the order of the events that were executed. Techniques to address the issues in the timestamp are mainly based on determining the impact that the timestamp information has to improve the quality of the event log.
In addition to the aforementioned patterns, there are also patterns related to problems in the labels associated with the events, such as the presence of a group of values (of certain attributes in an event log) that are syntactically different, but semantically similar, or the existence of two or more values of an event attribute that do not have an exact match with each other, but have strong similarities, syntactically and semantically. To address this type of imperfection pattern, the abstraction techniques and clustering turn out to be the most suitable for transforming event labels to a higher level of granularity, allowing to bridge the gap between an original low-level event log and a desired high-level perspective on the log.
Other authors [12] have identified that there are indicators associated with the time to detect imperfections in the order of the events of a log. Among the identified indicators are: (1) the existence of either coarse timestamp granularity or mixed timestamp value granularity from multiple systems, where each system records timestamps differently. An example of this is when an event x may be recorded at day-level granularity. Within the same case, another event y may have second-level granularity. The ordering of these two events will be incorrect; (2) identifying events exhibiting unusual temporal ordering (e.g., duplicate entry of exactly the same event; (3) learning the temporal position of a particular activity in the context of other activities, or the distribution of timestamp values of all events in a log may indicate the existence of timestamp-related problems. For example, when a log is comprised of events from multiple systems, there may be more than one way in which timestamps are formatted, which may lead to the 'misfielded' or 'unanchored' timestamp problem.
Despite the diversity of imperfections that may be present in the event log, and according to the review of the state-of-the-art, two of the most common problems are those related to the presence of noisy data, as well as the data diversity in the event log that deviates from the expected behavior.

C5. Related Tasks
What are the tasks closely related to event log preprocessing?
From the state-of-the-art works discussed so far, we identified two tasks strongly related to the data preprocessing in process mining: (1) event abstraction and (2) alignment. Both tasks allow improving the quality of the event log or the process model and the performance of some process mining techniques.

Event Abstraction
The majority of available process mining techniques assume that event data are captured on the same level of granularity. However, information systems in the real world record events at different granularity levels [95]. In many cases, events recorded in one event log are presented in a fine-grained level, causing process techniques and particularly process discovery algorithms to produce incomprehensible process models or models not representative of the event log. In these cases, the event abstraction techniques transform the event log to a higher level of granularity, allowing to bridge the gap between an original low-level event log and a desired high-level perspective on the log, such that more comprehensible process models can be discovered.
Some techniques proposed for event abstraction make use of supervised learning when annotations with high-level interpretations of the low-level events are available for a subset of the sequences (i.e., traces). These annotations provide guidance on how to label higher level events and guidance for the target level of abstraction. A general approach to supervised abstraction of events takes two inputs: (1) a set of annotated traces; that is, traces where the high-level event to which a low level event belongs (the label attribute of the low-level event) is known for all low-level events in the trace; and (2) a set of unannotated traces; that is, traces where the low level events are not mapped to high-level events.
Tax et al. [77] propose a method to abstract events in a XES event log that is too lowlevel, based on supervised learning and a condition random field learning step. A highlevel interpretation of a low-level event log is achieved through a supervised learning model on the set of traces where high-level target labels are available, and applying the model to other low-level traces is possible to classify them. The recognition of high-level event labels is viewed as a sequence labeling task in which each event is classified as one of the higher-level events from a high-level event alphabet. That work proposes a sequence-focused metric to evaluate supervised event abstraction results that fits closely to the tasks of process discovery and conformance checking. Conditional random fields are trained from the annotated traces to create a probabilistic mapping from low-level events to high-level events. This mapping, once obtained, can be applied to the unannotated traces in order to estimate the corresponding high-level event for each low-level event.
Sun and Bauer [73] propose a process model abstraction technique to optimize the quality of the potential high level model and to consider the quality of the submodels generated where each sub-model is employed to show the details of its relevant high level activity in the high level model.
There are some others methods explored within the process mining field that address the challenge of abstracting low-level events to higher level events [64,65,69,73,74]. Existing event abstraction methods rely on unsupervised learning techniques [76,78] for clustering of low-level events into one high-level event. Current techniques require the user/process analyst to provide high-level event labels themselves based on domain knowledge, or generate long labels by concatenating the labels of all low-level events incorporated in the cluster. Many existing unsupervised event abstraction methods contain one or more parameters to control the degree in which events are clustered into higher level events.
Finding the right level of abstraction that provides meaningful results is often a matter of trial-and-error.

Alignment
Alignment requires that the events in the log are correlated with the activities in the process model, and the discrepancies and conformance degrees between a log and a model can be determined. Alignment is not only relevant for the conformance checking task, but also for event log repairing and model repairing. On the one hand, the event log repairing concentrates on repairing event sequences according to a predefined process model, while model repairing focuses on repairing the necessary parts of the process model, such that it can replay event sequences in the log [75].
A sequential alignment between a trace and a process model is defined as a sequence of moves, each relating an event in the trace to an activity in the model. A cost function assigns a cost to each possible move. A sequential alignment with the lowest cost, according to the cost function is an optimal alignment. Unfortunately, the problem of finding the optimal alignment is NP-hard [96]. This indicates that when the process is large, it is intractable to determine the optimal alignment. The optimal alignment occurs when a trace in the log and an occurrence sequence in the model have the shortest edit distance.
Rogge-Solti et al. [68] present a cost-based alignment approach to repair missing entries in the logs. The authors employ probabilistic filtering models to derive the most likely timestamp of missing events using path probabilities and stochastically enriched process models. Bayesian networks are used to capture the dependencies between the random durations by a stochastic Petri net.
Song et al. [71] propose an approach for recovering missing events in process logs by decomposition of process into different sub-processes and heuristics to prune unqualified sub-processes that fail to generate the minimum recovery. In order to reduce the redundant recoveries in regard to parallel routings, the authors use an algorithm that leverages trace replaying to efficiently find a minimum recovery.
De Leoni et al. [67] propose a conformance checking approach based on the principle of finding an alignment of a given event log and its corresponding process model. The A* algorithm is used to find, for each trace in the event log, an optimal alignment, i.e., an alignment that minimizes the cost of the deviations. The authors adapt alignment-based approaches to be able to deal with the large search spaces induced by the inherent flexibility of declarative models. Based on such alignments, they provide diagnostics, at the trace level, showing why events need to be inserted/removed in a trace, and at the model level, coloring constraints, and activities in the model based on their degree of conformance.
In [75], the authors use the structural and behavioral features of process models, effective heuristics based on process decomposition and trace replaying to reduce the search space for the optimal alignment. They use a divide-and-conquer strategy. The alignment requires that the events in the log are correlated with the activities in the process model. A general framework is developed to seek the optimal alignment between process models and event logs with missing, redundant, and dislocated events and activities, respectively. This framework is not only used to align event logs with process models, but is also utilized for repairing event logs. Their approach is realized in the tool Effa, which acts as a plugin of ProM. Some other works on alignment are presented in [63,66,70].

C6. Information Type
What type of attributes or information the preprocessing techniques use to work?
According to the grouping of event log preprocessing techniques presented in Section 3.2.1, many of these techniques use various information resources to derive improvements in the quality of event logs of the mined or already built models. On the one hand, the event log is considered the base element to be exploited for the process mining tasks, particularly the automatic discovery of process models, as well as the reference source in the conformance with the already built model. Many of the strategies for data preprocessing in process mining are focused on working with the set of instances or traces that are obtained from the collection of events registered for executions of the process in question. To this end, different approaches that work on imperfections located in this source have been proposed. Some works have focused on some of the attributes identified in the collection of events, such as the information from timestamp which defines the ordering of events within the trace. Hence, incorrect ordering of events can have adverse effects on the outcome of process mining analysis. Even, in some scenarios in process mining, the timestamp attribute has been used to identify irregular processes using activity-level durations and contextual information. Some of the attributes used during the preprocessing are: case ID, event label, timestamp, cost, resource, contextual information, additional event payload, among others (see Figure 7). If the information of these attributes is missing, incomplete, in disorder, noisy or inconsistent, the preprocessing techniques clean and repair them.  Tax et al. [15] use the identification of chaotic activities to improve the quality of the event log. These activities are described as activities that do not have a clear position in the process model and can occur spontaneously at any point in the process execution. If it is the case, the discovery of the rest of the process becomes complex. Some other works on event log preprocessing techniques [32,49,50,71] make use of the running process model that has been previously built by the organization's specialists.
Furthermore, most of the preprocessing techniques require certain parameter values or tuning for filtering or accepting events. This allows establishing certain decision thresholds and thus being able to determine if any event, trace or activity can be considered as inconsistent or noisy.
Many of the noise event detection and visualization techniques, particularly the clustering techniques (Section 3.2.2), use a traces set as input, which facilitates the division of the original event log. From this set, distances between traces are calculated, so that traces that have high similarity are grouped within the same cluster and instances that have low similarity are in different groups. Some clustering strategies exploit the information associated with events, such as resource and contextual information to improve the partitioning of the event log. In the case of pattern-based preprocessing techniques, they mainly use the raw event log to identify concrete forms, which keeps recurring non-arbitrary contexts, with the timestamp attribute being the most used by these techniques. Within the transformation techniques (filtering), it is common to use a set of traces to identify problems associated with the missing or noisy values contained in the different attributes in the event log. Table 6 presents the relationships among the different characteristics (C1-techniques, C2-tools, C3-representation schemes, C4-imperfection types, C5-related tasks, and C6-types of information) of the preprocessing techniques surveyed in this work. As can be seen in the Table 6, filtering-based techniques are available in most of the process mining tools. However, the pattern-based techniques are only available through the ProM tool. Most of the processing techniques of the different classes handle the sequences of traces/events as their representation scheme of event logs to easily apply transformations on the records. In this way, the traces are information resources that are mostly exploited in the preprocessing task. In addition, all preprocessing techniques consider the identification, isolation, and elimination of noise data, and to a lesser extent, the solution of problems related to missing, duplicate, and irrelevant data.

Lessons Learned and Future Work
Based on the literature review, some important outcomes and guidelines can be inferred. There is increasing interest in the study of preprocessing techniques for process mining from various domains (health, manufacturing, industry, etc.). They have demonstrated great success in building process models that are more simple to interpret and manipulate, causing many organizations to be interested in these types of techniques. This is more evident with the arrival of big data, having business processes with huge event logs, which could contain a high amount of imperfections and errors, such as missing values, duplicate events, evolutionary changes, fine-granular events, heterogeneity, noisy data outliers, and scoping. In this sense, the preprocessing techniques in process mining represent a fundamental basis to improve the execution and performance of process mining tasks required by experts in process models.
In practice, process mining requires more than one type of preprocessing technique to improve the quality of the event log (as shown in column 2 of Table 4). This is because an event log can have different data cleaning requirements and a single technique could not address all possible issues. For example, if the event log is voluminous; that is, with a large number of events or cases, a suitable technique for this type of log is trace-clustering. This preprocessing technique divides the original log into small sub-logs, allowing to reduce the complexity of its handling and storage. If the event log size is of average size (normal), but there is high variability in the size of the set of traces that are formed from the log, it is highly possible that filtering techniques at the event/trace level are more suitable. On the other hand, in those event logs, where it is estimated that the duration of the activities of an event is too slow or too fast, the use of preprocessing techniques based on the study of the timestamp is suggested.
From the review presented in this work, it is observed that the most commonly used preprocessing techniques are trace-clustering, and trace/event level filtering (see Figure 8), mainly due to the fact that they are easy to implement and adequately manage noise and incompleteness in the event logs, and also allow models to be identified from less-structured processes. On the one hand, the trace clustering technique is more suitable for the case where it is required to reduce the complexity of the discovered models. This technique is generally applied together with pattern identification or event abstraction techniques, since both are strongly linked to identifying associations or rules from observed behaviors, or acquired experiences in the event log. On the other hand, trace/event filtering techniques are sometimes applied in conjunction with timestamp-based techniques to achieve the identification and correction of missing or noisy values in the event log. Several works on data preprocessing in process mining focus on the identification of specific noise patterns associated with the quality of the event log. For example, in the method proposed by Hsu et al. [30], 21 irregular process instances from a set of 2169 were identified. The results were presented to a group of domain knowledge experts who confirmed that 81% of the identified process instances were abnormal. By contrast, only 9% of the identified outlier process instances by the proposed method were confirmed as outliers in the same environment setting. This and other works have considered event logs available in the literature or with common characteristics. However, the study of several event logs in different scenarios considering different characteristics (log size, number of attributes, resources, organizations, among others) could be considered for the identification of new noise patterns that have not been previously identified in the studied event logs.
Today, there are no popular or widely known preprocessing tools fully dedicated to solving the preprocessing tasks that allow working with repositories and event logs of different characteristics, independently of the process mining task that will use that preprocessing. Therefore, the design and implementation of new tools dedicated to data preprocessing for process mining is required. These tools could incorporate a kind of "intelligence" and interact with the user to decide which events to correct or not. ProM is the most common tool in process mining used to incorporate new plugins of preprocessing techniques.
According to the surveyed works, it has been possible to identify a greater need for the application of preprocessing techniques in the process mining tasks, mainly in the discovery of process models [56,57,[59][60][61][62] and the conformance verification. On the one hand, in the discovery of process models, the preprocessing can reduce the complexity of the mined models through the identification, correction, and elimination of errors associated with event logs for the correct identification of the model gateways and, therefore, allows the discovery of more structured models. This would facilitate the interpretation of the discovered models, trying to maintain the original behavior of the event log. On the other hand, the preprocessing techniques have used for the conformance verification task between the event log and the discovered model. This is required to make a correct mapping between a clean event log and free of events, activities or traces that are missing, noisy, or inconsistent with the model in execution. In addition, the conformance task between the event log and the model can be executed in a considerable time, especially when there are large event logs, always expecting to get an output result, in the case where an enhancement task is focused on extending or improving an existing process model, using information from the actual model recorded in an event log, including, to a lesser degree, the use of preprocessing techniques. Some surveyed works report measures related to the lack of quality in the event logs, such as number of missing traces, the ratio of identified irregularities, and presence or absence of imperfection patterns. However, the vast majority of works report measures related to the quality of the discovered models (fitness, recall, precision, and f-measure) with the raw even log and preprocessed event log. Few works report any study or result of the computational complexity of their proposals. These works mainly report the execution time of their algorithms, which can be highly variable depending on the different variables used in the calculation (size of the log, search algorithm, size of the traces, types of attributes of the log, etc.).

Conclusions
In this survey, we presented, for the first time, a literature review about the main approaches used in data preprocessing for process mining. The review included a description of techniques and algorithms, tools, frequently posed questions, perspectives, and data types. Representative works were systematically revised to determine the key aspects in the preprocessing techniques that lead to improve the quality of a process model. As a result, this paper provided, for the first time, a grouping of the different existing preprocessing techniques. This grouping is organized in transformation techniques and detection-visualization techniques. Transformation techniques carry out actions to mark changes in the original structure of the raw event log in order to improve the quality of the log. While the detection-visualization techniques identify, group, and isolate those events or traces that can generate problems in the quality of the event log.
We also presented the challenges that must be addressed by these techniques. Furthermore, this survey presents some of the key elements to consider for data preprocessing in process mining: (1) grouping of existing techniques for the preprocessing of event logs; (2) preprocessing tools in the context of process mining available in the literature; (3) the more appropriate data structures to represent and manipulate event logs in preprocessing techniques; (4) the problems and imperfections more often found in event logs; (5) the tasks more often related to event log preprocessing; and (6) the type of attributes or information that the preprocessing techniques use to work.
This review can serve as a reference guide to identify the different types of preprocessing techniques and their qualities. Moreover, it seeks to highlight the foundations for those aspects that should be taken into account to obtain process models being simple and easy to interpret.
Concluding this work, we can affirm that data preprocessing in the context of process mining will continue to be a topic of great interest. In the following years, with the arrival of big data and the internet of things, in the creation of huge event logs, it may be necessary to design new preprocessing algorithms that deal with new challenges that have not, so far, been identified or solved. Great progress has been made with clustering and data filtering techniques in process mining; however, other techniques, such as those based on the identification of imperfection patterns, have not yet fully addressed the automatic identification of imperfection patterns.
As part of future work to consider, a suite of metrics describing the existence of noise patterns in event logs should be developed. The noise patterns could affect an event log at different levels, including attributes, events, and cases; thus, the various pervasiveness metrics would determine the number of attributes, events, and cases affected by individual noise patterns and the noise pattern collection overall. Few works [29,33] have tried to identify and quantify the level of imperfection that exists in an event log. However, until now, there are no specific metrics that determine this level. Another topic of possible interest, as part of the continuation of this line of research, consists of having frameworks that not only compute the accuracy or fitness obtained by the model to evaluate the impact of the preprocessing being used, but also the computational costs, memory, and time complexity associated with data cleaning.