Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

Efficient Time and Space Representation of Uncertain Event Data

Algorithms 2020, 13(11), 285; https://doi.org/10.3390/a13110285

by Marco Pegoraro^*

, Merih Seran Uysal

and Wil M. P. van der Aalst

Reviewer 1:

Shaohua Wan

Reviewer 2: Anonymous

Algorithms 2020, 13(11), 285; https://doi.org/10.3390/a13110285

Submission received: 30 September 2020 / Revised: 30 October 2020 / Accepted: 6 November 2020 / Published: 9 November 2020

(This article belongs to the Special Issue Process Mining and Emerging Applications)

Round 1

Reviewer 1 Report

In this paper, the authors improve the performance of uncertainty analysis by proposing a novel algorithm that allows for the construction of behavior graphs in quadratic time in the number of events in the trace. They proved the correctness of this novel algorithm, we showed asymptotic upper and lower bounds for its time complexity, and implemented performance experiments for this algorithm that effectively show the gain in computing speed it entails in real-world scenarios. Although the idea and research questions of this paper are timely and clear, however, some major questions need to be addressed.

Considering that details on the experimental scenario and setup are not included, this is not sufficient way to draw conclusions. Apparently, the authors did not compare the proposed method with the state of the art methods which the authors cite and discuss in the related work. Description of all results are limited.

The authors should clearly state where your contribution is In Introduction.

Author Response

Dear reviewer, thank you for your report. We have proceeded in improving the paper following your recommendations.

For the problem of building a behavior graph, the state of the art technique employed the transitive reduction operation to build it.

Section 5 analyzes the difference in complexity with this state of the art technique, proving that the novel method is quadratic in time, as opposed to cubic; we now highlighted this difference.

Furthermore, we added a description in Section 6 (specifically in Section 6.1, experiment related to Q4) that relates to the fact that the refinement presented in this paper enables a much lower memory consumption for storing the event log.

We also added a paragraph in the Introduction which states more clearly the goal of the paper and the contribution that it aims to bring to the field of process mining (lines 95 to 107).

Thank you again for your service and the recommendations to improve our paper. Best regards, the authors

Reviewer 2 Report

In this paper, the authors propose a method to discover uncertain directly-follows graphs from event logs containing uncertain data, in particular, with events associated with a time interval rather than a specific time. They provide a clear implementation of the algorithms alongside intuitive working examples and a deep analysis of the time complexity of the algorithms. The authors propose an evaluation based on artificial data, indeed, although some real-life event logs are used, those do not contain uncertain data, and the authors were forced to inject it artificially. The evaluation results show that the method proposed in this paper is faster than its predecessor, which was presented by the authors at ICPM 2019. This improvement in performance is highlighted as the major contribution of the study.

The field of process mining is in the midst of a momentum boosted by several factors both internal (from the academic world) and external (from industry vendors and partners), research in this area has always been cutting-edge, more precisely, it has always been way ahead of what the industry is interested in or can actually use. In this light, this study may be considered very interesting from an academic perspective, but my greatest doubts are in its significance from a practical perspective. What is the use of an uncertain directly-follows graph? Yes, of course, we can discover a process model, compute alignments, and apply conformance checking algorithms, but all these results would be uncertain if based on uncertain data - even more than they usually are when derived from reliable data. Supporting this argument there are plenty of research studies in process mining (and beyond) that focused on designing and applying algorithms to mitigate and remove uncertain data. How come the authors feel that at this point in time we should embrace uncertainty and work with it? And why the proposed algorithm is a good method to deal with uncertain data?

As an example, let's assume I would generate an uncertain directly-follows graph by approximating each time interval with the average between its minimum and maximum. It would be much quicker to discover an uncertain directly-follows graph - O(n log n). Would that be worse or better than what the authors are proposing? Both uncertain directly-follows graph would be uncertain.

Looking at the related work, the authors did not provide any example of similar studies (in terms of concept, i.e. using uncertain data). The authors specify that this research direction is new, but although new in the context of process mining, its total novelty in the field of data mining (way older than process mining) is puzzling. Luckily, there have been other studies in data mining in this direction, see "A Survey of Uncertain Data Algorithms and Applications" as a reference.

From a technical perspective, the paper is extremely well crafted and written, it is easy to read and contains all the elements of a high-quality research study, from introduction to conclusion, including a robust evaluation.

My suggestions for improvement:

The fact that I was not able to identify the significance of this study may relate to the following factors:
1. no reference to similar studies (i.e. retaining and using uncertain data), not even in areas beyond process mining
2. lack of a strong motivating example
3. lack of a discussion specifying the usefulness of the output, that goes beyond to the simple application of other process mining algorithms; I could generate a random DFG and I would still be able to apply process mining algorithms on it, but what is the intrinsic value of doing that?

About 1. The authors should have a look at the literature, maybe starting from "A Survey of Uncertain Data Algorithms and Applications" (Aggarwal and Philip, 2008), the authors should be able to find work related to what they are trying to do in process mining, that could be used as a supporting driver for the research direction they are taking.

About 2. I understand the authors want to use an intuitive working example, but I found a bit unrealistic in the case of a hospital with broken thermometers and unreliable nurses. Sure, that can happen, but it is (or at least should be) an exceptional case in modern healthcare environments. If the context of the study is significant, it shouldn't be hard to identify a more realistic, yet simple, scenario to set as a working example.

About 3. The authors should discuss what would be a real use of the output of their algorithm, for example, what would be the purpose of using an uncertain process model or having low and upper bound of conformance checking results? Most importantly, what would be the benefits of doing that as opposed to approximating or removing the uncertain data completely? As mentioned in 2, the working example they used is a very rare occasion, which could reasonably be removed completely from an event log, and then work the other (reliable) data. I assume the authors may find other studies (in data mining at least) that considered these matters, and may help them adding a discussion section in this regard.

Minor flaws:

-the introduction section is extremely lengthy and its content goes beyond the mere introduction, consider to move some content to a new section (e.g. background) and add more information to improve the significance of the study
-the related work section is screwed towards works of conformance checking, that does not really relate to the approach proposed in this paper, which is focused on working with uncertain data and time efficiency. Conformance checking is not even used in the experiments. I suggest to remove lines 591-598, and add related work on uncertain data from the data mining field.
-line 208, uncertainty information > uncertain information
-line 257, remove or move somewhere else the remark about the keyword break, it sounds out of scope in a paper submitted to the Algorithms journal

Author Response

Dear reviewer, first, thank you kindly for your detailed report. We have proceeded to work on the draft in order to improve it by incorporating your recommendations.

About 1. It is indeed true that, in the domain of data mining, the concept of uncertain data (or, at least, a formalization of uncertainty close to the one proposed in our paper) has a longer history. We have added a paragraph on the Related Work section, indicating the survey you suggested and other references to the important concept of probabilistic databases and the problem of frequent itemsets under uncertainty that may (hopefully) be interesting and useful for the reader (lines 623 to 629).

About 2. Indeed, in the effort to design a simple instance of uncertainty, the running example ended up being too simplistic and, as a consequence, not very realistic. We now included a different example which, while resulting in a "identically shaped" behavior graph and behavior net, should be more realistic - it is rooted in reality, and the example described is a simplification of real data we encountered in an event log related to healthcare.

About 3. In the conformance checking example (now Section 2) we added a paragraph clarifying the insights that can be obtained from the application of alignments over uncertain event data (namely, detailed case diagnostic, that is able to indicate specific deviations on a case-by-case basis and thus allow to estimate criticalities) (lines 153 to 165).

In regard to the guiding motivation behind developing new analysis techniques to deal with uncertain data, rather than remove it via filtering, the answer is twofold. Firstly, filtering cases out would not allow for tasks such as specific case diagnostic, the application example we mention. This is important in process mining, and highly desirable in industrial settings. Secondly, in some processes there are imprecisions, anomalies and data quality problems that affect a portion of the cases so large that the loss of information cause by the filtering would make it impossible to obtain insights from the remaining well-formed data.

We argue in favor of this and offer some references on the matter in lines 72 to 79. About the other remarks: we separated the applications of conformance checking and process discovery over uncertain data, which now reside in their own sections (2 and 3). This makes the paper more structured and the introduction is slimmer. We added, as mentioned, references to past work on the topic of uncertainty in data mining. Conformance checking is indeed utilized in the experiments (albeit evaluated only on time perfomance); on these grounds, we kept the section regarding conformance checking in the Related Work. The "uncertainty information" mistake has been fixed (the meaning was actually "information about uncertainty").

Lastly, we removed the explanation of the "break" keyword. We would like to thank you again for your scientific service, and for your insightful recommendations. In our opinion, the suggestions in your review helped to lift the quality level of our submission. Best regards, the authors

Round 2

Reviewer 2 Report

The authors' review of the article is satisfactory.

Article Menu

Efficient Time and Space Representation of Uncertain Event Data

Further Information

Guidelines

MDPI Initiatives

Follow MDPI