#### 3.1. Experimental Results

In the first set of experiments, to explore which of the group centrality factors have the most impact on the cycle, we created a network that visually illustrates the effect high and low values of the grouped centrality measures on case cycle times (see

Figure 3a–d below). Each node represents a case, with cases which share common performers connected and the edges weighted by the number of shared performers. The size of the nodes is proportional to the relevant grouped centrality measure. The network appears to show an inverse relationship between cycle time and group degree and eigenvector centrality while the opposite is the case with the group closeness and betweenness centrality measures.

We explore these apparent relationships between the group centrality measures and trace cycle times further.

Table 2 shows the Spearman rank correlation between each group centrality measure and the trace cycle time (with all the values statistically significant at the 95% confidence level in bold font). This test was selected to determine the strength and direction of the monotonic relationship between these measures. The group closeness centrality is the most strongly correlated measure to the trace cycle time, followed by the group eigenvector centrality. The group betweenness and closeness centrality were generally positively correlated while the group eigenvector centrality was generally negatively correlated.

We delve into the team effectiveness literature to sheds some light on these results. As [

21] posits, “maintaining strong ties with people outside the team is an important determinant of team success”. We argue that these results may have implications for team setup as they shed light on the nature of these “ties”. For example, many organisations create specialised cells or SWAT teams to handle certain types of cases, e.g., complex cases. As a result, these teams could become isolated from other process performers which increases their probability of “failing” [

32]. The results would seem to imply that connecting the teams to other influential performers in the organisation (high eigenvector centrality) will result in shorter cycle times, perhaps as a result of the ability of these performers to resolve issues relatively quickly. This would suggest that where such cells exist, it would be desirable to work cases with performers outside their cell periodically. Intuitively, this will increase the sharing of knowledge and experience across the organisation.

On the other hand, lower group betweenness across groups appears linked to lower cycle times. This measure is a proxy for how much the group is becoming a bottleneck across the organisation; perhaps because the performers are perceived as possessing certain desirable traits, e.g., viewed as experts or dependable. Lower group closeness centrality (a measure of the distance of the group to other performers) is also correlated with lower processing time as it indicates greater connectedness between performers.

Progressing to the second set of experiments,

Table 3 details the Global MAE and standard deviation (SD) for each dataset/algorithm pair. The performance of the algorithms is visualised in

Figure 4, which displays the average ranking of each algorithm over the datasets with associated error bars. Over the five datasets, the survival analysis approach outperforms all the other approaches.

Figure 5 and

Figure 6 show the aggregated error values obtained by dividing the Global MAE and SD by the average throughput time for each event log. Normalising these values enables them to be directly comparable [

14]. The survival approach has the lowest normalised mean and median MAE (0.659 and 0.463, respectively), providing further confirmation of its superior performance.

As recommended by [

33], the non-parametric Friedman test was performed on the ranked data to determine whether there was a significant difference between the algorithms. The test results indicate a statistically significant difference between the various algorithms at the 95% confidence level (

p = 0.008687). To determine which algorithms differ from the others, we utilise the Quade post-hoc test to perform a pairwise comparison between the various algorithms.

Table 4 shows the results of the pairwise comparisons (with all the values statistically significant at the 95% confidence level in bold font). The results indicate that the survival methods significantly outperformed all the existing methods except for

gbm (see results in bold). To determine the explanation for this, we observe that event logs typically contain a portion of incomplete traces which are filtered out by existing approaches as they do not contribute any information towards accurately predicting the remaining time of the trace. Intuition supports this approach as we cannot determine whether an incomplete trace will finish in the next hour, day or year.

Ref. [

14] provides a detailed discussion of generative and discriminative approaches for process monitoring. Discriminative approaches infer a conditional probability P(Y|X) from the training data set where X = (

${\sigma}_{1},$ ${\sigma}_{2}$…

${\sigma}_{n}$} denotes the set of feature variables and Y = {τ

_{rem_pred1}, τ

_{rem_pred2}….. τ

_{rem_predn}} represents the prediction target. The resulting probability distribution is used to make predictions for the test set. However, when there is a significant proportion of incomplete traces in the training data, this approach is not useful as the target (Y), i.e., the remaining time for the trace, is unknown. This is the reason why these traces are typically removed from the training set. However, generative approaches, such as the survival analysis approach proposed, calculate a joint distribution P(X,Y) which is then utilised to derive the conditional probability P(Y|X). This approach can generate synthetic values of X by sampling from the joint distribution. As a result, this approach performs better when an event log has a significant proportion of incomplete trace.

In our experimental data, the percentage of incomplete traces ranged from 39% (BPIC 14) to 69% (BPIC 12). However, the survival analysis approach enables us to “account for (incomplete traces (i.e., censored data)) in the analysis” as this approach is able to extract information from them [

34]. This is the main advantage of the approach we propose as it delivers better accuracy for event logs with a significant proportion of incomplete traces

To explore the effect of the proportion of incomplete traces on performance, we perform an additional set of experiment utilising a subset of data from a couple of event logs (BPIC12 and BPIC18), selected as they are on opposite spectrums of event log complexity (see [

1]). Keeping the size of the event log constant, we incrementally increase the percentage of incomplete traces in the log in steps of 20%, starting from 0% through to 100% (the baseline). We subsequently calculate the normalised MAE for each log using the proposed survival approach.

Figure 7 and

Figure 8 display the plots of the normalised MAE by the proportion of incomplete traces in the event log. As expected, both plots indicate a dramatic improvement in performance as the proportion of complete traces in the log increases. However, we observe that this improvement begins to level off once the proportion of complete traces exceeds c.60 %, after which the gain is less significant.

To test this effect, we utilise the non-parametric Kruskal–Wallis to determine whether there is a significant difference in the MAE for each log. As expected, there is a significant difference in the MAE for both logs (For BPIC 12, p = 3.282 × 10^{−9}; for BPIC 18, p < 2.2 × 10^{−16}). We subsequently run pairwise comparisons using Wilcoxon rank-sum test to determine which proportions differ significantly from the baseline (i.e., the log with 100% complete traces).

Table 5 shows the results of the pairwise comparisons against the baseline.

For BPIC 12, we notice that there is a significant difference until the point at which there is 40% incomplete traces (see results in bold). However, with BPIC 18, we notice that there is a significant difference between the MAE for all logs with incomplete traces against the baseline. To understand the results, we consider the event logs metrics (see

Table 1). We observe that, despite having roughly the same number of events, BPIC 18 is more complex than BPIC 12 particularly in terms of mean trace length (×4) and the number of distinct activities (×6). We postulate that for complex event logs, our approach delivers a significant difference compared to the baseline, even when there is a high proportion of complete traces. However, for simpler logs, the difference is less pronounced, levelling out when there the proportion of complete cases approaches c.40%.