Fibers of Failure: Classifying Errors in Predictive Processes

: Predictive models are used in many different ﬁelds of science and engineering and are always prone to make faulty predictions. These faulty predictions can be more or less malignant depending on the model application. We describe ﬁbers of failure (F I F A ), a method to classify failure modes of predictive processes. Our method uses M APPER , an algorithm from topological data analysis (TDA), to build a graphical model of input data stratiﬁed by prediction errors. We demonstrate two ways to use the failure mode groupings: either to produce a correction layer that adjusts predictions by similarity to the failure modes; or to inspect members of the failure modes to illustrate and investigate what characterizes each failure mode. We demonstrate F I F A on two scenarios: a convolutional neural network (CNN) predicting MNIST images with added noise, and an artiﬁcial neural network (ANN) predicting the electrical energy consumption of an electric arc furnace (EAF). The correction layer on the CNN model improved its prediction accuracy signiﬁcantly while the inspection of failure modes for the EAF model provided guiding insights into the domain-speciﬁc reasons behind several high-error regions.


Introduction
All models are wrong; some models are useful-as Box famously wrote [1]. To improve a model, to make the model less wrong, is a central process in the development and practical use of statistical models. When working with a predictive model, a user of the model may accumulate ground truth observations connected to the model inputs and model predictions. Even so, the model may fail in different ways-a model improvement computed on global error measures often performs worse than a model improvement that handles different error types separately.
In this article we describe fibers of failure (FIFA): a method that uses the MAPPER algorithm from topological data analysis (TDA) to classify error types based on observed errors paired with corresponding input data. Our method uses the observed errors as a part of the MAPPER process in order to construct a MAPPER model of the space of possible inputs to the predictive model that separates distinct error types from each other-each error type forms a distinct connected component in the fibers of the map from inputs to error measurements. We also suggest two types of methods-qualitative and quantitative-to use the error types to improve the predictions.
We demonstrate our method on two examples: first on a convolutional neural network (CNN) trained on MNIST digits and then used on a noise-distorted version of the MNIST digits, and next on data from a neural network process that predicts the electrical energy (EE) consumption of an electric arc furnace (EAF). In the CNN case, the classification of error types allows us to construct a prediction correction layer that produces a dramatic improvement in model performance, forming an example of where our quantitative approach performs well. For the EAF case, the quantitative approach is far less impressive. Instead, the qualitative approach-inspecting the error types for information to feed into further model refinements-uncovers actionable characteristics of the material processed by the furnace. Specific materials cause mispredictions, and the metallurgical modeling processes can be improved by using information about the materials that are particularly prone to mispredictions.
The FIFA method builds on MAPPER, an algorithm from topological data analysis that constructs a graphical (or simplicial complex) model of arbitrary data. The use of MAPPER has shown to be successful in a wide range of application areas, from medical research studying cancer, diabetes, asthma, and many more topics [2][3][4][5], and genetics and phenotype studies [6][7][8][9][10], to hyperspectral imaging, material science, sports, and politics [11][12][13][14]. Of note for our approach are, in particular, the contributions to cancer, diabetes, and fragile X syndrome [2,3,6] where MAPPER was used to extract new subgroups from a segmentation of the input space.
More closely related to our work, ref. [15] used MAPPER to analyze the weights learned by CNN models. In their work, they identify meaningful structures in the topology of the space of learned weight vectors internal to the neural network architectures. This differs from our work in that FIFA provides a model for the input space to a predictive model, not a model for the parameter space of the model.
A couple of works [16,17] have looked at using the output from a classifier as a component in constructing filter functions for MAPPER. One study [16] used MAPPER for explainable modeling, with a highly customized method for creating a cover for the filter functions they use. Another study [17] estimated a filter function from input data and used the result to construct a variation on the MAPPER algorithm.
Shapley additive explanations (SHAP), a recent development in the field of interpretable machine learning [52], has previously been used to uncover the effects of each input variable on the prediction by a model predicting the EE of an EAF [53]. However, SHAP does not reveal subsets of the prediction domain where the underlying model predicts values far off from the true values. All models, regardless of type, are susceptible to errors of one or more distinct types. Locating and analyzing these distinct error types furthers the understanding of the model's adaptation to the training data. Thus, FIFA could help to make the statistical models predicting the EE consumption more transparent by presenting the most significant variables demarcating a distinct error type from the rest of the data. The use of FIFA can also help in adjusting the consistent model error biases that are prevalent in the non-linear statistical model, thereby possibly reducing the error of the model in the following step.

Topological Data Analysis (TDA)
TDA stems from topology and displays three important properties; coordinate invariance, deformation invariance, and compression [54]. These three properties differentiate TDA from geometry-based data analysis methods.
Coordinate invariance: TDA only considers the distances between data points as a notion of similarity (or dissimilarity). This means that a topological model can be rotated freely in space in order to enhance the visual analysis of the data. Compare this property with a common data analysis tool, such as principal component analysis (PCA), which ultimately decides the visual outcome of the data due to its projection of the data into maximum variance space of two or three dimensions. Figure 1a illustrates the coordinate invariance property for an arbitrary dataset.
Deformation invariance: Topology explains shapes in a different way compared to geometry. For example, a sphere and a cube are identical (homeomorphic) according to topology. Likewise, a circle and an ellipse are identical. White noise is inherent to any dataset and can be considered as a deformation of the underlying distribution of the dataset. Due to the deformation invariance property, TDA is a suitable method for analyzing noisy datasets and thus presents a more accurate visualization of the underlying dataset. An example of deformation invariance is the figure-8-shaped dataset in Figure 1b).
Compression: This property enables TDA to represent large datasets in a simple manner. Imagine having a dataset of millions of data points that have the shape of the letter Y. See Figure 1c). The compression property enables TDA to approximate the dataset using 4 nodes, which contains the data points, and 3 edges, which express the relations between the data points. This property makes TDA highly scalable.
MAPPER [55] is an algorithm that constructs a graphical (more generally a simplicial complex) model for a point cloud dataset. The graph is constructed systematically from some well defined input data. It was defined by [55], and has been shown to have great utility in the study of various kinds of datasets, as described previously. It can be viewed as a method of unsupervised analysis of data, in the same way as principal component analysis, multidimensional scaling, and projection pursuit can, but it is more flexible than any of these methods. Comparisons of the method with standard methods in the context of hyperspectral imaging have been documented in [11,12]. An illustration of MAPPER is shown in Figure 2. Let X be a finite metric space. The following steps construct the MAPPER complex:

1.
Choose a collection of maps f 1 , . . . , f k : X → R, or equivalently some f : X → R k . These are usually chosen to be statistically meaningful quantities such as variables in the dataset, density or centrality estimates, or outputs from a dimensionality reduction algorithm such as PCA or MDS. These are usually referred to as lenses or filters.

2.
Choose a covering U = {U 1 , . . . } of R k : an overlapping partition of possible filter value combinations.

3.
Pull the covering back to a covering V of X, Refine the covering V to a coveringV by clustering each V i .

5.
Create the nerve complex of the coveringV: as vertices of the complex we choose the indexing set ofV, and a simplex [i 0 , . . . , i j ] is included ifV i 0 ∩ · · · ∩V i j = ∅. If we are only interested in the underlying MAPPER graph, it suffices to add an edge connecting any two vertices whose corresponding sets of data points share some data point.
One fundamental inspiration to the MAPPER algorithm is the Nerve lemma. Nerve Lemma: If X is some arbitrary topological space and U = {U i } is a good cover with index i then X N(U i ), where N(U i ) has simplex [i 0 , . . . , i d ] if and only if d k=0 U i k = 0. Here, a good cover is a cover such that d k=0 U i k is either contractible or empty for all [i 0 , . . . , i d ] If the function f , and the covering U are chosen well enough, the coveringV may well be a good cover, in which case the topology of the MAPPER complex reflects the topology of X itself.
The filters act as measures of enforced separation: data points with sufficiently different values for the filter function are guaranteed to be separated to distinct vertices in the MAPPER complex, while the nerve complex construction ensures that connectivity information is not lost in the process.
In practice, one particular covering construction has gained widespread use. It creates axis-aligned overlapping hyperrectangles by the following process:

2.
For each filter f i where i = 1, . . . , k, let min i and max i denote the minimum and maximum values taken by f i , and construct the unique covering of the interval . For the interior intervals in this covering, enlarge them by moving the right and left hand endpoints to the right and the left, respectively. For the leftmost (respectively rightmost) interval, perform the same enlargements on the right (respectively left) hand endpoints. Denote the intervals we have created by J i 1 , . . . , J i N i , from left to right.

3.
Construct the covering U of X by all "cubes" of the form ( f 1 × · · · × f k ) −1 (J 1 s 1 × · · · × J k s k ) where 1 ≤ s i ≤ N i . Note that this is a covering of X by overlapping sets.
For our work we are using the Ayasdi implementation of MAPPER.

FIFA: The General Case
In summary, the FIFA method is based on the following steps: 1. Create a MAPPER model that uses a measure of prediction failure as a filter.

2.
Classify hotspots of prediction failure in the MAPPER model as distinct failure modes.

3.
Use the identified failure modes to construct a model correction layer or to provide a guidance for model refinement.
The process can be seen as classifying failure modes by analyzing the fibres of the map that goes from the input space to the observed prediction failure-analyzing the fibers of failure.

MAPPER on Prediction Failure
The filters in the MAPPER function have the effect of ensuring a separation of features in the data that are separated by the filter functions themselves. In the setting of prediction failures, we leverage this feature to create MAPPER models that enforce a separation on prediction errors, allowing the subsequent analysis to identify contiguous regions of input space with consistent and large prediction errors.
We name the process of using MAPPER with prediction error as a filter in order to classify prediction failures as the fibers of failure method, and the resulting MAPPER model we name a FIFA model.

Extracting Subgroups
Subgroups of the FIFA model with tight connectivity in the graph structure and with homogeneous and large average prediction failure per component cluster provide a classification of failure modes. These can be selected either manually or using a community detection algorithm. When extracting subgroups manually, the intent is always to extract groups wherein the prediction error is as close to constant as possible. For community detection, most existing work in this area should be applicable. The MAPPER implementation we are using uses a grouping method based on agglomerative hierarchical clustering (AHCL) [60,61] and Louvain modularity [62]. This grouping algorithm used by Ayasdi is patented [63].

Quantitative: Model Correction Layer
Once failure modes have been identified, one way to use the identification is to add a correction layer to the predictive process. This is done by using a classifier to recognize input data similarly to a known failure mode, and by adjusting the predictive process output according to the behavior of the failure mode in available training data.
Train classifiers. For our illustrative examples, we demonstrate several "one versus rest" binary classifier ensembles where each classifier is trained to recognize one of the failure modes (extracted subgroups) from the MAPPER graph.
Evaluate bias. A classifier trained on a failure mode may well capture larger parts of test data than expected. As long as the space identified as a failure mode has consistent bias, it remains useful for model correction: by evaluating the bias in data captured by a failure mode classifier we can calibrate the correction layer.
Adjust model. The actual correction on new data is a type of ensemble model, and has flexibility on how to reconcile the bias prediction with the original model prediction-or even how to reconcile several bias predictions with each other. In the example cases used in this paper, we showcase two different methods for adjusting the model: on the one hand by replacing a classifier prediction with the most common class in the observed failure mode, and on the other hand by using the mean error as an offset.
Note on Type S and Type M errors. The authors of [64] argue that for model evaluation, the distinction between Type I and Type II errors is less useful than a distinction between Type S (sign) and Type M (magnitude) errors. Drawing on these error types, we will structure our quantitative adjustments in the continuous case with careful attention paid to Type S errors.
To elaborate, if a failure mode is found to have consistently overly-high predictions, adjusting with the observed bias of the failure mode is likely to produce a better prediction for all points in the failure mode. However, a failure mode that has errors in both directions will exacerbate some errors when adjusting for bias in that failure mode.
We apply this philosophy primarily in our design of model correction layers: by restricting bias corrections to cases that have large (handling Type M) and consistent (handling Type S) errors, we can rely on a bias correction to improve prediction for all the observations it corrects.

Statistical Modeling
We have chosen specific statistical measures to evaluate prediction errors for the two examples, described here in Section 2.4.1. For the subsequent qualitative analysis (as described in Section 2.3.4 above), we use the Kolmogorov-Smirnov statistic, described in Section 2.4.2, to determine the pairwise degree of dissimilarity between distributions.

Performance Metrics
The metrics used to evaluate the improved performance using the FIFA corrective layer for the EAF EE prediction model will be the coefficient of determination, R 2 , and the regular error function, Err Reg .
The regular error function is preferable over the absolute error function, since an overestimation differs from an underestimation in a practical EAF process context.
The regular error is defined as: where y i is the true value,ŷ i is the predicted value, and i ∈ 1, 2, . . . , n.
The coefficient of determination is a measure of how well the statistical model approximates the true data points. It is a function of the total sum of squares, S t , and the residual sum of squares, S r .
ȳ is the mean value of y i . For the MNIST prediction model, which is a classifier-type statistical model, we used the prediction accuracy as a performance metric. MNIST contains 10 classes and the prediction accuracy defined as the fraction of the correctly predicted images in the set of images. where N is the number of images in the image set, i ∈ 1, 2, ..., N, s pred is the predicted class, and s true is the true class.

Kolmogorov-Smirnov (KS) Statistic
The KS statistic can be used to measure dissimilarity between the cumulative distribution functions (CDF) of two samples. This is specifically known as the two-sample KS test and gives the maximum difference between the two distributions [65]. The KS-test is a non-parametric statistical test which is favorable, since many of the parameters governing the EAF process are of varying classes of distributions [66].
To perform a KS test, the KS-value has to be calculated by using the null hypothesis, H 0 ; i.e., that the two samples have the same distribution. The confidence level, i.e., p-value, is the probability that the two samples come from the same distribution. The KS-value can have values in the range [0, 1], where 0 indicates that the distributions of respective sample are identical and where 1 indicates that the distributions are totally different. Hence, a low p-value in tandem with a high KS-value is a strong indicator that the two samples are different The two-sample KS test calculation proceeds as follows: where U n 1 and V n 2 are the two distribution functions. n 1 and n 2 are the number of instances in each sample from the two distributions, respectively. x is the total sample space. sup is the supremum function. D n 1 ,n 2 is illustrated in Figure 3. H 0 is rejected if the following condition is satisfied: where α is a pre-determined significance level and c is the threshold value calculated using α and the cumulative KS distribution [67]. The KS-test is used by FIFA to present the variables that separate a specific error group from the rest of the data. Thus, one of the samples will be from the specific error group and the other sample will be from the rest of the data. Using the KS-test on each variable will show the variables whose distributions are the most different between the two samples.
The main drawback of the KS-test is that it reduces the difference between the two distributions to the point of maximum difference of the CDF. Hence, the maximum difference may not be representative over the complete distribution space. To combat this shortcoming in the analysis, the plotted distributions of the two samples will be provided as a complementary tool.  [66]. Left: The cumulative distribution functions (CDF) of X and Y. D n 1 ,n 2 , calculated using Equation (5), is shown as the difference between the upper and lower dashed lines; 100 samples were drawn from each distribution. Right: The probability density functions of X and Y.

MNIST Data with Added Noise
As a first example, we have chosen to work with the MNIST [68] database of handwritten digits and used the methods from the MNIST-C [69] test set. In order to provoke prediction failures to analyze, we have used the MNIST database as-is for training purposes, but have created noisy samples for evaluations, using the impulse noise method used in MNIST-C: The noise corruption was created by introducing random binary flips on 25% of the pixels of each of the images in the test portion of the database. The predictive model was trained exclusively on clean MNIST images, but then evaluated on its ability to generalize to the impulse noise corrupted images we created. Figure 5 illustrates how the different datasets are related in the experiments.

Electric Arc Furnace
The EAF process accounted for 28% of the total world production of steel, on average, between 2008 and 2017, and is therefore the second most common melting furnace in steelmaking [70]. It is a resource and energy intensive process, where electrical energy (EE) can account for up to 66% of the total energy input. See Table 1. The amount of EE consumed by the EAF sheds light on the potential to optimize the consumption of EE. The gain is twofold. First, the cost for producing steel will be reduced in tandem with a reduced EE consumption. Second, the environmental impact will be reduced since less energy is needed for each batch of produced steel. Numerous studies utilizing statistical models (machine learning) to optimize the EE of the EAF have previously been conducted. A review on the subject concluded that non-linear statistical models have significantly better performance over linear statistical models when predicting the EE of the EAF [76]. The main reason for this is that the EAF process itself is subjected to numerous non-linear impositions governed by its physicochemical nature, and delays imposed by downstream and upstream processes. Delays are also imposed by unpredictable events, such as equipment failures. This has been discussed in depth in previous research [66,76]. Hence, the choice of a statistical modeling framework has to be adapted to the non-linearity of the process.
However, non-linear statistical models are almost impossible to analyze due to their complex mapping of the input data to the output data. This hampers the process engineers' understanding, and therefore trust, in the model. It is always of paramount importance that the process operators trust the tools that are used to guide the process towards minimum operating costs and environmental impacts. Previous tools that have been applied to these types of statistical models are permutation feature importance and SHAP [53,66]. However, these tools only provide the relative importance of each feature for each prediction, while FIFA provides subgroups where the error by the modes is unusually large. Hence, FIFA could prove to be a valuable tool in the arsenal of interpretable machine learning methods for the steel process engineers.

Selected Models for Analysis
In order to produce examples of FIFA in action, we have produced two distinct predictive models-one for each dataset described in preceding sections. Specifications about the software used in the experiments can be viewed in Table A3.

CNN Model Predicting Handwritten Digits
We created a CNN model with a topology shown in Figure 4. The topology and parameters were chosen arbitrarily with the only condition: that the resulting model performed well on the original MNIST test dataset. The activation functions was "softmax" for the classification layer and "ReLU" for all other layers. The optimizer was Adadelta with learning rate lr = 1.0, ρ = 0.95, and = 10 −7 [77]. We trained the model on 60,000 clean MNIST training images (MNIST-train) through 12 epochs and tested it on 10,000 clean MNIST images (MNIST-test). The accuracy on the test-set of 10,000 clean MNIST images was 99.05%. We created 10,000 corrupt MNIST images (C-MNIST-test) using 25% random binary flips on the clean test images. The code is available in the Supplementary material [78]. The accuracy on the corrupt MNIST images was 40.45%. The datasets used to train, test, and evaluate the CNN model are illustrated in Figure 5.
Conv2D  The abbreviations, such as Conv2D, describe the specific transformations performed between layers in the model. The activation function for the classification layer was "softmax", and for the other layers it was "ReLU". The optimizer used was "Adadelta" [77].

ANN Model Predicting the EE Consumption of an EAF
The chosen model framework was the artificial neural network (ANN), which is commonly used for modeling non-linear problems and has previously been used to model the EE consumption of the EAF [66,76]. A grid-search was conducted to find the optimal numbers of hidden layers and hidden nodes, and the most optimal delay variable from a set of 5 variables representing the delays imposed on the process. See Tables A1 and A2 in the Appendix A for details regarding the grid-search. The variables used in the selected model are shown in Table 2. The variables were chosen based on their respective contributions to the increase or reduction in EE from a physicochemical perspective. In order to investigate the stability of each set of parameters, a total of 10 model iterations were conducted for each grid-search parameter setup. The strategy employed to select the best model was to pick the model with the highest mean R 2 and the difference between maximum R 2 and minimum R 2 of less than 0.05. See Appendix A and the software used to create the models. The selected model parameters using the grid-search had 1 hidden layer with 20 hidden nodes. The delay variable was "all delays". The rest of the variables are shown in Table 2.

Quantitative
To create the MAPPER graph we used the following approach: • Filters: Principal component 1, probability of Predicted digit, probability of ground truth digit, and ground truth digit. Our measure of predictive error is the probability of ground truth digit. By including the ground truth digit itself, we separate the model on ground truth, guaranteeing that any one one failure mode has a consistent ground truth that can be used for corrections. We purposely omitted the activations from the Dense-10 layer as input variables because of the direct reference to the probabilities for both the ground truth digit and the predicted digit.
The following variables were used in filter functions or in the subsequent analysis, but were not used to create the FIFA model:

•
Ten activations from the Dense-10 layer, which consist of the probabilities for each digit, 0-9. Hence, the total number of variables in our analysis was 10,272. From partitioned groups in the MAPPER graph, we retain as failure modes those groups that have at least 15 data points and have less than 99.05% correct predictions, which is the accuracy of the CNN model on the original MNIST test data (MNIST-train). We then trained logistic regression classifiers in a one versus rest scheme on each group using the same 16,000 data points (5-fold-training) used to create the MAPPER graph. (The one versus rest scheme is an ensemble method for converting a binary classifier to a classifier able to work on more than two groups. For each group, a separate classifier is trained to distinguish that group from the rest of the data. To use the classifier ensemble, the individual classifier results are combined to form a classification of the data.) We used logistic regression (LR) models with the following parameters. Penalty function: 2 . Regularization parameter C = [0.001, 0.01, 0.1, 1, 10, 100, 1000]. The regularization term that corrects the most number of MNIST images on the 5-fold-test dataset will be used.
Using the best performing model ensemble, we evaluated each model on a second dataset, called C-MNIST-eval, which consisted of 10,000 new corrupt images using 25% binary flips on the original MNIST test dataset (C-MNIST-test). The same impulse noise method was used both to produce test images to evaluate the performance of the CNN as input to the MAPPER process, and for the final evaluation of the performance of the combined CNN + correction layer model. Hence, we used the same noise pattern as the corrupt images used for testing the CNN model. See Section 2.7.1 for details regarding the noising methodology.
As we trained the classifiers on groups containing many wrong predictions, it was expected that the classifiers would classify member points with wrong predictions on the test datasets. Hence, we offset the predicted digits in the 5-fold-test and the C-MNIST-eval datasets with the ground truth digit of the group each classifier was trained on. We attempted to exploit the consistent bias of the classifiers to improve the accuracy of the now combined CNN and classifier ensemble.
The FIFA procedure was repeated using five different splits of the training and test datasets in order to mitigate the effects of selection bias when creating the MAPPER graph. This procedure is equivalent to the 5-fold cross-validation methodology in the field of machine learning used to mitigate the effects of selection bias. Hence, each of the 5 MAPPER graphs were created using 5 different selections of 16,000 (5-fold-training) of the available 20,000 data points consisting of 10,000 test MNIST images (MNIST-test) and 10,000 corrupt MNIST test images (C-MNIST-test) using 25% binary flips. See Section 2.7.1 for details regarding the noising methodology. The rest of the 4000 data points (5-fold-test) were used to evaluate the ensemble correction classifier. See Figure 5 for a detailed illustration on how the datasets were created and how they are related.

Qualitative
We chose to study the digit 5. From the MAPPER graph (in Figure 6) the part corresponding to a ground truth digit 5 decomposes into two connected components. We split out four groups of approximately locally constant prediction error: groups 30, 40, 47, and 50 in the numbering scheme generated by the community finding algorithm used.
For these four groups of observations, we then generated on the one hand the distributions of classifications from the CNN classifier-seen in Figure 8-and on the other hand a collection of saliency maps [79] to allow us to inspect the different responses of individual neural network activations to digits in the various groups. We chose activations to inspect by looking for the highest KS-score when comparing each group to the correctly classified group 50.

Quantitative
The following parameters were used to create the MAPPER graph: We trained an ensemble of logistic regression classifiers in a one versus rest scheme to identify membership in each of the groups with at least 15 data points in the training data. To make the corrected model more attractive to the model users, we restricted the resulting classifiers to ensure that each classifier would produce explainable adjustments. A classifier within the ensemble is qualified to adjust the test data if the following two conditions are satisfied: The average error value imposed by the predicted data points on the training data, ∆E Tr El , must be of the same sign (type S error) as the average group error value, ∆E Gr El . This is to ensure that the errors of the predicted data points by each classifer are consistent with errors of the groups they have been trained to predict.

2.
The error after adjustment of the group data cannot be worse than the group error, |∆E Gr El − ∆E Tr El | < |∆E Gr El |. This is to verify that the classifier can identify data points that have, on average, somewhat similar error values as the group it is trained to identify.
The mean error, standard error, max/min errors, and R 2 are recorded prior to the adjustment and after the adjustment of the test data. The number of test data points is 2384. The ensemble with the highest decrease in standard error and highest increase in R 2 for the test data is chosen. The following values for the logistic regularization parameters, C, were used to determine the best performing classifier ensemble: [0.001, 0.01, 0.1, 1.0, 10, 100, 1000].
Unlike the CNN model case, the FIFA procedure was not repeated using five different splits of the training and test datasets. A model that is used in practice will predict on data that have been generated from a future point in time with respect to the data used to train the model. Hence, the test data must be selected in chronological order from the training data. Hence, a K-fold cross-validation does not make sense in this case.

Qualitative
The qualitative inspection for each of the two MAPPER graphs was conducted by selecting 4 of the largest groups. Two with the highest ∆E El -values and two with the lowest ∆E El -values. Each of the groups was then compared to the rest of the data points. The 5 variables with the highest KS-values and the 5 variables with the lowest KS-values were analyzed further using EAF process expertise to determine reasons behind the model prediction error. The variables were selected if the p-value (α) was lower than 0.01 in order to reduce the probability of selecting a variable whose KS-value was due to randomness. In addition, the variables were prone to contain extreme outliers. Hence, the distribution plots for each variable are shown for all values that are not part of the 1000-quantile.

Quantitative
The logistic regression model with the best adjustment on the corrupt data had a C-value of 1000. The average number of data points in all failure mode groups in the five folds was 4937 of the total 16,000 (5-fold-training). The average number of clean data points in all groups in the five folds was 10.4, accounting for a fraction of 0.21% of the 4937 data points. This also means that the failure mode groups encompass roughly 62% of all corrupt data points, on average, in each training set of the 5-folds. The numbers of failure modes (extracted subgroups) in each fold were 41, 41, 41, 41, and 37, respectively. Table 3 shows the accuracy on the two test datasets using CNN with and without FIFA. The correction layer improved the CNN model by 6.05%pt on the 5-fold-test dataset and 18.43%pt on the C-MNIST-eval dataset. This is considered a significant improvement over the original CNN model. Table 3. Performance of the CNN as compared to CNN with FIFA-driven improvements both on the average of the five folds of test data (5-fold-test) and on entirely corrupted test data (C-MNIST-eval). The improvements by the classifier ensemble are for the best performing parameters. The FIFA-driven improvement produces an 18.43%pt increase in accuracy on the C-MNIST-eval dataset, which consists of only corrupt MNIST images. In addition, the percentage of clean images of the adjusted predictions by the correction layer was only 0.21%.

Qualitative
For the qualitative analysis, we chose to focus on four groups with digit five as the ground truth digit: group 50, which is not one of the failure mode groups, and groups 30, 40, and 47, all part of the total 39 failure mode groups. The location of each group in the MAPPER graph is shown in Figure 6.
The distributions of predicted probabilities for each digit in the three failure mode groups are shown in Figure 7: group 30 is the group with the highest probability to the digit five, while 40 and 47 are more focused on eight, two, and three. All three groups favor digit eight, as their mean probabilities are between 0.5 and 0.9.

G50
G30 G47 G40 Figure 6. One of the five MAPPER graphs created by the activations from the CNN model on four of the five folds, i.e., 16,000 images (5-fold-training), as explained in Section 2.8.1. The graph is colored with the probability of predicting the ground truth digit. The colorbar is for interpreting the values of the coloring. The circled nodes and edges are the groups 30, group 40, 47, and 50. The other four MAPPER graphs are shown in the Supplementary Material [78]. q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q We compared these three failure modes with the non-failure group 50 and extracted the five activations with the highest KS-values from the Dense-128 layer; see Figure 4. To illustrate the differences between the three failure modes regarding the activations, we have provided a selection of saliency maps [79] for all images considered as true members of each of the three failure mode groups. These were all produced using the keras-vis Python package. Figure 8 shows a selection of noisy images and their saliency maps for some of the activations' highest KS-values within the Dense-128 layer. The two leftmost image pairs were selected based on visual clear saliency maps with respect to digits. The two rightmost were selected based on most unclear/noisy saliency maps. The full collection of saliency maps for these groups can be found in our Supplementary Material [78].  . Examples of noisy images and saliency maps for activations in the penultimate dense layer for the three main failure modes identified for noisy 5s. The two leftmost images were chosen as the most clear saliency maps with respect to digits. The two rightmost were selected based on unclear/noisy saliency maps. All saliency maps are from images classified as members of the respective failure mode group. All saliency maps can be found in our Supplementary Material [78].
The activations 24 and 81, present in all three groups, display activity that is consistent with an activation detecting features of the digit five. The activations 89 and 99 correspond closer to an activation for the digit three and 119; 122 and 124 correspond to activations for the digit eight. In particular, in the last three groups, noise that closes loops in a written five tend to have high saliency.
In Table 4 we show the percentage of blank saliency maps, indicating how often an activation by the neuron is missing completely for the group's classified membership images. A blank saliency map for a particular image means that the neuron does not "send" its contribution further down the neural network, because the activation is equal to zero. For example, consider a neuron particularly prone to recognize digit five that also has a large percentage of blank saliency maps for a certain set of images. The resulting predictions will have a lower probability for digit five than another set of images with lower percentage of blank saliency maps. Table 4. The percentage of blank saliency maps for each of the five neurons with the highest absolute KS-values (compared to group 50) in the Dense-128 layer. The percentages only include saliency maps from the images classified as members of the respective failure mode group. Blank saliency maps do not contribute to the subsequent layers in the network because the activation is zero. The neuron numbers with bold font are the neurons qualitatively identified as encoding digit five. We observe that the neurons encoding digit five have predominantly larger percentages of blank saliency maps. This means that digit five receives lower probability for the number of images equating to the percentage of blank saliency maps.

Quantitative
In total, 7806 data points were captured by 88 groups containing 15 or more data points. This corresponds to 82% of the total data points since the total number of data points to create the MAPPER graph was 9533. The adjustment results from the classifier ensembles are shown in Table 5. A reduction in Std.∆E El and an increase in the R 2 -value can be seen after the adjustment by the ensemble classifier. While this is indeed an improvement, it is hard to justify the improvement from a practical EAF process perspective. The standard ∆E El improvement is only 121 kWh/heat, which from a process perspective is within the white noise of physico-chemical phenomena underpinning the EAF process. For example, the 121 kWh improvement is an approximate equivalent of 300 kg less scrap added to the heat. This small amount could easily be attributable to current scrap weighting sensitivity levels and is therefore comparable to the variations in the data due to white noise. Table 5. Results before and after the adjustments on the test data using logistic regression with C = 1. The chosen logistic regression classifier ensemble was the one that, to the highest extent, reduced the standard deviation of error and increased R 2 . (See Appendix A) C for the results from the logistic regression classifier ensembles with the other C-values.

Qualitative
The MAPPER graph is shown in Figure 9. It is possible to observe an ability for MAPPER to separate subgroups with high positive error from subgroups with high negative error.  Figure 10. However, the differences in KS-values, which take values from 0 to 1, vary between 0.331 to 0.792. A higher KS-value indicates that the two samples are not from the same distribution compared to a KS-value that is lower, as discussed in Section 2.4.2.  The use of FIFA made it possible to pinpoint raw material types C and D as two of the most predominant variables in separating each of the four groups from the rest of the data. From process experience it is known that raw material types have a large effect on the energy dynamics within the furnace and thus also on the EE consumption. Since raw material types are variable in the prediction model, it is clear that the model misjudges the impact of these raw material types on the true EE consumption.
The effective energy is a proprietary variable estimating the amount of energy that is effectively utilized during the EAF process. It is closely related to the energy dynamics of the process, but was not used as a input variable for the prediction model. However, it is possible to observe that group 1247 and group 1250 have lower effective energy than the rest of the data, while group 1252 has a higher effective energy than the rest of the data. The prediction model overestimates the EE for the heats in group 1247 and group 1250, while it underestimates the EE for the heats in group 1252.

Conclusions
This study demonstrated the use of fibers of failure, FIFA, to analyze distinct error modes from two predictive processes; namely, the CNN model predicting handwritten digits from the MNIST dataset and the ANN model predicting the EE consumption of an EAF. This was accomplished by using both a quantitative and a qualitative approach on two diverse cases.
For the CNN model predicting corrupt MNIST images (C-MNIST-eval), a 18.43 percentage accuracy increase was achieved using the quantitative path. Using the qualitative approach, the CNN model was prone to misclassifying "5" as "8" for several selected failure modes. This is not surprising, since "5" and "8" have similar "traces" when written by hand. The results from the saliency maps provided further support to this conclusion by highlighting areas of higher and lower activations in the penultimate dense layer of the CNN model. For the ANN model predicting the EE consumption of an EAF, the qualitative approach provided some interesting insights. It was found that two distinct raw material types out of the seven available raw material types were among the five variables with the highest KS-values for the four selected subgroups. This provides a guidance when improving future models with regard to the two raw material types. However, the quantitative approach did not significantly improve the EE predictions. The performance improvement was in magnitude comparable to added noise by measurement errors in the EAF process.

Abbreviations
The following abbreviations are used in this manuscript:
Scikit-learn Python package providing basic machine learning models Keras Python package providing deep learning modeling; CNN.
Keras-vis Python package providing visualisation tools for Keras.

Pandas
Python package for handling tabular data Matplotlib Python package for plotting and drawing of MAPPER graphs Ayasdi SDK Python package used to retrieve results from MAPPER provided by Ayasdi, Inc. [82]. Table A4. All variables used in the EAF case. The "Count" column shows the number of variables representing the parameter. The ANN/FIFA column indicates whether the variables were used as input variables in their respective models: "x" indicates the variable(s) is part of the model while "-" indicates the variable is absent. Completely absent variables in both ANN and FIFA were still part of the qualitative analysis using the KS-statistic to identify groups in the MAPPER graph.

Production indices
To keep order of the heats relative to the production supply chain -6 -/-