One-Dimensional Convolutional Auto-Encoder for Predicting Furnace Blowback Events from Multivariate Time Series Process Data—A Case Study

: Modern industrial mining and mineral processing applications are characterized by large volumes of historical process data. Hazardous events occurring in these processes compromise process safety and therefore overall viability. These events are recorded in historical data and are often preceded by characteristic patterns. Reconstruction-based data-driven models are trained to reconstruct the characteristic patterns of hazardous event-preceding process data with minimal residuals, facilitating effective event prediction based on reconstruction residuals. This investigation evaluated one-dimensional convolutional auto-encoders as reconstruction-based data-driven models for predicting positive pressure events in industrial furnaces. A simple furnace model was used to generate dynamic multivariate process data with simulated positive pressure events to use as a case study. A one-dimensional convolutional auto-encoder was trained as a reconstruction-based model to recognize the data preceding the hazardous events, and its performance was evaluated by comparing it to a fully-connected auto-encoder as well as a principal component analysis reconstruction model. This investigation found that one-dimensional convolutional auto-encoders recognized event-preceding patterns with lower detection delays, higher speciﬁcities, and lower missed alarm rates, suggesting that the one-dimensional convolutional auto-encoder layout is superior to the fully connected auto-encoder layout for use as a reconstruction-based event prediction model. This investigation also found that the nonlinear auto-encoder models outperformed the linear principal component model investigated. While the one-dimensional auto-encoder was evaluated comparatively on a simulated furnace case study, the methodology used in this evaluation can be applied to industrial furnaces and other mineral processing applications. Further investigation using industrial data will allow for a view of the convolutional auto-encoder’s absolute performance as a reconstruction-based hazardous event prediction model.


Introduction
South Africa hosts the majority of the world's platinum group metal (PGM)-reserves in the Bushveld Igneous Complex [1]. These PGMs are extracted from nickel-copper ores contained in the Bushveld Complex through a series of process steps. Mined ore undergoes comminution, liberating sulphides to create a sulphide concentrate that is concentrated through flotation. Flotation concentrates are smelted and converted, yielding a coppernickel matte rich in PGMs. Precious metals within the matte are separated from base metals through hydrometallurgical treatments before being refined into their pure forms [2].
Each of the aforementioned processing steps reduces the bulk of the concentrate or separates gangue from precious metals, increasing the PGM concentration. The smelting step is crucial to the overall PGM extraction process; during smelting, submerged electrode arc furnaces melt dried concentrate into a sulphide matte that acts as a PGM collector, increasing the concentration of PGMs tenfold [2]. The overall viability of the PGM extraction process is therefore reliant on the submerged arc furnaces being operated safely, effectively, and efficiently.
Desulphurization and electrode oxidation reactions within the furnace release sulphur dioxide and carbon monoxide into the furnace freeboard at high temperatures [3], resulting in a freeboard filled with hot, hazardous gases. Freeboard gases are extracted continuously to maintain a negative gauge pressure, preventing these gases from escaping into the surrounding area and jeopardizing operator safety [4]. Atmospheric air drawn in by the negative gauge pressure cools the furnace contents, consequently furnace efficiency is promoted by maintaining the pressure as close to zero as possible.
Freeboard pressures can routinely exceed atmospheric pressure despite gas extraction, causing blowbacks. Positive pressure events, also known as blowbacks, occur when hazardous gases escape from industrial furnaces and their causes are unknown. A monitoring model for predicting these events will therefore promote the safety of the smelting operation by providing a warning to operators of impending blowbacks and allow freeboard pressures to be raised when blowbacks are not imminent, promoting efficiency.
Similar to comminution and flotation processes, furnaces are subject to disturbances in the grade and supply of concentrate. These similarities extend to complex process interactions: furnaces are subject to interactions between various furnace zones just as particle interactions, recycle streams, and slurry-air interactions are present and challenging in comminution and flotation process units. Fault conditions in furnaces (i.e., blowbacks), comminution (e.g., mill trips from mill overloading), and flotation (e.g., sliming incidents) can cause sub-optimal operation with potentially rapid and extreme consequences.
Furnaces, as well as the physical mineral processing operations, generate large volumes of historical data. The large data volume recorded from these processes promotes the use of statistical process monitoring for predicting hazardous events such as blowbacks, mill trips, and sliming incidents [5]. This paper evaluates one-dimensional convolutional neural networks as reconstruction-based process monitoring models for predicting blowbacks in industrial submerged arc furnaces. This evaluation will yield insights into the suitability of reconstruction-based monitoring models for predicting hazardous events across the mineral processing chain.
Ideally, historical data recorded from submerged arc furnaces used to develop blowbackprediction models would be completely characterized, i.e., all observations in the historical dataset would be labelled correctly. Models could be trained to predict all possible events from using such a dataset by separating all historical observations into distinct classes [6]. Unfortunately furnace data, like most real-world datasets, are poorly characterized and event-prediction models must be trained using datasets where only a few observations are labelled properly. This constraint has spurred the development of reconstruction-based one-class classifiers as event prediction models [7].
Reconstruction-based one-class classifiers are data-driven models trained to find effective, compressed representations of specific process patterns [8]. If a model is trained to reconstruct the process patterns preceding specific events, then it will reconstruct the specific event-preceding patterns with minimal error. Process patterns that do not precede the target event will be reconstructed inaccurately. This facilitates event prediction based on reconstruction error [9]; lower reconstruction errors suggest that the specific event is imminent, while large reconstruction errors suggest that the event is not imminent. Reconstruction-based event-prediction models are distinguished by how they find compressed representations of process faults.
Principal component analysis (PCA) is the most common approach to feature learning [5,10], and recognizes linear correlations in event-preceding processes [9]. The efficacy of PCA in recognizing specific process patterns has been demonstrated for detecting faults on the Tennessee Eastman simulated process [11], for modelling the normal conditions of batch and continuous chemical processes [12], and for detecting faults in industrial boiler data [13]. Unfortunately, the performance of PCA deteriorates when applied to the nonlinear correlations typically found in industrial data [8], leading to the increasing prominence of neural network-based one-class classifiers.
Auto-encoders (AEs) are neural networks that find the low-dimensional subspace that accurately represents network inputs, then reconstruct these inputs as the network outputs. When trained to reconstruct specific event-preceding process patterns they make ideal candidates for reconstruction-based one-class classifiers [10]. Their ability to learn nonlinear representations of industrial process data was demonstrated on the Tennessee Eastman case study [14], and their ability to recognize specific process patterns was demonstrated on a simulated coal mill system [15].
Convolutional neural networks (CNNs) were developed for and completely outclass traditional neural networks in image processing [16]. CNNs extract simple, localized features from network inputs before moving on to more complicated features. This allows for more effective representations of network inputs across convolutional layers [17]. Their adoption for monitoring industrial processes has been slow due to the intrinsic differences between images and multivariate time series, but the localized feature extraction of CNNs can lead to better representations of multivariate time series. Recently, convolutional auto-encoders (CAEs) have been developed for compressing univariate electrocardiogram signals [18] and for fault detection using multivariate time series in the context of process monitoring [19].
This study compares the performance of different reconstruction-based event prediction models using a simulated furnace as case study. The furnace model was developed to specifically account for the complex dynamic interactions in a submerged arc furnace while maintaining a lumped parameter approach to ensure feasible computational costs. Further details on the current study are provided in [20].

Reconstruction-Based One-Class Classifiers
Data-driven models are used to predict events by applying a model function to monitored process variables. Supervised learning aims to optimally parameterize the model function by minimizing a pre-defined loss function [21,22], but requires labelled observations containing the characteristic patterns preceding the event of interest [23]. Historical datasets are rarely this well-defined, and semi-supervised learning approaches are used to train models using only the historical observations that are known to contain the characteristic patterns. Reconstruction-based one-class classifiers are semi-supervised prediction models that seek to address the problem of ill-defined historical datasets. Three semi-supervised models are considered in this work: principal component analysis, autoencoders, and convolutional auto-encoders.
In general, model parameters are found by training a model to compress monitored variables x i to a lower dimensional subspace and to reconstruct the observations accurately [24]. The output of the reconstruction modelx i is the reconstructed input, shown in Equation (1).
Using Equation (1), a model function, f , with model parameters, θ, is applied to an observation of multiple variables, x i . The model is trained on historical observations that are known to contain the event-preceding patterns, and, if properly trained, will reconstruct all observations with the characteristic patterns accurately while reconstructing those without inaccurately. The reconstruction error, ε R,i quantifies how accurately an observation x i is reconstructed (Equation (2)): The inverse of ε R,i , u i = 1/ε R,i , can be used as a discriminant. Higher discriminant values suggest that the reconstructed observation is similar to the observations used to train the reconstruction model. Therefore, the reconstruction model will generate higher discriminant values on observations that precede specific events if it was trained on those events. The reconstruction model is semi-supervised because it does not require negative samples during training, i.e., observations outside the target event [21].
Equation (3) formally states the training algorithm used to obtain reconstruction model parameters, θ, from observations in the historical data, X t , where X t is the subset of datapoints preceding the event to be predicted.
The generated discriminant values are compared to a recognition threshold before making a prediction. However, a theoretical basis for the reconstruction recognition threshold does not exist, and has to be obtained empirically [8].
The reconstruction-based model is trained by minimizing Equation (2) over all training samples, and is therefore sensitive to the units of variables in each observation [8]. Each observation is therefore standardized (rescaled to zero mean and unit variance) before the model is trained. Inputs can be corrupted during training by adding normally distributed noise with zero mean and variance σ 2 C to each observation, then training the model to reconstruct the original, uncorrupted input, improving model generalizability [15]. This is illustrated in Equation (4), where variable j of a standardized observation z i is corrupted with normally distributed noise. The variance σ 2 C represents an additional design parameter that must be specified before model parameters can be derived.

Principal Component Analysis
Principal component analysis (PCA) is a prominent data-driven model applied in process monitoring. Using PCA, the directions of significant linearly uncorrelated variance are identified using recorded data of specific process conditions [25]. These directions constitute a linear subspace of target process conditions and are called the principal components of the modelled data. Observations with similar correlation structures to the target process conditions are well represented in this subspace and can be reconstructed accurately; therefore, PCA is an ideal model to recognize process conditions characterized by distinct linear correlation structures [9]. Figure 1 provides an illustration of PCA-based reconstruction.
Linear correlations between variables are well-approximated in the PCA subspace, but this subspace excludes autocorrelations between observations. PCA is therefore best suited for static processes [11,24]. Dynamic PCA (dPCA) is a simple modification of PCA that addresses this limitation. Using dynamic PCA, observations are lagged, incorporating previous values in each observation as shown in Equation (5), allowing PCA to include autocorrelations in its subspace [11,26]:  Linear correlations between variables are well-approximated in the PCA subspace, but this subspace excludes autocorrelations between observations. PCA is therefore best suited for static processes [11,24]. Dynamic PCA (dPCA) is a simple modification of PCA that addresses this limitation. Using dynamic PCA, observations are lagged, incorporating previous values in each observation as shown in Equation (5), allowing PCA to include autocorrelations in its subspace [11,26]:

Auto-Encoders
Auto-encoders (AEs) are feedforward neural networks that find effective representations of inputs and reconstruct them accurately [8]. Like neural networks, AEs have network architectures consisting of layers of neurons with weighted connections. What distinguishes AEs is their equally sized input and output layers and the existence of a bottleneck layer. The bottleneck layer has fewer neurons than the input and represents the nonlinear subspace of modelled data [8]. Figure 2 illustrates a typical AE network architecture. The neurons in an AE function similarly to those in standard neural networks, where each neuron accepts weighted inputs and biases to produce an output dependent on the selected (often nonlinear) activation function [27].   [20]. A new data point (green star) is projected onto the first principal component (blue arrow), yielding the projected data point (dark crimson star). The difference between the new data point and the projected data point is the reconstruction error.

Auto-Encoders
Auto-encoders (AEs) are feedforward neural networks that find effective representations of inputs and reconstruct them accurately [8]. Like neural networks, AEs have network architectures consisting of layers of neurons with weighted connections. What distinguishes AEs is their equally sized input and output layers and the existence of a bottleneck layer. The bottleneck layer has fewer neurons than the input and represents the nonlinear subspace of modelled data [8]. Figure 2 illustrates a typical AE network architecture. The neurons in an AE function similarly to those in standard neural networks, where each neuron accepts weighted inputs and biases to produce an output dependent on the selected (often nonlinear) activation function [27].  [20]. A new data point (green star) is projected onto the first principal component (blue arrow), yielding the projected data point (dark crimson star). The difference between the new data point and the projected data point is the reconstruction error.
Linear correlations between variables are well-approximated in the PCA subspace, but this subspace excludes autocorrelations between observations. PCA is therefore best suited for static processes [11,24]. Dynamic PCA (dPCA) is a simple modification of PCA that addresses this limitation. Using dynamic PCA, observations are lagged, incorporating previous values in each observation as shown in Equation (5), allowing PCA to include autocorrelations in its subspace [11,26]:

Auto-Encoders
Auto-encoders (AEs) are feedforward neural networks that find effective representations of inputs and reconstruct them accurately [8]. Like neural networks, AEs have network architectures consisting of layers of neurons with weighted connections. What distinguishes AEs is their equally sized input and output layers and the existence of a bottleneck layer. The bottleneck layer has fewer neurons than the input and represents the nonlinear subspace of modelled data [8]. Figure 2 illustrates a typical AE network architecture. The neurons in an AE function similarly to those in standard neural networks, where each neuron accepts weighted inputs and biases to produce an output dependent on the selected (often nonlinear) activation function [27].
Gradient-based optimization routines are used to determine the parameters which satisfy Equation (3). Early nonlinear activation functions used in neural networks, like sigmoid and hyperbolic tangent functions, struggled with vanishing gradient problems, posing serious problems to gradient-based optimization [28]. The Rectified Linear Unit (ReLU) activation function (Equation (6)) is frequently used to overcome this problem. To avoid convergence to local minima, the gradient descent with momentum algorithm (Equation (7)) is often used to train neural networks [22]. At each iteration, k, a weight w ∈ θ is updated based on how much it contributed to the overall loss function according to the learning parameter η. The third term introduces momentum using the parameter γ to increase the likelihood that the model will find globally optimized weights and biases.
Lastly, regularization is used to modify the error function [22]. Using L 2 -regularization (equation 8), over-fitting can be avoided by adjusting the penalty parameter λ controlling the degree of regularization; larger values of λ results in more regularized model parameters. φ Auto-encoders are able to identify nonlinear characteristics between variables in each observation but, like PCA, are unable to identify dynamic characteristics between observations. As with PCA, observations can be lagged to incorporate previous values in each observation using Equation (5).

Convolutional Auto-Encoders
Convolutional auto-encoders (CAEs) are not fundamentally different from typical AEs. In fact, CAEs can be seen as a special case of fully-connected AEs [22]. They use the same activation functions and can both be trained using backpropagation combined with gradient descent algorithms. Convolutional auto-encoders are distinguished by the use of convolutional layers. Figure 3 provides an illustration of a simple CAEs architecture typically used in applications with two-dimensional data sets (e.g., images). The neurons in convolutional layers are connected to a subset of the neurons in the preceding layer. These subsets (shaded grey in Figure 3) are simpler than the set of all outputs from the preceding layer, and are connected by far fewer weighted connections [27].
posing serious problems to gradient-based optimization [28]. The Rectified Linear Unit (ReLU) activation function (Equation (6)) is frequently used to overcome this problem. To avoid convergence to local minima, the gradient descent with momentum algorithm (Equation (7)) is often used to train neural networks [22]. At each iteration, , a weight ∈ is updated based on how much it contributed to the overall loss function according to the learning parameter . The third term introduces momentum using the parameter to increase the likelihood that the model will find globally optimized weights and biases. Lastly, regularization is used to modify the error function [22]. Using 2 -regularization (equation 8), over-fitting can be avoided by adjusting the penalty parameter controlling the degree of regularization; larger values of results in more regularized model parameters.
Auto-encoders are able to identify nonlinear characteristics between variables in each observation but, like PCA, are unable to identify dynamic characteristics between observations. As with PCA, observations can be lagged to incorporate previous values in each observation using Equation (5).

Convolutional Auto-Encoders
Convolutional auto-encoders (CAEs) are not fundamentally different from typical AEs. In fact, CAEs can be seen as a special case of fully-connected AEs [22]. They use the same activation functions and can both be trained using backpropagation combined with gradient descent algorithms. Convolutional auto-encoders are distinguished by the use of convolutional layers. Figure 3 provides an illustration of a simple CAEs architecture typically used in applications with two-dimensional data sets (e.g., images). The neurons in convolutional layers are connected to a subset of the neurons in the preceding layer. These subsets (shaded grey in Figure 3) are simpler than the set of all outputs from the preceding layer, and are connected by far fewer weighted connections [27].  [20]. Shaded areas represent the subsets of each layer output used as receptive field for subsequent convolutional filters. In the example, the first convolutional filter maps nine input features to a single feature in the first convolved layer (red blocks), the second convolutional filter maps  [20]. Shaded areas represent the subsets of each layer output used as receptive field for subsequent convolutional filters. In the example, the first convolutional filter maps nine input features to a single feature in the first convolved layer (red blocks), the second convolutional filter maps four features to a single feature in the second convolved layer (green blocks), and the deconvolutional filter maps a single feature into eight separate features in the reconstructed pattern (blue blocks).

Case Study
The proposed event prediction methods were evaluated using a simulated submerged arc furnace with dynamic characteristics as a case study. The hazardous events simulated by the furnace model are positive pressure events (PPEs). The freeboards of submerged arc furnaces contain hazardous gases such as carbon monoxide at high temperatures [29,30]. A negative freeboard gauge pressure is maintained to prevent these gases from escaping.
PPEs occur when the gauge pressure of the furnace freeboard becomes positive, releasing hazardous gases into the surroundings. Figure 4 shows the layout of the furnace model simulation. The full model derivation and implementation is presented in [20].

Case Study
The proposed event prediction methods were evaluated using a simulated submerged arc furnace with dynamic characteristics as a case study. The hazardous events simulated by the furnace model are positive pressure events (PPEs). The freeboards of submerged arc furnaces contain hazardous gases such as carbon monoxide at high temperatures [29,30]. A negative freeboard gauge pressure is maintained to prevent these gases from escaping. PPEs occur when the gauge pressure of the furnace freeboard becomes positive, releasing hazardous gases into the surroundings. Figure 4 shows the layout of the furnace model simulation. The full model derivation and implementation is presented in [20]. . Submerged arc furnace model layout, with distinct bulk and smelting concentrate, liquid slag and matte, trapped reaction gas, cooling water, and freeboard zones. Each zone is modelled as a separate lumped parameter system. The dynamic furnace model is derived by performing mass and energy balances over distinct zones of the furnace interior ( Figure 4) to obtain a set of ordinary differential equations. This set of ordinary differential equations is used to generate datasets on which to evaluate the PPE prediction models [20]. These datasets are created by sampling the furnace model variables that can be monitored in a submerged arc furnace. The list of monitored variables is given in Table 1.
Freeboard temperature K 7 Cooling water temperature K 8 Freeboard pressure , Pa 9 Reaction gas concentration in freeboard mol/m 3 The generated datasets correspond to 12 weeks of simulated operation [20]. Each monitored variable is sampled once every ten seconds; the resulting dataset contains ~726,000 observations and = 9 features ( Table 1). The simulation switched between The dynamic furnace model is derived by performing mass and energy balances over distinct zones of the furnace interior ( Figure 4) to obtain a set of ordinary differential equations. This set of ordinary differential equations is used to generate datasets on which to evaluate the PPE prediction models [20]. These datasets are created by sampling the furnace model variables that can be monitored in a submerged arc furnace. The list of monitored variables is given in Table 1. Slag zone temperature T S K 4 Matte zone temperature T M K 5 Bulk concentrate temperature T C(B) K 6 Freeboard temperature T G K 7 Cooling water temperature T W K 8 Freeboard pressure C G,R Pa 9 Reaction gas concentration in freeboard P G mol/m 3 The generated datasets correspond to 12 weeks of simulated operation [20]. Each monitored variable is sampled once every ten seconds; the resulting dataset contains n ∼ 726,000 observations and m = 9 features ( Table 1). The simulation switched between two modes of operation; one where the furnace is operated in a way that does not cause PPEs, and one where the furnace is operated in a way that causes PPEs.
PPEs are caused in the furnace model by increasing the concentrate bed thickness. During normal operation, the concentrate feed rate to the furnace is manipulated so that the bed thickness is maintained between 0.4 m and 0.6 m. PPEs occur by manipulating the feed rate so that the thickness varies between 0.7 m and 1.0 m. This causes reaction gases to build up in the concentrate, causing the bed to rupture and reaction gases to release rapidly into the freeboard. The effect of concentrate bed thickness is shown in Figure 5.
PPEs, and one where the furnace is operated in a way that causes PPEs.
PPEs are caused in the furnace model by increasing the concentrate bed thickness. During normal operation, the concentrate feed rate to the furnace is manipulated so that the bed thickness is maintained between 0.4 m and 0.6 m. PPEs occur by manipulating the feed rate so that the thickness varies between 0.7 m and 1.0 m. This causes reaction gases to build up in the concentrate, causing the bed to rupture and reaction gases to release rapidly into the freeboard. The effect of concentrate bed thickness is shown in Figure 5.  [20]. A PPE-causing fault is introduced at 2 days of simulated operation; during this time the bed thickness is maintained at levels where PPEs occur when the bed ruptures.

Performance Evaluation
Evaluating the reconstruction-based event prediction models presented in this work requires that prediction performance metrics be defined. Table 2 defines the four possible outcomes when a predictive model is applied to an observation in a confusion matrix [31]. Table 2. Monitored variables in the simulated dataset.

Pattern Present Pattern Absent Recognition
True positives-False positives-No recognition False negatives-True negatives- The outcomes given in Table 2 are converted into metrics that express predictive performance from different perspectives. Typically, a classifier should have good sensitivity (equation 9) as well as specificity (Equation (10)) [31].
Specificity indicates how well a model flags negative samples as such, while sensitivity shows how well a model flags positive samples. While specificity and sensitivity express model performance from different perspectives, they can give misleading impressions of model performance in unbalanced datasets. Precision (given in Equation (11)) is a useful performance metric for datasets with few positive samples and many negative samples, as it indicates the probability that a prediction made by a model is correct: Figure 5. Illustration of how increasing the concentrate bed thickness (blue) causes PPEs, where the freeboard gauge pressure (red) becomes positive [20]. A PPE-causing fault is introduced at 2 days of simulated operation; during this time the bed thickness is maintained at levels where PPEs occur when the bed ruptures.

Performance Evaluation
Evaluating the reconstruction-based event prediction models presented in this work requires that prediction performance metrics be defined. Table 2 defines the four possible outcomes when a predictive model is applied to an observation in a confusion matrix [31]. Table 2. Possible outcomes when a predictive model is applied to an observation in a confusion matrix. The outcomes given in Table 2 are converted into metrics that express predictive performance from different perspectives. Typically, a classifier should have good sensitivity φ (Equation (9)) as well as specificity ψ (Equation (10)) [31].

Recognition
Specificity indicates how well a model flags negative samples as such, while sensitivity shows how well a model flags positive samples. While specificity and sensitivity express model performance from different perspectives, they can give misleading impressions of model performance in unbalanced datasets. Precision (given in Equation (11)) is a useful performance metric for datasets with few positive samples and many negative samples, as it indicates the probability that a prediction made by a model is correct: While a high precision shows that a model makes predictions with very few false alarms, it does not show how quickly that model makes those predictions, or if those predictions precede events with enough time to be useful. Time-to-event ∆t TE (Equation  (12)) expresses how quickly a model recognizes an event-preceding pattern before that event occurs: ∆t TE = t event − t detection (12)

Data Partitioning
A fair evaluation of the performance of a data-driven predictive model requires that the model be tested on data other than the training data. The simulated data with n observations and m features generated by the furnace model (X R n×m ) is partitioned into training (X 0 R n 0 ×m ) and testing (X 1 R n 1 ×m ) datasets. This partitioning is shown in Figure 6, where 12 weeks of simulated data is partitioned into training (gold) and testing (red) datasets.
While a high precision shows that a model makes predictions with very few false alarms, it does not show how quickly that model makes those predictions, or if those predictions precede events with enough time to be useful. Time-to-event ∆ (Equation (12)) expresses how quickly a model recognizes an event-preceding pattern before that event occurs:

Data Partitioning
A fair evaluation of the performance of a data-driven predictive model requires that the model be tested on data other than the training data. The simulated data with observations and features generated by the furnace model ( ϵ ℝ × ) is partitioned into training ( 0 ϵ ℝ 0 × ) and testing ( 1 ϵ ℝ 1 × ) datasets. This partitioning is shown in Figure 6, where 12 weeks of simulated data is partitioned into training (gold) and testing (red) datasets. Reconstruction-based event prediction models should only be trained on data where the characteristic patterns that precede the target event are present. Therefore, a target dataset ( ϵ ℝ × ) is constructed from a subset of observations in 0 . The ground truth regarding the presence of faulty conditions of the simulated data is known and is illustrated in Figure 7. The areas shaded in red shows where the PPE-causing fault is present, and gold-shaded areas indicate its absence. Note that the PPE-causing fault is present in unshaded areas in Figure 7, but detection at this point in time would not provide sufficient response time to operators before the target event occurs for the prediction to be useful. Reconstruction-based event prediction models should only be trained on data where the characteristic patterns that precede the target event are present. Therefore, a target dataset (X t R n t ×m ) is constructed from a subset of observations in X 0 . The ground truth regarding the presence of faulty conditions of the simulated data is known and is illustrated in Figure 7. The areas shaded in red shows where the PPE-causing fault is present, and goldshaded areas indicate its absence. Note that the PPE-causing fault is present in unshaded areas in Figure 7, but detection at this point in time would not provide sufficient response time to operators before the target event occurs for the prediction to be useful.
Unfortunately, the ground truth in industrial datasets is rarely known. However, the hazardous event is easily identified and training samples for a reconstruction-based event prediction model can be selected from a window preceding the target event [31]. Therefore, a prediction is assumed to be valid for a time (∆t prediction ) preceding the event. The prediction is correct if the event occurs within this period, and if enough time (∆t warning ) is available to take corrective measures. These two metrics allow a window of training samples that precede each event to be defined as illustrated in Figure 8 Unfortunately, the ground truth in industrial datasets is rarely known. However, the hazardous event is easily identified and training samples for a reconstruction-based event prediction model can be selected from a window preceding the target event [31]. Therefore, a prediction is assumed to be valid for a time (∆ ) preceding the event. The prediction is correct if the event occurs within this period, and if enough time (∆ ) is available to take corrective measures. These two metrics allow a window of training samples that precede each event to be defined as illustrated in Figure 8: Figure 8. Illustration of online event prediction. After an event is predicted, an event is assumed to occur within the prediction period (red arrow). The prediction is valid if an event occurs within this period. The prediction should provide a minimum warning period (blue arrow) for plant operators to prepare for the event. Only warnings given in the gold shaded area will be both valid and provide plant operators with sufficient time to prepare for the event.
(1) Invalid prediction as no fault occurs within the prediction period, (2) valid prediction, (3) invalid prediction as the minimum warning period is exceeded.
Specifying ∆ and ∆ defines a window preceding each event in the training dataset where predictions would be valid. Training samples can then be selected from these windows in the training dataset. Figure 9 illustrates observations in 0 , highlighted in gold, that are selected as training samples for ∆ = 1.5 ℎ and ∆ = 0.5 ℎ. These training samples are used to construct a new target dataset, . Unfortunately, the ground truth in industrial datasets is rarely known. However, the hazardous event is easily identified and training samples for a reconstruction-based event prediction model can be selected from a window preceding the target event [31]. Therefore, a prediction is assumed to be valid for a time (∆ ) preceding the event. The prediction is correct if the event occurs within this period, and if enough time (∆ ) is available to take corrective measures. These two metrics allow a window of training samples that precede each event to be defined as illustrated in Figure 8: Figure 8. Illustration of online event prediction. After an event is predicted, an event is assumed to occur within the prediction period (red arrow). The prediction is valid if an event occurs within this period. The prediction should provide a minimum warning period (blue arrow) for plant operators to prepare for the event. Only warnings given in the gold shaded area will be both valid and provide plant operators with sufficient time to prepare for the event. (1) Invalid prediction as no fault occurs within the prediction period, (2) valid prediction, (3) invalid prediction as the minimum warning period is exceeded.
Specifying ∆ and ∆ defines a window preceding each event in the training dataset where predictions would be valid. Training samples can then be selected from these windows in the training dataset. Figure 9 illustrates observations in 0 , highlighted in gold, that are selected as training samples for ∆ = 1.5 ℎ and ∆ = 0.5 ℎ. These training samples are used to construct a new target dataset, . Figure 8. Illustration of online event prediction. After an event is predicted, an event is assumed to occur within the prediction period (red arrow). The prediction is valid if an event occurs within this period. The prediction should provide a minimum warning period (blue arrow) for plant operators to prepare for the event. Only warnings given in the gold shaded area will be both valid and provide plant operators with sufficient time to prepare for the event.
(1) Invalid prediction as no fault occurs within the prediction period, (2) valid prediction, (3) invalid prediction as the minimum warning period is exceeded.
Specifying ∆t prediction and ∆t warning defines a window preceding each event in the training dataset where predictions would be valid. Training samples can then be selected from these windows in the training dataset. Figure 9 illustrates observations in X 0 , highlighted in gold, that are selected as training samples for ∆t prediction = 1.5 h and ∆t warning = 0.5 h. These training samples are used to construct a new target dataset, X t . Minerals 2021, 11, x FOR PEER REVIEW 11 of 18 Figure 9. Illustration of observations selected for the target dataset, . is constructed from observations in the gold-shaded region, but event-preceding patterns may still be present outside this window (red shaded area). The dashed blue line indicates zero gauge pressure.

Note that
does not contain all observations in 0 where the event-preceding patterns are present; the above approach is simply a way of selecting observations with patterns that, if recognized, will flag the observations that precede each event. Recognitions immediately succeeding these observations will not reduce model specificity nor increase specificity despite the presence of the characteristic patterns that precede the PPEs; they do not provide sufficient warning time before the PPEs.

Model Development
Model parameters for the dPCA, AE, and CAE models are derived from the synthetically generated dataset. Table 3 shows the model derivation algorithm for the dPCA model, while Table 4 shows how the AE and CAE models are derived. Finally, Table 5 shows how the derived models are used to calculate the reconstruction error for new observations. Table 3. Model derivation algorithm for dPCA.
Step description Output Equation 1 Standardize -2 Lag the standardized dataset 5 3 Optimize model parameters to reconstruct from ̂ 3 Table 4. Model derivation algorithm for both AE and CAE.
Step description Output Equation 1 Standardize -2  Lag  with observations  5  3 Reconstruct Calculate the reconstruction error 2 Figure 9. Illustration of observations selected for the target dataset, X t . X t is constructed from observations in the gold-shaded region, but event-preceding patterns may still be present outside this window (red shaded area). The dashed blue line indicates zero gauge pressure.
Note that X t does not contain all observations in X 0 where the event-preceding patterns are present; the above approach is simply a way of selecting observations with patterns that, if recognized, will flag the observations that precede each event. Recognitions immediately succeeding these observations will not reduce model specificity nor increase specificity despite the presence of the characteristic patterns that precede the PPEs; they do not provide sufficient warning time before the PPEs.

Model Development
Model parameters for the dPCA, AE, and CAE models are derived from the synthetically generated dataset. Table 3 shows the model derivation algorithm for the dPCA model, while Table 4 shows how the AE and CAE models are derived. Finally, Table 5 shows how the derived models are used to calculate the reconstruction error for new observations. Table 3. Model derivation algorithm for dPCA.
Step Description Output Equation Lag the standardized dataset Z t Z L t 5 3 Optimize model parameters to reconstruct Z L t fromẐ L t V 3 Table 4. Model derivation algorithm for both AE and CAE.
Step Description Output Equation Lag the standardized dataset Z t Z L Optimize model parameters to reconstruct Z L t fromẐ L t θ 3 Table 5. dPCA, AE, and CAE application algorithm.
Step Description Output Equation Calculate the reconstruction error ε R i 2 2.5.1. Dynamic Principal Component Analysis The principal components of a dataset, X t R n×m , are computed through eigenvalue decomposition of the covariance matrix of the dataset. This is shown in Equation (13) below: where v j R m×1 is a principal component of X. The corresponding eigenvalue, λ j , is the total variance captured on this principal component. A PCA subspace is constructed using the most significant principal components, i.e., the components with the most variance. The significance of v j is expressed by the fraction of total variance captured [32]. This fraction is calculated using Equation (14).
where σ 2 X is the total variance in X. The PCA subspace, V R m×v , contains the v most significant components. Retaining insignificant components causes noise to be retained in the PCA subspace. Selecting the optimal number of retained components, v, is therefore crucial to dPCA modelling [32]. In this investigation, v is selected so that 99.9 % of the variance in the training set is retained. The reconstruction model for dPCA, f PCA , is given by Equation (15) below: x

Auto-Encoder
The auto-encoder network architecture used in this investigation follows the template shown in Figure 10. Each observation with m features is lagged l times, requiring m(l + 1) input-and output neurons. The encoding and decoding layers contain twice as many neurons as the input and output layers. This investigation considers an auto-encoder with three neurons in the hidden bottleneck layer.
Equation (11) is used to determine the number of network parameters (weights and biases) to be learnt iteratively using the gradient descent with momentum algorithm. This equation uses a dataset with m = 9 features lagged l = 4 times. Using equation 16, it is demonstrated that the auto-encoder used in this investigation has 8958 learnable parameters. Table 6 shows the design parameters specified for the auto-encoder before model parameters are derived through training.
N θ = 4(m(l + 1)) 2 + 4N hidden m(l + 1) + 4m(l + 1) + N hidden (16) Minerals 2021, 11, x FOR PEER REVIEW 13 of 18 Figure 10. Illustration of a lagged AE network architecture. A lagged input, with ( + 1) variables, is projected to a high dimensional encoding layer. The hidden layer extracts representative features from this encoding layer, yielding the nonlinear AE subspace (3 features in this example). The decoding and output layers are used to reconstruct the lagged input from the subspace. Figure 11 shows the convolutional auto-encoder network architecture used in this investigation. Each input is a 5 × 9 matrix; = 9 features, each lagged = 4 times. The first two convolutional filters have a 3 × 1 dimension and will therefore only convolve across the time dimension. The first two convolutions eliminate the time dimension, yielding a 1 × 9 convolved feature. The third convolutional layer convolves across the variables, yielding the model subspace of single values. It is from this model subspace that the original input is reconstructed using a 5 × 9 deconvolutional filter. Table 7 shows how many convolutional filters are used at each layer, as well as how many learnable parameters exist for each filter. The table quantifies the complexity of the investigated convolutional auto-encoder architecture. It shows that the architecture only has 707 learnable parameters.    Figure 11 shows the convolutional auto-encoder network architecture used in this investigation. Each input is a 5 × 9 matrix; m = 9 features, each lagged l = 4 times. The first two convolutional filters have a 3 × 1 dimension and will therefore only convolve across the time dimension. The first two convolutions eliminate the time dimension, yielding a 1 × 9 convolved feature. The third convolutional layer convolves across the variables, yielding the model subspace of single values. It is from this model subspace that the original input is reconstructed using a 5 × 9 deconvolutional filter.  Figure 11. Illustration of the CAE architecture evaluated in this project [20]. Convolutions that are applied vertically convolve a feature in the time dimension. Horizontal convolutions convolve across the variables in a feature. In the example, the first convolutional filter maps three input features to a single feature in the first feature layer (orange blocks), the second convolutional filter maps three features to a single feature in the second feature layer (green blocks), the third convolutional filter maps nine features into a single feature (red blocks), and the final deconvolutional filter maps single features into a reconstructed output with 45 features (blue blocks). See Table 7 for further details.

Results and Discussions
The performances of the evaluated reconstruction-based models are closely linked to the recognition thresholds at which they are evaluated. However, these thresholds can be selected arbitrarily. Therefore, each model will be evaluated at three different thresholds shown in Table 8, as well as the motivation for using these thresholds for evaluation.  Figure 11. Illustration of the CAE architecture evaluated in this project [20]. Convolutions that are applied vertically convolve a feature in the time dimension. Horizontal convolutions convolve across the variables in a feature. In the example, the first convolutional filter maps three input features to a single feature in the first feature layer (orange blocks), the second convolutional filter maps three features to a single feature in the second feature layer (green blocks), the third convolutional filter maps nine features into a single feature (red blocks), and the final deconvolutional filter maps single features into a reconstructed output with 45 features (blue blocks). See Table 7 for further details. Table 7 shows how many convolutional filters are used at each layer, as well as how many learnable parameters exist for each filter. The table quantifies the complexity of the investigated convolutional auto-encoder architecture. It shows that the architecture only has 707 learnable parameters.

Results and Discussions
The performances of the evaluated reconstruction-based models are closely linked to the recognition thresholds at which they are evaluated. However, these thresholds can be selected arbitrarily. Therefore, each model will be evaluated at three different thresholds shown in Table 8, as well as the motivation for using these thresholds for evaluation.  Figure 12 shows the discriminant values generated by each of the investigated models over 9 days of simulated operation. Note that this evaluation is performed over 42 days of simulated operation; these figures are only for illustrative purposes. These figures also show the recognition thresholds given in Table 8. Minimum threshold where 100% specificity is achieved.
No false alarms Figure 12 shows the discriminant values generated by each of the investigated models over 9 days of simulated operation. Note that this evaluation is performed over 42 days of simulated operation; these figures are only for illustrative purposes. These figures also show the recognition thresholds given in Table 8.  Table 8. Figure 12 shows that the dPCA, AE, and CAE models are all unable to achieve both zero missed predictions and perfect specificity: the recognition threshold for no false alarms is greater than that for no missed blowbacks for each model. Table 9 shows the event prediction performance metrics at 95% precision, no missed alarms, and perfect specificity for each investigated model, respectively.
Note that sensitivity refers to the fraction of all observations that precede events that is recognized by the predictive models. The number of failed predictions is the number of the predictive models failed to recognize a single observation in the windows preceding the target events. Therefore, a model can have a sensitivity lower than 100% while still  Table 8. Figure 12 shows that the dPCA, AE, and CAE models are all unable to achieve both zero missed predictions and perfect specificity: the recognition threshold for no false alarms is greater than that for no missed blowbacks for each model. Table 9 shows the event prediction performance metrics at 95% precision, no missed alarms, and perfect specificity for each investigated model, respectively. Note that sensitivity refers to the fraction of all observations that precede events that is recognized by the predictive models. The number of failed predictions is the number of the predictive models failed to recognize a single observation in the windows preceding the target events. Therefore, a model can have a sensitivity lower than 100% while still succeeding in predicting each event.
The results presented in Table 9 suggests that the performance of the CAE model, relative to the AE-and dPCA models, is superior in the case study evaluated in this work.
Entry 1 in Table 9 show that the CAE model correctly recognized event-preceding conditions more quickly than the dPCA and AE models when the recognition threshold is set so that the precision of each model in recognizing event-preceding conditions is 95%. Furthermore, entry 4 shows that the CAE model managed to predict each event, while the AE-and dPCA models failed to predict 20 and 40 blowbacks out of 63, respectively.
While inferior to the CAE model at the 95% precision threshold, the nonlinear AE model did manage to outperform the linear dPCA model. The dPCA model did show lower average detection delays than the AE model (as seen in entry 1 in Table 9) but failed in predicting events twice as often. This suggests that predictions based solely on a process' linear characteristics will struggle to compete with predictions that utilize nonlinear characteristics.
The CAE model's superior performance was maintained when the recognition threshold was set so that no prediction fails. While the AE-and dPCA models did achieve significantly lower detection delays at this recognition threshold, they did so at far lower specificities (86.70% for the dPCA model and 90.14% for AE model). The CAE model successfully predicted all events at the highest specificity (99.86%) over all investigated recognition thresholds.
Finally, when the recognition threshold was set so that a perfect specificity was achieved, none of the evaluated models managed to predict each event. However, both the AE and CAE models failed to predict less than half of the events (24 and 16 out of 63, respectively). The dPCA model trailed significantly by failing to predict more than two thirds of the events (42 out of 63). This further suggests that modelling nonlinear characteristics is a crucial part of an event prediction model.

Conclusions
While the dPCA model showed inferior performance at each evaluated recognition threshold due to its limitations as a linear model, it should be noted that the computational requirements for developing and applying dPCA models are far lower than for AEs and CAEs. Kernel PCA is a non-linear alternative to PCA that performs eigenvalue decomposition of the outer product of modelled data, but this is computationally infeasible on the larger datasets typically recorded on industrial furnaces. The AE-and CAE models evaluated in this project were not limited by computing requirements but scaling them in complexity may not always be feasible. dPCA may be more suitable for applications where time-consuming optimization algorithms are undesired.
The superior performance observed for the CAE model compared to the AE model suggests that using one dimensional convolutional neural networks allows for more effective representations of the simulated furnace's multivariate time series data. As a reconstruction-based classifier, CAEs extract features using fewer parameters than AEs, representing inputs in fewer, more informative features. This suggests that the superior performance of convolutional networks is not limited to image data.
Overall, the results obtained in this investigation suggest that one-dimensional CAEs are promising models for extracting features from multivariate time series data recorded from submerged arc furnaces, and that they can be applied as reconstruction-based event prediction models for online process monitoring to improve the safety and therefore viability of mineral processing applications. However, this investigation only provided a comparative evaluation of PCA models, auto-encoders, and convolutional auto-encoders on a single dataset obtained from a furnace model as a case study. Further evaluations of datasets obtained from industrial furnaces and other mineral processing applications will provide crucial insights that cannot be obtained from a modelled system such as the one used in this study on the performance of convolutional auto-encoders as event prediction models for promoting safe operation of various mineral processing applications.