Dynamic Monitoring of Grinding Circuits by Use of Global Recurrence Plots and Convolutional Neural Networks

: Reliable control of grinding circuits is critical to more e ﬃ cient operation of concentrator plants. In many cases, operators still play a key role in the supervisory control of grinding circuits but are not always able to act timely to deal with disturbances, such as changes in the mill feed. Reliable process monitoring can play a major role in assisting operators to take more timely and reliable action. These monitoring systems need to be able to deal with what could be complex nonlinear dynamic behavior of comminution circuits. To this end, a dynamic process monitoring approach is proposed based on the use of convolutional neural networks. To take advantage of the availability of pretrained neural networks, the grinding circuit variables are treated as time series which can be converted into images. Features extracted from these networks are subsequently analyzed in a multivariate process monitoring framework with an underlying principal component model. Two variants of the approach based on convolutional neural networks are compared with dynamic principal component analysis on a simulated and real-world case studies. In the ﬁrst variant, the pretrained neural network is used as a feature extractor without any further training. In the second variant, features are extracted following further training of the network in a synthetic binary classiﬁcation problem designed to enhance the extracted features. The second approach yielded nominally better results than what could be obtained with dynamic principal component analysis and the approach using features extracted by transfer learning.


Introduction
The energy efficient operation of grinding circuits remains paramount in the processing of minerals, as these circuits account for a major proportion of the energy consumed in concentrator operations.Optimal control of grinding circuits is key towards achieving these goals.Generally, the objective is to maintain a specific level of product quality (particle size) and to maximize throughput [1].
Although single loop proportional-integral-derivative (PID) controllers are still the norm for industrial milling circuit control, this approach is not always effective, as strong interactions between the loops is not accounted for [1].As a consequence, advanced control systems have been developed, e.g., model predictive controllers [2][3][4], fuzzy logic [5], expert system controllers [6,7] and other nonlinear control systems, such as recently proposed by Inapakurthi et al. [8].
Despite these advances, the behavior of grinding circuits can be highly nonlinear and fully automated control may not be feasible in practice.This is particularly the case with autogenous grinding (AG) and semi-autogenous grinding circuits (SAG), where the behavior of the mill is strongly affected by the properties of the ore being ground.
Under these circumstances, human operators play a key role in the control of milling circuits and a sensible support strategy would be to support operator decisions, while pursuing the longer-term goal of developing fully automatic control systems.Operator support can include visualization of the mill behavior [9], rule-based or expert systems support [10,11] and multivariate statistical process control [12][13][14][15].
Process monitoring methods in particular are reinforced by the development of novel sensors to better interpret the behavior of the mill.This includes acoustic sensors [16,17], accelerometric measurement of the milling equipment [18] and the development of image processing to characterize particulate mill feed and product [19].These sensors often have the same predictive power with regard to the performance of the grinding circuit as sets of operational variables, although interpretation of physical phenomena from these signals may be more challenging.
The use of dynamic monitoring methods exploiting key signals in mill operation have proliferated over the last few decades.One of the earliest approaches based on the use of principal component analysis was introduced by Ku et al. [20], by augmenting each process variable with a number of lagged samples before building the PCA model.While this approach can explicitly deal with dependencies in the samples in time series data, it is adversely affected by a number of issues.These include determination of a proper lag structure for the variables, where dependencies are not well captured by linear methods, such as the autocovariance of the time series.In addition, there is the challenge of dealing with a considerable increase in the dimensionality of the systems (number of variables), if the lag parameter is large.In particular, this could pose a problem in highly nonlinear systems, such as grinding circuits.Different ways of extending dynamic PCA have been investigated to address some of these problems [21].This includes extension to nonlinear methods, such as dynamic kernel PCA [22], dynamic independent component analysis [23,24], dissimilarity methods [25,26] and support vector data description [27].Another class of approaches is focused on distributed or decentralized models, to deal with nonlinear dynamic process behavior that arises owing to the existence of operating modes in the process [28].
Despite these advances, dynamic monitoring of nonlinear process systems remains an open research issue, and, in this paper, a dynamic monitoring methodology based on the use of convolutional neural networks is proposed.In this approach, measured signals are first converted into images, before features are extracted that can be interpreted in a classical multivariate process monitoring framework.This addresses a number of the major challenges facing dynamic process monitoring, as briefly outlined above, as it globally captures potentially highly nonlinear process dynamics within a sliding window, without suffering from the drawbacks of having to deal with high-dimensional systems in the usual way.To contextualize the performance of these approaches, they are compared with a dynamic principal component model.
In Section 2, the basic operation of convolutional neural networks is outlined, followed by Section 3, where the analytical methodology is explained.In Section 4, two illustrative case studies are considered, and the results are discussed in Section 5.The conclusions of the paper are summarized in Section 6.

Convolutional Neural Networks
Convolutional neural networks (CNNs) are deep neural networks that have been designed to learn features from images that can be used in a variety of pattern recognition tasks.Their architectures emulate the animal visual cortex and its powerful visual processing capabilities.
A basic CNN architecture consists of convolutional, pooling and fully connected layers.The convolutional layers generate feature maps by sliding automatically learned filters over input images.These feature maps are down-sampled by pooling layers, as indicated in Figure 1.Max-pooling is often used, in which the maximum value of the features in a local neighborhood is retained.In this way, the image is downscaled to reduce the number of weights with each layer.The final layers of the network depths.Visual inspection of such learned features revealed that the first layers respond to general image features, such as edges and different colors [31].Deeper layers combine these features into the objects they are trained to identify.In mineral processing, transfer learning has been successfully applied to the monitoring of flotation froths [32,33], grinding circuits [19,34], ore texture analysis [35] and particle size distributions [36].
The architecture of VGG19 [37], the pretrained model used in this study, is shown in Figure 1.The shape of the feature maps output by each section of convolutional filters are shown.Images are presented to the model in RGB format.For a thorough discussion on CNNs and the various layers contained in the models, the reader is referred to Goodfellow et al. (2016) [29].VGG19 network architecture adapted for transfer learning.Layers include convolutional (white), pooling (blue) and an average pooling layer (red).Classification layers were replaced with an average pooling layer to output a flat vector of extracted features.

Dynamic Process Monitoring Methodology
This section first describes the overall dynamic process monitoring framework, before a detailed description of each of the methodology based on the use of convolutional neural networks is given.Two variants of the methodology are considered, which are referred to as Methods I and II.In Method I, features are extracted through transfer learning only.In Method II, feature extraction is enhanced by use of an artificial classification problem that forces the CNN to extract features more descriptive of the data distributions.These approaches are compared with monitoring based on dynamic principal component analysis, which is referred as Method III, as described in more detail below.Training of the networks is accomplished by means of stochastic gradient descent and back propagation.CNNs have fewer parameters to learn than equivalent fully connected neural networks and process data more efficiently.As a consequence, CNNs have become entrenched as state-of-the-art methods in a growing number of applications involving image processing [29].
Nonetheless, one of the major drawbacks of CNNs, is that their initial construction often require massive amounts of data and high-end computational resources.This problem can be circumvented by making use of CNNs that have been pretrained on data from different domains, such as the ImageNet database [30].This process of deploying pretrained CNNs to data from domains other than on which it was trained is known as transfer learning.
Transfer learning is facilitated by the hierarchical representation of image features at different network depths.Visual inspection of such learned features revealed that the first layers respond to general image features, such as edges and different colors [31].Deeper layers combine these features into the objects they are trained to identify.In mineral processing, transfer learning has been successfully applied to the monitoring of flotation froths [32,33], grinding circuits [19,34], ore texture analysis [35] and particle size distributions [36].
The architecture of VGG19 [37], the pretrained model used in this study, is shown in Figure 1.The shape of the feature maps output by each section of convolutional filters are shown.Images are presented to the model in RGB format.For a thorough discussion on CNNs and the various layers contained in the models, the reader is referred to Goodfellow et al. (2016) [29].

Dynamic Process Monitoring Methodology
This section first describes the overall dynamic process monitoring framework, before a detailed description of each of the methodology based on the use of convolutional neural networks is given.Two variants of the methodology are considered, which are referred to as Methods I and II.In Method I, features are extracted through transfer learning only.In Method II, feature extraction is enhanced by use of an artificial classification problem that forces the CNN to extract features more descriptive of the data distributions.These approaches are compared with monitoring based on dynamic principal component analysis, which is referred as Method III, as described in more detail below.

Dynamic Process Monitoring Framework
Time series data, X ∈ R N×M , with N observations in M dimensions are divided into reference and test datasets, denoted X (NOC) and X (TEST) , respectively.In practice, X (NOC) represents normal operating condition (NOC) data on which the monitoring algorithm is trained.X (TEST) represents new, unseen data which may or may not contain faults or process deviations.In contrast to X (NOC) , X (TEST) is not available during the training stage of the model.As shown in Figure 2, X (NOC) is further subdivided into a validation set, X (VAL) , to perform validation of the monitoring performance on data similar to X (NOC) .

Dynamic Process Monitoring Framework
Time series data,  ∈ ℝ × , with N observations in M dimensions are divided into reference and test datasets, denoted  () and  () , respectively.In practice,  () represents normal operating condition (NOC) data on which the monitoring algorithm is trained. () represents new, unseen data which may or may not contain faults or process deviations.In contrast to  () ,  () is not available during the training stage of the model.As shown in Figure 2,  () is further subdivided into a validation set,  () , to perform validation of the monitoring performance on data similar to  () .As mentioned above, both proposed monitoring approaches (Methods I and II) involve the extraction of features from segments of time series data.In both variants, the segmented time series data are transformed to images, and features are extracted from these images by use of a CNN.The feature matrices shown in Figure 2, F, are presented to a classical process monitoring scheme based on principal component analysis (PCA).The three methods considered in this investigation are summarized in Figure 3. Methods I-III are described in more detail in Sections 3.2-3.4,respectively.

Dynamic Process Monitoring Framework
Time series data,  ∈ ℝ × , with N observations in M dimensions are divided into reference and test datasets, denoted  () and  () , respectively.In practice,  () represents normal operating condition (NOC) data on which the monitoring algorithm is trained. () represents new, unseen data which may or may not contain faults or process deviations.In contrast to  () ,  () is not available during the training stage of the model.As shown in Figure 2,  () is further subdivided into a validation set,  () , to perform validation of the monitoring performance on data similar to  () .As mentioned above, both proposed monitoring approaches (Methods I and II) involve the extraction of features from segments of time series data.In both variants, the segmented time series data are transformed to images, and features are extracted from these images by use of a CNN.The feature matrices shown in Figure 2, , are presented to a classical process monitoring scheme based on principal component analysis (PCA).The three methods considered in this investigation are summarized in Figure 3. Methods I-III are described in more detail in Sections 3.2-3.4,respectively.After feature extraction, PCA is applied to F (NOC) to obtain the principal component loadings matrix, P K .All three feature datasets are projected to the principal component space according to Equation (1), where K refers to the number of principal components retained to account for 90% of the variance in F (NOC) .T The principal component scores, T K , calculated in Equation ( 1) are used to calculate Hotelling's T 2 -statistic, as defined in Equation (2).
where t i refers to column i of matrix T K and s t i is the eigenvalue of the corresponding column in the PCA model.The data are projected back to the original feature space according to Equation (3): allowing for the calculation of the Q-statistic, representing the reconstruction or Squared-Prediction-Error (SPE) of the model.
where R refers to the number of samples in the feature matrix F. The monitoring performance of the PCA models are validated using the 95% confidence limits of the T 2 and Q statistics of the NOC data.These confidence limits are used as control limits on control charts, and an alarm is raised once these limits are exceeded, indicating a fault.The performance of the algorithms was quantified by the true alarm rate (TAR), false alarm rate (FAR) and detection delay (DD).TAR is the fraction of detections flagged for known fault samples.FAR is the fraction of detections flagged on the validation dataset, which is known to be NOC data, i.e., the fraction of false positives.DD is defined as the number of consecutive known fault samples missed by the monitoring algorithm, before a detection is logged.For calculation of DD, detection was defined as three consecutive alarms.

Method I: CNN Feature Extraction Based on Transfer Learning
With Method I, sections of the time series were transformed into images and features were extracted from these images with a CNN making use of transfer learning.Any of a number of CNNs pretrained on the ImageNet database, can be used to accomplish this, but, in this investigation, the VGG19 CNN shown in Figure 1 was used.
More formally, the zero mean, unit variance normalized multivariate time series data, X ∈ R N×M containing N measurements of M variables, were segmented by a moving window of length b, moving s time steps at a time along the time series, as shown in After feature extraction, PCA is applied to  () to obtain the principal component loadings matrix,   .All three feature datasets are projected to the principal component space according to Equation (1), where K refers to the number of principal components retained to account for 90% of the variance in  () .
The principal component scores,   , calculated in Equation ( 1) are used to calculate Hotelling's  2 -statistic, as defined in Equation (2).
where   refers to column i of matrix   and    is the eigenvalue of the corresponding column in the PCA model.The data are projected back to the original feature space according to Equation (3): allowing for the calculation of the -statistic, representing the reconstruction or Squared-Prediction-Error (SPE) of the model.
where R refers to the number of samples in the feature matrix .The monitoring performance of the PCA models are validated using the 95% confidence limits of the  2 and Q statistics of the NOC data.These confidence limits are used as control limits on control charts, and an alarm is raised once these limits are exceeded, indicating a fault.The performance of the algorithms was quantified by the true alarm rate (TAR), false alarm rate (FAR) and detection delay (DD).TAR is the fraction of detections flagged for known fault samples.FAR is the fraction of detections flagged on the validation dataset, which is known to be NOC data, i.e., the fraction of false positives.DD is defined as the number of consecutive known fault samples missed by the monitoring algorithm, before a detection is logged.For calculation of DD, detection was defined as three consecutive alarms.

Method I: CNN Feature Extraction Based on Transfer Learning
With Method I, sections of the time series were transformed into images and features were extracted from these images with a CNN making use of transfer learning.Any of a number of CNNs pretrained on the ImageNet database, can be used to accomplish this, but, in this investigation, the VGG19 CNN shown in Figure 1 was used.
More formally, the zero mean, unit variance normalized multivariate time series data,  ∈ ℝ × containing N measurements of M variables, were segmented by a moving window of length , moving  time steps at a time along the time series, as shown in   After segmentation of the time series, a distance matrix was constructed for each of the R time series segments.The elements of these distance matrices (D k ∈ R b×b ) with elements d r x i , x j , r = 1, 2, . . .R were the pairwise Euclidean distances between the points in the time series segment.
Figure 5 gives a diagrammatic representation of the calculation of a distance matrix in a window of length 5 sliding across a one-dimensional signal or time series.This gives a 5 × 5 distance matrix.Multivariate signals would be processed identically, with the distances computed in the corresponding multivariate Euclidean space.These distance matrices were presented directly to the convolutional neural networks, as equivalent to images, with the distance values equivalent to pixel intensities.These images can be considered a modified version of recurrence plots, as first proposed in [38].Recurrence plots are a two-dimensional representation of the recurrences of state vectors in a dynamical system to local neighborhoods and are often used in the analysis of nonlinear or chaotic systems [39,40].
Minerals 2020, 10, x FOR PEER REVIEW 6 of 28 After segmentation of the time series, a distance matrix was constructed for each of the R time series segments.The elements of these distance matrices (  ∈ ℝ × ) with elements   (  ,   ),  = 1,2, …  were the pairwise Euclidean distances between the points in the time series segment.
Figure 5 gives a diagrammatic representation of the calculation of a distance matrix in a window of length 5 sliding across a one-dimensional signal or time series.This gives a 5  5 distance matrix.Multivariate signals would be processed identically, with the distances computed in the corresponding multivariate Euclidean space.These distance matrices were presented directly to the convolutional neural networks, as equivalent to images, with the distance values equivalent to pixel intensities.These images can be considered a modified version of recurrence plots, as first proposed in [38].Recurrence plots are a two-dimensional representation of the recurrences of state vectors in a dynamical system to local neighborhoods and are often used in the analysis of nonlinear or chaotic systems [39,40].However, instead of confining the recurrence plot to returning trajectories in local neighborhoods, the thresholding function is removed to provide a global view of these trajectories in all neighborhood sizes.In this case, the different neighborhood sizes of the global recurrence plot (GRP) are represented as different colors in the image.
The feature extraction process is summarized in Figure 6.The distance matrices were presented to the CNN in the form of RGB color maps, instead of greyscale mappings.The final classification layers of the VGG19 network were removed, since these layers are trained to classify images into the 1000 classes of common objects of the ImageNet dataset.As shown in Figure 1, the output of the final pooling block was downsampled and flattened using an average pooling layer.The layer reduces the output size to a 512-dimensional feature vector.Note that the output sizes in Figure 1 are given for the default image size of 224-by-224 pixels on which the network However, instead of confining the recurrence plot to returning trajectories in local neighborhoods, the thresholding function is removed to provide a global view of these trajectories in all neighborhood sizes.In this case, the different neighborhood sizes of the global recurrence plot (GRP) are represented as different colors in the image.
The feature extraction process is summarized in Figure 6.The distance matrices were presented to the CNN in the form of RGB color maps, instead of greyscale mappings.
Minerals 2020, 10, x FOR PEER REVIEW 6 of 28 After segmentation of the time series, a distance matrix was constructed for each of the R time series segments.The elements of these distance matrices (  ∈ ℝ × ) with elements   (  ,   ),  = 1,2, …  were the pairwise Euclidean distances between the points in the time series segment.
Figure 5 gives a diagrammatic representation of the calculation of a distance matrix in a window of length 5 sliding across a one-dimensional signal or time series.This gives a 5  5 distance matrix.Multivariate signals would be processed identically, with the distances computed in the corresponding multivariate Euclidean space.These distance matrices were presented directly to the convolutional neural networks, as equivalent to images, with the distance values equivalent to pixel intensities.These images can be considered a modified version of recurrence plots, as first proposed in [38].Recurrence plots are a two-dimensional representation of the recurrences of state vectors in a dynamical system to local neighborhoods and are often used in the analysis of nonlinear or chaotic systems [39,40].However, instead of confining the recurrence plot to returning trajectories in local neighborhoods, the thresholding function is removed to provide a global view of these trajectories in all neighborhood sizes.In this case, the different neighborhood sizes of the global recurrence plot (GRP) are represented as different colors in the image.
The feature extraction process is summarized in Figure 6.The distance matrices were presented to the CNN in the form of RGB color maps, instead of greyscale mappings.The final classification layers of the VGG19 network were removed, since these layers are trained to classify images into the 1000 classes of common objects of the ImageNet dataset.As shown in Figure 1, the output of the final pooling block was downsampled and flattened using an average pooling layer.The layer reduces the output size to a 512-dimensional feature vector.Note that the output sizes in Figure 1 are given for the default image size of 224-by-224 pixels on which the network The final classification layers of the VGG19 network were removed, since these layers are trained to classify images into the 1000 classes of common objects of the ImageNet dataset.As shown in Figure 1, the output of the final pooling block was downsampled and flattened using an average pooling layer.The layer reduces the output size to a 512-dimensional feature vector.Note that the output sizes in Figure 1 are given for the default image size of 224-by-224 pixels on which the network was trained.For a fixed-size convolutional filter, these output sizes change with the image size.In this work, the image size was dependent on the window length, b, and thus different intermediate output sizes were encountered; however, the number of feature maps remain constant, and the average pooling layer always results in a 512-dimensional vector.Implementation of the pretrained CNN models was facilitated by the Keras [41] and TensorFlow [42] libraries in Python.
In this first variant of the methodology, the CNNs pretrained on the ImageNet database were not trained any further, but simply used as feature extractors.These features, F ∈ R K×n F with n F = 512 for the VGG19 CNN, were extracted for each of the NOC, validation and test datasets.

Method II: CNN Feature Extraction Based on Contrast Learning
Method II differs from Method I in that the pretrained networks were further trained to recognize the NOC data.An artificial training problem was set up to accomplish this.That is, in addition to the original time series, X (NOC) , a surrogate time series was generated, X (NOC) * ∈ R N×M , with the same marginal distribution as the original time series, as well as the same autocorrelation function.This was accomplished by use of the iterated amplitude adjusted Fourier transform (IAAFT) algorithm [43].
The CNN was subsequently trained to distinguish between the real and surrogate data in a binary classification problem.Essentially, the original time series is masked or contrasted with the surrogate data, which forces the neural network to extract features that are descriptive of the distribution of the original NOC data, as it tries to discriminate between the two time series.
More formally, the surrogate time series was generated by an autoregressive moving average or Gaussian model of unidentified order.The IAAFT algorithm [43,44] used for this purpose can be summarized as follows: (a) Generate a sorted copy and random permutation of time series X, called X and S (0) , respectively.(b) Calculate the Fourier amplitudes of X by the Discrete Fourier Transform (DFT): (c) The following steps are iterated until convergence: (i) Transform S (i) to the Fourier domain using DFT: (ii) Generate F S (i) by replacing the Fourier amplitudes of F S (i) with |A k | 2 , while keeping the complex phases.This step ensures the original power spectrum is produced at each iteration.(iii) Take the inverse DFT back to the time domain: (iv) Generate S (i+1) by ranking the values of in ascending order and replacing them by the values of X having the same ranking.Here, the distribution of the original series is retained.
(v) Convergence is achieved once the ranked order of is identical to that of X, and is returned as the surrogate time series.The original time series was labeled as 0 and the surrogate time series as 1.Both time series were segmented, and the segments were converted to GRPs, as was done with Method I.The labels associated with time series measurements were also mapped to the segments or corresponding images derived from the specific time series.These images and their associated labels were then concatenated into a single training database.
The classification section of the VGG19 network was modified to enable training to classify the real and surrogate data.Details of these modifications are summarized in Table 1 and Figure 7.The output shapes given in the second column in Table 1 pertain to a single sample, corresponding to one segment of time series data of length b.The output of the sigmoidal nonlinearity in the final node distinguishes between the two classes.images derived from the specific time series.These images and their associated labels were then concatenated into a single training database.The classification section of the VGG19 network was modified to enable training to classify the real and surrogate data.Details of these modifications are summarized in Table 1 and Figure 7.The output shapes given in the second column in Table 1 pertain to a single sample, corresponding to one segment of time series data of length .The output of the sigmoidal nonlinearity in the final node distinguishes between the two classes.
Since the goal is improved feature extraction, the final section of convolutional filters were retrained, as indicated in Figure 7. Prior to retraining, this section outputs 512 feature maps, each of size 14 × 14, trained to successfully characterize and predict the classes in the ImageNet dataset.After retraining, the convolutional filters generating these feature maps will have been repurposed to better identify the differences between the original and surrogate datasets considered here.Once training is complete and the network can successfully discriminate between the original and surrogate data, the fully connected layers are removed.Features are again extracted for the  () ,  () and  () datasets similarly to the previous variant.
The difference between the features generated by Method II and those generated by Method I, is that, in the latter case, the features are solely derived from transfer learning from the ImageNet database.In Method II, the features are derived from contrast learning of the original time series data.In essence, the CNN is attempting to differentiate between the original time series and its surrogate mask.It accomplishes this by learning the decision boundary enclosing the features corresponding to the original NOC time series data.This decision boundary should enable not only distinguishing between original NOC data and surrogate data, but also NOC data and data from any other distribution, such as fault data.
For comparative purposes, process monitoring based on the use of dynamic principal component analysis is also considered, referred to as Method III, as discussed in the next section.Since the goal is improved feature extraction, the final section of convolutional filters were retrained, as indicated in Figure 7. Prior to retraining, this section outputs 512 feature maps, each of size 14 × 14, trained to successfully characterize and predict the classes in the ImageNet dataset.After retraining, the convolutional filters generating these feature maps will have been repurposed to better identify the differences between the original and surrogate datasets considered here.
Once training is complete and the network can successfully discriminate between the original and surrogate data, the fully connected layers are removed.Features are again extracted for the X (NOC) , X (VAL) and X (TEST) datasets similarly to the previous variant.
The difference between the features generated by Method II and those generated by Method I, is that, in the latter case, the features are solely derived from transfer learning from the ImageNet database.In Method II, the features are derived from contrast learning of the original time series data.In essence, the CNN is attempting to differentiate between the original time series and its surrogate mask.It accomplishes this by learning the decision boundary enclosing the features corresponding to the original NOC time series data.This decision boundary should enable not only distinguishing Minerals 2020, 10, 958 9 of 28 between original NOC data and surrogate data, but also NOC data and data from any other distribution, such as fault data.
For comparative purposes, process monitoring based on the use of dynamic principal component analysis is also considered, referred to as Method III, as discussed in the next section.

Method III: Dynamic Principal Component Analysis
Ku et al. [20] proposed a DPCA method using a time lag shift method in model building procedure.Since then, different variants of DPCA have been proposed to address different problems in industry.For example, Li and Rong [45] proposed the use of partial DPCA to obtain structured residuals and to improve the capability for fault isolation.Russel et al. [46] applied canonical variate analysis (CVA) statistics and DPCA for more sensitive fault detection.Dynamic kernel PCA was proposed by Choi and Lee [47] and Jia et al. [48], with further improvements proposed by Zhang et al. [22].
Principal component analysis can be extended to deal with dynamic process behavior by expanding the data matrix with time lagged copies of the variables [20] where is the embedding lag and is the embedding dimension of the time series.In dynamic PCA models, the analyst needs to decide on the number of lags to add to each variable in addition to the number of principal components to retain in the model.Different approaches have been proposed for this, e.g., by taking into account the autocorrelational behavior of the time series [49] or the methods proposed by Rato and Reis [50], where a single lag structure ( ) is determined for all the variables.In this investigation, the objective was to detect change in the process system as rapidly as possible, so that the product of the embedding lag and the embedding dimension was minimized.For example, in a steady state system, with a lag of zero and embedding dimension of one, change can be detected as soon as a single sample on all the measurements that have been collected.
DPCA does not require analysis of segments of time series as the other two variants proposed.Instead, the time series embedded into a lag-trajectory matrix is input directly into the classical process monitoring scheme based on PCA, as shown in Figure 3.

Dataset Description
The analytical methodologies are demonstrated on a composite simulated signal generated by the combination of three sinusoidal signals and additive Gaussian noise, N (0, 0.01), with zero mean and a standard deviation of 0.01, as indicated in Equation (10): sin(0.11t) + sin(t) + 0.1 sin(10t) + N (0, 0.01), t < 1000 sin(0.12t)+ sin(t) + 0.1 sin(10t) + N (0, 0.01), t ≥ 1000 (10) The signal was simulated for 2000 time steps, with the frequency of one of the sinusoidal components changing after 1000 time steps to mimic a fault.The simulated signal is depicted in Figure 8.After feature extraction, the principal component monitoring scheme was applied.The results in Figure 10 were used to determine the number of principal components to retain for monitoring calculations.The figure contains the cumulative sum of the normalized eigenvalues extracted from  () , also for a window length of 150 time steps.For a window length of 150 time steps, 84 principal components were required to capture 90% of the variance in the  () ∈ ℝ ×512 feature matrix, as indicated in Figure 10.The first 1000 time steps were divided into NOC and validation data in a 70/30 ratio, with NOC data in blue and validation data in green.Test data, wherein the whole dataset contains the fault, is indicated in red. Figure 8 demonstrates that a visual inspection would not suffice to detect the parameter change.In the following sections, the process monitoring strategies described are applied to detect the parameter change in the time series.After feature extraction, the principal component monitoring scheme was applied.The results in Figure 10 were used to determine the number of principal components to retain for monitoring calculations.The figure contains the cumulative sum of the normalized eigenvalues extracted from  () , also for a window length of 150 time steps.For a window length of 150 time steps, 84 principal components were required to capture 90% of the variance in the  () ∈ ℝ ×512 feature matrix, as indicated in Figure 10.After feature extraction, the principal component monitoring scheme was applied.The results in Figure 10 were used to determine the number of principal components to retain for monitoring calculations.The figure contains the cumulative sum of the normalized eigenvalues extracted from F (NOC) , also for a window length of 150 time steps.For a window length of 150 time steps, 84 principal components were required to capture 90% of the variance in the F (NOC) ∈ R K×512 feature matrix, as indicated in Figure 10.The results for the simulations are summarized in Table 2.The TAR of both statistics increases with an increasing window length.The FAR in both cases approximates 0.05 (or 5%), as expected, since these data are similar to the NOC data from which the confidence limit was calculated.No clear patterns are observed in the detection delays, but the calculated values are short compared to the window lengths.According to the definition of the detection delay used, a minimum value of three time steps can be achieved.Overall, the best results were obtained with a window length of 200.
As an example, the  2 -and -statistic monitoring charts, for a window length of 150 time steps, are shown in Figure 11, with logarithmically scaled  2 -and -scores for better visualization.The 95% and 99% confidence limits are shown as broken lines.Note that the first value is plotted after 150 time steps, since the sliding window has to buffer until the window is fully populated before feature extraction can continue.The results for the simulations are summarized in Table 2.The TAR of both statistics increases with an increasing window length.The FAR in both cases approximates 0.05 (or 5%), as expected, since these data are similar to the NOC data from which the confidence limit was calculated.No clear patterns are observed in the detection delays, but the calculated values are short compared to the window lengths.According to the definition of the detection delay used, a minimum value of three time steps can be achieved.Overall, the best results were obtained with a window length of 200.As an example, the T 2 -and Q-statistic monitoring charts, for a window length of 150 time steps, are shown in Figure 11, with logarithmically scaled T 2 -and Q-scores for better visualization.The 95% and 99% confidence limits are shown as broken lines.Note that the first value is plotted after 150 time steps, since the sliding window has to buffer until the window is fully populated before feature extraction can continue.

Method II: CNN Feature Extraction Based on Contrast-Based Learning
To enable contrast-based learning, an IAAFT surrogate of the simulated NOC data were generated, as shown in Figure 12. Figure 13 shows the distributions of the data in the two time series, as well as their autocorrelations functions.Examples of distance plots used during contrast-based training of the pretrained CNN are shown in Figure 14.
A CNN, with architecture as shown in Figure 7, was trained to classify the images using the parameters in Table 3.The contrast-based learning procedure was achieved in two stages.Firstly, only the classification layers were trained with no alterations being made to convolutional filters.After 20 epochs through the training data, training of the final section of convolution filters commenced.The two-step procedure is required since the added classification layers were randomlyinitialized and led to high loss function values in the first few epochs.These large losses result in high magnitude gradient updates to propagate throughout the network, causing large changes to filter weights and the benefit of using pretrained networks being lost.Thus, the initial training is necessary to reduce the loss function before changes to convolution filters are allowed.Further, a smaller learning rate is applied to the filter weights to prevent large weight adjustments resulting from slightly larger loss values.After training, the classification layers were removed, and features for NOC, validation and test data were extracted.
After feature extraction, the results in Figure 15 were used to determine the number of principal components to retain for monitoring purposes.Unlike the case with the first variant, only 17 principal components were required to capture 90% of the variance in the contrast-based learning features.Retraining the final section of convolutional filters adapted these filters to concentrate on discriminating between NOC and surrogate images, instead of properties of images in ImageNet dataset.Consequently, the network was able to extract a more compact feature set representing the variation in the NOC data.A CNN, with architecture as shown in Figure 7, was trained to classify the images using the parameters in Table 3.The contrast-based learning procedure was achieved in two stages.Firstly, only the classification layers were trained with no alterations being made to convolutional filters.After 20 epochs through the training data, training of the final section of convolution filters commenced.The two-step procedure is required since the added classification layers were randomly-initialized and led to high loss function values in the first few epochs.These large losses result in high magnitude gradient updates to propagate throughout the network, causing large changes to filter weights and the benefit of using pretrained networks being lost.Thus, the initial training is necessary to reduce the loss function before changes to convolution filters are allowed.Further, a smaller learning rate is applied to the filter weights to prevent large weight adjustments resulting from slightly larger loss values.After training, the classification layers were removed, and features for NOC, validation and test data were extracted.After feature extraction, the results in Figure 15 were used to determine the number of principal components to retain for monitoring purposes.Unlike the case with the first variant, only 17 principal components were required to capture 90% of the variance in the contrast-based learning features.Retraining the final section of convolutional filters adapted these filters to concentrate on discriminating between NOC and surrogate images, instead of properties of images in ImageNet dataset.Consequently, the network was able to extract a more compact feature set representing the variation in the NOC data.The results of the monitoring simulations with different window lengths are given in Table 4.For similar window lengths, the contrast-based learning features improved monitoring performance over the transfer learning features, with the exception of the largest window length.The best result was found at a window length of 150 time steps, with the highest TARs and low FARs.The detection delay was however higher at this window length.The results of the monitoring simulations with different window lengths are given in Table 4.For similar window lengths, the contrast-based learning features improved monitoring performance over the transfer learning features, with the exception of the largest window length.The best result was found at a window length of 150 time steps, with the highest TARs and low FARs.The detection delay was however higher at this window length.Again, the monitoring performance improved with window length, except for the largest window.The detection delay remains uncorrelated with the window length increase.The T 2 -and Q-statistic monitoring charts, for a window length of 150 time steps, are shown in Figure 16.The T 2 -values span a much lower range, since only 17 principal components are used during calculation, while the Q-scores are calculated on the remaining 495 principal components.
minimum on the decision surface.Better results may have been found through longer training times, but overall it is clear that improved monitoring results can be obtained using contrast-based learning features over transfer learning features.

Method III: Dynamic Principal Component Analysis
DPCA results are dependent on the embedding lag and embedding dimension, according to Equation (9).For simulations, an embedding dimension was determined and results generated for a range of embedding lag values.Since the simulated signal consists of three components with added noise, an embedding dimension of four was chosen.The results are summarized in Table 5.At the window length of 200, it is likely the decreased result is not because of the increase in length, but that training came to a halt while the filter weights were stuck in a suboptimal local minimum on the decision surface.Better results may have been found through longer training times, but overall it is clear that improved monitoring results can be obtained using contrast-based learning features over transfer learning features.

Method III: Dynamic Principal Component Analysis
DPCA results are dependent on the embedding lag and embedding dimension, according to Equation (9).For simulations, an embedding dimension was determined and results generated for a range of embedding lag values.Since the simulated signal consists of three components with added noise, an embedding dimension of four was chosen.The results are summarized in Table 5.
Table 5. Monitoring results on simulated dataset using Method III with embedding dimension of four.

Embedding Lag
FAR TAR DD The sensitivity to the embedding parameters is clear from the results in Table 5.The only meaningful detection results were found at an embedding lag of 30.Monitoring charts for these parameters are depicted in Figure 17.On the simulated dataset, DPCA was outperformed by transfer learning features, while even more superior results were achieved when using features derived from contrast-based learning.The sensitivity to the embedding parameters is clear from the results in Table 5.The only meaningful detection results were found at an embedding lag of 30.Monitoring charts for these parameters are depicted in Figure 17.On the simulated dataset, DPCA was outperformed by transfer learning features, while even more superior results were achieved when using features derived from contrast-based learning.

Monitoring the Power Draw of an Autogenous Milling Circuit
The process monitoring strategies encapsulated by Methods I-III were applied to an industrial autogenous grinding (AG) circuit on a platinum group metals concentrator.In this instance, the power draw of the AG mill was monitored, sampled at 5-s intervals.A change in the power draw generally implied a change in the mill load due to changing feed ores.Additionally, a fault was simulated in the data by gradually transforming the power draw data stream into white noise, as described by Equation (11).

Monitoring the Power Draw of an Autogenous Milling Circuit
The process monitoring strategies encapsulated by Methods I-III were applied to an industrial autogenous grinding (AG) circuit on a platinum group metals concentrator.In this instance, the power draw of the AG mill was monitored, sampled at 5-s intervals.A change in the power draw generally implied a change in the mill load due to changing feed ores.Additionally, a fault was simulated in the data by gradually transforming the power draw data stream into white noise, as described by Equation (11).
where y(t) are the raw power draw data, y * (t) is a random permutation of y(t) and α is a vector of evenly spaced values between 0 and 1.The power draw with simulated fault is shown in Figure 18.where () are the raw power draw data,  * () is a random permutation of () and  is a vector of evenly spaced values between 0 and 1.The power draw with simulated fault is shown in Figure 18.Since this is in part a real dataset, the time series does not remain steady for long periods of time, and performance evaluation on a validation set necessarily leads to high FARs.Thus, the data are classified as "fault" or "non-fault" data, instead of NOC and test data.Only the true alarm rate and detection delay are used to quantify monitoring performance.Since this is in part a real dataset, the time series does not remain steady for long periods of time, and performance evaluation on a validation set necessarily leads to high FARs.Thus, the data are classified as "fault" or "non-fault" data, instead of NOC and test data.Only the true alarm rate and detection delay are used to quantify monitoring performance.

Method I: CNN Feature Extraction Based on Transfer Learning
Process monitoring results for detection of the fault in the mill power draw signal using CNN features extracted using transfer learning are shown in Table 6.As before, the monitoring performance improves with an increasing window length, in terms of both the TAR and the detection time (DD) and the best performance is achieved at the largest window length of 200 time steps.The monitoring charts for this window length are shown in Figure 19.

Method III: Dynamic Principal Component Analysis
Process monitoring results for detection of the fault in the mill power draw signal using DPCA are shown in Table 8.In the absence of knowing the true dimension of the underlying data generating distribution, the method of false nearest neighbors [52] was used to estimate the embedding dimension.The optimal embedding dimension was calculated to be four, where the number of false nearest neighbors decreased to zero.The monitoring simulations were run over a range of embedding lags, as seen in Table 6.Monitoring charts at an embedding lag of 10 time steps are shown in Figure 21.The results show that DPCA was unsuccessful in detecting the simulated fault, or the movement  Since the power draw signal was not steady over its duration, the unseen non-fault data are also flagged as faults.This could lead to the conclusion that the contrast-based learning procedure was simply overfitting the training data.However, this is countered by the low FARs achieved in Table 4 on the simulated data and the fact that the time series does not remain stationary during this period.Rather, the contrast-based learning features are correctly detecting that the time series is moving away from the training data.These results emphasize the sensitivity of the method to the training data.It is critical that a large dataset is used during training to capture the normal variation occurring during NOC periods.

Method III: Dynamic Principal Component Analysis
Process monitoring results for detection of the fault in the mill power draw signal using DPCA are shown in Table 8.In the absence of knowing the true dimension of the underlying data generating distribution, the method of false nearest neighbors [52] was used to estimate the embedding dimension.The optimal embedding dimension was calculated to be four, where the number of false nearest neighbors decreased to zero.The monitoring simulations were run over a range of embedding lags, as seen in Table 6.Monitoring charts at an embedding lag of 10 time steps are shown in Figure 21.

Monitoring the Operational State of an Autogenous Milling Circuit
Multivariate data from an industrial grinding circuit were monitored for a change in the mill operational state.The grinding circuit consisted of a fully autogenous mill closed with a recirculating load originating from the oversize material of a screen.The grinding circuit is depicted in Figure 22.
The monitored variables are the coarse ore feed rate, fine ore feed rate, mill power draw, mill load and the inlet water addition rate.Additionally, the milling circuit expert controller logged the operational state of the circuit at each sample, from a set of predefined states.The normalized samples of data collected at approximately 10 s intervals are shown in Figure 23.

Monitoring the Operational State of an Autogenous Milling Circuit
Multivariate data from an industrial grinding circuit were monitored for a change in the mill operational state.The grinding circuit consisted of a fully autogenous mill closed with a recirculating load originating from the oversize material of a screen.The grinding circuit is depicted in Figure 22.
operational state.The grinding circuit consisted of a fully autogenous mill closed with a recirculating load originating from the oversize material of a screen.The grinding circuit is depicted in Figure 22.
The monitored variables are the coarse ore feed rate, fine ore feed rate, mill power draw, mill load and the inlet water addition rate.Additionally, the milling circuit expert controller logged the operational state of the circuit at each sample, from a set of predefined states.The normalized samples of data collected at approximately 10 s intervals are shown in Figure 23.The monitored variables are the coarse ore feed rate, fine ore feed rate, mill power draw, mill load and the inlet water addition rate.Additionally, the milling circuit expert controller logged the operational state of the circuit at each sample, from a set of predefined states.The normalized samples of data collected at approximately 10 s intervals are shown in Figure 23.For the samples considered, the mill gradually drifts from NOC to an overloaded state.The color change in Figure 23 corresponds to the exact time the expert controller identified the change in operational state.Such a change would generally result from a change in the feed ore competency or particle size distribution.The expert controller takes delayed corrective action by dropping the feed rates to return the circuit to a NOC state.
In the following sections, the monitoring methodologies are applied to the dataset to detect this change of operational state.The results of the monitoring simulations are summarized in Table 9.Again, the monitoring performance is only validated on the TAR and DD, since the circuit is not completely steady during the NOC period.Table 9 shows good detection results, with high TARs and short detection delays.Monitoring charts at a window length of 200 time steps are shown in Figure 25.From the charts, it appears the algorithm would identify the change in operational state slightly more quickly than the predefined control limits given to the expert controller.The results of the monitoring simulations are summarized in Table 9.Again, the monitoring performance is only validated on the TAR and DD, since the circuit is not completely steady during the NOC period.Table 9 shows good detection results, with high TARs and short detection delays.Monitoring charts at a window length of 200 time steps are shown in Figure 25.From the charts, it appears the algorithm would identify the change in operational state slightly more quickly than the predefined control limits given to the expert controller.For the multivariate case, a surrogate matrix was generated by first creating a surrogate for each individual time series.These surrogates were concatenated to form a surrogate matrix with the same shape as the original multivariate series.While the distribution and autocorrelation of each individual variable could be retained, any cross-correlation between variables are destroyed during surrogate creation.
The results of the monitoring simulations are summarized in Table 10, and monitoring charts at a window length of 100 time steps are depicted in Figure 26.Again, NOC data not used for contrast-based training are displayed in green.Better results were generally achieved using contrast-based learning features than could be achieved using transfer learning features, except at the window length of 150.At this window length, it is likely that training halted with the network weights stuck in a suboptimal local minimum.Additionally, it seems the trend towards the overloaded state is detected earlier and more pronounced than with transfer learning features.

Method III: Dynamic Principal Component Analysis
For DPCA, the lag-trajectory embedding of the multivariate time series was determined using the method of false nearest neighbors and the average mutual information criterion [53].For each Better results were generally achieved using contrast-based learning features than could be achieved using transfer learning features, except at the window length of 150.At this window length, it is likely that training halted with the network weights stuck in a suboptimal local minimum.Additionally, it seems the trend towards the overloaded state is detected earlier and more pronounced than with transfer learning features.

Method III: Dynamic Principal Component Analysis
For DPCA, the lag-trajectory embedding of the multivariate time series was determined using the method of false nearest neighbors and the average mutual information criterion [53].For each variable, the dimensions in which false nearest neighbors tend to zero, as well as the lag value at which the average mutual information criterion reaches a first minimum value, was calculated.The maximum of calculated embedding dimensions, as well as maximum time lag value, was used to embed all of the variables in the time series.This maximum dimension was two, with a maximum time lag of four time steps.Thus, the embedding resulted in a 10-dimensional matrix, two dimensions per variable, with time lags of four time steps between dimensions of the same variable.
The monitoring results are summarized in Table 11 and displayed in monitoring charts in Figure 27.The simulation was only run with one set of embedding parameters.In this case, the DPCA model successfully detects the change in operational state, performing similarly to the CNN feature models.Visual inspection of the time series data in Figure 23 confirms that this detection is more trivial than the previous case studies.However, it still serves as a demonstration of the application of the methods to a multivariate problem.

Discussion
Convolutional neural networks were used to capture the structure of dynamic process variables in the form of sets of features that are representative of the time series behavior or process dynamics.The time series were encoded as a series of images from which features can be extracted via these networks.It is not technically necessary to convert the time series into images, as CNNs can take input of any dimension, including the time series directly.However, conversion of the time series to images allows the user to exploit the use of pretrained CNNs that are available in the public domain, as has been done in a growing number of examples in a variety of applications [54,55].Imaging of time series is advantageous for two reasons.First, it saves on the development cost (training time) of the CNNs, which can be considerable, and, second, it also allows the use of these models in the absence of the large datasets typically required in the development of CNN models.Mathematical transformation of the data might also reveal certain In this case, the DPCA model successfully detects the change in operational state, performing similarly to the CNN feature models.Visual inspection of the time series data in Figure 23 confirms that this detection is more trivial than the previous case studies.However, it still serves as a demonstration of the application of the methods to a multivariate problem.

Discussion
Convolutional neural networks were used to capture the structure of dynamic process variables in the form of sets of features that are representative of the time series behavior or process dynamics.The time series were encoded as a series of images from which features can be extracted via these networks.It is not technically necessary to convert the time series into images, as CNNs can take input of any dimension, including the time series directly.However, conversion of the time series to images allows the user to exploit the use of pretrained CNNs that are available in the public domain, as has been done in a growing number of examples in a variety of applications [54,55].Imaging of time series is advantageous for two reasons.First, it saves on the development cost (training time) of the CNNs, which can be considerable, and, second, it also allows the use of these models in the absence of the large datasets typically required in the development of CNN models.Mathematical transformation of the data might also reveal certain temporal correlations or interactions between variables that would otherwise be more difficult through analysis of the raw time series data.The method is easily extended to analysis of images generated from time-frequency-or multiscale transforms commonly used to analyze high frequency sensor data.
More specifically, the CNNs used in this investigation, and in a large number of other applications discussed in the literature, were trained on the ImageNet database of common objects.As a consequence, transfer learning can be used to generate features from other image datasets, as was done in Method I as applied to the case studies considered in this paper.
As informative as these features are, in principle at least, they are not necessarily optimal, especially if the images from the training and application domains are very different.Additional training could therefore go a long way towards further optimization of the features.With Method II, this is accomplished by setting up an artificial binary classification problem by generating synthetic data similar to the original data.Further training of the CNN to discriminate between these data in effect forces it to develop an internal set of features that are more specifically representative of the data in the application domain.
The efficacy of this approach is epitomized by the more compact set of features required to represent the data, as well as the comparatively favorable results that were obtained within the dynamic monitoring framework.
Both Methods I and II can be optimized in a number of ways.For example, imaging of the data based on Euclidean distance matrices may not be effective when the data lie on the convoluted low-dimensional manifold of a nonlinear dynamic system, but this could in principle be accounted for by use of more appropriate distance measures or different approaches to imaging of the data [56].

Conclusions
As shown in this investigation, convolutional neural networks can be used in dynamic process monitoring schemes where the time series measurements are converted to images.Two variants were proposed, based on the use of a pretrained CNN (VGG19).In the first (Method I), the network was used to extract features from data contained in a sliding window, by making use of transfer learning, which were further analyzed by a principal component model in a classical multivariate statistical process monitoring framework.
The second approach (Method II) was similar, except that the network was trained via an artificial classification problem to derive features that were more descriptive of the data.Both models performed better than a dynamic principal component model (Method III), although Method I did so only marginally, while Method II appeared to do so with a more significant margin.
Methods I and II would need to be validated further against additional case studies in different systems, but, as indicated in the previous section, there would be considerable scope for further optimization in among others, the CNN model used, the generation of images from the data and by extending the overall scheme to accommodate multiscale approaches.

Figure 1 .
Figure 1.VGG19 network architecture adapted for transfer learning.Layers include convolutional (white), pooling (blue) and an average pooling layer (red).Classification layers were replaced with an average pooling layer to output a flat vector of extracted features.

Figure 1 .
Figure 1.VGG19 network architecture adapted for transfer learning.Layers include convolutional (white), pooling (blue) and an average pooling layer (red).Classification layers were replaced with an average pooling layer to output a flat vector of extracted features.

Figure 2 .
Figure 2. General approach to feature extraction from time series data with CNNs, showing normal operating condition (NOC) and validation (VAL) data used for training, as well as independent test data (TEST) not used in the construction of the model.

Figure 3 .
Figure 3. Three approaches to dynamic process monitoring: Method I is based on the use of a principal component model derived from the features extracted from the time series with convolutional neural networks and transfer learning; Method II is the same as Method I, except for extraction of enhanced features; and Method III is based on dynamic principal component analysis.

Figure 2 .
Figure 2. General approach to feature extraction from time series data with CNNs, showing normal operating condition (NOC) and validation (VAL) data used for training, as well as independent test data (TEST) not used in the construction of the model.

Figure 2 .
Figure 2. General approach to feature extraction from time series data with CNNs, showing normal operating condition (NOC) and validation (VAL) data used for training, as well as independent test data (TEST) not used in the construction of the model.

Figure 3 .
Figure 3. Three approaches to dynamic process monitoring: Method I is based on the use of a principal component model derived from the features extracted from the time series with convolutional neural networks and transfer learning; Method II is the same as Method I, except for extraction of enhanced features; and Method III is based on dynamic principal component analysis.

Figure 3 .
Figure 3. Three approaches to dynamic process monitoring: Method I is based on the use of a principal component model derived from the features extracted from the time series with convolutional neural networks and transfer learning; Method II is the same as Method I, except for extraction of enhanced features; and Method III is based on dynamic principal component analysis.

Figure 4 .
In total, R = f loor N−b+1 s ∈ R b×M such segments were constructed.Minerals 2020, 10, x FOR PEER REVIEW 5 of 28

Figure 4 .
Figure 4. Time series segmentation using a sliding window algorithm.Figure 4. Time series segmentation using a sliding window algorithm.

Figure 4 .
Figure 4. Time series segmentation using a sliding window algorithm.Figure 4. Time series segmentation using a sliding window algorithm.

Figure 5 .
Figure 5. Generation of a distance matrix associated with a window sliding across a univariate process signal.

Figure 6 .
Figure 6.CNN feature extraction from global recurrence plots.

Figure 5 .
Figure 5. Generation of a distance matrix associated with a window sliding across a univariate process signal.

Figure 5 .
Figure 5. Generation of a distance matrix associated with a window sliding across a univariate process signal.

Figure 6 .
Figure 6.CNN feature extraction from global recurrence plots.

Figure 6 .
Figure 6.CNN feature extraction from global recurrence plots.

Figure 8 .
Figure 8. Simulated sinusoidal signal with parameter change occurring in the red region.

Figure 9 .
Figure 9. GRPs for CNN feature extraction through transfer learning: (a) NOC data (occurring before t = 1000); and (b) fault data (occurring after t = 1000 in Figure 8).The color scale defines the Euclidean distances represented by different RGB pixel values.

Figure 8 .
Figure 8. Simulated sinusoidal signal with parameter change occurring in the red region.

Figure 9 .
Figure 9. GRPs for CNN feature extraction through transfer learning: (a) NOC data (occurring before t = 1000); and (b) fault data (occurring after t = 1000 in Figure 8).The color scale defines the Euclidean distances represented by different RGB pixel values.

Figure 9 .
Figure 9. GRPs for CNN feature extraction through transfer learning: (a) NOC data (occurring before t = 1000); and (b) fault data (occurring after t = 1000 in Figure 8).The color scale defines the Euclidean distances represented by different RGB pixel values.

Minerals 2020 , 28 Figure 10 .
Figure 10.Cumulative fraction variance explained by each principal component from CNN feature extraction with transfer learning, for a window length of 150 time steps.

Figure 10 .
Figure 10.Cumulative fraction variance explained by each principal component from CNN feature extraction with transfer learning, for a window length of 150 time steps.

Figure 11 .
Figure 11. 2 (top) and Q (bottom) process monitoring charts of simulated dataset for a window length of 150 time steps using Method I. NOC data are shown in blue (up to index 749), validation data in green (indices 750-999) and test (fault) data in red (indices 1000-2000).

Figure 11 .
Figure 11.T 2 (top) and Q (bottom) process monitoring charts of simulated dataset for a window length of 150 time steps using Method I. NOC data are shown in blue (up to index 749), validation data in green (indices 750-999) and test (fault) data in red (indices 1000-2000).

Figure 12 .
Figure 12.Simulated data (top) and IAAFT surrogate of the simulated data (bottom).

Figure 12 .
Figure 12.Simulated data (top) and IAAFT surrogate of the simulated data (bottom).

Figure 13 .
Figure 13.Distribution (a) and autocorrelation function (b) of the simulated data and the distribution (c) and autocorrelation function (d) of the IAAFT surrogate data.

Figure 13 .
Figure 13.Distribution (a) and autocorrelation function (b) of the simulated data and the distribution (c) and autocorrelation function (d) of the IAAFT surrogate data.Minerals 2020, 10, x FOR PEER REVIEW 14 of 28

Figure 14 .
Figure 14.Distance plots for training the CNN in Method II, with a window length of 150 time steps: (a) original time series; and (b) surrogate time series.

Figure 14 .
Figure 14.Distance plots for training the CNN in Method II, with a window length of 150 time steps: (a) original time series; and (b) surrogate time series.

Figure 14 .Table 3 .Figure 15 .
Figure 14.Distance plots for training the CNN in Method II, with a window length of 150 time steps: (a) original time series; and (b) surrogate time series.

Figure 15 .
Figure 15.Cumulative fraction variance explained by each principal component from CNN feature extraction with contrast-based learning (Method II), for a window length of 150 time steps.

Figure 16 .
Figure 16. 2 (top) and Q (bottom) process monitoring charts of simulated dataset for a window length of 150 time steps using Method II.NOC data are shown in blue (up to index 749), validation data in green (indices 750-999) and test (fault) data in red (indices 1000-2000).

Figure 16 .
Figure 16.T 2 (top) and Q (bottom) process monitoring charts of simulated dataset for a window length of 150 time steps using Method II.NOC data are shown in blue (up to index 749), validation data in green (indices 750-999) and test (fault) data in red (indices 1000-2000).

Figure 17 .
Figure 17. 2 (top) and Q (bottom) process monitoring charts of simulated dataset using Method III, with an embedding dimension of four and an embedding lag of 30.NOC data are shown in blue (indices 0-600), validation data in green (indices 601-800) and test (fault) data in red (indices 800-1720).

Figure 17 .
Figure 17.T 2 (top) and Q (bottom) process monitoring charts of simulated dataset using Method III, with an embedding dimension of four and an embedding lag of 30.NOC data are shown in blue (indices 0-600), validation data in green (indices 601-800) and test (fault) data in red (indices 800-1720).

Figure 18 .
Figure 18.AG mill power draw gradually transforming from non-fault data (blue) into white noise through the fault region (red).

Figure 18 .
Figure 18.AG mill power draw gradually transforming from non-fault data (blue) into white noise through the fault region (red).

Figure 19 .
Figure 19. 2 (top) and Q (bottom) process monitoring charts of AG mill power draw for a window length of 200 time steps using Method I. NOC data are shown in blue and test (fault) data in red.4.2.2.Method II: CNN Feature Extraction Based on Contrast-Based Learning Process monitoring results for detection of the fault in the mill power draw signal using CNN features extracted through contrast-based learning are shown in Table 7.The monitoring performance generally increases along with the window length, with the best results obtained at a window length of 200 time steps.At a window length of 50, the contrast-based training did not result in successful discrimination between the real and surrogate time series, leading to bad detection results.Monitoring charts at a window length of 200 are shown in Figure 20.

Figure 19 .
Figure 19.T 2 (top) and Q (bottom) process monitoring charts of AG mill power draw for a window length of 200 time steps using Method I. NOC data are shown in blue and test (fault) data in red.4.2.2.Method II: CNN Feature Extraction Based on Contrast-Based Learning Process monitoring results for detection of the fault in the mill power draw signal using CNN features extracted through contrast-based learning are shown in Table 7.The monitoring performance generally increases along with the window length, with the best results obtained at a window length of 200 time steps.At a window length of 50, the contrast-based training did not result in

Figure 20 .
Figure 20. 2 (top) and Q (bottom) process monitoring charts of AG mill power draw for a window length of 200 time steps using Method II.Training non-fault data are shown in blue, untrained nonfault data in green and test (fault) data in red.

Figure 20 .
Figure 20.T 2 (top) and Q (bottom) process monitoring charts of AG mill power draw for a window length of 200 time steps using Method II.Training non-fault data are shown in blue, untrained non-fault data in green and test (fault) data in red.

Figure 20
Figure 20 splits the non-fault data according to whether the data were seen by the model during contrast-based training.The data indicated by blue markers were seen during contrast-based learning, while the data indicated by green markers remained unseen.Since the power draw signal was not steady over its duration, the unseen non-fault data are also flagged as faults.This could lead to the conclusion that the contrast-based learning procedure was simply overfitting the training data.However, this is countered by the low FARs achieved in Table4on the simulated data and the fact that the time series does not remain stationary during this period.Rather, the contrast-based learning features are correctly detecting that the time series is moving away from the training data.These results emphasize the sensitivity of the method to the training data.It is critical that a large dataset is used during training to capture the normal variation occurring during NOC periods.

Figure 21 .
Figure 21. 2 (top) and Q (bottom) process monitoring charts of simulated dataset using Method III, with embedding dimension of four and embedding lag of 10.NOC data are shown in blue and test (fault) data in red.

Figure 21 .
Figure 21.T 2 (top) and Q (bottom) process monitoring charts of simulated dataset using Method III, with embedding dimension of four and embedding lag of 10.NOC data are shown in blue and test (fault) data in red.The results show that DPCA was unsuccessful in detecting the simulated fault, or the movement of the unsteady time series data.Detection only occurs once the time series is close to fully consisting of white noise.This case study thus produces similar results to the first, with both Methods I and II performing better than Method III.Feature extraction based on contrast-based learning leads to additional superior monitoring performance above feature extraction based on transfer learning.

28 Figure 23 .
Figure 23.Scaled multivariate grinding circuit data.Circuit operational state changes from NOC (blue) to overloaded mill (red), as logged by the expert mill controller.

Figure 23 .
Figure 23.Scaled multivariate grinding circuit data.Circuit operational state changes from NOC (blue) to overloaded mill (red), as logged by the expert mill controller.

Figure 24 .
Figure 24.Distance plots for CNN feature extraction, with a window length of 100 time steps: (a) NOC state; and (b) overloaded mill state.

Figure 25 .
Figure 25. 2 (top) and Q (bottom) process monitoring charts of multivariate grinding circuit data for a window length of 200 time steps using Method I. NOC (trained) data are shown in blue and test (fault) data in red.

Figure 25 .
Figure 25.T 2 (top) and Q (bottom) process monitoring charts of multivariate grinding circuit data for a window length of 200 time steps using Method I. NOC (trained) data are shown in blue and test (fault) data in red.4.3.2.Method II: CNN Feature Extraction Based on Contrast-Based Learning

Figure 26 .
Figure 26. 2 (top) and Q (bottom) process monitoring charts of multivariate grinding circuit data for a window length of 100 time steps using Method II.Training non-fault data are shown in blue, untrained non-fault data in green and test (fault) data in red.

Figure 26 .
Figure 26.T 2 (top) and Q (bottom) process monitoring charts of multivariate grinding circuit data for a window length of 100 time steps using Method II.Training non-fault data are shown in blue, untrained non-fault data in green and test (fault) data in red.

Figure 27 .
Figure 27. 2 (top) and Q (bottom) process monitoring charts of multivariate grinding circuit data using Method III.NOC data are shown in blue and test (fault) data in red.

Figure 27 .
Figure 27.T 2 (top) and Q (bottom) process monitoring charts of multivariate grinding circuit data using Method III.NOC data are shown in blue and test (fault) data in red.

Table 1 .
Modified CNN architecture for contrast-based learning (Method II).

Table 1 .
Modified CNN architecture for contrast-based learning (Method II).

Table 2 .
Monitoring results on simulated dataset using Method I.

Table 2 .
Monitoring results on simulated dataset using Method I.

Table 3 .
CNN training parameters during contrast-based learning for Method II.

Table 3 .
CNN training parameters during contrast-based learning for Method II.

Table 4 .
Monitoring results on simulated dataset using Method II.

Table 4 .
Monitoring results on simulated dataset using Method II.

Table 6 .
Monitoring results on AG mill power draw using Method I.

Table 7 .
Monitoring results on AG mill power draw using Method II.

Table 7 .
Monitoring results on AG mill power draw using Method II.

Table 8 .
Monitoring results on AG mill power draw using Method III with embedding dimension of four.

Table 8 .
Monitoring results on AG mill power draw using Method III with embedding dimension of four.

Table 9 .
Monitoring AG mill operational state using Method I.

Table 9 .
Monitoring AG mill operational state using Method I.

Table 10 .
Monitoring AG mill operational state using Method II.

Table 11 .
Monitoring AG mill operational state using Method III.

Table 11 .
Monitoring AG mill operational state using Method III.