Comparison of Semirigorous and Empirical Models Derived Using Data Quality Assessment Methods

: With the increase in available data and the stricter control requirements for mineral processes, the development of automated methods for data processing and model creation are becoming increasingly important. In this paper, the application of data quality assessment methods for the development of semirigorous and empirical models of a primary milling circuit in a platinum concentrator plant is investigated to determine their validity and how best to handle multivariate input data. The data set used consists of both routine operating data and planned step tests. Applying the data quality assessment method to this data set, it was seen that selecting the appropriate subset of variables for multivariate assessment was difﬁcult. However, it was shown that it was possible to identify regions of sufﬁcient value for modeling. Using the identiﬁed data, it was possible to ﬁt empirical linear models and a semirigorous nonlinear model. As expected, models obtained from the routine operating data were, in general, worse than those obtained from the planned step tests. However, using the models obtained from routine operating data as the initial seed models for the automated advanced process control methods would be extremely helpful. Therefore, it can be concluded that the data quality assessment method was able to extract and identify regions sufﬁcient and acceptable for modeling.


Introduction
As the importance of data increases and increasingly large amounts of data are collected by industrial processes [1,2], there is a growing need to develop methods that can automate the processing of the data. However, much of the data are directly used without any consideration of its quality. The famous computing adage, attributed originally to George Fuechsel, a programming instructor at IBM who used it as a teaching device in the late 1950s [3] goes "garbage in, garbage out". In the context of modeling, this means that if the data provided are useless, then the resulting analysis will also be useless.
In process control, one of the most important applications of data is for understanding the process, that is, developing models that describe the behaviour of the system for different situations [4]. In process modeling, the key metric for determining the quality of the data is persistent excitation, which measures how much of the system has been excited [5,6]. Traditionally, persistent excitation is used to design experiments so that the maximal amount of information can be extracted from the experiment [6]. Operating companies who are involved in MPC projects often ask the question "why can we not use all the historical data that we have gathered for our process to derive the models"? This is a valid question, and the standard answers are listed below.

1.
The modes of the PID loops in the past are not known.

2.
The data are correlated and cannot be used.

3.
There is too much data to look through to find regions where model identification can be used.
This paper aims to redress these areas using a case study with analysis of real plant data. The short answers to the concerns are below.

1.
Either store the modes, or calculate them from the setpoint, process value, and output behaviour.

2.
Modern identification techniques can handle a certain amount of cross-correlation in the data. 3.
In this era of big data, can we not write a routine that partitions the data for us?
With the growth of data-driven methods, it is now necessary to test the available data to determine if it satisfies this condition. This approach is often called data quality assessment (DQA) [7][8][9]. DQA consists of three main steps: preprocessing, assessment, and postprocessing. In the preprocessing step, any known process changes are used to partition the data set into known regions, while the assessment step takes each of the known regions and seeks to determine which segments are sufficiently exciting. Finally, the segments are analyzed to determine if any of them are similar in nature and could be considered to come from a single underlying region. The need for postprocessing arises from two constraints: First, the assessment method can be a bit too conservative and split adjacent data sets into multiple segments even if they come from the same underlying model. Second, within the original data set, there may be multiple instances of the same model. Combining these regions has the advantage of increasing the amount of available data so that better models can be obtained. Another large issue in DQA is how to handle multivariate data sets, where some, but not necessarily all, of the input variables are needed for modeling the process [10]. Clearly, if all the redundant data are used, then the data quality will automatically be assessed as being bad, as the information matrix will be uninvertible. However, there may well be a subset of inputs that can in fact be used to model the data set. Therefore, there is a need to determine which subset can or should be used to model the data set.
Another critical aspect for DQA that has not been well studied is the development of methods that can be applied to the modeling of different types of models. Most DQA methods focus on data-driven, black-box models. However, other types of models can also be created, such as semirigorous, gray-box models, where the overall form of the model is developed from a theoretical understanding of the model, but whose parameter values are determined based on the available data. A comparison of the applicability of DQA for such models would also be helpful. Therefore, this paper will examine the problem of DQA for multivariate data sets and its application to both data-driven and semirigorous models. The results will be shown using data taken from the primary milling circuit in a platinum concentrator plant.

Data Quality Assessment Method
DQA seeks to extract from a historical data set those regions that are of interest for the engineer or user. In such cases, the important factor is identifiability or whether the particular data set can be used for modeling the process. Clearly, a variable whose value does not change is not effective for modeling the process. In general, identifiability is linked with the invertibility of the Fisher information matrix, which measures the amount of information present in the input signals about the parameters of interest [11]. However, in practical terms, other factors, such as the variance of the signals and missing data, need to also be considered [12].
Furthermore, extending the results to the multivariate case requires that additional factors be considered. From the perspective of invertibility of the Fisher information matrix, collinearity of the variables is an important factor to avoid. This means that within a given historical data set some of the variables may be correlated with each other. Placing both of these variables into the Fisher information matrix will cause the matrix to be uninvertible, and thus the data region will be considered useless. One easy solution to this problem is to examine the correlations between the variables and only select those that are not correlated with each other. A further area of concern is that such variables may also be correlated as a function of a delay. This will complicate the problem even further and make searching for an uncorrelated set more difficult.

Multivariate Data Segmentation Algorithm
In order to consider multivariate data segmentation, the following changes need to be made to the overall data segmentation algorithm:

1.
Preprocessing: The data set is first loaded and preprocessed. Often preprocessing consists of scaling and centering the data.

2.
Mode Changes: Any known mode changes should be given to the algorithm. This will allow for better segmentation of the data set. The modes could be based the state of the process control system, for example, whether certain PID loops are in cascade, automatic of manual mode, or the mode may refer to an operating mode of the plant such as low or high throughput mode.

3.
Correlation Detection: For each mode, determine which of the parameters are correlated with each other. This problem can quickly become intractable if time delays in the problem are considered. For each of the correlated variables, only a subset of these parameters can be selected for analysis. Otherwise, the methods will all return the result that there is no viable model, as the regression matrix will not be invertible [6].

4.
Segmentation: For each mode, perform the following steps: (a) Initialization: Initialize the mode counter to the current data value, that is, set Computation: Compute the variances of the signals and the condition number of the information matrix. (c) Comparison: Compare the variances, the condition number, and the significance of the parameters against the thresholds. (d) Failure: Should the thresholds fail to be met, then go to the next data point, that is, set k = k + 1, and go to Step 4.b. (e) Success: Otherwise, set k = k + 1, and go to Step 4.c. The "good" data region is then [k init , k].

5.
Termination: The procedure ends when k equals N, that is, the total number of data points for the given mode.

6.
Simplification: Compare adjacent regions, and determine if they belong to the same model using an appropriate metric such as an entropy-based metric. It happens that the method may be too anxious to split adjacent regions on the slightest of changes [13].
In Laguerre-based data segmentation, the segmentation is performed using orthogonal Laguerre models that can handle unknown time delays. The general, ith-order Laguerre model can be written as where L i is the ith-order Laguerre basis function, α is a time constant, and z −1 is the backshift operator. The resulting model can then be written as where y(t) is the output signal, u(t) is the input signal, e(t) is the error, θ i is the to-bedetermined coefficient, and N g is the Laguerre order of the process. The parameters for the model given by (2) can be obtained using standard regression analysis. In order to simplify the computation of the required variances, they are computed recursively using a tunable forgetting factor [7]. The Laguerre model parameters, α and N g , are set based on the guidance in [7], that is, where θ is the continuous time delay and τ s is the sampling time. For this paper, α has been set to 0.95 and N g to 6. The thresholds have been set based on the guidelines presented in [12], that is, the variance threshold has a value of 10 −7 , while the regression threshold for the parameter variance has been set to 10 −3 . The threshold on the condition number has been set to 1000.

Process Description
Data from a concentrator circuit are used as a case study to evaluate the effectiveness of the DQA method to identify regions of data to construct control system models. The circuit in question is located in the Limpopo province of South Africa and is a single stream platinum-group metals concentrator with a nameplate capacity of 600 kton/month. Recent de-bottlenecking activities have seen the throughput of the plant increase to more than 700 kton/month. The primary mill as shown in Figure 1, which is the focus of this study, is supplied with a crushed feed at a top-size of 10 mm by a dry crushing section which comprises of a primary gyratory crusher, three cone crushers and a high pressure grinding roll. The wet beneficiation circuit operates in a mill-float-mill-float configuration that includes two 17 MW ball mills for primary and secondary grinding duty as well as 14 primary and 16 secondary flotation cells. Secondary grinding is augmented by 5 IsaMills, increasing the installed power for comminution to approximately 50 MW.    Table 1. The primary milling circuit comprises of a 26 radius × 28 length ball mill with a wrap-around ABB motor and a variable speed drive (VSD). P mill represents the power consumed by the mill, SPEED the variable rotational rate of the mill, and LOAD the total mill load as measured by load cells. As shown in Figure 3, fresh feed into the mill (MFO) is controlled by a proportional, integral, derivative (PID) controller that cascades to a VSD feeder under the silo. Water into the mill (MIW) is controlled as a ratio to the fresh feed (r MIW ), cascading from a ratio controller to an inlet water PID controller. Both feed and water ratio set points are supplied by either the operator, from the Supervisory Control and Data Acquisition (SCADA) system, or the advanced process control (APC) system. The mill discharges through a partial overflow mechanism into the discharge sump. Two cyclone packs with 10 cyclones each in a duty/standby operating philosophy close the circuit loop. The coarse underflow is recycled back to the mill with the finer overflow reporting to the rougher feed surge tank and onto primary flotation. The APC also controls the discharge sump level (h sump ), discharge density (ρ so ), and cyclone cluster pressure (P c1 and P c2 ) by manipulating the discharge sump dilution water flow rate (SFW), cyclone slurry flow rate (CFF), and the number of open cyclones (N c1 and N c2 ) in a cluster. The final product particle size passing 75 µm (PSI) at the cyclone overflow is measured by a BlueCube analyzer. The advanced control strategy includes a layered approach consisting of PID, fuzzy logic-based and model predictive control, as described by Steyn et al. [14].

Process Data
The site uses a Siemens PCS7 programmable logic controller with a WinCC humanmachine interface that connects to an OSI PI historian using Open Platform Communication. Analogue instrument data are logged to the database at a maximum frequency of 1 Hz. The primary mill data used during this project were sampled at a period of 2 min (0.008 Hz) over a duration of 23 weeks from the 1 September 2019 to the 10 February 2020.
The period was selected as it recorded an average process usage exceeding 90%. The period also incorporates data from before and after the mill online grind analyzer commissioning date of the 31 October 2019. From a control perspective, the data contain periods where the mill was under manual control as well as periods where the APC was active.
Step tests to improve the dynamic step response models used by the model predictive control (MPC) algorithm of the APC were performed between the 1 and 3 February 2020. These data were collected at a one-minute frequency and is used in this study to derive models for comparison purposes.

Linear Empirical Models
The majority of industrial MPC systems use linear empirical models in their formulation [15]. Standard practice has been to develop these models by performing planned tests on the plant. These step-tests are designed to generate the data needed to produce linear models of the required fidelity. Performing a manual step test on a reasonable sized multi-variable controller represents a significant time investment, particular as engineer supervision is often required for the duration of the test. As an example, the controller employed on this plant has eight manipulated variables. Assuming a time to steady state of one hour and eight steps per variable are required, the total step test time may be calculated as, If stepping can occur for eight hours per day, this is eight days of effort. The software vendors have introduced automated step testing software to assist in this area [16,17]. While these tools are definitely a boon to the MPC engineer, they still require that an initial model matrix ("seed matrix") be derived for the system before the automated process can commence.

Linear Model Identification
Given a data set consisting of a set of inputs to a plant and the resulting outputs, several routines are available to derive the linear time-invariant models used in linear MPCs. Methods used include the following: Autoregressive with exogenous inputs (ARX) and • subspace modeling (SS).
Further details of these algorithms are available in [15,18,19]. The FIR and ARX methods have the advantage of simplicity. Their disadvantage is that they are multiinput, single-output (MISO) methods and, therefore, do not take full advantage of the relationships in the data. In addition, these methods are not good at managing closedloop data where the inputs are correlated [19]. Subspace models are less intuitive for the practitioner, but are multi-input, multi-output (MIMO) models by nature, and manage input correlation much better than the MISO methods.
For this study, FIR and SS methods are used. The particular SS method used is called canonical variate analysis (CVA) [20]. These methods are implemented in a commercially available package allowing for fast identification of multiple cases. For FIR, the user must specify a time to steady state (TTSS), the number of coefficients, and optionally a smoothing factor. A TTSS is supplied for the SS fits which is used for filtering purposes only. In addition, the maximum states per CV group and the maximum order per I/O pair can be specified. For this study, these parameters were left at their default values. If a CV is to be modeled as an integrating variable, this must be specified.

Semirigorous Mill Circuit Model
Apart from using a linear empirical model, the circuit in Figure 3 can also be model led using the semirigorous nonlinear population balance model of Le Roux et al. [21]. A brief overview of the model is given below. Table 1 shows the circuit variables, and Table 2 lists the model parameters. For the purposes of this model, ore fragments too large to pass through the partial overflow mechanism are referred to as rocks and must be broken further. All ore fragments small enough to pass to the sump are referred to as solids. The broken ore below the specification size is referred to as fines. Note, whereas solids refer to all ore small enough to discharge from the mill, fines refer to the portion of solids smaller than the specification size. Therefore, solids are a combination of fine ore and coarse ore, where coarse ore refers to the portion of solids larger than the specification size.

Mill Model
The population volume balance model of the mill describes four states: water (x mw ), solids (x ms ), rocks (x mr ), and fines (x m f ), where α f and α r represent the fraction of fines and rocks in MFO, respectively; ρ o is the ore density; d q is the discharge rate; Q cwu , Q csu , and Q c f u are the cyclone underflow of water, solids, and fines, respectively; Q RC is a rock consumption term that indicates the volumetric rate of rocks broken into solids; and Q FP is a fines production term that indicates the volumetric rate of ore broken into fines. The mill inlet water is given by MIW = r MIW MFO. The rheology factor ϕ is an empirically defined function that incorporates the effect of the fluidity and density of the slurry on the performance of the mill, The fraction of the mill filled with charge (J T ) is given by, where v mill (m 3 ) is the inside volume of the mill and x mb is the volume of balls in the mill assumed as constant. Since the mill charge is measured by means of load cells, there is a linear relationship between J T and LOAD [22], where a JT and b JT are calibration constants. The mill power draw (P mill ) is modeled as, where δ v is the power change parameter for the volume of mill filled, δ s is the power change parameter for the volume fraction of solids in the slurry, ϕ N is a normalization factor, and J TP max is the fraction of the mill filled at maximum power draw. The fraction of critical mill speed φ c is calculated as [23] φ c = SPEED 2π 60 where D is the internal mill diameter. The maximum mill power draw P max (φ c ) is parameterized as a linear function of φ c [24], where p max (m,c) are function constants. The rock consumption (Q RC ) and fines production (Q FP ) in (6)-(8) are defined as, where K rc is the rock consumption factor and indicates the energy required per tonne of rocks broken and K f p is the fines production factor and indicates the energy required per tonne of fines produced. The fines production factor is modified by the fractional change in power per fines produced per change in fractional filling of the mill K f pJT .

Sump Model
The population volume balance of the sump hold-ups, water (x sw ), solids (x ss ), and fines (x s f ), are defined as, x ss = ϕd q x mw x ms x ms + x mw − Q sso (18) where the discharge flow-rates are, Q sso = CFF x ss x sw + x ss (21) The percentage of the sump filled with slurry (h sump ) and the sump outflow density (ρ so ) are defined as, where v sump is the total sump volume and ρ w is the density of water.

Cyclone Cluster Model
The cluster of cyclones is modeled as a single classifier. If needed, the model can be expanded into separate smaller cyclones as in Botha et al. [25]. However, the goal here is simply to calculate the total water, solids, and fines split at the cluster.
The underflow of coarse material (Q ccu ) is modeled as, To determine the amount of water and fines accompanying the coarse underflow, the fraction of solids in the underflow (F u ) must be determined. This is modeled as, where α su relates to the fraction solids in the underflow. The cyclone underflow flow rates in (5)-(8) are, The product particle size is defined as the fraction of fines to solids in the cyclone overflow,

Data Quality Assessment Results
The DQA will be performed on two data sets available from the mill process. The first data set, which will be referred to as the "complete data set", contains data for approximately one year. The second data set, which will be referred to as the "step data set", contains a series of step tests on the same mill for a period in February 2020.
For the complete data set, there are sixteen variables of which SFW was selected as the output. The other variables were candidates for being included as the inputs. For the step data set, there were 30 variables, of which the grinding rate, and the combined pressure of the two cyclones were selected as the key output variables. As can be seen from the mill circuit Figure 3, the two cyclones operate in parallel and not simultaneously. If one is on, then the other is off. Therefore, the combined pressure is simply the value of the on cyclone. The choice of variables is informed more by process knowledge than by any fundamental statistics; this is why different sets are chosen. It is an are for further research on how to choose the best inputs and outputs for the algorithm.

Analysis of the Complete Data Set
The analysis of the complete data set can be split into two parts: correlation analysis and data segmentation. Correlation refers to the absolute value of the standard Pearson's correlation. A signal is assumed to be correlated if Pearson's correlation is greater than 0.90. Figure 4 shows the cross-correlation plot for all the variables. The x-axis has the same labeling as the y-axis. The color scale ranges from deepest blue which corresponds to a correlation of zero to deepest red which corresponds to a perfect correlation of 1. Variables that are in shades of dark red are correlated with one another. From Figure 4, it can be seen that the first three variables as well as the ninth variable are strongly correlated with each other. Likewise, the eighth and ninth variables are strongly correlated with each other. Finally, the fourteenth and fifteenth variables are also strongly correlated with one another.

Correlation Analysis of the Complete Data Set
This means that only one out of each group of variables should be used when it comes to the data segmentation task. MFO Figure 4. Cross-correlation plot for the complete data set.

Data Segmentation for the Complete Data Set
In order to investigate the effects of correlation on the data segmentation results, the following cases will be considered:

1.
Case 1: Here the partitioning will be performed using MFO, MIW, SPEED, and SFW (that is, the first four variables).

2.
Case 2: Here the partitioning will be performed using only MFO.

3.
Case 3: Here the partitioning will be performed using only MFO, r MIW , and SFW.

4.
Case 4: Here the partitioning will be performed using only h sump , which is a quasiintegrating variable. Figure 5 shows the partition results for all four cases. The x-axis is the time in units of samples from the initial time period, while the y-axis shows the partition number. Adjacent times with the same partition number can be assumed to belong to the "same" model. A y = x line implies that the data are not sufficiently excited or good for modeling. Regions with the same or similar partition number separated by a small region can also be assumed to belong to the same overall model. From Figure 5 (top left), which shows the results for Case 1, it can be seen that using the first four variables provides no regions that are suitable for system identification. This is a direct consequence of the fact that the selected variables are correlated with each other forcing the information matrix to be uninvertible. This clearly shows the issues involved with multivariate data segmentation. Figure 5 (top right) presents the partition results for Case 2. It can be seen that using only a single variable provides results that suggest there is a relatively large chunk of data that could potentially be considered to not only come from a single model, but also be sufficiently excited for modeling. These data will then be used in the next section for modeling to examine the effect of the segmented data on the overall models obtained for both linear and nonlinear situations. Figure 5 (bottom left) shows the partition results for Case 3. It can be noted that although there is some correlation in the data set, it is perhaps possible to obtain something useful. This will test what the effect of correlation is on the data segmentation.
From Figure 5 (bottom left), it can be seen that using three variables for partitioning gives similar results to using a single variable. With the increase in variables, there are now more individual partitions. Nevertheless, it can be seen that this case is very similar to that with only one partitioning variable (cf. Figure 5 (top right)). Further, note that it would seem that the third variable is the troublesome variable that is correlated with the other variables. Furthermore, it should be noted that strictly speaking any linear combination of variables is problematic. However, such a circumstance will not be found using a simple correlation plot. As well, there is a need to consider time-delayed variables and their lagged cross-correlation. Finally, Figure 5 (bottom right) shows the partition results for Case 4. It can be noted that with an integrating input, it is necessary to preprocess the signal by integrating it. The integrating nature of the signal is quite clear from the autocorrelation plot shown in Figure 6. Here, it can be noted that the autocorrelation plot shows a typical pattern associated with an integrating or autoregressive process, where the values decrease steadily, but there is no cut-off point [6]. For an integrating process, it would be expected that the values remain more or less constant at 1 [6]. In practice, due to sampling issues, it may not be possible to see such a behaviour. Instead, the values will decrease as seen here. From Figure 5 (bottom right), it can be seen that the partitioning results are similar to that obtained before showing that the selection of variables, although critical, is not necessarily all that important. As long as appropriately uncorrelated variables are selected, relatively similar results are obtained.

Data Quality Assessment for the Step Data Set
Using the step data set for DQA shows similar results to the previous case. Noted that as a step test can often provide sufficient excitation for modeling a process, much of the data set is selected as being valid. However, the results strongly depend on the variables selected, since as before, some of the variables are strongly correlated with each other. As the overall results are the same as before, they are not provided here.

Results
The models discussed above were used with the two sets of data to generate two linear and two nonlinear models.

Results from the Routine Operating Data
DQA indicates that a section of data consisting of a little over 19 days of data was suitable for identification. The modeled system MVs and CVs for this period are shown in Figure 7.
The light gray vertical shading in the figures indicates that data has been excluded from the fitting for all variables. The dark gray shading indicates that data for that variable only has been excluded.

Results Using the Linear Empirical Models
The fitting routine defines cases consisting of a set of MVs and CVs. If any variable in the set is excluded or has bad data the fit is not performed for all variables. For CVs, this effect can be minimized by identifying them one by one. This is not recommended for the SS algorithm, which takes advantage of the correlation structure of the CVs. The DQA model includes the number of cyclones open in the pack as an MV.
It would be tempting to include the flow out of the sump as an independent variable for modeling. Analysis shows that this total flow out of the sump is not an independent variable, as it is used in closed-loop control of the sump level.
The calculated models for the data shown in Figure 7 are shown in Figure 8. The CVA method was used to generate a matrix of SS models for the system. The linear timeinvariant models are represented as unit step responses, i.e., the response of the output variable (columns of the matrix) to a step of one engineering unit of the input (rows of the matrix) at time zero. The timescales are in minutes, and the response curves are in engineering units. The number in the top right corner of each plot is the gain of that particular model, or the steady-state value.  More analysis of these models will be presented below, where they are compared with models derived from planned step test data. Suffice it to say that some key linear models such as feed and speed to load, speed to power, feed, inlet and sump water to density and feed, and number of cyclones to cyclone pressure have been identified successfully.
A further point to note is that not all models calculated are implemented in the seed models used to bootstrap an automated stepping campaign. The reason for this is that if the models are not well known, then weak directions need to be avoided. The models that would be deleted are shown with red boxes around them in Figure 8. The load and power to sump feed water and number of cyclones to inlet water ratio models would be removed on physical grounds. The discharge density and cyclone pressure to speed model would be removed on the basis of small scaled gains. The resulting matrix would be robust enough for seed model purposes.

Results Using the Semirigorous Mill Circuit Model
The model parameters for the model presented in Section 3.4 can be estimated from the routine operating process data representing a steady-state operating condition. In other words, it is necessary to estimate parameters for data where the MVs and CVs remain constant such thatẋ = 0 in (5)- (8) and (17)- (19). The estimation procedure follows an algebraic routine to sequentially determine the parameters in Table 2. Although it is advantageous that the model be fitted to a single steady-state operating condition, the procedure is dependent on accurate process measurements. Any unmeasured spillage water added at the sump has a significant impact on the simulation accuracy of the model. If this additional water is not accounted for, the model may predict that the sump will run dry very quickly. The model accuracy is also dependent on a good characterization of P mill . Specifically, it is important to have a reasonably accurate parameterization of P max in (14) and to know if the plant is operating either before or past the peak in P mill [22,24]. Once the model states and parameters are determined for the specific steady-state operating condition, pure simulation is used to produce the model outputs.

Results from the Planned Step Tests
As discussed above, planned step tests were made on the process. The key system MVs and CVs for this period are shown in Figure 9.

Results Using the Linear Empirical Model
The planned step data is particularly suitable for identifying linear dynamic models, as the inputs tend to be uncorrelated. The CVA method was used to generate a matrix of SS models for the system. The results are shown in Figure 10, where they are overlaid with the models calculated from the planned steps. Note that no steps were made on the number of cyclones in operation, so this variable is excluded from the models. It would be usual to delete some of these curves based on either physical intuition or small gains. This has not been done here to retain all the curves for comparison with those calculated from the DQA-determined data. Visually, it is clear that there is reasonable agreement between these models in some cases, but there are some that agree poorly. To quantify this, the ratio of the gains of the DQA model to the planned step tests model is formed for each submodel. The results are shown in Table 3. Table 3. Ratio of gains between the DQA and planned step test models. Red background-negative; Orange background-greater than two; Yellow background-positive, less than 0.5. Assuming the planned step test model is accurate, a negative gain ratio means that the DQA model could potentially cause instability if used for process control. From experience, gains that are up to twice as large or as small can be tolerated, particularly if used as seed models. In this case, of the twelve models six have acceptable gain ratios, three have negative gain ratios, one has a ratio less than 0.5 and one a little higher than two.

Results Using the Semirigorous Mill Circuit Model
Similarly to Section 5.1.2, the parameter fitting procedure used to fit the semirigorous model to the data identified by DQA was used for the planned step test data. The only change necessary was to reparameterise P max in (14).

Comparison of the Linear and Nonlinear Models
Although not the primary objective of this work, the availability of a linear, datadriven, and semirigorous, fundamentally driven dynamic model of the plant invites comparison between the accuracy and utility of the two approaches.  Figure 11. Comparison of the CV predictions using the linear and nonlinear models for a short section of DQA data.
The results from modeling the planned step test data are used to compare the two models. The visual comparison of the predictions of the two models is shown in Figure 12.  Figure 12. Comparison of the CV predictions using the linear and nonlinear models for the planned step test data.
The goodness-of-fit statistics for the two models are given in Table 4. All the models have small mean squared errors (MSEs). The correlation coefficient for the power (P mill ) and discharge density (ρ so ) are very good, but poorer for the load (LOAD) and, in particular, the grind (PSI). The nonlinear model gives a superior prediction of the discharge density and load. Further investigation would be needed to determine why the grind predictions are poor, as this could be due to either an unmodelled effect or a fault in the instrument.

Conclusions
With the increasing need for automated data processing methods, this paper examined the application of a DQA method in comparison with planned step tests, linear, and nonlinear models of the system. First, it was shown that for multivariate DQA one of the challenges is selecting the appropriate set of variables for partitioning. Nevertheless, the DQA method was able to uncover a large region of data suitable for modeling. Second, it was shown that both the linear and nonlinear models of the primary mill can use the identified region for modeling. In addition, planned step test data was also used and compared. However, it was shown that the closed-loop data obtained using the DQA method provided poorer estimates than the planned step tests. This can partly be explained by noting that such closed-loop data presents a number of challenges for identification. Nevertheless, the models obtained would have been sufficient as the initial seed models for the automated APC methods. Future work will consider extending the applicability of the DQA method to more complex situations with the eventual goal of being able to better identify appropriate regions from complex industrial data sets and apply it to various different types of models.

Data Availability Statement:
The industrial data presented in this study is not publicly available.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: