3.1. Principal Component Analysis (PCA)
The process variables collected during the performance test and included in the PCA are summarized here in two tables, with the first being the NRU stream variables (
Table 1) and the second being the NRU operating variables (
Table 2). The units in these two tables reflect the units associated with the automatic sensors. An overall mass balance was initially performed around the NRU to verify that the sensors being used were behaving consistently throughout the test period. Results showed that variations in total input and total output mass flows were being tracked consistently.
PCA was used to help determine if the data set was representative of one or more operating modes. This is similar to the classical discrimination problem proposed in [
5] in their tutorial and the analysis of historical process data sets suggested in [
6] in their tutorial. Minitab was used and the correlation matrix was selected where each variable was mean-centred and scaled by its standard deviation. Data points with missing values were removed from the data set manually. Outliers were removed through a combination of visual inspection and with the help of the PCA analysis itself [
5]. Missing data and outliers represented less than 10% of the original data set.
Figure 3b shows a plot of the first two principal components. The data set in this reduced dimension clearly clusters into two distinct regions. By comparing the data points in each cluster to the time series plot of the steam injection schedule during the performance test (
Figure 3a), it was determined that all the data points in the left-hand cluster came from data collected during the performance test up to August 4th (referred to as OP1) and all the data points found in the right-hand cluster came from the data collected after August 4th (referred to as OP2). This would seem to indicate that a distinct shift in process behaviour occurred on August 4th, coinciding with the large increase in steam flowrate that occurred at that time. Therefore, if statistical models were going to be built for prediction purposes, it would probably be best to identify separate models for the process in OP1 and OP2 [
7].
3.2. Linear Regression Models
In this part of our analysis, our goal was to generate an equation that describes the statistical relationship (model) between one or more predictors (regressors) and the primary response variable of interest in this study, naphtha recovery (NR), as defined in Equation (1). Separate models were developed using Minitab for each operating mode (OP1 and OP2) based on our PCA. Each data set was divided randomly in half, and one half of the data was used for model development (training) while the other half was used for model validation (testing). Given that we have assumed the data is representative of the system at steady-state, partitioning the data in this way is valid.
Several dimensionless variables were examined as possible predictors based on our engineering knowledge of the NRU. A sequence of models was constructed iteratively by starting with a large number of predictors and then reducing the model size based on the p-value of each predictor, one predictor at a time. In the end, only the predictors that made a significant and therefore meaningful contribution to the model were retained.
The predictors that were determined to make a significant contribution in both OP1 and OP2 to NR are given here:
Composition of naphtha in the NRU feed (NF): Normalized steam injection rate in the NRU feed (
SF):
Here, we have made use of Equation (3) to normalize the steam injection rate by the feed mass flowrate in Equation (5). The 2.2 factor is needed for unit conversion and the multiplier of 100 at the right end of Equation (5) is used to bring SF to the same order of magnitude as NF and NR.
The fact that these variables have appeared in these models makes physical sense because, from an input-output point of view,
NR is the primary output variable of interest,
SF is a key manipulated input variable and
NF is an important disturbance input variable. The modelling results are summarized in
Table 3 and
Table 4 and plots related to these models may be found in
Figure 4 and
Figure 5.
Based on these linear regression models,
Figure 6 and
Figure 7 were generated to look at the prediction of the individual effects of
NF and
SF on
NR, respectively. The positive slope in
Figure 6 associated with the
NF to
NR relationship may at first appear counterintuitive, i.e., one might expect based on mass and energy balance considerations that an increase in the mass fraction of naphtha in the feed would cause a higher naphtha loss and therefore a drop in naphtha recovery. However, given the way that naphtha recovery is defined in Equation 1, this is not necessarily the case. For example, say there are 100 units of naphtha coming in initially. If the naphtha recovery is 0.8, the naphtha loss would be 20 units. Now, let us assume the naphtha inflow increases from 100 to 130 units. The expected naphtha loss would increase as well when using the same steam flowrate; let us say it increases from 20 to 24 units. In this case, the recovery actually increases from 0.8 to 0.82 (106/130).
The negative slope in
Figure 7 associated with the
SF to
NR relationship is also counterintuitive based on mass and energy balance considerations and is contrary to the general belief of the operators. However, it is important to point out that the NRU does not consist of only a vacuum stripping column but also includes an overhead heat exchanger system and two separators connected in series with material recycled from both separators flowing back to the column. In addition, these modelling results seem to reinforce the suspicions of the process engineers that more steam does not necessarily improve recovery because of the interaction between the column and the overhead heat exchange system. Altogether, these interesting findings are what encouraged us to dig more deeply into the problem and turn our attention away from a data-based modelling approach towards a first-principles modelling approach, as described in the following section.
Before moving to the next section, we would like to close this section by illustrating one possible application of these linear regression models. A measurement of the mass flowrate of naphtha in the tailings is given by Equations (2) and (3) written for the tailings:
Note that this measurement of naphtha in the tailings requires a measurement of the volumetric flowrate of the tailings,
, and composition analysis of the tailings (
wt.%
of each component i in the tailings).
Figure 8 and
Figure 9 show comparisons of the measured and predicted mass flowrates of naphtha in the tailings for OP1 and OP2, respectively. In these plots, the training and testing data have been included to illustrate the overall fit obtained using these regression models to generate a prediction for the mass flowrate of naphtha in the tailings. The correlation between the measured mass flowrate and the predicted value is approximately 0.8 for both OP1 and OP2.
This represents a simple soft-sensor application for these linear regression models in that they could be used to predict naphtha in the tailings based solely on the measured mass flowrate and composition analysis of the feed and a predicted value for naphtha recovery without requiring measurements of the flowrate and composition of the tailings. This application highlights the need for multiple models and the ability to detect when a system has shifted from one operating mode to another. PCA could be used in an on-line manner for this purpose [
7] as could a Bayesian approach [
8].