1. Introduction
A topic of interest within the scientific field is fault detection in processes, a task that, for a long time, depended only on the operator’s experience, and due to the increase in the complexity of modern processes, plants ended up generating a higher cost due to the large number of monitoring variables that exceed the operator’s experience [
1].
The evolution of process automation has transformed the industry, allowing the systematic conversion of natural resources into final products without constant human supervision.
A key method that can help ensure that processes operate safely is the application of the efficient and continuous monitoring of variables, which, in conjunction with a fault detection system, can increase reliability and ensure proper operation in the different stages and process units [
2,
3,
4].
Fault detection can be classified into model-based and data-driven approaches: model-based approaches describe the physical and chemical background of the process; models typically focus on describing ideal steady states of processes [
5]. However, mathematical models are computationally expensive and require precise knowledge of the model parameters, which can be difficult to obtain. On the other hand, data-driven models use data measured within the plant under normal and abnormal conditions, therefore describing the actual process conditions in the presence of faults, and can describe the true process conditions; they are based on the idea that patterns and behaviors observed in the past can be used to establish a baseline for normal system operation [
6].
In data-based fault detection, statistical techniques are used as monitoring methods [
7,
8]. Univariate and multivariate methods are useful for monitoring key measurements that help define the final quality of the process. The main difference lies in the number of variables analyzed to monitor a process [
9].
Multivariate techniques are invaluable in industries with intricate and interdependent processes, such as chemical and petrochemical manufacturing, where a local failure can propagate through multiple variables, leading to consequences such as degradation, decreased productivity, and, in extreme cases, latent danger to the operator [
10].
Conventional methods for monitoring multivariate processes include independent component analysis (ICA) [
11], Partial Least Squares (PLS) [
12], and principal component analysis (PCA) [
13]; due to their capabilities of handling complex data and correlations, they are powerful tools when applied for this purpose.
ICA focuses on separating independent signals within correlated data; it is useful in processes where variables are mixed, and it is required to identify independent sources of variation, such as in the analysis of signals or noise in complex systems.
In [
14], the authors propose a novel scheme based on the estimation of ICA model parameters at each decomposition depth, where the effectiveness of the proposed FD (fault detection) strategy, based on multi-scale independent component analysis (MSICA), is illustrated through three case studies: a multivariate dynamic process, a quadruple tank process, and a distillation column process. The results indicate that the proposed MSICA FD strategy can improve performance for higher levels of noise in the data.
However, ICA assumes that the data sources are statistically independent, which may not be true in industrial systems where variables are correlated. In contrast, PCA only requires linearly correlated data [
15].
On the other hand, PLS models relationships between predictor variables X and response variables Y. It is used to monitor processes where the input and output variables are highly correlated, allowing the prediction of the behavior of the process and the detection of deviations.
The features of PCA include the transformation of multivariate data into uncorrelated variables called principal components that encapsulate systematic data variations. During normal operation, data points cluster tightly in the transformed space. The fault is assessed by monitoring these components, evaluating the relationships between various variables to determine when the behavior deviates from the norm [
16,
17].
In industrial environments, PCA stands out for its ability to identify deviations and its simplicity [
18], even in the presence of noise. Additionally, PCA is excellent for reducing data dimensionality while preserving as much variability as possible—a feature that is very necessary in industrial systems due to the handling of enormous amounts of data. Moreover, PCA allows the observation space to be separated into two subspaces: one that captures the systematic trend of the process and another related to random noise [
19,
20].
Statistical measures such as Hotelling’s and the squared prediction error (SPE) statistic are useful for defining these decision thresholds, which indicate the occurrence of a failure.
The application of PCA in fault detection systems is the subject of active research. In the literature, applications related to the predictive maintenance of industrial induced-draft fans can be found, considering faults such as high vibration in the internal diameter of fans and complete failure of the fan-motor system [
21]. An application example that focuses on successfully detecting, isolating, and estimating incipient failure in sensors can be found in [
22].
Processes such as that of Tennessee Eastman (TE) in conjunction with a stacked autoencoder (SAE) consider the linear and nonlinear relationships between the variables [
23]. In [
24], PCA is used in combination with wavelet transform based on moving windows; through performance analysis based on
, squared prediction error statistics, and contribution graphs, it is possible to detect sensor bias-type faults and process faults in a stirred tank reactor.
Although numerous applications related to PCA are reported in the literature, most of these investigations use simulated data and, in addition, mainly consider failures in the steady state of the process, ignoring its transient state [
25,
26,
27].
Detecting faults during the initialization phase of a distillation column is critical to ensuring operational efficiency, safety, and product quality.
The startup period is inherently unstable, with dynamic transitions in temperature, pressure, and composition, making the system highly susceptible to deviations that can escalate into significant operational failures if not promptly addressed.
For instance, improper handling of vapor flow during startup can lead to flooding, weeping, or foaming, which severely compromise separation efficiency and may necessitate costly shutdowns. Early fault detection allows for corrective actions before these issues propagate, minimizing energy waste, reducing downtime, and preventing off-spec products.
Moreover, startups are energy-intensive; optimizing this phase through fault monitoring can lead to substantial energy savings, as highlighted by studies on hybrid model-based control systems that reduce boiler heat duty during transient states.
In this work, a fault detection scheme for a binary ethanol–water batch column is developed. PCA is used, taking advantage of the two-dimensional reduction characteristic, to monitor the process performance over time and verify that the process remains within a normal operating control state. The Hotelling statistic is used as a process monitoring method. The resulting models consider the variation in their variables under normal conditions that affect the process and are essentially unavoidable within the current process.
2. Methodology
2.1. Case Study: Distillation Column
A distillation column consists of a condenser, a boiler, and the column body, which consists of n-2 perforated plates. The actuator in the boiler provides the heat necessary to evaporate the liquid mixture that it contains. As the vapor rises through the plates of the column body, it is enriched with the light element (the element with the lowest boiling point in the mixture). The vapor that reaches the condenser condenses and, depending on the state of the reflux actuator, is extracted as a distilled product or re-enters the column. The liquid that re-enters through the reflux descends due to gravity within the column body, becoming enriched with the heavy element (the element with the highest boiling point).
Figure 1 shows a simplified diagram as well as a photograph of the distillation pilot plant considered in this case study, composed of 11 perforated plates, 7 of which have RTD (Pt100) temperature sensors: the plate located in the condenser (plate 1); plates 2, 4, 6, 8, and 10; and the boiler (plate 11). It also has 2 actuators, 1 in the boiler and 1 in the condenser. The actuator in the boiler is an adjustable direct-current voltage source that feeds the heating resistance inside the boiler tank. The actuator in the condenser is an open/close valve.
Variable Selection Strategy for Training Data
An ethanol–water mixture was used in the same proportion (1 L), within an operating range of 158 to 160 watts of electrical power in the heating resistance. The process data were acquired through a local interface and stored as CSV files; each file contained data that included the transient state and part of the stable state of the process.
Every file had 5000 samples, and the contained variables were as follows: condenser temperature, plate 2 temperature, plate 4 temperature, plate 6 temperature, plate 8 temperature, plate 10 temperature, boiler temperature, room temperature, DC voltage, electrical current, and heating power.
Data corresponding to the normal operation of the process, such as those indicated in
Figure 2, were considered.
The transient state analysis was categorized into two phases, low- and high-transient, as illustrated in
Figure 3. This classification was based on the observed thermal behavior of the system, where temperature trends correlate with vapor accumulation in the plate. Specifically, as the vapor content increases, the plate temperature exhibits a proportional increase.
The terms “low-transient” and “high-transient” were assigned to reflect these dynamics: the low-transient phase corresponds to a state of minimal vapor input, resulting in a gradual temperature increment, while the high-transient phase aligns with elevated vapor levels, driving a more pronounced thermal increase. This division ensured a systematic evaluation of thermal responses under transient conditions.
The data for the training set were obtained by selecting samples within the variables of interest, named according to the following nomenclature:
—Initial transient temperature;
—Low-transient sample 1 temperature;
—Low-transient sample 2 temperature;
—Low-transient sample … temperature;
—Low-transient sample temperature;
—High-transient sample 0 temperature;
—High-transient sample 1 temperature;
—High-transient sample … temperature;
—Final transient n temperature;
—Heating power.
where n indicates the number of samples ( for this analysis).
Figure 4 shows an example of sample selection within the low and high transients for the temperature of plate 4.
The data were pre-processed to remove null and outlier values from the dataset and standardized using Python 3.11.4.
2.2. PCA
PCA was proposed by Pearson in 1901 to reduce data dimensionality while preserving variance [
28]. Hotelling [
29] extended Pearson’s work and formalized PCA in more rigorous mathematical terms, introducing the use of covariance and correlation matrices to calculate the principal components.
By projecting process variables onto a lower-dimensional subspace, PCA reveals the inherent cross-correlation between process variables. PCA uses an orthogonal transformation to convert a sample set of possibly correlated variables into a set of linearly and statistically uncorrelated variable values called principal components [
30].
In this sense, PCA latent variables, or principal components (PCs) (also called scores), are the directions in which the data have the largest variances and capture most of the information content of the data, as shown in
Figure 5. Mathematically, they correspond to the eigenvectors associated with the largest eigenvalues in the autocorrelation matrix of the data vectors [
31].
PCA-based methods are frequently applied in data compression [
32], pattern recognition, data smoothing, classification [
33], and fault detection [
34].
PCA starts with a matrix of observations
X of
n ×
m. The data are normalized using Equation (
1) to avoid different magnitudes. The purpose of the usual scaling is to make the variance the same (i.e., give standard units) [
35].
In Equation (
1),
represents the normalized data,
the mean of the data
X, and
the standard deviation of the data
X.
To understand the relationship between variables and, thus, the values that represent how two variables change together, an analysis of covariance was performed. To obtain all possible covariance values between the different dimensions, the covariance matrix
S (Equation (
2)) was obtained.
In the matrix
S,
is the variance of the i-th variable
, and
is the covariance at the i-th and j-th variables. If the covariance is not equal to zero, it indicates that there is a linear relationship between these two variables, and the strength of that relationship is represented by the correlation coefficient (Equation (
3)).
The covariance
is calculated using Equation (
4).
If the covariance is positive, both variables tend to increase or decrease together; if it is negative, one increases while the other decreases.
PCA is based on a key result of matrix algebra: a symmetric, non-singular
matrix
A, such as the covariance matrix
S, can be reduced to a diagonal matrix
L by premultiplying and postmultiplying it by a particular orthonormal matrix U such that Equation (
5) is obtained.
The diagonal elements of L (), called characteristic roots, latent roots, or eigenvalues of S, indicate the amount of variance in the data that each principal component captures; thus, is a matrix comprising eigenvalues of S arranged diagonally in decreasing magnitude, A high eigenvalue means that that principal component explains more of the variability in the original data.
The columns of
U (
), called characteristic vectors or eigenvectors of
S, represent the direction of the new axes in the data space. These vectors indicate how to combine the original variables to obtain the principal components. Each principal component is a linear combination of the original variables; the eigenvectors define these coefficients. Important points about eigenvalues and eigenvectors are mentioned in [
36].
The characteristic roots can be obtained from the solution of the determining Equation (
6), called the characteristic equation:
where
I is the identity matrix. This equation produces a polynomial of degree
from which the values
are obtained. Then, the characteristic vectors can be obtained by solving Equations (
7) and (
8).
The characteristic vector from the matrix
U, shown in Equation (
9), is orthonormal.
The principal axis transformation will transform
p correlated variables
into
p new uncorrelated variables
. The coordinate axes of these new variables are described by the characteristic vectors
, which form the direction cosine matrix
U used in the transformation given by Equation (
10):
where
and
are vectors
p x 1 of observations on the original variables and their means. The transformed variables are the principal components of
or
. The
ith principal component is
. The classical PCA algorithm can be seen in
Figure 6.
The first principal components are sufficient to preserve the relevant information in the original data, according to the parameter k that determines the dimension of the extracted features; this parameter must meet .
In the PCA algorithm, the eigenvectors corresponding to the first
k eigenvalues of the sample covariance matrix are the orthogonal bases of the feature space. Generally, the variance contribution rate or explained variance refers to the proportion of the total variance in the original data captured by each principal component (Equation (
11)).
The parameter
k should make the cumulative contribution rate greater than a threshold (usually 80–90%) [
37], as shown in Equation (
12). This value also represents the cumulative sum of the variance explained by the first principal components. This measure indicates how much of the total information from the original data has been retained when considering a specific set of principal components.
2.3. Control Charts Based on Principal Components
The
control chart combines information on the mean and dispersion of more than one variable [
29].
Given a vector x × 1 of measurements y, normally distributed variables with a covariance matrix , we can test whether the vector of the means of these variables is at its desired target by computing the Hotelling statistic .
For notational purposes,
ith represents the individual observation of the
p characteristics of the reference sample with the
vector
The estimated mean vector, whose components are the means of each feature, is
where
is obtained by Equation (
13),
and the estimated covariance matrix is obtained by Equation (
14).
To construct a multivariate control chart based on Hotelling’s
statistic, for observation
, the chart statistic given by Equation (
15) is used:
This statistic will be distributed as a central Chi-square distribution with
q degrees of freedom if
. A multivariate Chi-square control chart can be constructed by plotting
against time with an upper control limit
in Equation (
17),
and an
in Equation (
18) [
38].
where
is the
perceptible of the
F distribution with
p and
degrees of freedom.
The relationship between the UCL and the data is determined by calculating the statistic . Each sample generates a value of , which measures the distance of the data from the multivariate mean of the process, and this calculation considers the covariance matrix used to evaluate the correlation between the variables.
It is also possible to establish that the distribution of the values, under normal conditions, follows a Fisher F distribution because the statistic is based on the relationship between the variability in the samples and the expected variability in the process, which establishes a control limit based on a predefined confidence level.
2.4. Fault Detection with PCA-
The main idea behind this method is to select the control limits whose function is to determine whether the monitored process is under statistical control. The UCL value defines the threshold at which multivariate observations are considered normal or within the expected limits of process variation. If an observation has a
value that exceeds the UCL value, the process is outside the control limits.
Figure 7 defines the methodology that describes the proposal for the PCA-
FD.
3. Results
This section presents the FD PCA-
method, aimed at detecting faults in the transient state of a chemical process, specifically a distillation column. The matrix
, used for the plate 2 analysis, contains the procedure variables. This is a 12 × 63 matrix. For the sake of size, only a portion of the resulting analysis matrices is shown.
where
p in the matrix
indicates the plate under analysis. For any plate
p in the distillation column, this matrix is formed of
where
n indicates the number of samples considered for low- and high-transient sections corresponding to each temperature value.
The vector of sample means for the analysis of plate 2 is indicated by
.
The covariance matrix
resulting from the analysis data of plate 2 is of size 63 × 63.
The eigenvector matrix
for plate 2 is defined as
The importance of each principal component resulting from the analysis was observed, the result is shown in
Figure 8. Within the graph, the representation of the components is listed in order from highest (47.88%) to lowest (6.82%).
Figure 9 shows the variance explained according to the number of components selected for the analysis. The graph indicates that by only using the first and second components, 71.31% of the variance is explained.
Principal components that explain an acceptable cumulative percentage of variance are often selected. A commonly used value is between 70% and 90% of the explained variance. Too many components can overload the model with irrelevant details, while too few components can cause the loss of important information. This range allows for a balance between these extremes. In this work, only the first two principal components were used for the analysis, accounting for 71.31% of the explained variance.
The selection is validated using the matrix
. This matrix is useful for obtaining the variance in each variable explained by a component. The load matrix is considered, and the value of squared cosines is obtained.
In
Figure 10, the direction and importance of the original characteristics given by the first and second components are shown; within the graph, the magnitude of the vector is scaled 5 times for visualization purposes.
The first component describes each of the analysis variables for the low-transient state; on the other hand, the second component explains the analysis variables for the high-transient state, as shown in matrix
.
To define the reference threshold, statistical theory is taken into account. The idea is to map the transformed data to a univariate set using the Hotelling statistical parameter , and from this, establish a normal state threshold, based on the variability in the data.
This parameter represents the magnitude of each transformed data point, resulting in a univariate data set. Thus, it is possible to define the UCL control limits with a univariate process approach [
38]. Assuming a
-type distribution, shown in Equation (
15), with a confidence level of
,
, and
, the UCL threshold or upper control limit is given by
A fixed UCL provides a constant criterion for fault detection, avoiding dynamic adjustments that could cause false alarms or computational overload. If the UCL changes dynamically, it might respond to minor variations that do not represent actual faults, leading to unnecessary alerts. Since this research considered an operational range, a fixed limit allows for a more reliable assessment of significant deviations [
39].
Figure 11 shows the data transformed into a control chart based on the
statistic, the indicated region represents the normal state of the system within the UCL value.
For the classification process, a new sample must be standardized, then mapped based on the principal components obtained from the normal-state model, and then, the Hotelling statistic is calculated based on the covariance. Thus, if this value falls outside the UCL threshold, an abnormal condition is detected, which indicates a failure.
Faults Scenarios
This study consisted of causing faults during the start-up of the process, such as pressure leaks and changes in the heating power (for this experiment, the start-up is considered to be the transient state of the process).
Figure 12 shows a failure due to a heating power increase during the transient state, starting in sample 1500 and finishing in sample 2000.
Under this failure, it is possible to observe that the temperature remains constant and does not reflect immediate changes. The effects of this fault appear in the range of samples 2010 to 2800; this delay changes depending on the magnitude of the fault failure.
Figure 13 shows a failure due to a pressure leak during the start-up of the process; the failures occurred in 1000 to 1500 and 2500 to 3000 samples.
To validate the FD-PCA
model, files containing fault conditions were used, from which a vector
was generated with the analysis data.
The data of the vector
were standardized and projected onto the reduced basis of the principal components using Equation (
20).
where
represents the normalized
vector data and
the first two principal components.
Figure 14 shows an example of data to validate the model in failure due to a change in the heating power. The effect of this fault extends to sample 3196 in the boiler. This behavior is also visible at other plate temperatures, but with a different time extension. For example, for plate 2, the effect extends to sample 3850.
Figure 15 presents the result of FD-PCA
; the parameter
is greater than the normal operating UCL for the temperature trend of plate 2.
The results of FD-PCA
for plates 2 and 6 and the condenser are shown in
Figure 16 in this order.
The blue, green, and violet points indicate normal behavior in the process; the failure of these plates is represented by the red, orange, and yellow points.
It is possible to observe that for plate 6 (orange), the statistic does not detect the fault due to its lower value compared with the UCL threshold.
To validate the model under a pressure leakage fault, the data observed in
Figure 17 were used.
Unlike faults due to heating power, the effect of the pressure leak fault is proportional to the magnitude of the leak present, which has a greater impact on the plate where the leak occurs.
Figure 18 shows the result of FD-PCA
evaluated with a pressure leak fault. The graph indicates that the parameter
is greater than the normal operating UCL for the temperature trend of plate 2.
The results of FD-PCA
for plates 2 and 6 and the condenser are shown in
Figure 19 in this order. Blue, green, and violet points indicate normal behavior; the faults are represented as red, orange, and yellow points. In the case of the condenser (yellow), the statistic does not detect the fault due to its lower value regarding the UCL threshold.
The variance explained by each model applied to each plate with a sensor in the distillation column, as well as the
Hotelling values of the models used in the
graph, are described in
Table 1.
According to
Table 1, the variance explained by only two components for the boiler presents 97.68% of the information of the data used, compared to plate 2, which indicates 71.31% of this value. Within the analysis, plate 2 presents the least information when considering only two components.
Finally, to validate the model, Accuracy, Precision, Sensitivity/Recall, and Specificity metrics are used. These results are presented in
Table 2.
These results suggest that the models perform well using only two principal components. Overall, the average accuracy is 0.8386, indicating correct detection in normal cases and element faults in 83.86% of cases.
Regarding accuracy, an average value of 0.9523 indicates a correct fault detection in 95.23% of the evaluated cases where a fault was indicated. On the other hand, an average value of 0.9642 in specificity suggests that the models responded adequately in cases where a fault did not occur; in other words, the fault detection models were correct in normal operating cases in 96.42% of cases. On average, a sensitivity value of 0.8666 indicates that the models performed correct fault detection in 86.66% of cases. In
Table 2, it can be observed that the best model was for Plate 4, with 100% for each metric.
5. Conclusions
This paper presents a fault detection strategy that uses principal component analysis with Hotelling’s statistical distribution to construct a multivariate control chart. Fault detection is performed by observing a UCL value that defines the threshold value at which multivariate observations are considered normal. If an observation has a value that exceeds the UCL, this indicates a fault.
Real data from a distillation column are used to train and validate the resulting models. The results indicate good performance of the applied method, achieving correct fault detection in 95.23% of the evaluated cases. However, for the fault detection and normal (non-failure) cases, the model performs reliable detection in 86.66% of the evaluated cases, suggesting that there is room for improvement in detecting some real faults that could be masked in the training data. This research demonstrates that the applied method can detect faults in the transient state of the process. The proposed strategy meets the essential criteria for a reliable fault monitoring scheme.
The PCA-T2 fault detection system can be applied in real time, since once the process data under normal operating conditions are obtained, the X matrix is generated, the principal component analysis is performed, and the statistical control thresholds or limits are also calculated, so no recalculation is needed during the fault detection process. Having PCA data available, process monitoring and fault detection can be performed; each new reading from the monitored process must have its T2 statistic calculated. If any of the values exceed the threshold, the alarm is triggered.
However, the results can be improved; for this experiment, a limited dataset was used, which was divided into two subsets (test and validation data). In addition, the data are obtained from a specific operating range because of the characteristics of the distillation pilot plant; however, considering more variables in the analysis, such as pressure, temperature, and concentration, would improve the results obtained, as correlations may exist between them. By including these relationships, more complex patterns can be identified and the precision of diagnosis can be improved. In addition, when correlated variables are considered, the system can better differentiate between normal variations and real failures. This helps reduce false alarms, avoiding unnecessary interventions in the process. However, redundancy of information should be considered.