Fault Diagnosis and Identification of Abnormal Variables Based on Center Nearest Neighbor Reconstruction Theory

Wang, Guozhu; Zhou, Ruizhe; Li, Fei; Li, Xiang; Zhang, Xinmin

doi:10.3390/math13122035

Open AccessArticle

Fault Diagnosis and Identification of Abnormal Variables Based on Center Nearest Neighbor Reconstruction Theory

by

Guozhu Wang

^1,2,*,

Ruizhe Zhou

^1,2,

Fei Li

³,

Xiang Li

⁴ and

Xinmin Zhang

⁴

¹

School of Cable Engineering, Henan Institute of Technology, Xinxiang 453003, China

²

Henan Key Laboratory of Advanced Cable Materials and Intelligent Manufacturing, Xinxiang 453003, China

³

School of Electrical and Information Engineering, Anhui University of Technology, Ma’anshan 243032, China

⁴

College of Control Science and Engineering, Zhejiang University, Hangzhou 310027, China

^*

Author to whom correspondence should be addressed.

Mathematics 2025, 13(12), 2035; https://doi.org/10.3390/math13122035

Submission received: 7 May 2025 / Revised: 15 June 2025 / Accepted: 18 June 2025 / Published: 19 June 2025

Download

Browse Figures

Versions Notes

Abstract

:

Fault diagnosis and identification are important goals in ensuring the safe production of industrial processes. This article proposes a data reconstruction method based on Center Nearest Neighbor (CNN) theory for fault diagnosis and abnormal variable identification. Firstly, the k-nearest neighbor (k-NN) method is used to monitor the process and determine whether there is a fault. Secondly, when there is a fault, a high-precision CNN reconstruction algorithm is used to reconstruct each variable and calculate the reconstructed control index. The variable that reduces the control index the most is replaced with the reconstructed variable in sequence, and the iteration is carried out until the control index is within the control range, and all abnormal variables are finally determined. The accuracy of the CNN reconstruction method was verified through a numerical example. Additionally, it was confirmed that the method is not only suitable for fault diagnosis of a single sensor but also can be used sensor faults that occur simultaneously or propagate due to variable correlation. Finally, the effectiveness and applicability of the proposed method were validated through the penicillin fermentation process.

Keywords:

k-NN; center nearest neighbor; data reconstruction; fault detection; fault diagnosis

MSC:

62H99

1. Introduction

The existence of faults not only affects the safe operation of a system but also causes a decrease in product quality and casualties in industrial processes. Therefore, factories and enterprises are increasingly emphasizing the safety and reliability of control processes, and fault detection and diagnosis technology has become a hot research direction in the international automation control field [1,2,3,4]. Hoai Vu Anh Truong et al. proposed a novel observer-based neural network finite-time output control strategy for general high-order nonlinear systems, which reduces the number of estimated parameters. Additionally, an observer-based finite-time output feedback control has been established to satisfy the output tracking regulation using semi-global practically finite-time stability [5]. Van Du Phan et al. also proposed an adaptive neural compensation disturbance observer for estimating composite disturbances and oil leakage faults [6]. With the increasing complexity of automation control systems, as well as the rapid development of computer technology and sensor technology, more and more process data can be stored and applied, which has become a prerequisite for the development of data-driven fault diagnosis technology [7,8,9,10]. In recent years, the Multivariate Statistics Process Control (MSPC) method has been studied extensively and achieved promising results. PCA, PLS, similarity analysis, and

k

-NN, as well as their extensions, have been widely applied in practical industrial processes [11,12,13,14,15,16].

When there is a fault in industrial processes, it is crucial to diagnose the fault and identify abnormal variables, and it is also crucial to ensure that operators can eliminate the fault at the root. The contribution plot has been widely applied as a classical fault diagnosis method. Kourti and Mac Gregor [17] used contribution plots of quality and process variables to identify abnormal variables in high-pressure and low-density polyethylene reactors. Experimental verification showed that this method cannot always accurately reveal the root cause of a fault. Qin and Dunia [18] also proposed a fault diagnosis method based on reconstruction to isolate abnormal variables in the fault subspace, which was applied to the reconstruction of fault variables. Yue and Qin applied a fault detection method based on joint statistical indicators and conducted in-depth research on an abnormal variable identification method based on Reconstruction-Based Contribution (RBC). They also demonstrated that this method has a fault-tailing effect to some extent, similar to the traditional contribution plot methods [19].

In recent years, fault detection methods based on

k

-NN have been proposed and successfully used to monitor continuous and batch processes. Due to its large computational complexity and the need to store multiple intermediate values, the

k

-NN method has high requirements for the computing speed and storage space of the computer. To address these issues, He and Wang [20,21] proposed the PC-

k

-NN method, which uses the principal components of the original samples as modeling samples. This method reduces the computational complexity of distance calculation and saves storage space, but ignores abnormal information that occurs in the residual space. Subsequently, Guo Xiaoping and Li Yuan [22] proposed a batch process monitoring method based on feature space

k

-nearest neighbor (FS-

k

-NN), which combines the principal component part and residual part of the feature space to represent the useful information of the original data. This method has achieved positive results. Although the

k

-NN method has achieved satisfactory results in fault detection, there is little research on fault diagnosis and abnormal variable identification methods [23].

Based on the above discussion, this article proposes a fault diagnosis and identification of abnormal variables based on Center Nearest Neighbor theory. The basic principles of mean

k

-NN,

k

-1NN, weighted

k

-NN, and CNN reconstruction theory are analyzed. The four methods are compared in terms of data reconstruction accuracy, and it is determined that CNN reconstruction theory has high accuracy. The fault diagnosis process based on CNN reconstruction theory is as follows: Firstly, the

k

-NN fault detection method is used to monitor the process and determine whether there is a fault. Secondly, when there is a fault, a high-precision CNN reconstruction algorithm is used to reconstruct each variable and calculate the reconstructed control index. The variable that reduces the control index the most is replaced in sequence, and the iteration is carried out until the control index is within the control range. The purpose of this method is to find all variables that cause process faults.

The fault diagnosis method based on CNN reconstruction theory has the following three advantages: (1) There are no restrictions on data characteristics, meaning that applied non-Gaussian, nonlinear, and multi-modal data can be used effectively. (2) The CNN reconstruction method has high accuracy in terms of data reconstruction and

k

is a relatively reasonable adaptive calculation result. (3) Not only is it applicable to situations where a single variable is abnormal, it is also effective for situations where multiple variables are abnormal. This method can obtain the preliminary relationship between a fault and the variables.

The rest of this paper is organized as follows: the fault detection method based on

k

-NN rules is presented in Section 2. The proposed data reconstruction and abnormal variable identification methods are described in Section 3, including the mean

k

-NN reconstruction method,

k

-1NN reconstruction method, weighted

k

-NN reconstruction method, CNN reconstruction method, and abnormal variable identification methods based on CNN. In Section 4, a numerical example and the penicillin fermentation process are introduced to illustrate the effectiveness of the proposed method from the perspective of data reconstruction accuracy and abnormal variable identification. The concluding remarks are provided in the final section.

2. Fault Detection Based on $k$ -NN Rules

The

k

-NN method is a non-parametric supervised classification method that can predict unknown classes or labels. It can achieve high classification accuracy for datasets with unknown distributions and non-normal distributions, and has strong robustness. The fault detection based on

k

-NN rules directly utilizes process data under normal working conditions, selects the first

k

neighboring data samples of each sample to establish a process statistical model, extracts the nearest neighbor distance feature between normal samples, and calculates the statistical control threshold of the model. The guiding principle is that abnormal samples and normal samples belong to different categories, and the nearest neighbor distance between abnormal samples and normal samples in the modeling data will be greater than the nearest neighbor distance between normal samples; this can be expressed as Equation (1):

\min d_{f} > \min d_{n}

(1)

where

\min d_{f}

is the nearest neighbor distance between the abnormal sample and the training sample, and

\min d_{n}

is the nearest neighbor distance between normal samples and training samples.

The detailed steps of

k

-NN modeling and fault detection are shown in Figure 1:

(1): Collect and standardize training data $X_{n \times m}$ for normal process, converting it into a matrix $x_{n \times m}$ with zero mean and unit standard deviation, where $n$ and $m$ represent sample and variable labels, respectively.
(2): For each sample $x_{i}$ , its $k$ nearest neighbors can be found in the training data set using the Euclidean distance as the indicator. The sum of squares $D_{k}^{2} (x_{i})$ of the $k$ nearest distances are calculated between each training sample and other training samples using Equation (2), where $N_{j} (x_{i})$ represents the $j$ th nearest neighbor sample of sample $x_{i}$ :

$D_{k}^{2} (x_{i}) = \sum_{j = 1}^{k} {‖x_{i} - N_{j} (x_{i})‖}^{2}$

(2)
(3): Determine the control threshold $δ_{L i m i t}$ with a confidence level of 99% using the non-central $χ^{2}$ distribution method [20,22].
(4): Standardize real-time data set $x_{n e w}$ using the mean and variance of the training data and find its $k$ nearest neighbors in the training data set.
(5): Calculate the distance of statistical index $D_{k}^{2} (x_{n e w})$ .
(6): Compare $D_{k}^{2} (x_{n e w})$ with the control threshold $δ_{L i m i t}$ . If the statistical index is less than the control threshold, $D_{k}^{2} (x_{n e w}) < δ_{L i m i t}$ , this indicates that the sample is normal. However, faults may occur during the process.

3. Data Reconstruction and Abnormal Variable Identification Methods Based on CNN

3.1. Data Reconstruction Method

When the distance statistical index of the detection system is greater than the control threshold, this indicates that there is a fault. At this time, in order to ensure that the equipment can be restored to normal state, it is necessary to identify the root cause of the fault. On the basis of

k

-NN fault detection, this section introduces the data reconstruction methods based on mean

k

-NN,

k

-1NN, weighted

k

-NN, and the CNN data reconstruction method. The reconstruction accuracy of the various methods was compared.

For ease of understanding, several symbols and weight parameters are defined as follows:

(1)

x_{t e s t}

Represents the detected fault samples;

x_{t e s t}^{'}

represents the samples composed of the remaining variables after removing the variables that need to be reconstructed, as shown in Figure 2.

(2)

n_{j} (x_{t e s t}^{'})

represents the

j

th label of the nearest sample

x_{t e s t}^{'}

in

x_{n \times (m - 1)}^{'}

, where

j = 1, 2, \dots, k

.

(3)

w

is the weight, representing the similarity between the test sample and the modeling sample, with a higher weight indicating a greater degree of similarity and a lower weight indicating a lower degree of similarity.

(4)

v_{1}^{'}

represents the reconstructed value of variable

v_{1}

.

(5) Taking the reconstruction of the variable

v_{1}

as an example, the first

k

nearest neighbor samples of

x_{t e s t}^{'}

in

x_{n \times (m - 1)}^{'}

were found and labelled as

n_{j} (x_{t e s t}^{'})

,

j = 1, 2, \dots, k

, in addition to separately calculating the Euclidean distance between them; the result was

d_{1}, d_{2}, \dots, d_{k}

,

d_{1} < d_{2} < \dots < d_{k}

.

The reconstruction methods for the mean

k

-NN,

k

-1NN, weighted

k

-NN, and CNN are described as follows:

(1) Mean

k

-NN reconstruction method

The calculation formula when using the mean

k

-NN reconstruction method to reconstruct variable

v_{1}

is shown in Equation (3):

v_{1}^{'} = \frac{1}{k} \sum_{l = 1}^{k} {[x_{n_{l}^{'} (x_{t e s t}^{'})}]}_{1}

(3)

{[x_{n_{l} (x_{t e s t}^{'})}]}_{1}

represents the first variable value of the sample under the

l

th nearest neighbor label of

x_{t e s t}^{'}

in

x_{n \times (m - 1)}^{'}

.

(2)

k

-1NN reconstruction method

When

d_{1}

and

d_{2}

are satisfied,

d_{1} / d_{2} < 0.3

, the

k

-1NN reconstruction method is selected to reconstruct the variables, as shown in Equation (4):

v_{1}^{'} = {[x_{n_{1}^{'} (x_{t e s t}^{'})}]}_{1}

(4)

(3) Weighted

k

-NN reconstruction method

When using the weighted reconstruction method, the first step is to determine the weight parameters

w

according to Equation (5), and further reconstruct the variable

v_{1}

according to Equation (6):

w_{j} = \frac{1}{d_{j}} / \sum_{l = 1}^{k} \frac{1}{d_{l}}

(5)

v_{1}^{'} = \sum_{i = 1}^{k} w_{i} {[x_{n_{l} (x_{t e s t}^{'})}]}_{1}

(6)

(4) CNN reconstruction method

When using the CNN method to reconstruct variable

v_{1}

, the first step is to determine the value of

k

according to Equation (7), and then reconstruct the variables according to Equation (3):

f (k) = \min \{{‖x_{t e s t}^{'} - \frac{1}{k} \sum_{l = 1}^{k} [x_{n_{l}^{'} (x_{t e s t}^{'})}]‖}^{2}\}

(7)

Through the above analysis, the differences between the four methods and their impact on reconstruction accuracy are explained from the perspective of the geometric distribution of data. For the four sample points, A, B, C, and D, as shown in Figure 3, there will be significant differences in reconstruction effect when the

k

values are different. When using the mean

k

-NN reconstruction method for point A, which has the three most similar nearest neighbor samples,

k

= 2 or

k

= 3 is more reasonable. However, for other values, the error is relatively; for point B, it can be seen that the mean

k

-NN method is not applicable through an analysis of its neighboring sample points. In this case, the accuracy was higher when using the

k

-1NN reconstruction method. For points C and D, the mean

k

-NN method can be used for reconstruction, but the value of

k

is relatively low, and the accuracy of the reconstruction results may vary with different

k

values. When choosing the weighted

k

-NN reconstruction method, the reconstruction accuracy will be relatively ideal, but fuzzy determination of the size of

k

is still required. When choosing the CNN reconstruction method, the precise

k

value, determined through calculation, can ensure the most ideal reconstruction result is obtained. The distribution data presented are very common in actual production process data. Therefore, when reconstructing data, the size of

k

and the choice of reconstruction algorithm have become a question worth considering. Blindly using a fixed

k

value will affect the accuracy of data reconstruction. Obviously, the mean

k

-NN method is not suitable for mechanically reconstructing all data using a unified

k

value; this was addressed in the CNN methods. The

k

-1NN method is only applicable when the first nearest neighbor is very close to the other nearest neighbors, which has limitations, and although the weighted

k

-NN method has high reconstruction accuracy, it still cannot avoid the fuzzy selection of

k

values. For the CNN reconstruction method,

k

is a relatively reasonable adaptive calculation result, ensuring that the reconstruction effect of data points A, B, C, and D is better than that obtained using other methods.

3.2. Abnormal Variable Identification Methods Based on CNN

The detailed process modeling, fault detection, data reconstruction, and identification of abnormal variables processes are shown in Figure 4 (assuming that only two variables have anomalies: variable

v_{1}

and

v_{2}

).

Modeling phase
(a)
Collect and standardize the normal training data set $X_{n \times m}$ ;
(b)
For each standardized sample, $x_{i}$ , find its $k$ nearest neighbor sample in the training set, and calculate the squared distance between them using Equation (2);
(c)
Determine the control threshold $δ_{L i m i t}$ .
Fault detection phase
(a)
Collect real-time samples $X_{1 \times m}$ and standardize them using the mean and standard deviation of the modeling data;
(b)
For standardized sample $x_{t e s t}$ , find its $k$ nearest sample in the training set, and calculate the squared distance index $D^{2} (x_{t e s t})$ according to Equation (8), where $v_{i}$ represents the $i$ th variable of $x_{t e s t}$ , $n_{j} (x_{t e s t})$ represents the first $j$ nearest neighbors of the test sample in the training set, and ${[n_{j} (x_{t e s t})]}_{i}$ is the $i$ th element in sample $n_{j} (x_{t e s t})$ :

$D^{2} (x_{t e s t}) = \sum_{j = 1}^{k} {‖x_{t e s t} - n_{j} (x_{t e s t})‖}^{2} = \sum_{j = 1}^{k} \sum_{i = 1}^{m} {\{v_{i} - {[n_{j} (x_{t e s t})]}_{i}\}}^{2}$

(8)

(c)
Compare the distance index $D^{2} (x_{t e s t})$ and control thresholds $δ_{L i m i t}$ to determine whether there is a fault in the real-time sample.
Data reconstruction and identification of abnormal variables
When the distance index $D^{2} (x_{t e s t})$ is greater than the control threshold $δ_{L i m i t}$ , there is a fault in the process and it is necessary to further identify abnormal variables, as shown in Figure 4:
(a)
Calculate the statistical index $S_{1}$ after the CNN reconstructs the first variable.
(b)
Reconstruct all variables in $x_{t e s t}$ and sequentially calculate statistical index $S_{1}$ , $S_{2}$ ,…, $S_{m}$ .
(c)
Determine the difference between the statistical index and control limits after reconstructing each variable. Assuming the sorting result is shown in Figure 4, $d_{1} < d_{2} < \dots < d_{m}$ , where $d_{i} = |S_{i} - δ_{L i m i t}|$ , find the variable that reduces the control indicator the most, where a smaller $S_{i}$ leads to a smaller $d_{i}$ , indicating that the indicator has decreased more compared to when the fault occurred; this means that probability of the variable being abnormal is higher.
(d)
In Figure 4, replace the variable ( $v_{1}$ in $x_{t e s t}$ ) with the reconstructed $v_{1}^{'}$ , and calculate control index $D_{1}$ to determine if it is below the control threshold. If it is greater than the control threshold, replace $v_{2}$ in $x_{t e s t}$ with the reconstructed $v_{2}^{'}$ , and calculate control index $D_{12}$ , iterating from small to large until the control index is within the control threshold. The calculation methods for $D_{1}$ and $D_{12}$ are shown in Equations (9) and (10):

$D_{1} = \sum_{j = 1}^{k} {‖x_{t 1} - N_{j} (x_{t 1})‖}^{2}$

(9)

$D_{12} = \sum_{j = 1}^{k} {‖x_{t 12} - N_{j} (x_{t 12})‖}^{2}$

(10)

where $x_{t 1}$ represents the data sample obtained by replacing $v_{1}$ with $v_{1}^{'}$ in $x_{t e s t}$ ; $x_{t 12}$ represents the data sample obtained by replacing $v_{1}$ and $v_{2}$ in $x_{t e s t}$ with $v_{1}^{'}$ and $v_{2}^{'}$ ;
(e)
Assuming that only two variables (variables $v_{1}$ and $v_{2}$ ) are abnormal in the system, when $D_{1} \leq δ_{L i m i t}$ , only variable $v_{1}$ is abnormal. When $D_{1} > δ_{L i m i t}$ and $D_{12} \leq δ_{L i m i t}$ , variables $v_{1}$ and $v_{2}$ are both abnormal variables, as shown in Figure 4.

4. Simulation Experiment Analysis

This section provides a numerical simulation and engineering examples to verify the effectiveness of the proposed method. Firstly, a seven-variable numerical simulation [23] was conducted to compare the accuracy of reconstructing variables using mean

k

-NN,

k

-1NN, weighted

k

-NN, and CNN reconstruction methods. The effectiveness of the CNN reconstruction method was verified through identifying multiple abnormal variables. Finally, this method was used to identify abnormal variables in the penicillin fermentation process.

4.1. Numerical Simulation

The numerical simulation consists of seven-variable data constructed from two latent variables,

s_{a}

and

s_{b}

, as shown in Equation (11):

\{\begin{cases} x_{1} = 0.3217 s_{a} + 0.4821 s_{b} + e_{1} \\ x_{2} = 0.2468 s_{a} + 0.1766 s_{b} + e_{2} \\ x_{3} = 0.8921 s_{a} + 0.4009 s_{b}^{2} + e_{3} \\ x_{4} = 0.7382 s_{a}^{2} + 0.0566 s_{b} + e_{4} \\ x_{5} = 0.3972 s_{a}^{2} + 0.8045 s_{b}^{2} + e_{5} \\ x_{6} = 0.6519 s_{a} s_{b} + 0.2071 s_{b} + e_{6} \\ x_{7} = 0.4871 s_{a} + 0.4508 s_{a} s_{b} + e_{7} \end{cases}

(11)

where

e_{1} ~ e_{7}

is a noise with a mean of zero and standard deviation of 0.01,

s_{a}

is a random number between −10 and −7, and

s_{b}

follows normal distribution

N (- 15, 1)

. According to Equation (11), 500 training samples and 500 test samples are generated. Assuming that the variable

x_{1}

is missing data from times 151 to 200 in the test sample set, four reconstruction methods are used to reconstruct the missing data. In order to quantitatively compare the reconstruction accuracy of several methods, this section uses the root mean square error (RMSE) as the measurement index for calculation, as shown in Equation (12).

RMSE = \sqrt{\frac{1}{N} \sum_{i = 1}^{N} {(x_{i 1} - {\bar{x}}_{i 1})}^{2}}

(12)

where

x_{i 1}

is the original value after standardization,

{\bar{x}}_{i 1}

is the reconstruction value of missing data from time 151–200, and

N

is the number of samples that need to be reconstructed. Figure 5 shows a comparison between the actual values and the reconstructed values using several methods. Through calculation, the average errors of the 50 missing values were determined to be 0.049, 0.051, 0.041, and 0.02, respectively. The CNN reconstruction result obtained the smallest error, indicating the superiority of this method in data reconstruction.

In order to verify the effectiveness of the CNN reconstruction method in identifying multiple abnormal variables, 500 test samples were generated according to Equation (11). The fault addition method was as follows:

(1) Variable

x_{1}

: add 8% step fault from time 101 to 150;

(2) Variable

x_{2}

: add 10% step fault from time 401 to 450;

(3) Variable

x_{7}

: add 5% step fault from time 401 to 450.

Figure 6 shows the

k

-NN fault detection results of the test data. It is evident that the control index exceeded the control threshold during the time periods of 101–150 and 401–450, indicating the presence of faults in this system, which is consistent with the preset fault time. Further identification of abnormal variables is required, assuming data loss during the fault period, and using CNN reconstruction methods to reconstruct each variable. From time 101 to 150, fault detection was performed on the reconstructed data of each variable in sequence. Detailed results are shown in Figure 7. It can be seen that after reconstructing variable

x_{1}

, the fault was eliminated, and it can be determined that only variable

x_{1}

was abnormal during this period. The detection results of other variables after reconstruction are also shown in Figure 7, which are basically consistent with the control index in Figure 6, indicating that the reconstruction of these variables did not significantly affect the control index; that is, these variables are normal variables. Therefore, it can be determined that the only abnormal variable was

x_{1}

from time 101 to 150. Similarly, the variables were sequentially reconstructed from time 401 to 450, and the reconstruction results are shown in Figure 8. Comparing the detection results of reconstructed variables

x_{1}, x_{3}, x_{4}, x_{5}

and

x_{6}

with the control index in Figure 6, it was found that the changes in control index are minimal. After reconstructing variables

x_{2}

and

x_{7}

, there was a significant decrease in the control index. For example, after reconstructing variable

x_{2}

, the index decreased to between 3.5 and 4.5, and after reconstructing variable

x_{7}

, the index decreased to between 6 and 8. Therefore, it can be preliminarily determined that the abnormal variables are

x_{2}

and

x_{7}

. Figure 9 shows the detection results of reconstructed

x_{2}

and

x_{7}

: the detection results returned to normal; that is, the abnormal process variables were

x_{2}

and

x_{7}

during the 401–450 time period, and the fault diagnosis and identification of abnormal variables were completed.

4.2. Application Research on the Penicillin Fermentation Process

Penicillin is an antibiotic with high clinical and medicinal value. Its production process has dynamic, nonlinear, and multi-stage characteristics. The simulation platform (PenSim v2.0) for this process was developed by the University of Illinois. This platform can simulate the production process under different operating conditions and has been widely used in various academic research fields, such as industrial process monitoring, fault detection, and pattern recognition [20,24,25]. Figure 10 shows the simulation flowchart. The simulation program code and specific operation process can be downloaded from the following website: https://www.iteye.com/resource/vvvvvv12345-6866041 (accessed on 6 May 2025). The fermentation process of penicillin can be divided into approximately three phases: the growth of the biomass, the synthesis of penicillin, and cell autolysis. Firstly, after inoculation with the fermentation medium, the production bacteria undergo a short period of adaptation in a suitable environment and begin to develop, grow, and reproduce until reaching the critical concentration of bacterial cells. Secondly, when penicillin reaches its production environment, it gradually begins to synthesize penicillin, and its synthesis rate increases until it reaches its maximum value, which is maintained until the period of decline in synthesis ability. Thirdly, the bacterial cells begin to age, the cells begin to self-dissolve, the ability to synthesize penicillin declines, the production rate of penicillin decreases, and the pH value increases in the later stage of fermentation. In order to achieve a high penicillin synthesis rate and ensure product quality, each stage of the process needs to be maintained in its own optimal environmental state. Therefore, it is necessary to continuously adjust process parameters such as reaction temperature, pH value, and stirring rate during the process, as shown in Figure 10, including through adjusting temperature and pH controllers.

The PenSim v2.0 simulation platform can effectively simulate the dynamic characteristics of 16 variables in the penicillin process under different operating conditions. The optimal simulation time is 400 h for the entire batch, with an average sampling every hour. Table 1 and Table 2 provide the initial conditions and monitoring variables for the process operation, respectively. The process can introduce different types of process disturbances (see Table 3) to verify the effectiveness of the proposed abnormal variable identification method. Simulation experiments were conducted using two different fault types categorized as fault 1 (abnormal aeration rate) as examples. Test data 1 comprised a 5% step fault between 201 and 400 sample points; test data 2 comprised a slope fault with a slope of 0.1, introduced from time 251 and lasting until the end of system operation.

To test the effectiveness of the proposed abnormal variable identification algorithm, a

k

-NN fault detection model was established. The detection results of test data 1 and test data 2 are shown in Figure 11. Due to the step fault type test data 1, the detection results are quite obvious. From 201 to 400, the control index exceeded the normal control threshold, which is consistent with the preset fault situation, as shown in Figure 11a. In Figure 11b, as test data 2 considered a slope fault with a small gradient, a certain amount of time was needed for the accumulated control index to reach the alarm limit after the fault occurred, resulting in a brief lag in its detection effect. Figure 12 shows the variable contribution indicators for two fault scenarios, which can approximately determine the main variables causing the fault. Figure 12a,c show the variable contributions of the test data at all time points, and it can be concluded that there was no significant difference in the contribution indicators of each variable in the range of 1–200. After the fault was introduced at time 201, the contribution of variable 1 (aeration rate) significantly increased, and it can be preliminarily judged that this variable was the main variable causing the fault. Furthermore, Figure 12b,d show the variable contribution values at 261 h under two fault conditions, with the aeration rate indicator contributing the most, which is consistent with the preset fault condition. At this point, although it was determined that there is a fault in the system and the fault interval and main variables were accurately determined, there was no accurate judgment on whether other variables in the system were abnormal. This section also presents the reconstruction results for the main responsibility variable (aeration rate), which were obtained using the CNN method for test data 1, as shown in Figure 13. The trend in the reconstructed data is basically consistent with that of the normal process data.

According to the reconstruction results in Figure 13, the aeration rate in the test data 1 can be replaced with sequentially reconstructed data, and

k

-NN fault detection can be performed again. The results are shown in Figure 14. The occurrence of this fault was mainly affected by the responsible variable (aeration rate), which has little correlation with the other variables; that is, the abnormality of the responsible variable has little impact on the other variables. Therefore, after reconstructing the responsible variable, the system returned to a normal state. The root cause of the fault was the aeration rate.

5. Conclusions

The

k

-NN method can be effectively applied in the field of fault detection. When there is a fault, it is very important to identify all the abnormal variables. This paper proposes a data reconstruction method based on CNN theory. Compared with the mean

k

-NN,

k

-1NN, and weighted

k

-NN data reconstruction methods obtained through numerical simulation, the CNN method has a higher data reconstruction accuracy and

k

is a relatively reasonable adaptive calculation result; therefore, there is no need to choose the

k

value based on experience. The effectiveness and applicability of the proposed method were validated in terms of their abnormal variable identification through the penicillin fermentation process. Based on the research in this article, the next planned work is to find the correlation between variables, obtain a timely understanding of the direction of fault propagation, and accurately locate all abnormal variables during faults.

Author Contributions

Conceptualization, G.W.; methodology, G.W., R.Z., F.L. and X.L.; software, R.Z. and F.L.; validation, G.W., R.Z. and F.L.; formal analysis, R.Z.; investigation, G.W., R.Z. and F.L. resources, G.W., X.L. and X.Z.; writing—review and editing, X.L. and X.Z.; visualization, R.Z. and F.L.; supervision, X.L. and X.Z.; project administration, G.W.; funding acquisition, G.W. and X.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the Open Project of State Key Laboratory of Industrial Control Technology, Zhejiang University (No. ICT2025B35), Joint Fund of Henan Science and Technology Research and Development Plan in 2023 (No. 232103810028), and Henan Institute of Technology High Level Talent Fund Project (No. KQ1806).

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Conflicts of Interest

The authors declare no conflict of interest.

References

Sun, H.; Zhang, S.M.; Zhao, C.H.; Gao, F.R. A sparse reconstruction strategy for online fault diagnosis in nonstationary processes with no a priori fault information. IECR 2017, 56, 6993–7008. [Google Scholar] [CrossRef]
Kanno, Y.; Kaneko, H. Deep convolutional neural network with deconvolution and a deep auto encoder for fault detection and diagnosis. ACS Omega 2022, 7, 2458–2466. [Google Scholar] [CrossRef] [PubMed]
Zhao, C.H.; Huang, B. A full-condition monitoring method for nonstationary dynamic chemical processes with cointegration and slow feature analysis. AIChE J. 2018, 64, 1662–1681. [Google Scholar] [CrossRef]
Guo, J.Y.; Zhao, W.J.; Li, Y. Fault detection in multi-modal processes based on the local entropy double subspace. Trans. Inst. Meas. Control 2022, 45, 1323–1336. [Google Scholar] [CrossRef]
Truong, H.V.A.; Phan, V.D.; Tran, D.T.; Ahn, K.K. A novel observer-based neural-network finite-time output control for high-order uncertain nonlinear systems. Appl. Math. Comput. 2024, 475, 128699. [Google Scholar] [CrossRef]
Phan, V.D.; Truong, H.V.A.; Le, V.C.; Ho, S.P.; Ahn, K.K. Adaptive neural observer-based output feedback anti-actuator fault control of a nonlinear electro-hydraulic system with full state constraints. Sci. Rep. 2025, 15, 3044. [Google Scholar] [CrossRef]
Liu, Q.; Zhuo, J.; Lang, Z.Q.; Qin, S.J. Perspectives on data-driven operation monitoring and self-optimization of industrial processes. Acta Autom. Sin. 2018, 44, 1944–1956. [Google Scholar]
Ji, H.Q.; He, X.; Zhou, D.H. Fault detection techniques based on multivariate statistical analysis. J. Shanghai Jiaotong Univ. 2015, 49, 842–848. [Google Scholar]
Ji, H.Q. Data-driven sensor fault diagnosis under closed-loop control with slow feature analysis. IEEE Sens. J. 2022, 22, 24299–24308. [Google Scholar] [CrossRef]
Shang, L.L.; Lu, Z.L.; Wen, C.B. Canonical residual based incipient fault detection and diagnosis for chemical process. Control Theory Appl. 2021, 38, 1247–1256. [Google Scholar]
Dong, Y.N.; Qin, S.J. A novel dynamic PCA algorithm for dynamic data modeling and process monitoring. J. Process Control 2018, 67, 1–11. [Google Scholar] [CrossRef]
Chen, Z.W.; Liu, C.; Ding, S.; Peng, T.; Shardt, Y.R. A Just-in-time-learning aided canonical correlation analysis method for multimode process monitoring and fault detection. IEEE Trans. Ind. Electron. 2021, 68, 5259–5270. [Google Scholar] [CrossRef]
Zhang, C.; Gao, X.W.; Li, Y. Fault detection strategy based on principal component score difference of k nearest neighbors. Acta Autom. Sin. 2020, 46, 2229–2238. [Google Scholar]
Zhang, Y.; Li, S.; Teng, Y. Dynamic processes monitoring using recursive kernel principal component analysis. Chem. Eng. Sci. 2012, 72, 78–86. [Google Scholar] [CrossRef]
Zhang, Y.; Hu, Z. Multivariate process monitoring and analysis based on multi-scale KPLS. Chem. Eng. Res. Des. 2011, 89, 2667–2678. [Google Scholar] [CrossRef]
Zhang, Y.; Hu, Z. On-line batch process monitoring using hierarchical kernel partial least squares. Chem. Eng. Res. Des. 2011, 89, 2078–2084. [Google Scholar] [CrossRef]
Kourti, T.; MacGregor, J.F. Multivariate SPC methods for process and product monitoring. J. Qual. Technol. 1996, 28, 409–428. [Google Scholar] [CrossRef]
Dunia, R.; Qin, S.J. Subspace approach to multidimensional fault identification and reconstruction. AIChE J. 1998, 44, 1813–1831. [Google Scholar] [CrossRef]
Yue, H.H.; Qin, S.J. Reconstruction-based fault identification using a combined index. Ind. Eng. Chem. Res. 2001, 40, 4403–4414. [Google Scholar] [CrossRef]
He, Q.P.; Wang, J. Fault detection using the k-nearest neighbor rule for semiconductor manufacturing processes. IEEE Trans. Semicond. Manuf. 2007, 20, 345–354. [Google Scholar] [CrossRef]
He, Q.P.; Wang, J. Large-scale semiconductor process fault detection using a fast pattern recognition-based method. IEEE Trans. Semicond. Manuf. 2010, 23, 194–200. [Google Scholar] [CrossRef]
Guo, X.P.; Yuan, J.; Li, Y. Feature space k nearest neighbor based batch process monitoring. Acta Autom. Sin. 2014, 40, 135–142. [Google Scholar]
Wang, G.Z.; Li, J.; Hu, Y.T. Fault identification of chemical processes based on k-NN variable contribution and CNN data reconstruction methods. Sensors 2019, 19, 929. [Google Scholar] [CrossRef] [PubMed]
Yoo, C.K.; Lee, J.M.; Vanrolleghem, P.A. On-line monitoring of batch processes using multi-way independent component analysis. Chemom. Intell. Lab. Syst. 2004, 71, 151–163. [Google Scholar] [CrossRef]
Jia, Z.Y.; Wang, P.; Gao, X.J. Process monitoring and fault diagnosis of penicillin fermentation based on improved MICA. Adv. Mater. Res. 2012, 591, 1783–1788. [Google Scholar] [CrossRef]

Figure 1. Process flowchart of

k

-NN modeling and fault detection.

Figure 1. Process flowchart of

k

-NN modeling and fault detection.

Figure 2. Detailed diagram of variable structure.

Figure 3. Data distribution chart. ○(A,B,C,D)—indicate the sample points that need to be reconstructed; ● are normal sample points.

Figure 4. Process flowchart of

k

-NN modeling, fault detection and fault variable reconstruction.

Figure 4. Process flowchart of

k

-NN modeling, fault detection and fault variable reconstruction.

Figure 5. Comparison between the actual and the reconstructed values.

Figure 6. Fault detection results of test data.

Figure 7. Fault detection results after refactoring variables.

Figure 8. Fault detection results after refactoring variables.

Figure 9. Fault detection results after refactoring variables

x_{2}

and

x_{7}

.

Figure 9. Fault detection results after refactoring variables

x_{2}

and

x_{7}

.

Figure 10. Process flowchart of penicillin fermentation.

Figure 11. Fault detection results: (a) test data 1; (b) test data 2.

Figure 12. Contribution values of all variables: (a,b) test data 1; (c,d) test data 2.

Figure 13. Comparison between reconstructed data of test data 1 and normal data.

Figure 14. Fault detection result after reconstructing the variable aeration rate.

Table 1. Initial Conditions for Normal Process.

Variable (Unit)	Initial Value
Concentration of culture medium (g/L)	15
Reactor liquid level (L)	100
CO₂ concentration (mmol/L)	0.5
Hydrogen ion concentration (mol/L)	10^−5.1
Temperature (K)	297
Dissolved oxygen concentration (g/L)	1.16
Biomass concentration (g/L)	0.1
Penicillin concentration (g/L)	0

Table 2. Monitoring Variables for Penicillin Process.

No. Variable	Meas. Variable	No. Variable	Meas. Variable
1	Aeration rate	9	CO₂ concentration
2	Agitator power	10	PH
3	Substrate feed rate	11	Temperature
4	Substrate concentration	12	Generated heat
5	Dissolved oxygen concentration	13	Acid flow rate
6	Biomass concentration	14	Base flow rata
7	Penicillin concentration	15	Cold water flow rate
8	Culture volume	16	Hot water flow rate

Table 3. Description of penicillin fermentation process malfunctions.

No. Fault	Fault Type	Corresponding Process Variables
1	Step/Slope	Aeration rate
2	Step/Slope	Agitator power
3	Step/Slope	Substrate feed rate

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, G.; Zhou, R.; Li, F.; Li, X.; Zhang, X. Fault Diagnosis and Identification of Abnormal Variables Based on Center Nearest Neighbor Reconstruction Theory. Mathematics 2025, 13, 2035. https://doi.org/10.3390/math13122035

AMA Style

Wang G, Zhou R, Li F, Li X, Zhang X. Fault Diagnosis and Identification of Abnormal Variables Based on Center Nearest Neighbor Reconstruction Theory. Mathematics. 2025; 13(12):2035. https://doi.org/10.3390/math13122035

Chicago/Turabian Style

Wang, Guozhu, Ruizhe Zhou, Fei Li, Xiang Li, and Xinmin Zhang. 2025. "Fault Diagnosis and Identification of Abnormal Variables Based on Center Nearest Neighbor Reconstruction Theory" Mathematics 13, no. 12: 2035. https://doi.org/10.3390/math13122035

APA Style

Wang, G., Zhou, R., Li, F., Li, X., & Zhang, X. (2025). Fault Diagnosis and Identification of Abnormal Variables Based on Center Nearest Neighbor Reconstruction Theory. Mathematics, 13(12), 2035. https://doi.org/10.3390/math13122035

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Fault Diagnosis and Identification of Abnormal Variables Based on Center Nearest Neighbor Reconstruction Theory

Abstract

1. Introduction

2. Fault Detection Based on $k$ -NN Rules

3. Data Reconstruction and Abnormal Variable Identification Methods Based on CNN

3.1. Data Reconstruction Method

3.2. Abnormal Variable Identification Methods Based on CNN

4. Simulation Experiment Analysis

4.1. Numerical Simulation

4.2. Application Research on the Penicillin Fermentation Process

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

Fault Diagnosis and Identification of Abnormal Variables Based on Center Nearest Neighbor Reconstruction Theory

Abstract

1. Introduction

2. Fault Detection Based on k -NN Rules

3. Data Reconstruction and Abnormal Variable Identification Methods Based on CNN

3.1. Data Reconstruction Method

3.2. Abnormal Variable Identification Methods Based on CNN

4. Simulation Experiment Analysis

4.1. Numerical Simulation

4.2. Application Research on the Penicillin Fermentation Process

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

2. Fault Detection Based on $k$ -NN Rules