Next Article in Journal
Experimental Investigation into the Process of Hydraulic Fracture Propagation and the Response of Acoustic Emissions in Fracture–Cavity Carbonate Reservoirs
Previous Article in Journal
Antioxidant Activity of Carob Tree (Ceratonia siliqua L.) Leaf Extracts Obtained by Advanced Extraction Techniques
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Matrix Completion Method for Imputing Missing Values of Process Data

Institute of Process Systems Engineering, Qingdao University of Science and Technology, Qingdao 266042, China
*
Authors to whom correspondence should be addressed.
Processes 2024, 12(4), 659; https://doi.org/10.3390/pr12040659
Submission received: 27 February 2024 / Revised: 21 March 2024 / Accepted: 23 March 2024 / Published: 26 March 2024
(This article belongs to the Section Process Control and Monitoring)

Abstract

:
Real-time process data are the foundation for the successful implementation of intelligent manufacturing in the chemical industry. However, in the actual production process, process data may randomly be missing due to various reasons, thus affecting the practical application of intelligent manufacturing technology. Therefore, this paper proposes the application of appropriate matrix completion algorithms to impute the missing values of real-time process data. Considering the characteristics of online missing value imputation problems, this paper proposes an improved method for a matrix completion algorithm that is suitable for real-time missing data imputation. By utilizing real device data, this paper studies the impact of algorithm parameters on the effect of missing value imputing and compares it with several classical missing value imputing methods. The results show that the introduced method achieves higher accuracy in data imputation compared to the baseline method. Furthermore, the proposed enhancement significantly improves the speed performance of algorithms.

1. Introduction

As intelligent manufacturing in the chemical industry gains momentum, the significance of real-time process data escalates. However, transmission interruptions, sensor failures, or database issues may cause random missing data in real-time process data during the actual production process [1], thereby reducing the representativeness of real-time data and affecting the implementation of intelligent manufacturing in the chemical industry [2]. Therefore, the imputation of missing values is critically essential for the successful execution of intelligent manufacturing within the chemical industry.
We commonly use three methods for the analysis of data with missing values: the deletion method, the imputation method, and the model-based method [3]. Commonly used deletion methods for handling missing data include listwise and pairwise. Listwise deletion involves removing all data associated with time points that exhibit anomalies. This method may lead to the deletion of significant amounts of data, thus diminishing statistical efficiency and potentially introducing more uncertainty and bias into parameter estimates [4]. On the other hand, pairwise deletion may lead to inconsistent data lengths due to the varying time points of the deleted data, making it impossible to reconstruct and represent the complete dataset as a matrix. This inconsistency affects the application of numerous algorithms, complicating subsequent tasks in process modeling and statistical inference for monitoring.
The data processed using deletion methods are illustrated in Figure 1.
Thus, when a significant amount of data is missing and the correlation between variables is affected, it is not advisable to use deletion methods. To maximize the number of samples and preserve statistical characteristics for subsequent data mining, it is advisable to employ the strategy of estimating and imputing missing data. While manually selecting values for imputation is possible, automated methods become indispensable for processing the real-time data involved in the deployment of intelligent manufacturing [5]. Common strategies for imputation involve using statistical values, such as the mean or median values of variables [6], or employing a similar pattern from complete data, such as hot-deck imputation [2]. Model-based methods often involve algorithms, such as the Maximum Likelihood (ML) algorithm [7] and the Expectation-Maximization (EM) algorithm [8].
To address the issue of missing values in chemical process data, this paper employs a method from machine learning: the Matrix Completion Method (MCM). Due to its remarkable ability to discover and leverage latent patterns or structures, the MCM has found applications in recommendation systems, image restoration, analytical chemistry, thermodynamics, and quantum chemistry. In chemical engineering, the MCM has been used to predict the thermodynamic properties and the models of mixtures. For example, it has been incorporated into the UNIQUAC model to predict activity coefficients at any temperature and composition, surpassing the performance of the best physical models for predicting activity coefficients [9]. Additionally, the MCM has been used to predict Henry’s law coefficients based on the sparsity of experimental data matrices, outperforming traditional state equations [10]. It has also found application in calculations involving quantum and variational effects [11].
This paper will introduce existing methods for imputing missing data in Section 2. Section 3 briefly introduces the principle of matrix completion and improves the existing Matrix Completion Method according to the characteristics of the online data imputing problem. Section 4 will evaluate the efficacy of the MCM in processing missing data, thus providing a comparative discussion with other methods. Finally, Section 5 concludes this study.

2. Established Methods for Data Imputation

In handling chemical process data, routine methods for imputing missing values involve utilizing existing data for imputation or employing model-based methods. The former includes mean imputation, median imputation, and hot-deck imputation. These methods aim to recover or estimate missing data values based on available information, thus enhancing the accuracy of portraying the genuine characteristics of the data for further analysis and exploration. The latter encompasses various model-based methods, including the ML and EM algorithms. These methods impute missing values by establishing models and generating data that adhere to specific distributions for imputation. The selection of each method depends on the particular data characteristics and analytical requirements.

2.1. Mean and Median Replacement

Imputing missing data using the mean of data from different time points on the same sensor offers fast processing speed. Let x ¯ obs represent the mean of the observed data. The missing data x mis are formulated as follows:
x mis = x ¯ obs
Imputing missing data using the median of data from different time points on the same sensor provides a processing speed similar to mean imputation. Let M e obs represent the median of the observed data. The missing data x mis are given by:
x mis = M e obs

2.2. Hot-Deck Replacement

The hot-deck imputation method for processing missing data involves identifying an object in the existing data that is most similar to the missing data and then imputing the missing values with the values from this similar object. In defining similarity, this experiment treats the data obtained from each time point as individual entities. Comparing the absolute average differences in the observed data x obs among different entities serves as the measure for determining similarity. Assuming there are n sensors of observed data from time point 1 and time point 2, we can express their level of similarity X s as follows:
X s = x obs1 x obs2 n
When X s is small, indicating a higher level of similarity, the strategy involves using x obs2 from time point 2 to impute the missing data x mis1 for time point 1.
x mis1 = x obs2

2.3. Model-Based Methods

Model-based methods include ML (Maximum Likelihood) and EM (Expectation-Maximization) algorithms. The ML algorithm assumes that we have a dataset X with a distribution described by parameters Θ. The parameter Θ is dictated by the statistical distribution of the data. Given observed data x obs and missing data x mis , the marginal probability density of x obs is defined as follows:
P ( x obs | Θ ) = P ( x obs , x mis | Θ ( m ) ) d x mis
The likelihood definition of Θ based on x obs is as follows:
L ( x obs | Θ ) P ( x obs | Θ )
If the likelihood function is differentiable and Θ is known, the estimate of x mis can be obtained by solving the problem illustrated in Equation (7):
log [ L ( Y obs | Θ ) ] Θ = 0
The EM algorithm consists of two steps: an E-step and an M-step. In the E-step, the expectation of the log-likelihood function is calculated. This expectation is equivalent to the conditional distribution of x mis given x obs under the estimation of Θ(m), and it is expressed as follows:
E ( x mis | x obs , Θ ( m ) ) = E [ log L ( x obs , x mis | Θ ( m ) ) ]
During the M-step, the goal is to determine the parameters Θ(m+1) that maximize the expectation E. The formula is given by:
Θ ( m + 1 ) = arg max Θ E ( Θ | Θ ( m ) )
The EM algorithm iterates between these two steps until the estimated values converge.
The methods previously discussed, such as mean and median imputation, are simple to calculate but lack accuracy. These methods do not utilize information beyond the data collected by the current sensor, leading to inefficient utilization of information. Hot-deck imputation utilizes information beyond the current sensor. It still uses existing values to impute missing ones, potentially resulting in a single observed value being used to impute multiple time points. For real-time data, it is difficult for us to predict their distribution in advance, and the formulation of the likelihood function in the ML algorithm becomes complex to derive without a definite data distribution. While the EM algorithm does not require the likelihood function expression, it still assumes a specific data distribution and has a slow computation speed.
The MCM can effectively address the issues with the previously mentioned algorithms. The MCM leverages the hidden relationships within the data to calculate new values for imputing missing ones, thus eliminating the need to derive expressions or make assumptions about data distribution.

3. Data Imputation with Improved Matrix Completion Methods

3.1. Principles of Matrix Completion

Matrix completion problems can be considered a type of matrix recovery problem. Specifically, they involve restoring the missing elements in the matrix based on a limited number of known elements [12,13]. Typical applications include estimating missing data, generating recommendations, uncovering hidden structures, conducting image restoration, and classification [14,15,16].
Matrix completion aims to recover the entire matrix using a limited amount of observed data. We record the positions of the observed data with Ω and search for a matrix X that has the same values as the input data matrix M at the known positions and has the minimum rank. Low-rank matrices preserve a substantial amount of redundant information, which we can use to recover the missing data in the matrix. The operator PΩ sets all elements not in Ω in the matrix to zero, and the elements in Ω are PΩ (X) = X. Therefore, we can formulate the above problem as follows:
minimize X   rank ( X ) ,   s . t .   P Ω ( X ) = P Ω ( M )
However, the previously discussed problem is an NP-hard problem [17], which is computationally complex and difficult to solve directly. Consequently, it requires replacement with a problem that is easier to solve.
There are various models for matrix completion. These include, for example, models based on nuclear norm relaxation, in which Ma [18] and Toh [19] relaxed the standard problem into a matrix LASSO model. Additionally, the SVT algorithm proposed by Cai et al. [20] enhances the stability of solving matrix completion problems. Another model is based on matrix factorization. For example, the SOR algorithm proposed by Wen et al. [21] can handle large-scale matrix completion problems faster than traditional nuclear norm minimization algorithms. However, this approach requires an initial rank estimate, and due to the non-convex nature of the model, it cannot assure global convergence. Finally, models based on non-convex function relaxation are also an option. These include, for example, the algorithm proposed by Nie et al. [22] using Schatten p-norm and Lp-norm, and the FGSR algorithm proposed by Fan et al. [23]. Moreover, the FGSR algorithm avoids SVD decomposition, resulting in higher computational efficiency, and is insensitive to the choice of initial rank.
Due to the challenge of estimating the initial rank of the target data, this study selects the SVT (Singular Value Thresholding) algorithm, which does not require initial rank estimation, and the FGSR (Factor Group-Sparse Regularization) algorithm, which is not sensitive to the initial rank. We improve these algorithms by incorporating features for online use to impute missing values.
The SVT algorithm applies convex relaxation to the problem (10), transforming it into the following problem:
minimize X X * ,   s . t .   P Ω ( X ) = P Ω ( M )
This optimization problem involves regularization, resulting in the construction of a Lagrangian function. Finally, we use the alternating iterative method to solve the optimization problem, expressing it in the following form:
X k = D τ ( Y k 1 ) Y k = Y k 1 + δ k P Ω ( M X k )
In this context, Y is the Lagrange multiplier, δ k is the step size, and the operation D τ ( Y k 1 ) is represented by the following equation:
D τ ( Y k 1 ) = [ U , S , V ] = SVD ( Y k 1 ) S = sgn ( S ) max ( S τ , 0 ) X k = U S V T
where ε is the convergence error, the condition for ending the iteration is:
P Ω ( M X k ) F P Ω M F < ε
The FGSR algorithm uses other parameters as proxies for rank, transforming problem (10) into the following form:
minimize X   FGSR ( X ) ,   s . t .   P Ω ( X ) = P Ω ( M )
FGSR(X) can be expressed as follows:
FGSR ( X ) = 2 3 α 1 / 3 min A B = X A 2 , 1 + α 2 B F 2
A and B can be represented as follows:
[ U X , S X , V X ] = SVD ( X ) A = α 1 3 U X S X 2 3 B = α 1 3 S X 1 3 V X T
The problem can be represented as follows:
minimize X A 2 , 1 + α 2 B F 2 ,   s . t .   X = A B , P Ω ( X ) = P Ω ( M )
We can solve the matrix completion problem by addressing the problem (18).

3.2. Data Imputation Based on Matrix Completion

The MCM is used to impute missing values within chemical processes. The fundamental principle is that data from chemical processes include control variable data X con and display variable data X var . The data are in matrix form, with one dimension representing time and the other representing sensor identification. The display variable data X var are returned after being collected by sensors at various positions in the production device, which include observed data x obsv and missing data x mis caused by transmission or sensor failure. Control variable data X con are set manually and contain observed data x obsc . Consequently, the raw data can be perceived as a matrix M with missing elements composed of x obsv , x obsc , and x mis . The form of matrix M is shown in Figure 2.
From this perspective, the problem of filling in missing values in chemical process data can be transformed into a matrix completion problem, as follows:
X = f M C M ( M )
Here, X represents the dataset after filling, and the specific process of data processing is shown in Figure 3:
Given that the equipment in chemical plants operates continuously, generating a constant stream of new data, we have adopted a moving window strategy for online data cleansing in this study. A moving window can reduce the size of the input matrix, thereby accelerating the speed of matrix completion for data processing.
In terms of the selection of parameters used in the algorithm, considering the balance between accuracy and computation time, the convergence error ε of the algorithm is set to ε < 10−6, and the maximum number of iterations is set to 300.
The parameters of the SVT algorithm include the step size δ and the parameter τ. The conventional range of step size δ is 0~2, and this study uses the middle value δ = 1. And, the parameter τ is selected empirically, following reference [20]. The literature sets the parameter τ for an n × n matrix as 5n. In this study, for a matrix of size n1 × n2, the parameter τ is determined as 5 × qrt(n1 × n2).
For the FGSR algorithm, parameters include λ, d, and α. Step size λ is set at 0.3. Parameter d represents the rank of the factor matrix after matrix decomposition, and its selection range is d ≤ min (n1, n2), where n1 and n2 are the row and column numbers of the matrix. In this study, d = min (n1, n2). Parameter α, which adjusts the scale of the two-factor matrices following matrix decomposition, is chosen from the range α > 0. In this experiment, we set α = 2. The reasons for parameter selection refer to Section 4.3.

3.3. Improved Matrix Completion for Online Computing

In the application of the matrix completion algorithm, there is a need to initialize some data involved in the iterative computation. When processing data using a sliding window, most of the data in two adjacent windows are the same. Consequently, we consider using the data from the last iteration of the previous window’s computation as the initial data for the next window to reduce the number of iterative computations for the new window.
We posit that the data window currently under processing is the i -th window. The D τ operation from Equation (13) of the SVT algorithm is improved, leading to the D n τ operation, as shown in Equation (20).
D n τ ( Y k 1 ) = [ U , S , V ] = SVD ( Y k 1 ) ,   if   i = 1 , 3 , 5 [ U , S , V ] = U i 1 , S i 1 , V i 1 ,   if   i = 2 , 4 , 6 S = sgn ( S ) max ( S τ , 0 ) X k = U S V T
U i 1 , S i 1 , and V i 1 represent the final iteration of the prior window. By implementing this improvement and capitalizing on the advantages of online processing, we minimize the number of SVD decompositions, thus lowering the overall time expenditure while ensuring precision. We will identify the SVT algorithm incorporating these changes as the ISVT (Improved Singular Value Thresholding) algorithm in the following text.
For the FGSR algorithm, we begin iterative computation upon each window transition and use A i 1 and B i 1 from the concluding iteration of the prior window when initializing matrices A i and B i for the i - th window’s computation ( i 1 ) according to (17). The modified initialization of A and B can be denoted by Equation (21).
[ A , B ] = [ A i 1 , B i 1 ] ,   if   i 1 [ U X , S X , V X ] = SVD ( X ) ,   if   i = 1 A = α 1 3 U X S X 2 3 B = α 1 3 S X V X T
Such a practice can effectively decrease the frequency of SVD decompositions and accelerate the computational speed. In the subsequent sections, we will refer to the modified FGSR algorithm as the IFGSR (Improved Factor Group-Sparse Regularization) algorithm.

4. Performance Evaluations

4.1. Studied Case

The experimental data are derived from the DCS system of an atmospheric–vacuum distillation unit in a particular refinery. The atmospheric–vacuum distillation unit, an essential device that finds wide application in areas including petroleum and chemical engineering, plays a significant role in improving oil refining optimization and economic efficiency.
The DCS system records assorted data from the apparatus, including the feed temperature, tower top pressure, and side draw flow rate of the distillation column within the unit. The data referenced in this paper are logged every five minutes, with the initial data being complete. The initial data are divided into two groups: control variable data X c o n and display variable data X var . The temperature (TI), pressure (PI), and flow rate (FI) data gathered by the DCS serve as display variable data X var , while the input data PIC, TIC, and FIC for the control system function as control variable data X con .
All data preprocessing and subsequent processing tasks are accomplished using MatlabR2021b. The program is operated on a laptop with a 3.20 GHz CPU and 16 GB of memory.

4.2. Evaluation Procedures

This study selects three common methods for processing missing values: mean imputation, median imputation, and hot-deck imputation. Additionally, four matrix completion algorithms, SVT, ISVT, FGSR, and IFGSR, are used to impute the missing values.
In the experiment, we selected data from 600 time points, which formed an original data matrix of size ns × 600, where ns represents the number of sensors. The window size for processing the data is determined by the matrix’s row count, which corresponds to the sensor number ns. The window size ranges from [0.8ns] to [2.0ns], where [∙] represents rounding up. The missing rate for variable data ranges from 10% to 80%, and the positions of the missing data are randomly determined.
The number of sensors displaying variable data for temperature, pressure, and flow rate differs. The specific details are shown in Table 1.
As different variables have data of different scales and the range of data values varies, we need to perform experiments on each variable independently to evaluate the performance of algorithms.
The detailed steps of the experiment are as follows:
  • First, we convert the complete dataset into a matrix form suitable for algorithm processing. Each column of the data represents the measured values of the variables collected at the current time, and each row represents the sensor number transmitting these variable data.
  • The experiment determines the relevant parameters and the size of the moving window.
  • We set the missing rate and randomly generate missing data in the complete display variable data X var . After preprocessing, the data we obtain will serve as experimental data.
  • We use the MCM to fill in the experimental data.
  • We compare the output with the original data to evaluate the effects of different methods.
Concerning the selection of normalization methods, this experiment opted for quantile normalization. The formula for this approach is provided as follows:
X s c a l e d = X Q 1 ( X ) Q 3 ( X ) Q 1 ( X )
where X s c a l e d represents the normalized data, X denotes the original data, and Q 1 and Q 3 are quartiles. Quantile normalization scales data using quartiles, which effectively reduces the impact of outliers on the data.
The evaluation criterion employed is M A P E (Mean Absolute Percentage Error), which measures the relative error between imputed values and actual values. M A P E results are expressed in percentage, with smaller values indicating higher prediction accuracy. When M A P E equals 0, it signifies perfect accuracy in imputed values. The average M A P E for all data within a single time point represents the M A P E for that time point, denoted as M A P E t . Its formula is as follows:
M A P E t = 100 % n i = 1 n X i X p X i
where X i represents the actual values of the data, X p represents the imputed values after MCM processing, and n represents the total count of data points. Subsequently, the M A P E for all time points is computed, denoted as M A P E a , with the following formula:
M A P E a = t = 1 n t M A P E t n t
where n t represents the number of time points after matrix completion processing.
Finally, using M A P E as the standard for evaluating the accuracy of missing value imputation, the MCM is compared with other methods.

4.3. Results and Discussion

In the section on results and the discussion, we divide the algorithms into the MCM, statistical value imputation methods, and the hot-deck method. The initial step involves performing a sensitivity analysis of the parameters within the four algorithms implemented in the MCM.
For the SVT and ISVT algorithms, parameter testing is conducted using the SVT algorithm and by applying the same parameter settings to both in subsequent experiments. We use the standard parameter setting range from reference [20] to adjust and test the value of parameter τ and the size of step length δ. Figure 4 illustrates the results of the test.
As depicted in Figure 4, the errors in pressure and temperature data tend to stabilize when δ > 0.25. Similarly, the error in flow rate data stabilizes after δ > 0.75, reaching a lower value around δ = 1. Therefore, we choose δ = 1 as the step size for the SVT and ISVT algorithms for subsequent testing.
For the FGSR and IFGSR algorithms, parameter testing is conducted using the FGSR algorithm and by applying the same parameter settings to both in subsequent experiments. The parameters of the FGSR algorithm include γ, α, and step size λ. The initial parameter settings are α = 1 and step size λ = 0.03. The parameter γ is related to the initial rank estimation. According to reference [23], the initial rank estimation has a slight effect on the FGSR algorithm, so we directly choose its maximum selectable value, that is, min(n1, n2), representing the smaller value between the number of rows and columns within the window of the data matrix.
In the selection of parameters, this study chooses parameters that exhibit stable performance in processing three types of variable data.
When λ = 0.03, we evaluate the influence of changing the parameter α on the data processing results. The results are illustrated in Figure 5.
As shown in Figure 5, the M A P E of the three variables after processing does not change significantly after α > 0.5. In this study, we choose the midpoint value of 2 from the range of 0~4 as the value of α for the following experiments.
Setting α = 2, we evaluate the influence of the step increment λ on the results. The results are illustrated in Figure 6.
As illustrated in Figure 6, after λ > 0.2, the MAPE of the three variable data does not change significantly. The M A P E of the pressure data slightly decreases as λ increases, and the M A P E of the flow rate data has a slight fluctuation and is at a lower level near λ = 0.3. In this study, we chose λ = 0.3 as the value of λ for subsequent experiments.
The experimental results reveal that the M A P E in the imputation of flow rate data is substantially higher than that in temperature and pressure data. This is a consequence of the data’s characteristics, with the flow rate data used in the experiment showing a larger range of change than the other two variables.
After determining the parameters, we then test the size of the data processing window and use a fixed window size in subsequent tests. The subsequent experimental data will be presented in Table A1, Table A2, Table A3, Table A4, Table A5 and Table A6 in Appendix A, corresponding, respectively, to Figure 7, Figure 8, Figure 9, Figure 10, Figure 11 and Figure 12.
We fix the missing rate of the variable data at 40%, generating missing data randomly. The range for the window size is from 0.8n to 2.0n (where n represents the number of sensors for the current variable, as detailed in Table 1, and the chosen window size is rounded up). We first conduct tests on the temperature data, and Figure 7 presents the results.
As shown in Figure 7a, when we test the temperature data using M A P E as the evaluation standard, the statistical value imputation methods and the hot-deck algorithm exhibit an increase in M A P E with the growing window size. The M A P E of the ISVT algorithm gradually decreases with the increase in the window size. The FGSR, IFGSR, and SVT algorithms have the smallest M A P E when the window size reaches 1.8n. As seen in Figure 7b, the time consumption of the SVT, FGSR, and IFGSR algorithms increases significantly after the window size is 1.8n, and the time consumption of the ISVT algorithm shows little change between the window sizes of 1.8n and 2n. Considering these factors, to achieve higher accuracy and reduce time consumption, we choose 1.8n as the window size for subsequent experiments.
As illustrated in Figure 8a, when we test the pressure data using M A P E as the evaluation standard, the M A P E of the ISVT algorithm and the hot-deck algorithm remains relatively stable. The M A P E values for the FGSR, SVT, and IFGSR algorithms all decrease with an increase in the window size. As seen in Figure 8b, as the time window increases, the time consumption of the hot-deck algorithm and the MCM gradually increases. When the window size is 2n, the time consumption of the IFGSR algorithm is halved compared to the FGSR algorithm and is close to the hot-deck algorithm. For better accuracy, in subsequent experiments, we use 2n as the window size to process data.
As shown in Figure 9a, the M A P E of MCMs exhibits fluctuations with an increase in the window size when different algorithms are applied. The M A P E of ISVT, FGSR, and IFGSR maintains a low level when the window size is 1.6n. The precision of the SVT algorithm remains stable once the window size exceeds 1.6n. As seen in Figure 9b, the time consumption of the MCM does not change much after the window size is 1.4n. Therefore, based on the above results, we select a window size of 1.6n for future experiments on flow rate data.
In summary, through testing on three variables, we have chosen 1.8n as the window size for temperature data, 2n as the window size for pressure data, and 1.6n as the window size for flow rate data for subsequent experiments.
After determining the parameters and the size of the test window, we use the same experimental data to test the performance of various methods under different data loss rates. The evaluation metrics include accuracy and time consumption. The results are as follows.
As shown in Figure 10, we first test the temperature data. With a fixed window size, we test the results with a data loss rate of 10% to 80%.
As shown in Figure 10a, when we use M A P E as the evaluation standard, it can be seen that MCMs have the smallest M A P E at any missing rate, ISVT’s M A P E is slightly larger than SVT, and the accuracy performance of IFGSR and FGSR is almost the same. As shown in Figure 10b, the highest time consumption in the MCM is the SVT and FGSR algorithms. In comparison, both ISVT and IFGSR show improvements in time consumption.
In summary, the MCM consistently exhibits the smallest M A P E when processing temperature data. Among them, SVT has the lowest M A P E but the highest time consumption; ISVT’s accuracy is slightly lower than SVT, and the time consumption is lower. The accuracy of FGSR and IFGSR is almost the same, with both slightly underperforming the SVT algorithm. IFGSR has a lower time consumption compared to FGSR, and both significantly reduce time consumption compared to SVT and ISVT.
For pressure data, Figure 11 shows the results of testing data with missing rates ranging from 10% to 80% at a fixed window size.
As shown in Figure 11a, processing pressure data with M A P E as the evaluation metric reveals that FGSR and IFGSR consistently exhibit the smallest M A P E . Moreover, the difference in M A P E between IFGSR and FGSR is insignificant across all missing rates. Figure 11b illustrates that the IFGSR algorithm consumes less time than the FGSR algorithm, and its time consumption exceeds only those of the two statistical value imputation methods at high missing rates.
In conclusion, when processing pressure data, the IFGSR algorithm achieves a slight reduction in accuracy compared to the FGSR algorithm but significantly reduces the processing time. Compared to other methods, it also demonstrates higher accuracy and advantages in time consumption.
For flow rate data, Figure 12 shows the results of testing data with missing rates ranging from 10% to 80% at a fixed window size.
Figure 12a illustrates that the MCM exhibits the highest accuracy in processing missing values for flow rate data. The overall accuracy ranking is SVT > FGSR > IFGSR ≈ ISVT. As shown in Figure 12b, considering time consumption, ISVT and IFGSR save a substantial amount of time compared to SVT and FGSR.
In conclusion, when using the MCM to impute the missing values of temperature, pressure, and flow rate data, the SVT, FGSR, and IFGSR algorithms in the MCM consistently demonstrate superior accuracy under any conditions compared to traditional methods, thus validating the applicability of matrix completion for real-time missing data imputation. Furthermore, a comparison between the IFGSR and FGSR algorithms reveals a minor difference in accuracy but a significant reduction in computation time for IFGSR, demonstrating the effectiveness of the improvement method proposed in this paper for matrix completion algorithms used for real-time missing data imputation.

5. Conclusions

This study describes a method for using the MCM to fill in missing values in chemical process data. This method employs matrix completion and sliding windows to process missing data in real-time chemical process data. The fundamental principle involves using the MCM to restore matrices that contain missing values within the chemical process data. In addition, an improvement method is proposed, which enhances the performance of the matrix completion algorithm used for real-time missing data imputation. The results show that the MCM performs well in solving the problem of real-time data missing value imputation in chemical processes, and the proposed improvement method also significantly enhances the computational speed of matrix completion algorithms.
The MCM proposed in this paper does not rely heavily on long-term historical data. It does not need to rely on long-term historical data for detailed modeling of specific devices and then use the model to calculate missing data. It only needs short-term data to obtain relatively accurate results for missing value recovery. This method exhibits good versatility and is applicable to different production devices. The MCM can also predict future data of production devices, and, when combined with appropriate constraints, it can identify and replace outliers.

Author Contributions

Conceptualization, X.Z. and S.T.; methodology, X.Z. and S.T.; software, X.Z.; validation, X.Z.; formal analysis, X.Z.; investigation, S.T.; resources, X.S.; data curation, X.Z.; writing—original draft preparation, X.Z.; writing—review and editing, S.T.; visualization, X.Z.; supervision, S.X.; project administration, L.X.; funding acquisition, S.X. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China, grant number No. 22178190.

Data Availability Statement

The data presented in this study are not available due to privacy.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Table A1. The influence of window size on the processing of temperature data.
Table A1. The influence of window size on the processing of temperature data.
Method\Window SizeError and Time Consumption0.8ns1.0ns1.2ns1.4ns1.6ns1.8ns2.0ns
Mean imputationMAPE/%0.4296890.4367320.4358030.4362750.4390290.4431910.448638
Time/s0.1705020.128560.1303090.2068990.0908820.1091370.100491
Median imputationMAPE/%0.4291910.4352740.4322430.4312910.4336020.4369310.44047
Time/s0.2012650.0964620.0750460.2351750.1133310.09660.060941
Hot-deckMAPE/%0.3680260.3702440.3716740.3698680.3723330.3729590.372206
Time/s0.3214340.4186520.588010.8951450.9293941.3425821.954099
SVTMAPE/%0.3306920.3305860.3286750.3271930.3254320.3241940.327919
Time/s2.7592063.917474.314614.7705635.0542054.7971856.812804
ISVTMAPE/%0.3740910.3685590.3612740.3564270.3505040.3429270.33737
Time/s1.7436032.4029952.6422172.9270752.3221314.2031534.477385
FGSRMAPE/%0.3477810.3466030.3454470.3434820.3410730.339990.34509
Time/s1.2697981.6058161.985932.3845962.8126892.6570194.498306
IFGSRMAPE/%0.3439150.3425290.3421090.3396520.3386370.3389050.344688
Time/s1.0130551.2083371.519251.9034692.2695532.1856263.625262
Table A2. The influence of window size on the processing of pressure data.
Table A2. The influence of window size on the processing of pressure data.
Method\Window SizeError and Time Consumption0.8ns1.0ns1.2ns1.4ns1.6ns1.8ns2.0ns
Mean imputationMAPE/%0.461160.4688820.4782660.4818180.4878730.4939750.498029
Time/s0.0066260.0087270.0125390.0186820.0169060.0262180.024772
Median imputationMAPE/%0.4611450.4683430.4733040.476830.4842120.4896640.495333
Time/s0.0124140.0109060.0188150.0054720.0207750.018090.023827
Hot-deckMAPE/%0.3067170.3039020.3034360.3024070.3023310.3037550.303662
Time/s0.0063750.0099660.0181060.0205860.0282610.0439140.046188
SVTMAPE/%0.2376680.2286560.2282440.2261210.2262220.2241980.22248
Time/s0.1452160.1435190.1532340.1619810.246990.3753390.382238
ISVTMAPE/%0.3784990.377480.3796640.3772440.377910.377640.37548
Time/s0.0899360.093140.0914710.0983330.1456660.2152310.190195
FGSRMAPE/%0.2315820.2251960.2215280.2169680.2172460.2130670.209774
Time/s0.0281330.0314910.0368070.0545310.0578730.0673870.095115
IFGSRMAPE/%0.2547440.2555880.2339280.2311920.2300230.2280480.226329
Time/s0.0209260.0212670.0255440.0381620.0391570.0330970.045742
Table A3. The influence of window size on the processing of flow rate data.
Table A3. The influence of window size on the processing of flow rate data.
Method\Window SizeError and Time Consumption0.8ns1.0ns1.2ns1.4ns1.6ns1.8ns2.0ns
Mean imputationMAPE/%0.9523420.9714530.9868960.998511.0124671.0299581.044433
Time/s0.0278330.0222360.0585120.0391260.0447060.045360.051778
Median imputationMAPE/%0.9600080.9811770.9971881.0081031.0248711.0430591.060312
Time/s0.0356340.0296530.0380050.0379850.0292850.0295580.038445
Hot-deckMAPE/%0.9866630.9812680.9845670.9805680.9792320.945810.943907
Time/s0.0226140.0328450.0580580.0758910.0837560.1019580.124636
SVTMAPE/%0.8319480.8318940.8297710.8304610.8218720.8217810.821165
Time/s0.3501210.5125510.5894020.645640.650250.6633640.666758
ISVTMAPE/%0.9296140.9298270.9210770.9265090.9190680.9285030.944799
Time/s0.2151140.3087480.3488950.4011680.3923680.4064070.414813
FGSRMAPE/%0.8544040.8526550.8567740.8613440.851590.8534640.85124
Time/s0.1604310.2204770.230720.2999420.2644350.2852320.30906
IFGSRMAPE/%0.8871590.9026840.8851020.901920.8847780.9048240.893551
Time/s0.1140310.1750190.1941660.2426490.2200730.2236180.217471
Table A4. The influence of miss rate on the processing of temperature data.
Table A4. The influence of miss rate on the processing of temperature data.
Method\Miss RateError and Time Consumption0.10.20.30.40.50.60.70.8
Mean imputationMAPE/%0.1114150.2215670.3305080.4427650.5553710.6696870.7762170.890122
Time/s0.0643550.0891040.1007750.0927440.1249610.1282730.0836670.127447
Median imputationMAPE/%0.1114430.2181740.325260.4377720.5500060.662390.7665470.879003
Time/s0.08530.0691590.0953770.0788690.0962460.0806750.0780360.069624
Hot-deckMAPE/%0.0892450.182210.2750350.3678490.4647740.5612810.6530080.757046
Time/s1.0453131.4207771.7248941.3188431.3096741.302661.3977181.169198
SVTMAPE/%0.0758530.1547870.2374230.3237730.4152130.5073190.6029570.713683
Time/s4.3444515.9879625.8805895.6192515.5821995.6689596.5471735.714972
ISVTMAPE/%0.0797670.1632870.2502390.342090.4371980.5349760.6354930.747522
Time/s2.7745713.9862733.7835413.6440633.6351854.1677424.2741853.599238
FGSRMAPE/%0.0826960.1672740.2535340.3407910.4285550.51730.6077270.705739
Time/s3.5935454.6097314.0857533.3069662.8573072.4722332.2485251.738816
IFGSRMAPE/%0.082260.1660690.2527630.3389170.4265360.5139260.6069550.706588
Time/s3.2480334.2163413.6135622.670052.1955561.7921131.6291941.265752
Table A5. The influence of miss rate on the processing of pressure data.
Table A5. The influence of miss rate on the processing of pressure data.
Method\Miss RateError and Time Consumption0.10.20.30.40.50.60.70.8
Mean imputationMAPE/%0.1253690.258910.3860060.500690.6350190.7630160.8836181.040332
Time/s0.0318780.0300320.0311710.0242680.030430.0314510.0287350.02203
Median imputationMAPE/%0.12320.2589290.3860070.500050.6287950.7590140.8809591.041563
Time/s0.0260020.0239580.0197620.0144440.0161180.0169420.0183830.020368
Hot-deckMAPE/%0.0702590.1469910.222110.2990010.3901970.4916410.6261090.786883
Time/s0.0427080.0421160.042380.042790.05410.0519130.0513560.051745
SVTMAPE/%0.0468170.1005160.1576760.2260040.3107590.4205880.5506620.748993
Time/s0.2979980.2996930.2993470.2989050.3564110.3582020.3549570.354273
ISVTMAPE/%0.0874380.1762650.2831080.3809760.5058530.6454330.7870631.007307
Time/s0.1805780.1807120.1802890.1826140.2145080.2134380.2144170.213874
FGSRMAPE/%0.0458190.0967580.1535440.2124780.2882760.3757870.4755590.640863
Time/s0.0777450.0758040.075240.0738850.0882610.0764340.0652840.0532
IFGSRMAPE/%0.0514390.1073750.1672970.2292390.3030.3928650.4993570.666309
Time/s0.0396490.0405910.0388360.0338790.0346440.0381040.0396150.035125
Table A6. The influence of miss rate on the processing of flow rate data.
Table A6. The influence of miss rate on the processing of flow rate data.
Method\Miss RateError and Time Consumption0.10.20.30.40.50.60.70.8
Mean imputationMAPE/%0.2393660.4797150.59371.0202461.4988381.6655911.823322.043651
Time/s0.0368630.0325850.0370250.0335440.0407570.039320.0324360.049352
Median imputationMAPE/%0.2401110.4840990.5972881.0271791.511141.684081.8402592.071771
Time/s0.024770.0499420.0563030.0460610.0513520.0179150.0314720.042276
Hot-deckMAPE/%0.2519020.4630790.6427770.896691.4072121.6814551.6483852.049117
Time/s0.0276120.0282420.0286660.0280870.0256830.023450.0233010.026054
SVTMAPE/%0.2152220.4159870.5596950.9511961.3822861.5376851.7278611.922707
Time/s0.4220970.4186320.4135290.4155090.403980.4311140.3559440.3587
ISVTMAPE/%0.2290680.4648720.5949991.0246541.5132731.7229141.9244572.27328
Time/s0.2573540.2557160.2532060.2561940.2629430.2513030.2149860.258442
FGSRMAPE/%0.2316520.4388770.6101540.9888061.3857931.5754411.7486051.937884
Time/s0.2077260.2074220.1984180.1939250.1794960.1694730.1639230.118877
IFGSRMAPE/%0.2466120.4635560.6409261.030571.4251741.607681.8019651.964265
Time/s0.1701720.1848340.153030.1403240.1351620.1134170.1055560.08579

References

  1. Zhang, K.; Gonzalez, R.; Huang, B.; Ji, G. Expectation–Maximization Approach to Fault Diagnosis with Missing Data. IEEE Trans. Ind. Electron. 2015, 62, 1231–1240. [Google Scholar] [CrossRef]
  2. Zhu, J.; Ge, Z.; Song, Z.; Gao, F. Review and Big Data Perspectives on Robust Data Mining Approaches for Industrial Process Modeling with Outliers and Missing Data. Annu. Rev. Control 2018, 46, 107–133. [Google Scholar] [CrossRef]
  3. Pigott, T.D. A Review of Methods for Missing Data. Educ. Res. Eval. 2001, 7, 353–383. [Google Scholar] [CrossRef]
  4. Chatfield, C.; Little, R.J.A.; Rubin, D.B. Statistical Analysis with Missing Data. J. R. Stat. Soc. Ser. A 1988, 151, 375. [Google Scholar] [CrossRef]
  5. Xu, S.; Lu, B.; Baldea, M.; Edgar, T.F.; Wojsznis, W.; Blevins, T.; Nixon, M. Data Cleaning in the Process Industries. Rev. Chem. Eng. 2015, 31, 453–490. [Google Scholar] [CrossRef]
  6. Luo, L.; Bao, S.; Peng, X. Robust Monitoring of Industrial Processes Using Process Data with Outliers and Missing Values. Chemom. Intell. Lab. Syst. 2019, 192, 103827. [Google Scholar] [CrossRef]
  7. Allison, P.D. Handling Missing Data by Maximum Likelihood; Statistical Horizons: Haverford, PA, USA, 2012. [Google Scholar]
  8. Walczak, B.; Massart, D.L. Dealing with Missing Data: Part II. Chemom. Intell. Lab. Syst. 2001, 58, 29–42. [Google Scholar] [CrossRef]
  9. Jirasek, F.; Bamler, R.; Fellenz, S.; Bortz, M.; Kloft, M.; Mandt, S.; Hasse, H. Making Thermodynamic Models of Mixtures Predictive by Machine Learning: Matrix Completion of Pair interactions. Chem. Sci. 2022, 13, 4854–4862. [Google Scholar] [CrossRef] [PubMed]
  10. Hayer, N.; Jirasek, F.; Hasse, H. Prediction of Henry’s Law Constants by Matrix Completion. AIChE J. 2022, 68, e17753. [Google Scholar] [CrossRef]
  11. Bac, S.; Quiton, S.J.; Kron, K.J.; Chae, J.; Mitra, U.; Mallikarjun Sharada, S. A Matrix Completion Algorithm for Efficient Calculation of Quantum and Variational Effects in Chemical Reactions. J. Chem. Phys. 2022, 156, 184119. [Google Scholar] [CrossRef] [PubMed]
  12. Recht, B. A Simpler Approach to Matrix Completion. J. Mach. Learn. Res. 2011, 12, 3413–3430. [Google Scholar]
  13. Candès, E.J.; Recht, B. Exact Matrix Completion via Convex Optimization. Found. Comput. Math. 2009, 9, 717–772. [Google Scholar] [CrossRef]
  14. Liu, Y.; Zheng, W.-F.; Zhang, Y. Optimization of Data Matrix Completion by Symmetric Weighting Algorithm. J. Sichuan Univ. Nat. Sci. Ed. 2021, 58, 73–80. [Google Scholar] [CrossRef]
  15. Cabral, R.; De la Torre, F.; Costeira, J.P.; Bernardino, A. Matrix Completion for Weakly-Supervised Multi-Label Image Classification. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 121–135. [Google Scholar] [CrossRef] [PubMed]
  16. Li, W.; Zhao, L.; Lin, Z.; Xu, D.; Lu, D. Non-Local Image Inpainting Using Low-Rank Matrix Completion. Comput. Graph. Forum 2015, 34, 111–122. [Google Scholar] [CrossRef]
  17. Chistov, A.L.; Grigor’ev, D.Y. Complexity of Quantifier Elimination in the Theory of Algebraically Closed Fields. In Proceedings of the Mathematical Foundations of Computer Science, Praha, Czechoslovakia, 3–7 September 1984; Chytil, M.P., Koubek, V., Eds.; Springer: Berlin/Heidelberg, Germany, 1984; pp. 17–31. [Google Scholar]
  18. Ma, S.; Goldfarb, D.; Chen, L. Fixed Point and Bregman Iterative Methods for Matrix Rank Minimization. Math. Program. 2011, 128, 321–353. [Google Scholar] [CrossRef]
  19. Toh, K.-C.; Yun, S. An Accelerated Proximal Gradient Algorithm for Nuclear Norm Regularized Linear Least Squares Problems. Pac. J. Optim. 2010, 6, 15. [Google Scholar]
  20. Cai, J.-F.; Candès, E.J.; Shen, Z. A Singular Value Thresholding Algorithm for Matrix Completion. SIAM J. Optim. 2010, 20, 1956–1982. [Google Scholar] [CrossRef]
  21. Wen, Z.; Yin, W.; Zhang, Y. Solving a Low-Rank Factorization Model for Matrix Completion by a Nonlinear Successive over-Relaxation Algorithm. Math. Program. Comput. 2012, 4, 333–361. [Google Scholar] [CrossRef]
  22. Nie, F.; Wang, H.; Huang, H.; Ding, C. Joint Schatten \(p\)-norm and \(\ell _p\)-norm robust matrix completion for missing value recovery. Knowl. Inf. Syst. 2015, 42, 525–544. [Google Scholar] [CrossRef]
  23. Fan, J.; Ding, L.; Chen, Y.; Udell, M. Factor Group-Sparse Regularization for Efficient Low-Rank Matrix Recovery. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; Wallach, H., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2019; Volume 32. [Google Scholar]
Figure 1. Listwise deletion and pairwise deletion.
Figure 1. Listwise deletion and pairwise deletion.
Processes 12 00659 g001
Figure 2. The form of input matrix data M , composed of x obsv , x obsc , and x mis .
Figure 2. The form of input matrix data M , composed of x obsv , x obsc , and x mis .
Processes 12 00659 g002
Figure 3. Flowchart of the Matrix Completion Method for imputing missing data.
Figure 3. Flowchart of the Matrix Completion Method for imputing missing data.
Processes 12 00659 g003
Figure 4. The impact of step size δ on the effectiveness of the SVT algorithm.
Figure 4. The impact of step size δ on the effectiveness of the SVT algorithm.
Processes 12 00659 g004
Figure 5. The impact of parameter α on the effectiveness of the FGSR algorithm.
Figure 5. The impact of parameter α on the effectiveness of the FGSR algorithm.
Processes 12 00659 g005
Figure 6. The impact of step size λ on the effectiveness of the FGSR algorithm.
Figure 6. The impact of step size λ on the effectiveness of the FGSR algorithm.
Processes 12 00659 g006
Figure 7. Influence of window size on the processing of temperature data. Subfigure (a) represents the MAPE after data imputation, subfigure (b) represents the time consumption.
Figure 7. Influence of window size on the processing of temperature data. Subfigure (a) represents the MAPE after data imputation, subfigure (b) represents the time consumption.
Processes 12 00659 g007
Figure 8. Influence of window size on the processing of pressure Data. Subfigure (a) represents the MAPE after data imputation, subfigure (b) represents the time consumption.
Figure 8. Influence of window size on the processing of pressure Data. Subfigure (a) represents the MAPE after data imputation, subfigure (b) represents the time consumption.
Processes 12 00659 g008
Figure 9. Influence of window size on the processing of flow rate data. Subfigure (a) represents the MAPE after data imputation, subfigure (b) represents the time consumption.
Figure 9. Influence of window size on the processing of flow rate data. Subfigure (a) represents the MAPE after data imputation, subfigure (b) represents the time consumption.
Processes 12 00659 g009
Figure 10. Influence of miss rate on the processing of temperature data. Subfigure (a) represents the MAPE after data imputation, subfigure (b) represents the time consumption.
Figure 10. Influence of miss rate on the processing of temperature data. Subfigure (a) represents the MAPE after data imputation, subfigure (b) represents the time consumption.
Processes 12 00659 g010
Figure 11. Influence of miss rate on the processing of pressure data. Subfigure (a) represents the MAPE after data imputation, subfigure (b) represents the time consumption.
Figure 11. Influence of miss rate on the processing of pressure data. Subfigure (a) represents the MAPE after data imputation, subfigure (b) represents the time consumption.
Processes 12 00659 g011
Figure 12. Influence of miss rate on the processing of flow rate data. Subfigure (a) represents the MAPE after data imputation, subfigure (b) represents the time consumption.
Figure 12. Influence of miss rate on the processing of flow rate data. Subfigure (a) represents the MAPE after data imputation, subfigure (b) represents the time consumption.
Processes 12 00659 g012
Table 1. Table of variable types and number of sensors.
Table 1. Table of variable types and number of sensors.
Variable TypeVariable DataControl DataTotal
Temperature24525270
Pressure381755
Flow rate336093
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhang, X.; Sun, X.; Xia, L.; Tao, S.; Xiang, S. A Matrix Completion Method for Imputing Missing Values of Process Data. Processes 2024, 12, 659. https://doi.org/10.3390/pr12040659

AMA Style

Zhang X, Sun X, Xia L, Tao S, Xiang S. A Matrix Completion Method for Imputing Missing Values of Process Data. Processes. 2024; 12(4):659. https://doi.org/10.3390/pr12040659

Chicago/Turabian Style

Zhang, Xinyu, Xiaoyan Sun, Li Xia, Shaohui Tao, and Shuguang Xiang. 2024. "A Matrix Completion Method for Imputing Missing Values of Process Data" Processes 12, no. 4: 659. https://doi.org/10.3390/pr12040659

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop