Missing and Corrupted Data Recovery in Wireless Sensor Networks Based on Weighted Robust Principal Component Analysis

Although wireless sensor networks (WSNs) have been widely used, the existence of data loss and corruption caused by poor network conditions, sensor bandwidth, and node failure during transmission greatly affects the credibility of monitoring data. To solve this problem, this paper proposes a weighted robust principal component analysis method to recover the corrupted and missing data in WSNs. By decomposing the original data into a low-rank normal data matrix and a sparse abnormal matrix, the proposed method can identify the abnormal data and avoid the influence of corruption on the reconstruction of normal data. In addition, the low-rankness is constrained by weighted nuclear norm minimization instead of the nuclear norm minimization to preserve the major data components and ensure credible reconstruction data. An alternating direction method of multipliers algorithm is further developed to solve the resultant optimization problem. Experimental results demonstrate that the proposed method outperforms many state-of-the-art methods in terms of recovery accuracy in real WSNs.


Introduction
Wireless sensor networks (WSNs) contain a group of spatially distributed sensor nodes that are capable of communicating wirelessly and collecting data from the surrounding environments [1,2]. Recently, WSNs have been widely applied in different domains, such as environmental monitoring [3], military management [4], and health care [5]. Typically, the main task of WSNs is to collect sensing data from all sensor nodes to a certain sink and then perform further analysis based on the monitoring data, and the collected data are usually composed of readings sensed by multiple nodes in consecutive time slots. However, due to the poor environments and energy constraints in WSNs, data loss and corruption are inevitable in practical applications. Therefore, it is important to reconstruct the real data from partially collected data with corruption.
Recently, various reconstruction methods have been proposed for data recovery in WSNs. Based on data interpolation techniques, a K nearest neighbor (KNN)-based method [6] was proposed to simply utilize the values of the nearest neighbors to estimate the missing values. The Delaunay triangulation (DT) [7] utilizes the vertices as their global errors to reconstruct virtual triangles for data interpolation. Based on compressed sensing (CS) [8], the distributed compressed sensing (DCS) method [9,10] was proposed to exploit the sparsity of the data under various transform domains.
Since many signals in various applications are always distributed into two-dimensional data (i.e., matrix form) and exhibit second-order sparsity (i.e., the low-rankness), matrix completion (MC) [11] has emerged as a novel technology and has been applied to many fields, such as image inpainting [12], magnetic resonance imaging [13], and recommendation systems [14]. The matrix completion aims at recovering the missing entries of a low-rank matrix from the incompletion observations, which can be formulated as a rank minimization problem. In general, solving this problem is NP-hard, since the rank function is non-convex. Fortunately, the nuclear norm, the sum of all singular values of the matrix, is the convex approximation of the rank function and can be used as an alternative [11].
Since the readings collected from N nodes during M time slots in WSNs can also be distributed into a matrix exhibiting low-rankness, the matrix completion-based methods have been proposed to utilize the correlation of WSNs data. An efficient data collection approach (EDCA) [15] and spatiotemporal compressive data collection (STCDG) method [16] were firstly proposed to recover the WSNs data by exploiting the spatiotemporal correlation in the form of low-rankness. Recently, several methods jointly utilizing low-rank and spatiotemporal sparsity feature [17,18] were proposed. Considering that the missing of row of the data matrix due to a broken node will greatly degrade the recovery accuracy, the matrix completion method [19] was proposed to utilize the interpolation technique for WSNs data recovery. In addition, in order to address the needs of real-time reconstruction of data in practical applications, the sliding window-based reconstruction approach [20,21] was proposed to achieve real-time data recovery.
However, the reconstruction performance of these methods will greatly degenerate when corruption exists in the sampled data. Direct constraint of the low-rankness cannot avoid the impact of corruption on the reconstruction of normal data. A two-phase MCbased data recovery scheme (MC-Two-Phase) [22] was proposed to recover the normal data without the influence of corruption by detecting the corruption with the principal component analysis (PCA) [23] before reconstruction. Although PCA can be utilized to detect faults corrupted by small noise, it has the problem of poor robustness. To overcome the limitations of PCA, the robust principal component analysis (RPCA) method [24][25][26] have been proposed in recent years. The RPCA method improves the robustness since it only emphasized that the noise is sparse regardless of the strength of the noise. However, it is unreasonable to treat all singular values equally in the traditional RPCA algorithm, since different singular values may contain signal information with different important levels.
To solve the above problem, we propose a weighted robust principal component analysis (WRPCA) method for the reconstruction of WSN data with corruption. The main contributions of this paper are the following: Firstly, based on RPCA, the original data with outliers are decomposed into a sum of a low-rank normal data matrix and a sparse abnormal matrix to avoid the influence of outliers during reconstruction.
Secondly, the low-rankness of WSNs data is revealed by the variation of singular values of two real datasets collected from the Inter Berkeley Research lab and GreenOrbs.
Thirdly, the weighted nuclear norm is introduced to constrain the low-rankness and preserve the principal components of WSNs data.
The rest of this paper is organized as follows. Section 2 presents the basics of RPCA. Section 3 describes the proposed method and the reconstruction method. Section 4 shows the result of computer experiments and analysis, which is followed by the conclusion of the paper in Section 5.

Basics of RPCA
Although PCA can be used to detect corruptions, it is sensitive to gross noise and outliers. The performance and applicability of PCA are limited due to the lack of robustness to gross corruptions in real-life scenarios. As an improvement of PCA, RPCA can handle grossly corrupted data well. Suppose that data matrix X can be viewed as consisting of the two components: a low-rank matrix L and a sparse matrix S: The low-rank matrix L and sparse matrix S can be obtained by solving the following problem: where rank(·) denotes the rank of the matrix, · 0 is the l 0 norm, and λ is the balance parameter. Equation (2) is non-convex and NP-hard, which is difficult to solve. Typically, the matrix nuclear norm, the convex approximation of the rank function, can be used as an alternative. Therefore, the above problem can be cast as the following convex optimization problem: where value of matrix L, and · 1 is the l 1 norm. The main goal of (3) is to reconstruct low-rank normal data L from the corrupted observation data X. RPCA has been successfully applied in different domains, including image processing [27], multimedia [28], document analysis [29], etc. The nuclear norm minimization utilized in (3) shrinks all the singular values equally [30], ignoring that different singular values may have different importance.
Actually, the real data sensed in the monitoring area always exhibit low-rankness, and the unavoidable corrupted data are sparsely distributed in the sensed data matrix. Based on RPCA, we propose a weighted robust principal component analysis method to recover the missing data in WSNs with the data corruption.

Problem Formulation and Signal Feature
Consider a WSN consisting of one sink and N sensor nodes, and the sensor nodes sense the environmental information and send the signal to the sink in each time slot. During M time slots, N × M readings are gathered in the sink and can be organized into a matrix X ∈ R N×M .
However, due to hardware and network conditions, data loss and corruption may occur in the network. Mathematically, only partial data d = Ω(X) can be successfully collected in the sink, and the original data X contain the corrupted data. Here, Ω(·) is the random sampling operator. That is, under the sampling ratio ρ s , for a matrix X ∈ R N×M , there are d = Ω(X) ∈ R D×1 entries that are sampled from the whole data randomly, where D = ρ s N M + 1 2 . It is worth noting that the sampled partial data d also contains the sampled corruption data. It is necessary to reconstruct the uncorrupted whole data from the sampled partial data under the sampling ratio ρ s .
The data sensed in a certain area during a consecutive time are always redundant and highly correlated and can be distributed into a matrix (uncorrupted matrix L) exhibiting low-rankness. Since the outliers in real WSN are uniformly and randomly distributed, the sparsely distributed corrupted data can be denoted by the matrix S. Therefore, the whole data X can be regarded as a combination of the uncorrupted data matrix L and the corrupted matrix S.
In order to verify that the uncorrupted data L in WSNs is low-rank, two datasets from the Inter Berkeley Research lab [31] and GreenOrbs [32] were used as testing data. Since data loss and corruption exist in both two datasets, two small but completed subset data without corruption are selected as the ground truth for our verification experiment. Specifically, the selected Inter Berkeley Research lab subset data including temperature and humidity data were measured by 49 sensor nodes during 138 time slots, and the selected GreenOrbs subset data were measured by 130 sensor nodes during 129 time slots. As shown in Figure 1, the singular values of the two attribute data matrix illustrate the low-rankness for both two datasets.

Proposed Method
Since the original data matrix X can be decomposed into a low-rank matrix L and a sparse matrix S, the WSNs data recovery problem can be expressed by (3). However, the NNM method adopts the same threshold for each singular value, which is not appropriate because the larger singular values usually represent the major data components of the data and contain more signal information. The larger singular values should be shrunk less to preserve the major data components.
In order to improve the practically and flexibility of the nuclear norm, the weighted nuclear norm is utilized in the recovery of WSNs data. The weighted nuclear norm of matrix L is defined as: where w i is the weight coefficient, and σ i (L) is the i−th singular value of L. It is clear that the weighted nuclear norm becomes the conventional nuclear norm when w 1 = w 2 = · · · = w n . The weighted nuclear norm minimization (WNNM) based low-rank matrix completion problem can be described as: Gu et al. [30] proved that the problem can be solved by the following singular value thresholding formula: The larger singular values should be given smaller weights to achieve less shrinkage, and the smaller ones should be given greater weights to achieve more shrinkage. The weights should be inversely proportional to singular values. Therefore, in this paper, we set the weight as: where c > 0 is a constant, M is the number of columns in L, σ 2 is the variance of noise, and ε only needs to be a very small number to avoid dividing by zero. By introducing the weighted approach, different singular values are shrunk differently with weight w i , which further preserves the major components of data. Then, a WSNs data reconstruction method is proposed by applying WNNM in traditional RPCA to recover L from partial measurement d. It can be described as: Only partial measurement d is known as a prior in (8). The original data X and uncorrupted L can be reconstructed from the partial data d. By introducing a quadratic penalty term, (8) can be converted to the following formulation: where µ and λ are the regularization parameters. The proposed method incorporates both the RPCA and WNNM in a single formulate to further preserve the major data components. The recoveredL can be obtained as the uncorrupted completed data in WSNs.

Model Optimization
To solve (9), a reconstructed algorithm based on an alternating direction method of multipliers (ADMM) [33,34] is introduced. The augmented Lagrangian function of (9) can be written as: where A is the Lagrangian multiplier, and α is the penalty parameter. More details of the proposed algorithm are given as follows.
For the X-subproblem, we update X k+1 as follows: Here, the preconditional conjugate gradient (PCG) algorithm is applied to solve this problem in this paper.
For the L-subproblem, we update L k+1 as follows: In general, the WNNM problem is non-convex. Gu et al. [30] proved that the problem has a fixed point and can be solved by the singular value thresholding formula: and S w (Σ) denotes taking singular value thresholding to the diagonal matrix Σ. Since the threshold can be effected by w i and µ, here, we set w i = µ· √ M·σ 2 σ i (L)+ε to simplify the solution; then, S w (Σ) ii = max(Σ ii − w i , 0). The initial σ i (L k+1 ) can be estimated as: For the S-subproblem, we update S k+1 as follows: We can find the solution via the well-known soft thresholding formula: For the A-subproblem, we update A k+1 as follows: In practical implementation, we initialize X 0 , L 0 , S 0 , and A 0 as the zeros matrices. Then, (9) can be solved by repeating the above steps until L k+1 − L k F / L k F is smaller than a predefined tolerance parameter or the number of iterations reaches the predefined maximum.
The main computational cost of (9) depends on the update of L k+1 , which requires computing the SVD of the N × M matrix per iteration. The computational complexity per iteration is O min N M 2 , N 2 M .

Experiments and Analysis
Most existing WSNs data reconstruction methods (e.g., KNN [6], CS [9], EDCA [15], and methods utilizing both low-rank and sparsity feature [17,18]) have achieved satisfying recovery performances, but they do not consider the case that the WSNs data have outliers. Therefore, the performance of our proposed method is compared with the RPCA method [24] and MC-Two-Phase method [22], which consider outliers during reconstruction.

Experimental Environments
The two datasets adopted to verify the low-rank property of normal WSNs data were also utilized for the reconstruction experiments. The normal data without corruption can be denoted by L nor , and let L nor_B and L nor_G denote the normal matrix for Berkeley data and GreenOrbs data, respectively. The normal data can be regarded as the ground truth WSNs data. In real WSNs, due to data loss and corruption, the measurement d is partially sampled and contains corruption. To obtain the partial measurement d from normal matrix L nor in the experiment, the following steps were performed. Firstly, the partial sampled normal data d nor ∈ R D×1 can be obtained by d nor = Ω(L nor ) according to the sampling ratio ρ s . Then, ρ c × D entries in d nor were randomly selected as the corruption data by adding additional random Gaussian noise with zero mean and variance σ 2 = 20, where ρ c is the corruption ratio.
The parameters λ, µ and α can be chosen according to the characteristics of the signal collected by the sink. In this paper, ε, λ, µ, and α are set to 10 −6 , 0.05, 3.3, and 0.05, respectively.
The Normalized Mean Absolute Error (NMAE) is used to measure the recovery performance of different methods on missing data and corrupted data: whereL is the recovered data, Π m denotes the missing data set, and Π c is the corrupted data set. The experimental result is an average of 50 repeated experiments.

Recovery Performance Comparisons
To compare the proposed method with the existing methods, temperature and humidity data from the Inter Berkeley Research lab and GreenOrbs were utilized for the recovery performance comparisons. With the sampling ratio ρ s = 0.1, 0.2, 0.3, and 0.4, the corruption ratio ρ c = 0.2, 0.3, 0.4, 0.5, and 0.6, Figures 2 and 3 show the recovery performance of each method for Berkeley temperature and humidity data, while Figures 4 and 5 show the recovery performance of each method for GreenOrbs temperature and humidity data, respectively. As can be seen, the proposed method has a lower NMAE than the comparison methods on both missing and corrupted data in two datasets, especially at low sampling ratio and high corruption ratio.    As shown in Figure 2b, the N MAE cor of the proposed method, RPCA, and MC-Two-Phase is 0.030, 0.037, and 0.099 when ρ s = 0.2 and ρ c = 0.2, while the values are 0.048, 0.097, and 0.140 when ρ c is up to 0.6. The results show that the proposed method not only improves the recovery accuracy of WSNs data but also has strong robustness to the gross noise.
Especially, from Figure 4, we can see that as the ρ c increases from 0.2 to 0.6, the corresponding N MAE loss and N MAE cor of MC-Two-Phase and RPCA dramatically increase, while that has very little change of the proposed method, and even the entire range of change is only less than 0.01.
As shown in Figure 3, the N MAE loss and N MAE cor of the proposed method decrease rapidly with the increase in ρ s , but there is little change for the MC-Two-Phase. Specifically, in Figure 3b, the N MAE cor of the proposed method, RPCA and MC-Two-Phase is 0.056, 0.069, and 0.076 when ρ s = 0.1 and ρ c = 0.4, while the values are 0.029, 0.040, and 0.069 when ρ s is up to 0.4.

Conclusions
In this paper, we propose a WRPCA method to increase the recovery accuracy of WSNs data with loss and corruption. The original data matrix is treated as a sum of a low-rank normal data matrix and a sparse abnormal matrix to avoid the influence of corruption. In addition, the weighted nuclear norm minimization is utilized to further constrain the low-rankness of the normal data and overcome the problem that the nuclear norm minimization treats all singular values equally. The experimental results show that the proposed method has better recovery performance in both loss and corruption data. In further work, the higher-order low-rankness of multi-attribute data in WSNs can be explored for multi-attribute data reconstruction.