Outlier Detection Using Improved Support Vector Data Description in Wireless Sensor Networks

Wireless sensor networks (WSNs) are susceptible to faults in sensor data. Outlier detection is crucial for ensuring the quality of data analysis in WSNs. This paper proposes a novel improved support vector data description method (ID-SVDD) to effectively detect outliers of sensor data. ID-SVDD utilizes the density distribution of data to compensate SVDD. The Parzen-window algorithm is applied to calculate the relative density for each data point in a data set. Meanwhile, we use Mahalanobis distance (MD) to improve the Gaussian function in Parzen-window density estimation. Through combining new relative density weight with SVDD, this approach can efficiently map the data points from sparse space to high-density space. In order to assess the outlier detection performance, the ID-SVDD algorithm was implemented on several datasets. The experimental results demonstrated that ID-SVDD achieved high performance, and could be applied in real water quality monitoring.


Introduction
Wireless sensor networks (WSNs) have been widely used in various fields, such as industrial (e.g., industrial surveillance), military (e.g., military reconnaissance), medical (e.g., medical diagnosis), agricultural (e.g., agriculture production detection), mechanical engineering, and aerospace engineering applications [1][2][3][4][5][6][7][8]. The reliability of sensor data has increasingly attracted attention from both academia and industry. Outlier detection can recognize noise, errors, events, and hostile attacks, which helps to reduce network risk and ensure data quality [9][10][11][12]. Generally, outliers are less than the normal data in the monitoring process, and they can represent changes in monitoring objects and environments. Therefore, outliers have great potential value. In aquaculture, sensors are susceptible to germ corrosion and easily go wrong since they are deployed underwater [13]. Moreover, in the water quality monitoring process, a high speed of outlier detection is required for processing big data.
There are four commonly used outlier detection methods: statistical-based [14,15], nearest neighbor [16,17], clustering-based [18,19], classification-based [20][21][22], etc. However, these methods still have some limitations when they are used in practice. The statistical-based methods construct models based on prior knowledge [23]. There is no mathematical model to match the real application problem of WSNs perfectly. The nearest neighbor method is a classic detection algorithm [24], and it is time-consuming and has poor scalability when applied in high-dimensional data. Clustering-based methods are limited to the issue of clustering width [25]. Meanwhile, the calculation of data distance Table 1. Key notations.

Radius of sphere o
Center of sphere C The trade-off between sphere volume and the number of target data outside the sphere ξ i Slack variable α Lagrange multiplier R The distance between an observation datum in the feature space and center a θ The mean of Parzen-window density Par(x i ) d The feature dimension of input data w Weighting factor n The number of target data ρ i Relative density weight of x i Par(x i ) Parzen-window density of x i md ij Mahalanobis distance between vectors MS Covariance matrix  Given target data {x i , i = 1, 2, . . . , n}, SVDD maps the target data from input space into a feature space F via nonlinear mapping function ϕ and gets the smallest sphere Ω = (o,R) in F. The objective function of SVDD is as follows: where C is a parameter that denotes the trade-off between sphere volume and the number of target data outside the sphere. The slack variable ξ i is used to incorporate the effect of data not included in the data description, which allows a probability that some points can be wrongly classified.
To solve the objective function (1), we introduce the Lagrange multiplier α. By calculating the inner product with the kernel function, we can get the rearranged function as follows: where K(x i ,x j ) is the kernel function that satisfies Mercer's theorem [36]. The radius R of the sphere and the distance r between an observation datum in the feature space and center o are denoted as: In outlier detection, Figure 1 also shows the SVDD detection principle. It determines the description boundary as the detection boundary. For given test data, x is regarded as a target datum inside the sphere if r ≤ R, which indicates x is normal. Otherwise, it is treated as an outlier, which indicates x is abnormal.

Density-Compensated Support Vector Data Description
The traditional SVDD algorithm often ignores the impact of data density distribution on classification [33], which means the sphere cannot reflect all features of the target data, and reduces the classification accuracy. To account for the distribution information of data, we introduce the notion of relative density weight to compensate SVDD, which reflects how dense the region of target data is compared to other regions. This approach makes the training data in high-density areas more likely to fall into the sphere than those in low-density areas.
In this paper, the Parzen-window algorithm [37] is applied to calculate the relative density weight of sample data. Assuming the target data X = {x 1 , x 2 , . . . x i , i = 1, 2, . . . , n}, the relative density weight of point x i in a dataset is determined as: Par( where ρ i is the relative density weight, θ = 1 Par(x i ) is the mean Parzen-window density Par(x i ), d represents the feature dimension of input data, ω (0 ≤ ω ≤ 1) is the weighting factor, s denotes the smoothing parameter of Parzen-window density, and n is the number of target data. We use relative density to reflect the data distribution in real space. In the process of searching for an appropriate description, we calculate the relative density weight according to Equation (5). After importing the relative density weight to SVDD, we obtain the redefined objective function as follows: Let the relative density weight multiply the slack variable. Then each datum in high-density regions will get a high relative density value. For searching the optimal description of target data, D-SVDD can shift the description boundary to the dense areas. By introducing the Lagrange multiplier to solve Equation (7) in SVDD, we get the optimization Equation (8):

Outlier Detection Using Improved Density-Compensated SVDD
In D-SVDD, we choose the Gaussian function as the window function of the Parzen-window algorithm and use the Euclidean distance to measure the distance in the Gaussian function. However, Euclidean distance does not take into account the correlation between sample points [38], which will affect the precision of D-SVDD. Mahalanobis distance (MD) is scale-invariant [39], which can overcome the shortcomings of Euclidean distance. Thus, it can avoid the calculation error caused by measurement units or the difference in magnitude of eigenvector values [40]. The performance of MD is better than Euclidean distance. MD is a non-uniform distribution of the normalized distance in Euclidean space, and it is constant for all linear transformations. The formula of MD is given as follows: where md ij is the Mahalanobis distance between vector x i and x j , MS is the covariance matrix between two vectors, and x = 1 n n i=1 x i denotes the mean of x i .
In this paper, we introduce the MD to replace Euclidean distance in the Gaussian function. By calculating the MD between data points, the improved Parzen-window density for x 1 is redefined. The Parzen-window relative density weight is denoted as Now, we can give the pseudocode of the ID-SVDD algorithm as shown in Algorithm 1. The outputs of the ID-SVDD algorithm are the Lagrange multipliers α i , radius R of sphere, and distance r. For a given x i , it is classified as an outlier if the distance r i is greater than R. If not, x i is classified as a normal datum.

Input:
Target dataset X = {x 1 , x 2 , . . . x i , i = 1, 2, . . . , n}, kernel function K(.) Output: α i , R, and r. Begin Define an array P to store relative density weight for each point. for (k = 1; k ≤ n; k++) do calculate P k = ρ(x k ) according to Equation (12) End Solve the optimization problem of (8). Determine a sample whose α i is between 0 and ρ(x i )C. Calculate the radius R of sphere and the distance r according to Equations (3) and (4). End Return α i , R and r.

Experiment Design
In order to evaluate the performance of ID-SVDD, we compared it with the traditional SVDD, D-SVDD, and DW-SVDD provided in [33]. Cha proposed the DW-SVDD with a weight coefficient calculated by k-NN distance. We chose DW-SVDD for comparison because it also attempts to apply the relative density to traditional SVDD. Meanwhile, these four methods were implemented with MATLAB language and run on a PC with 2.9-GHz Core™ processor, 16.0 G memory, and Microsoft Windows 10 operating system.
Considering the completeness and continuity of data, we chose the data of nodes 12 and 17 in the SensorScope system dataset [41] to complete simulation experiments. The SensorScope system is deployed at Grand-St-Bernard Mountain, which lies between Switzerland and Italy. The datasets have two attributes, including the external temperature and surface temperature. In addition, we finished experiments on a real water quality dataset with three attributes, including dissolved oxygen (DO), pH, and dissolved oxygen relative saturation. More detailed information about the experimental datasets is presented in Table 2. We used different indexes to evaluate the performance of ID-SVDD. These indexes included true positive rate (TPR), true negative rate (TNR), accuracy, and run time [42,43]. The calculation formulas of these indexes are shown as follows.
where TP is the number of true positive results, TN represents the number of true negative results, FP is the number of false positive results, FN denotes the number of false negative results. These indicators together with run time can measure the performance of outlier detection methods effectively.

Comparison among Different Kernel Functions
The kernel function of SVDD can map nonlinear relations to higher-dimensional space and construct linear regression for processing [44]. In ID-SVDD, the kernel function also plays a key role. The common kernel functions include the linear kernel function (16), polynomial kernel function (17), Gaussian kernel function (18), and Sigmoid kernel function (19) [45]: Here, we set the parameter C within the range 2 −8 to 2 8 , which controls the trade-off between volume and errors. For better outlier detection results, we conducted experiments to choose the optimal kernel function. For the variable δ in Gaussian kernel function, we set it between 2 −8 and 2 8 . Further, we used fivefold cross validation (CV) to find the adequate parameters of these kernel functions. After parameter selection, we obtained the optimal results of ID-SVDD with the SensorScope dataset. The detailed results are provided in Table 3.  Table 3 clearly indicates that the TPR, TNR, and accuracy of Gaussian kernel function were superior to the other three kernel functions in nodes 12 and 17 of SensorScope. Based on these experimental results, we adopted the Gaussian function as the kernel function of ID-SVDD in the outlier detection of water quality data. Figure 2a,b are the testing results distribution diagrams of the ID-SVDD detection algorithm in SensorScope node 12 and 17 datasets, respectively. From Figure 2, the support vector constructed the boundary to distinguish the normal data from outliers. The decision boundaries of the two datasets are both irregular graphics. The blue points outside the sphere represent the outliers, whereas the red points inside the sphere are normal data. It is clear that the detection model could describe the data edges accurately. So, the ID-SVDD model is an effective detection model.

Comparison Results of Different Datasets
We conducted experiments to compare ID-SVDD with traditional SVDD, D-SVDD, and DW-SVDD in a set of standard datasets from SensorScope. The detection results are displayed in Table 4. Table 4. Detection results of SensorScope datasets. We can see from Table 4 that the TPR and accuracy values of ID-SVDD in nodes 12 and 17 were both superior to the D-SVDD, DW-SVDD, and SVDD. However, the TNR of ID-SVDD was lower than the TNR for D-SVDD and SVDD. These results indicate that the MD improved Parzen-window relative density weight could eliminate the interference of correlation between variables. It is appropriate for measuring the distance between target data. Meanwhile, the TPR, TNR, and accuracy of D-SVDD in nodes 12 and 17 were superior to those for DW-SVDD and SVDD. These results indicate that Parzen-window relative density weight is appropriate for compensating the SVDD. In terms of run time, these four algorithms were close. Actually, the use of ID-SVDD provided an acceptable improvement in outlier detection on the SensorScope datasets of nodes 12 and 17.

Experimental Results on Water Quality Datasets
This experiment evaluated the ID-SVDD algorithm on a real water quality dataset. All data were collected from the internet of things (IOT) monitoring system running in the Nanquan breeding base located in Wuxi city, Jiangsu province [46]. This system uses various types of sensors

Comparison Results of Different Datasets
We conducted experiments to compare ID-SVDD with traditional SVDD, D-SVDD, and DW-SVDD in a set of standard datasets from SensorScope. The detection results are displayed in Table 4. Table 4. Detection results of SensorScope datasets. We can see from Table 4 that the TPR and accuracy values of ID-SVDD in nodes 12 and 17 were both superior to the D-SVDD, DW-SVDD, and SVDD. However, the TNR of ID-SVDD was lower than the TNR for D-SVDD and SVDD. These results indicate that the MD improved Parzen-window relative density weight could eliminate the interference of correlation between variables. It is appropriate for measuring the distance between target data. Meanwhile, the TPR, TNR, and accuracy of D-SVDD in nodes 12 and 17 were superior to those for DW-SVDD and SVDD. These results indicate that Parzen-window relative density weight is appropriate for compensating the SVDD. In terms of run time, these four algorithms were close. Actually, the use of ID-SVDD provided an acceptable improvement in outlier detection on the SensorScope datasets of nodes 12 and 17.

Experimental Results on Water Quality Datasets
This experiment evaluated the ID-SVDD algorithm on a real water quality dataset. All data were collected from the internet of things (IOT) monitoring system running in the Nanquan breeding base located in Wuxi city, Jiangsu province [46]. This system uses various types of sensors to collect water quality data (e.g., DO, pH, and dissolved oxygen relative saturation). These data were transmitted from sensors to a server via the IOT monitoring system. The water quality dataset in this experiment included 1756 data (sampled 10 min once) from 20 May to 2 June 2017. We chose the first 1052 data as a training dataset, and the remaining 704 data as a testing dataset. The distribution of training data is illustrated in Figure 3. to collect water quality data (e.g., DO, pH, and dissolved oxygen relative saturation). These data were transmitted from sensors to a server via the IOT monitoring system. The water quality dataset in this experiment included 1756 data (sampled 10 minutes once) from 20 May to 2 June 2017. We chose the first 1052 data as a training dataset, and the remaining 704 data as a testing dataset. The distribution of training data is illustrated in Figure 3.  Figure 3 is the training result distribution diagram of the ID-SVDD detection algorithm in the water quality dataset. In Figure 3, the green points represent all normal data in the training process. The three-dimensional coordinate represents the DO content, pH variable, and DO relative saturation variable, respectively. Most normal data are aggregated and distributed in an irregular shape. Small amounts of data are dispersedly distributed. The detection result on the testing dataset is shown in Figure 4.   Figure 3 is the training result distribution diagram of the ID-SVDD detection algorithm in the water quality dataset. In Figure 3, the green points represent all normal data in the training process. The three-dimensional coordinate represents the DO content, pH variable, and DO relative saturation variable, respectively. Most normal data are aggregated and distributed in an irregular shape. Small amounts of data are dispersedly distributed. The detection result on the testing dataset is shown in Figure 4. to collect water quality data (e.g., DO, pH, and dissolved oxygen relative saturation). These data were transmitted from sensors to a server via the IOT monitoring system. The water quality dataset in this experiment included 1756 data (sampled 10 minutes once) from 20 May to 2 June 2017. We chose the first 1052 data as a training dataset, and the remaining 704 data as a testing dataset. The distribution of training data is illustrated in Figure 3.  Figure 3 is the training result distribution diagram of the ID-SVDD detection algorithm in the water quality dataset. In Figure 3, the green points represent all normal data in the training process. The three-dimensional coordinate represents the DO content, pH variable, and DO relative saturation variable, respectively. Most normal data are aggregated and distributed in an irregular shape. Small amounts of data are dispersedly distributed. The detection result on the testing dataset is shown in Figure 4.  We can see from Figure 4 that the three-dimensional coordinate is the same as Figure 3. After ID-SVDD outlier detection, the error points are shown in Figure 4 with the form of a black dot. Error points appear in both the normal dataset and the outlier dataset. The outlier data are distributed around the normal data. To evaluate the performance of the ID-SVDD algorithm, we made a comparison with D-SVDD, DW-SVDD, and traditional SVDD. The precision comparison results are shown in Table 5. Figure 5 presents the run-time comparison of the four algorithms. Table 5. Detection results for the water quality dataset. D-SVDD: density-compensated SVDD; DW-SVDD: density-weighted SVDD; ID-SVDD: improved density-compensated SVDD.

pond13
ID-SVDD D-SVDD DW-SVDD SVDD We can see from Figure 4 that the three-dimensional coordinate is the same as Figure 3. After ID-SVDD outlier detection, the error points are shown in Figure 4 with the form of a black dot. Error points appear in both the normal dataset and the outlier dataset. The outlier data are distributed around the normal data. To evaluate the performance of the ID-SVDD algorithm, we made a comparison with D-SVDD, DW-SVDD, and traditional SVDD. The precision comparison results are shown in Table 5. Figure 5 presents the run-time comparison of the four algorithms. Table 5. Detection results for the water quality dataset. D-SVDD: density-compensated SVDD; DW-SVDD: density-weighted SVDD; ID-SVDD: improved density-compensated SVDD.  It can be seen from Table 5 that ID-SVDD had the highest values of TPR and TNR, with 91.335% detection accuracy. The TPR of ID-SVDD was 2.322%, 28.542%, and 35.307% higher than those of D-SVDD, DW-SVDD, and traditional SVDD, respectively. The TNR of ID-SVDD was 4% greater than that of DW-SVDD, and equal to that of SVDD. ID-SVDD was successful in detecting the outliers of water quality data, increasing the accuracy by 2.064%, 27.327%, and 33.403% when compared to D-SVDD, DW-SVDD, and SVDD, respectively. There are correlations among pH, DO, and DO relative saturation. MD improved Parzen-window relative density weight can eliminate the interference of correlations, thus improving the detection performance. Meanwhile, the TPR, TNR, and accuracy of D-SVDD were superior to DW-SVDD and SVDD. That is because Parzen-window relative density weight can obtain a characterized description of the dataset in high-dimensional feature space and help search for an optimal SVDD. This approach is suitable for calculating the relative density weight. The introduction of improved relative density to SVDD helps enhance the performance for outlier detection efficiently.

pond13 ID-SVDD D-SVDD DW-SVDD SVDD
As Figure 5 indicates, ID-SVDD had an advantage over D-SVDD and DW-SVDD in terms of run time. It consumed 0.5381 s for outlier detection. Single SVDD provided the shortest time (0.4832 s), but its TPR and accuracy were the lowest among the four algorithms. Therefore, ID-SVDD provided satisfactory outlier detection accuracy and efficiency, and it is suitable for detecting outliers in real water quality monitoring. It can be seen from Table 5 that ID-SVDD had the highest values of TPR and TNR, with 91.335% detection accuracy. The TPR of ID-SVDD was 2.322%, 28.542%, and 35.307% higher than those of D-SVDD, DW-SVDD, and traditional SVDD, respectively. The TNR of ID-SVDD was 4% greater than that of DW-SVDD, and equal to that of SVDD. ID-SVDD was successful in detecting the outliers of water quality data, increasing the accuracy by 2.064%, 27.327%, and 33.403% when compared to D-SVDD, DW-SVDD, and SVDD, respectively. There are correlations among pH, DO, and DO relative saturation. MD improved Parzen-window relative density weight can eliminate the interference of correlations, thus improving the detection performance. Meanwhile, the TPR, TNR, and accuracy of D-SVDD were superior to DW-SVDD and SVDD. That is because Parzen-window relative density weight can obtain a characterized description of the dataset in high-dimensional feature space and help search for an optimal SVDD. This approach is suitable for calculating the relative density weight. The introduction of improved relative density to SVDD helps enhance the performance for outlier detection efficiently.

Conclusions
As Figure 5 indicates, ID-SVDD had an advantage over D-SVDD and DW-SVDD in terms of run time. It consumed 0.5381 s for outlier detection. Single SVDD provided the shortest time (0.4832 s), but its TPR and accuracy were the lowest among the four algorithms. Therefore, ID-SVDD provided satisfactory outlier detection accuracy and efficiency, and it is suitable for detecting outliers in real water quality monitoring.

Conclusions
This paper presents a new outlier detection algorithm (ID-SVDD) incorporating the relative density weight with SVDD. This approach can obtain the features of the data, thus improving the performance of SVDD. To measure the relative density weight, we used the Parzen-window method. The Mahalanobis distance was applied to improve the Gaussian function in the calculation of relative density. ID-SVDD can realize data mapping from relatively sparse space to high-density space.
We conducted experiments to evaluate the performance of ID-SVDD based on SensorScope datasets and water quality datasets, and then compared it with D-SVDD, DW-SVDD, and SVDD algorithms. The experimental results showed that ID-SVDD performed better than its three other counterparts in terms of TPR, TNR, accuracy, and run time. Therefore, it is efficient and useful to introduce relative density to SVDD. ID-SVDD provides a new idea of outlier detection and it can be used in real-world applications.