A Robust Multi-Sensor Data Fusion Clustering Algorithm Based on Density Peaks

In this paper, a novel multi-sensor clustering algorithm, based on the density peaks clustering (DPC) algorithm, is proposed to address the multi-sensor data fusion (MSDF) problem. The MSDF problem is raised in the multi-sensor target detection (MSTD) context and corresponds to clustering observations of multiple sensors, without prior information on clutter. During the clustering process, the data points from the same sensor cannot be grouped into the same cluster, which is called the cannot link (CL) constraint; the size of each cluster should be within a certain range; and overlapping clusters (if any) must be divided into multiple clusters to satisfy the CL constraint. The simulation results confirm the validity and reliability of the proposed algorithm.


Introduction
As a powerful tool, clustering analysis is usually used in machine learning [1], image analysis [2], information retrieval [3] and data mining [4] to eliminate noise data-points and find hidden groups or patterns in a dataset. Due to the diversity/variability of the dataset to be processed, many clustering algorithms, such as density-based clustering [5,6], hierarchical clustering [7], and k-means clustering [8], have been developed to solve specific problems. It can be seen that, although there are many clustering algorithms, none of them can be applied in all cases.
Clustering is often taken as an unsupervised learning technique in many pre-processing processes, as no information is provided. Nevertheless, for many of the problems, including the MSDF clustering problem, to be solved in this paper, an amount of prior information can be obtained through additional data features [9,10], which can be employed to obtain better clustering results, namely, semi-supervised clustering.
Constraining the dataset during the clustering process to obtain specific clustering results is a hot issue in clustering research. In constrained clustering, "must-link" constraints (ML) and "cannot-link" constraints (CL) are two basic rules. An ML constraint is used to specify that the two instances should be associated with the same cluster, whereas a CL constraint is used to specify that the two instances should assigned to different clusters, allowing users to specify constraint rules to obtain the desired clustering results. Typical constrained clustering algorithms include the constrained k-means [11], pairwise constrained k-means [12], complete link [13], constrained hierarchical clustering algorithms [14].
In clustering research, the number of clusters and cluster center initialization have a great impact on the clustering convergence speed and clustering result. The research on the number of clusters mainly focuses on running the clustering algorithm multiple times, with different values of k, and the estimated k is chosen based on a specific criterion, such as the Bayesian information criterion [15], rate distortion theory [16], Akaike information criterion [17], etc. The research on cluster center C C C , and the CL constraint requires that: ( , ) where ( , ) ss ij c z z  means , ss ij zz cannot be within the same cluster. We define the set of noisy data points in dataset Z as 0 C and the observations of targets as T C . The dataset Z can be defined as: At the same time, each cluster cannot have any intersection with the rest of the subsets.
As mentioned above, the MSDF clustering problem can be described as: A dataset Z is divided into k clusters, the size of each cluster must satisfy the CL constraint (3), and each cluster cannot have any intersection with the others (5).

Problem Formulation
The above MSDF problem can be formulated as a CL-constrained clustering problem. Considering a dataset Z, which consists of observations from multiple sensors, z i is included in dataset Z.
where parameters N and P are the number of data points and the parameter space, respectively. In this paper, we define z i as a point in a two-dimensional Cartesian coordinate system. The dataset Z can be written in the form of a union of multi-sensor observations. We define the sth sensor as S s = z s 1 , z s 2 , · · · , z s m s , where m s is the number of data-points, and all the data-points in Z can be written as: Z := {S 1 , S 2 , · · · , S n } = z 1 1 , z 1 2 , · · · , z 1 m 1 , z 2 1 , z 2 2 , · · · , z 2 m 2 , · · · , z n 1 , z n 2 , · · · , z n m n (2) where n is the number of sensors. The MSDF problem requires that the dataset Z be divided into k clusters, namely, C 1 , C 2 , · · · , C k , and the CL constraint requires that: where c (z s i , z s j ) means z s i , z s j cannot be within the same cluster. We define the set of noisy data points in dataset Z as C 0 and the observations of targets as C T . The dataset Z can be defined as: At the same time, each cluster cannot have any intersection with the rest of the subsets.
As mentioned above, the MSDF clustering problem can be described as: A dataset Z is divided into k clusters, the size of each cluster must satisfy the CL constraint (3), and each cluster cannot have any intersection with the others (5).

CL Constraint and the Size of Clusters
The CL constraint (3) limits the size of each cluster, which must be smaller or equal to the number of sensors n.
|C i | n,∀i ∈ {1, 2, · · · k} (6) where |C i | means the number of data points in cluster C i , means smaller than or equal to. Denoting the detection probability of the sensor s on target i as p s D (i) ≤ 1, to simplify the calculation, we simplify p s D (i) as a constant p D , then the size of a cluster can be calculated as: Given p D and the number of sensors n, E[|C i |] can be considered as a constant: The number of sub-clusters (targets) in each cluster C i is:

Density Peaks Clustering Algorithm
In this paper, we use the DPC to calculate the local density. For each data point, we compute two quantities: its local density ρ i and distance δ i from points of higher density. Both these quantities depend only on the distances d ij between data points [5]. The local density ρ i is defined as: where d ij means the distances between data points. χ(x) = 1 if x < 0 and χ(x)= 0; otherwise, d c is a cutoff distance. ρ i is equal to the number of data points within the cutoff distance to point i. The larger the ρ i , the higher density of data point i, and the more likely are the observations of targets. δ i is measured by computing the minimum distance between point i and the other points with a higher density: The original DPC algorithm defines the data points of ρ i ≥ 0.8 × r and δ i > 2d c as cluster centers. Figure 2 shows the clustering results of the DPC algorithm of 50 i.i.d sensors. Clusters of different colors represent observations of different targets. The red "+" represents the true position of the targets, and the red "o" represents the clustering results. It can be seen, from Figure 2, that for non-overlapping clusters, the real position and estimated position of the targets are very close; for overlapping clusters, the clustering result has a large deviation from the target real position, and the target number is incorrect. In subsequent calculations, we need to re-cluster the overlapping clusters to obtain correct estimates.
The difference between Equations (9) and (13) is that Equation (13) can determine whether cluster i C is an overlapping cluster using the i  of cluster center i . The calculation of i k is also an important step in the subsequent re-clustering process.

Target Observations Set and Target Number
The multi-source n-points algorithm searches for the number of data points within the cutoff distance of data point i to determine whether the union of point i and the data points within the cutoff distance is a cluster formed by observations of targets. The position of the data point i and detection probability D p have a greater impact on the effect of the multi-source n-points algorithm.
How to quickly and efficiently filter out noise and obtain the target observations is the key to designing MSDF clustering algorithms. Using the DPC algorithm, we find that data point i in T C has a prior rule: i  must be larger than a threshold. The data points in T C can be defined as: where 0.4 l  is a reference and can be chosen roughly between 0.3~0.45, i ln   means that the number of data points closer in data points i must be larger than or equal to ln  , and the data point i in i ln   is considered to be the target observations. Based on the same dataset shown in Figure 2, the data points in T C are circled with a red "o" in Figure 3. As shown in Figure 3, the observations of targets (color data points) are almost circled with a red "o", and only few data points are not circled. Considering the impact that noisy data points may have in clusters of T C , equation (14) is still very reliable. From Figure 2, we can draw a conclusion: the cutoff distance in the clusters of k i ≥ 2 (overlapping clusters) is larger than that in the clusters of k i = 1 (non-overlapping clusters). We define the cutoff distance in the non-overlapping clusters as d c , and the cutoff distance in overlapping clusters is Assuming the cluster center of C i is data point i, i ∈ {1, 2, · · · , k}, the number of data points closer than i is equal to the size of the cluster C i , that is, ρ i ≈ C i . The data points of cluster C i can be defined as: The number of targets in cluster C i can be defined as: The difference between Equations (9) and (13) is that Equation (13) can determine whether cluster C i is an overlapping cluster using the ρ i of cluster center i. The calculation of k i is also an important step in the subsequent re-clustering process.

Target Observations Set and Target Number
The multi-source n-points algorithm searches for the number of data points within the cutoff distance of data point i to determine whether the union of point i and the data points within the cutoff distance is a cluster formed by observations of targets. The position of the data point i and detection probability p D have a greater impact on the effect of the multi-source n-points algorithm. How to quickly and efficiently filter out noise and obtain the target observations is the key to designing MSDF clustering algorithms. Using the DPC algorithm, we find that data point i in C T has a prior rule: ρ i must be larger than a threshold. The data points in C T can be defined as: where l = 0.4 is a reference and can be chosen roughly between 0.3~0.45, ρ i ≥ l × n means that the number of data points closer in data points i must be larger than or equal to l × n, and the data point i in ρ i ≥ l × n is considered to be the target observations.
Sensors 2020, 20, 238 6 of 14 Based on the same dataset shown in Figure 2, the data points in C T are circled with a red "o" in Figure 3. As shown in Figure 3, the observations of targets (color data points) are almost circled with a red "o", and only few data points are not circled. Considering the impact that noisy data points may have in clusters of C T , Equation (14) is still very reliable. The CL constraint requires that T C be divided into multiple clusters of roughly the same size, and the number of clusters/targets in dataset Z is: The number of targets in cluster i C can be calculated using the i  of the cluster center i through Equation (9), while the total number of targets in dataset Z can be calculated through Equation (15).
Given the number of clusters (targets) i k  and dataset T C , the preferred choice is to use the k-means algorithm for clustering, as this saves the computing resources of i  ; however, the k-means algorithm has difficulty handle cases where the local density differs greatly between clusters. During the experiment, we found that if the size of the clusters is roughly equal (no overlapping clusters), the k-means algorithm can obtain correct clustering results. Conversely, if there is at least one overlapping cluster contained in the dataset, the clustering result obtained by the k-means algorithm does not satisfy the CL constraint. In order to correctly cluster T C using the k-means algorithm, we must first determine whether there are overlapping clusters in dataset Z . The key to determining whether there are overlapping clusters in dataset Z is to compare

Proposed Clustering Method
The original DPC algorithm needs to calculate i  and i  to find the cluster centers, which involves a computational burden that is too great in the case of no overlapping clusters in dataset Z , in this case, we can obtain the correct clustering results using the k-means clustering algorithm to cluster dataset T C (a total of i k  targets), and dataset T C can be obtained by a threshold rule. The CL constraint requires that C T be divided into multiple clusters of roughly the same size, and the number of clusters/targets in dataset Z is: The number of targets in cluster C i can be calculated using the ρ i of the cluster center i through Equation (9), while the total number of targets in dataset Z can be calculated through Equation (15).
Given the number of clusters (targets) k i and dataset C T , the preferred choice is to use the k-means algorithm for clustering, as this saves the computing resources of δ i ; however, the k-means algorithm has difficulty handle cases where the local density differs greatly between clusters. During the experiment, we found that if the size of the clusters is roughly equal (no overlapping clusters), the k-means algorithm can obtain correct clustering results. Conversely, if there is at least one overlapping cluster contained in the dataset, the clustering result obtained by the k-means algorithm does not satisfy the CL constraint. In order to correctly cluster C T using the k-means algorithm, we must first determine whether there are overlapping clusters in dataset Z.
The key to determining whether there are overlapping clusters in dataset Z is to compare max(ρ) and 1.1 × r. If max(ρ) < 1.1 × r, there are no overlapping clusters in dataset Z, and the number of targets in each cluster is 1, that is, k i = 1, i ∈ {1, 2, · · · , k}; otherwise, at least one overlapping cluster is contained in dataset Z. The reason for why we choose 1.1 × r, instead of the number of sensors n, is that the noisy data points may otherwise fall into cluster C i .

Proposed Clustering Method
The original DPC algorithm needs to calculate ρ i and δ i to find the cluster centers, which involves a computational burden that is too great in the case of no overlapping clusters in dataset Z, in this case, we can obtain the correct clustering results using the k-means clustering algorithm to cluster dataset C T (a total of k i targets), and dataset C T can be obtained by a threshold rule.
While the DPC algorithm cannot correctly cluster overlapping clusters, and we have a fast and more efficient solution for the non-overlapping clusters. For the above reasons, we divide the dataset Z into two cases for processing: (1) Non-overlapping clusters in dataset Z (Algorithm 1); and (2) at least one overlapping cluster in dataset Z (Algorithm 2). The main difference between Algorithm 1 and Algorithm 2 is that Algorithm 2 requires an additional calculation of parameter δ i and re-clustering of the cluster centers of overlapping clusters.
The proposed clustering Algorithm 1 includes 3 steps: (1) Calculate the ρ i for each data point and determine whether there is an overlapping cluster in dataset Z according to (16); (2) filter out clutter and obtain dataset C T and k i for k-means clustering; and (3) revisit each cluster to make sure each cluster satisfies the CL constraint.

Algorithm 1 Clustering without any overlapping cluster in dataset Z
Input: dataset Z. Output: cluster C i and its cluster center z i , i ∈ {1, 2, · · · , k}. 1.1: Calculate ρ i according to (10) and determine whether there is any overlapping cluster in dataset Z according to (16). If there is no overlapping cluster, go to step 1.2; otherwise, see Algorithm 2.

1.2:
Calculate C T and k i according to (14) and (15), then cluster C T using the k-means algorithm.

1.3:
Revisit each cluster C i to make sure that the CL constraint was satisfied, then calculate the cluster center z i of each cluster.

Algorithm 2 Clustering with at least one overlapping cluster in dataset Z
Input: dataset Z. Output: cluster C i and its cluster center z i , i ∈ {1, 2, · · · , k}.

2.2:
According to (16), for cluster centers z i that are max(ρ i ) < 1.1 × r, the cluster center is z i ; for cluster centers z i that are max(ρ i ) ≥ 1.1 × r, calculate C i and k i according to (12) and (13), then cluster C i with the k-means algorithm (k i clusters).

2.3:
Repeat step 2.2, until all the overlapping clusters are all divided into sub-clusters.

2.4:
Revisit each cluster C i to make sure that the CL constraint was satisfied, then calculate the cluster center z i of each cluster.

Remark 1.
The cutoff distance is the key to the proposed and the existing MSDF clustering algorithm. The C4F [24] algorithm selects two times the standard deviation of the observation noise as the cutoff distance, the multi-source n-points [26] calculates the cutoff distance using an online learning algorithm, and the proposed algorithm selects 2% of the sorted distances matrix d ij (from small to large) as the cutoff distance. The multi-source n-points algorithm and the algorithm proposed in this paper can deal with unknown observation noise associated with the proposed clustering problem, whereas C4F can only deal with the case of known observation noise.

Remark 2.
An indispensable step in the existing multi-sensor data fusion clustering algorithm is to calculate the point-to-point distance, which is also the most time-consuming part of the algorithm. The runtime complexity/storage space requirements of the proposed algorithm and the multi-source n-points algorithm are O(N 2 )/(N 2 − N)/2, O(N log N)/O(N), respectively. Compared with the multi-source n-points algorithm, it can be seen that the proposed algorithm runs more slowly and requires more storage space.

Remark 3.
For the multi-source n-points algorithm, the selection of sensor s is very critical. If one target in sensor s is lost, this target will not be detected during the subsequent clustering process, while the proposed algorithm can well deal with the case of some targets avoiding detection. This is the advantage of the proposed algorithm, which is more obvious when the sensor detection probability is lower.

Simulation Results
In this section, we compare the proposed algorithm with the k-means algorithm [8], multi-source n-points algorithm [26], and typical DBSCAN [6] algorithm to obtain the performance of the various algorithms.

Given Cutoff Distance
The k-means clustering needs one parameter k (the number of clusters), and the DBSCAN algorithm needs two parameters ε (neighborhood radius) and m (minimum number of points). Both the multi-source n-points algorithm and the proposed algorithm need one parameter: the cutoff distance d c . All the parameters used in the four algorithms are provided in Table 1. The number of sensors is set to n = {20, 50}, and the experimental results of n = {20, 50} are given in Figures 4 and 5, respectively.   In each Monte Carlo simulation, the color of the circles is assigned randomly, and circles of the same color represent the same cluster. The clustering results show that both the proposed method and the multi-source n-points algorithm can solve the MSDF clustering problem, but the proposed method algorithm has a smaller variance. The k-means algorithm is unable to deal with clutter, and the clustering result is incorrect. The DBSCAN algorithm can detect observations of targets, but the overlapping cluster clustering result is incorrect. Table 2

Unknown Cutoff Distance
Based on the same dataset as that given in Figure 2, we assume the cutoff distance d c is unknown and must be calculated from dataset using an algorithm, such as the DPC algorithm. The cutoff distances of the multi-source n-points and the proposed method are shown in Table 3. The clustering results of the proposed algorithm are given in Figure 6. Compared with the clustering results shown in Figures 4 and 5, the clustering results shown in Figure 6 are also good. This demonstrates that the cutoff distance calculation used in the DPC algorithm is effective.  Algorithms Multi-Source n-Points Proposed Method 20 sensors 6.7815 8.0932 50 sensors 5.9779 9.1440 Figure 6. Clustering results with {20, 50} sensors (different color "O") and the. Cluster centers, estimated using proposed method (black "O") and multi-source n-points (red "+").

Clustering-Based Model
In this simulation, we compare our algorithm with the C4F and multi-source n-points algorithm for multiple target trajectories, provided in the excellent sample MATLAB code in [26]. Information on, for example, clutter and the target dynamic model, are unknown, and the only information that can be used is contained in the observations dataset (the data points in the two-dimensional Cartesian coordinate system) of multiple sensors. The surveillance area is [−100,100]  [−100,100] (m), and the start/end time and the initial position (green "□") of each target are recorded near the target trajectory, as shown in Figure 7. The average clutter rate per scan is 10, and the observation noise obeys a zeromean Gaussian distribution, with a variance of 4.
To test the clustering accuracy, we use the optimal sub-pattern assignment (OSPA) metric [29] to compare the proposed algorithm with the C4F and multi-source n-points algorithms. We set the cutoff parameter c = 100 and the ordered parameter p = 2.

Clustering-Based Model
In this simulation, we compare our algorithm with the C4F and multi-source n-points algorithm for multiple target trajectories, provided in the excellent sample MATLAB code in [26]. Information on, for example, clutter and the target dynamic model, are unknown, and the only information that can be used is contained in the observations dataset (the data points in the two-dimensional First, we use 20 sensors. The clustering results of different algorithms for t = 16 are given in Figure 8. The average clustering target numbers and the average OSPA versus time over 100 Monte Carlo trails of different algorithms are given in Figure 9. The average OSPA of the proposed method is 3.7979, which is better than that of the C4F (5.9163) and the multi-source n-points (12.24). To test the clustering accuracy, we use the optimal sub-pattern assignment (OSPA) metric [29] to compare the proposed algorithm with the C4F and multi-source n-points algorithms. We set the cutoff parameter c = 100 and the ordered parameter p = 2.
First, we use 20 sensors. The clustering results of different algorithms for t = 16 are given in Figure 8. The average clustering target numbers and the average OSPA versus time over 100 Monte Carlo trails of different algorithms are given in Figure 9. The average OSPA of the proposed method is 3.7979, which is better than that of the C4F (5.9163) and the multi-source n-points (12.24). First, we use 20 sensors. The clustering results of different algorithms for t = 16 are given in Figure 8. The average clustering target numbers and the average OSPA versus time over 100 Monte Carlo trails of different algorithms are given in Figure 9. The average OSPA of the proposed method is 3.7979, which is better than that of the C4F (5.9163) and the multi-source n-points (12.24).   Figure 11. The average OSPA of the proposed method is 1.6717, which is better than that of the C4F (5.5241) and multi-source n-points (7.8788) algorithms. Compared with Figure 8, the clustering accuracy of the two algorithms increases with the increase of the sensor number.     Figure 11. The average OSPA of the proposed method is 1.6717, which is better than that of the C4F (5.5241) and multi-source n-points (7.8788) algorithms. Compared with Figure 8, the clustering accuracy of the two algorithms increases with the increase of the sensor number.   Table 4. F (blue "o"), and proposed method (black "o"). Figure 12 gives the average time-consuming and average OSPA comparison of different algorithms versus different numbers of sensors over 100 Monte Carlo trials. It can be seen that the proposed method outperforms the C4F and multi-source n-points algorithm in the average OSPA. The clustering accuracy with 20 sensors using the proposed algorithm exceeds the clustering result with 100 sensors using the C4F and the multi-source n-points algorithms. As for the computing speed, the proposed method is slower than the C4F and multi-source n-points algorithms. . Figure 11. Mean estimated number of targets and mean OSPA of different algorithms over 100 MC trials. Figure 12 gives the average time-consuming and average OSPA comparison of different algorithms versus different numbers of sensors over 100 Monte Carlo trials. It can be seen that the proposed method outperforms the C4F and multi-source n-points algorithm in the average OSPA. The clustering accuracy with 20 sensors using the proposed algorithm exceeds the clustering result with 100 sensors using the C4F and the multi-source n-points algorithms. As for the computing speed, the proposed method is slower than the C4F and multi-source n-points algorithms.

Conclusions
We propose a robust multi-sensor clustering algorithm to solve the MSDF problem. The MSDF problem corresponds to the clustering dataset of the observations (containing a large amount of noise) of multiple sensors, forming k clusters, and each cluster must satisfy the CL constraint. Unlike other model-based multi-sensor data fusion algorithms, no prior information, like the noise and motion model of a target, is needed in the proposed algorithm. Compared with the existing multisensor data fusion clustering algorithm, the proposed algorithm is more robust, and the lower the detection probability of the sensors, the better the performance of the proposed algorithm.

Conclusions
We propose a robust multi-sensor clustering algorithm to solve the MSDF problem. The MSDF problem corresponds to the clustering dataset of the observations (containing a large amount of noise) of multiple sensors, forming k clusters, and each cluster must satisfy the CL constraint. Unlike other model-based multi-sensor data fusion algorithms, no prior information, like the noise and motion model of a target, is needed in the proposed algorithm. Compared with the existing multi-sensor data fusion clustering algorithm, the proposed algorithm is more robust, and the lower the detection probability of the sensors, the better the performance of the proposed algorithm.