Noises Cutting and Natural Neighbors Spectral Clustering Based on Coupling P System

: Clustering analysis, a key step for many data mining problems, can be applied to various ﬁelds. However, no matter what kind of clustering method, noise points have always been an important factor affecting the clustering effect. In addition, in spectral clustering, the construction of afﬁnity matrix affects the formation of new samples, which in turn affects the ﬁnal clustering results. Therefore, this study proposes a noise cutting and natural neighbors spectral clustering method based on coupling P system ( NCNNSC-CP ) to solve the above problems. The whole algorithm process is carried out in the coupled P system. We propose a natural neighbors searching method without parameters, which can quickly determine the natural neighbors and natural characteristic value of data points. Then, based on it, the critical density and reverse density are obtained, and noise identiﬁcation and cutting are performed. The afﬁnity matrix constructed using core natural neighbors greatly improve the similarity between data points. Experimental results on nine synthetic data sets and six UCI datasets demonstrate that the proposed algorithm is better than other comparison algorithms.


Introduction
With the rapid development of information technology, many fields have accumulated massive data. How to mine significant information and useful knowledge is a huge challenge. Clustering analysis, a key step for many data mining problems, can be applied to a variety of data types. The purpose of clustering is to divide a set of unlabeled data into different clusters based on the similarity between the data [1]. Therefore, the data with the most similar characteristics will be in the same cluster, while the data with the most dissimilar characteristics will be in different clusters [2]. Predecessors have proposed many clustering methods, such as partitioning methods [3], hierarchical methods [4], density methods [5,6], grid methods [7], and prototype-based methods [8]. Over the past few decades, clustering analysis has been effectively applied in image segmentation [9], text clustering [10], community division [11], pattern recognition [12], etc.
In recent years, spectral clustering, an effective clustering algorithm based on graph theory, has attracted the attention in academia because of its high performance and simple implementation [13]. Spectral clustering can identify samples with arbitrary shapes while converging to the global optimal solution. Its main idea is to treat all data as nodes in space, and these nodes can be connected by edges. The weight of the edge between two nodes is determined by their distance. The distance that is closer shows that the higher the similarity, so it has higher weight, and vice versa. Then, according to the graph partition method, the graph is divided into several disconnected sub-graphs. The weight sum of edges between different sub-graphs is as small as possible, and the weight sum of edges within a sub-graph is as high as possible. The set of nodes contained in the sub-graph is the final clustering result [14]. Spectral clustering can be understood as mapping data in high-dimensional space to low-dimensional, and then clustering in low-dimensional space using other clustering algorithms (such as K-means).
In the spectral clustering algorithm, an important problem is constructing the affinity matrix. Yessica and Miin-Shen [2] merged the power parameter into the Gaussian kernel similarity function for the construction of affinity matrix. This power parameter can separate the points actually located on different clusters, but the distance is small. Then, it uses the maximum of all the minimum distances between the data nodes to obtain better clustering results. The maximum value between the estimated power parameter and the minimum distance can effectively improve the effect of spectral clustering. Huang et al. [15] proposed two novel algorithms-ultra-scalable spectral clustering (U-SPEC) and ultrascalable ensemble clustering (U-SENC)-based on ultra-large-scale data under limited resources. In U-SPEC, first, in order to construct a sparse affinity sub-matrix, a hybrid representative selection strategy and a fast approximation method of k-nearest representative are proposed. Then, it interprets the sparse sub-matrix as a bipartite graph. Finally, using transfer cutting partitions the graph effectively and achieves the clustering result. In U-SENC, by integrating multiple U-SPEC clusters, a new bipartite graph is constructed between the nodes and the base clusters, and then the effective division is performed to obtain consistent clustering results. Bian et al. [16] combined spectral clustering structure and data fuzzy similarity matrix learning method (FSCM) to enhance the clustering performance. FCSM adopts the dual-index fuzzy c-means clustering algorithm to determine the fuzzy similarity between any pair of data points. Meanwhile, it generates the fuzzy similarity matrix of the data by adaptively assigning the fuzzy neighborhood of the data points. In this way, the spectral clustering structure of the data is found and the clustering stability of the FSCM algorithm is ensured. Lin and Guo [17] put forward a new affinity matrix generation method based on the principle of neighbor relationship propagation and gave a neighbor relationship propagation algorithm. The generated affinity matrix can effectively promote the similarity of point pairs in the same cluster and can better identify the structure of the data. Aiming at the similarity measurement of complex data, Xiucai and Tetsuya [18] proposed a novel spectral clustering method, which is based on the similarity measurement of data points in the kernel space adaptive neighborhood. In the kernel space, adaptive and optimal neighbors are assigned to each data point according to the local structure, and the sparse matrix is learned as a similarity matrix for spectral clustering. Wu et al. [19] present a scalable spectral clustering method based on Random Binning features (RB), which can accelerate the construction and feature decomposition of similar graphs at the same time. In detail, it constructs the inner product implicit approximation graph similarity (kernel) matrix of a large sparse feature matrix through RB.
Membrane computing [20] (also known as P system) is a system with the characteristics of distributed parallel computing proposed by Professor Păun. Its purpose is to learn from and simulate the way cells, tissues, organs, or other biological structures process chemical substances and establish a distributed parallel computing model with outstanding computing performance [21]. P system is a novel branch of natural computing, which provides an abundant computing framework for bimolecular computing. P system has been proved to have the calculation ability equivalent to Turing machine, and can effectively solve the difficult problem of calculation [22]. Nowadays, the P system is mainly divided into cell-like P system, tissue-like P system, and neural-like P system [23]. In recent years, the research content of the P system mainly includes theoretical research and application research. In terms of theoretical research, some new variants of the P system have been proposed to solve the problem, which can improve the computing power with the min cells or spikes [24,25]. For application research, the P system can solve practical problems [26] and can be used to implement clustering processes [27,28].
Although the above algorithm can achieve better clustering performance, to a certain extent, noise points also have a great influence on the clustering effect. At the same time, the determination of the parameters of natural neighbors is also an important issue when constructing the affinity matrix. To address above problems and based on the above analysis, we propose a spectral clustering method with noises cutting and natural neighbors based on the coupling P system (NCNNSC-CP) and verify clustering performance. The main contributions of this paper are as follows: (1) A new coupling P system is proposed, which integrates natural neighbors and spectral clustering into the coupling membrane system to perform clustering tasks.
(2) Aiming at the noise points in the data set, we utilize the characteristics of the natural neighbors to identify and cut the noise points.
(3) In the stage of spectral clustering, we propose a search of natural neighbors without parameters, which can quickly determine the natural eigenvalues, thereby further constructing an affinity matrix with high similarity within the cluster.
(4) Nine classical synthetic data sets and six UCI data sets are used to simulate and verify the clustering performances of NCNNSC-CP.
The rest of the paper is organized as follows. Section 2 introduces the related concepts of P system and natural neighbors, and the basic algorithm of spectral clustering. In Section 3, a spectral clustering method with noises cutting and natural neighbors based on the coupling P system is proposed. Section 4 shows the performance of the algorithm through experimental analysis. Conclusions and future research work are given in Section 5.

Spectral Clustering
The spectral clustering algorithm is based on graph theory. Compared with traditional clustering algorithms, it can cluster data points with arbitrary shapes and converge to the global optimal solution efficiently. First, it constructs an undirected weighted graph G(V, E) based on similarity. Each v i of the graphs corresponds to a data point, and the weight w ij is the similarity of the edges formed by the data points. Generally, there are three ways to construct an affinity matrix: (1) ε-neighborhood graph. Set the distance threshold ε, (2) K-nearest neighbor graph. Using the KNN algorithm obtains the k neighbors of each data point, only when x j is one of the k neighbors of x i w ij > 0, and w ij = exp(− ||xi−xj|| 2 2 2σ 2 ). (3) Fully connected graph. In general spectral clustering, it is the most commonly used method of constructing affinity matrix. Different kernel functions can be selected to define the weight between edges. When using Gaussian kernel function, the similarity matrix and the affinity matrix are the same, w ij = s ij = exp(− ||xi−xj|| 2 2 2σ 2 ). Then, according to the affinity matrix, we construct the degree matrix D. For any point x i in the graph, its degree d i is defined as the sum of the weights of all edges connected to it d i = ∑ n j=1 w ij . The most important process in spectral clustering is the construction of Laplacian matrix L: (1) The denormalized Laplacian matrix L = D − W.
(3) Symmetric normalized Laplacian matrix L = I − D −1/2 WD −1/2 . Next, the eigenvector u 1 , u 2 , · · · u k corresponding to the first k eigenvalues of the L can be calculated and set: U = {u 1 , u 2 , · · · u k }, U ∈ R n * k . In addition, U is normalized by row to generate Y = {y 1 , y 2 , · · · y n }, and each row of Y represents a sample. At last, the clustering algorithm (such as k-means) is applied to cluster the new samples into clusters C 1 , C 2 , · · · C k .

Algorithm 1 Spectral clustering (NJW)
Input: The dataset D Output: C (the clustering results) 1: Construct the affinity matrix W.

Natural Neighbors
Zhu [30] systematically expounded the concept of natural neighbors through induction and summary based on previous studies, which is a reflection of the friendship between people in human society. Compared with the traditional nearest neighbor method, the natural neighbor method is scale-free. The relevant definitions of natural neighbors are as follows.
Definition 1: (The Natural Neighbor Stable Structure). The natural neighbor stable structure is, generally speaking, that A is a Natural Neighbor of B if A regards B as a neighbor and B regards A as a neighbor at the same time.
where NNk(x i ) is the kth nearest neighbor of point x i .

Definition 2: (k-Nearest Neighbors).
Given a data set D, for any point x i , its k nearest neighbors refer to a set of points in D with d( where d(x i , kn) is the distance of the kth nearest neighbor of x i .

Definition 3: (Reverse Neighbors).
The reverse neighbor of x i is considered to be a set of data points x in D that take x i as its k nearest neighbor, which is

Definition 4: (The Natural Characteristic Value Sup [31]).
Sup is the search range in the natural neighbor method.
where the initial value of k is 1, and Nb k (x i ) is the number of reverse neighbors of x i in the kth iteration. In addition, f (x) = 0, otherwise 1, i f x == 0 . The cell-like P system is the first proposed P system, and its membrane structure is shown in Figure 1. The outermost membrane 1 is the skin membrane. The skin membrane separates the entire P system from the external environment. If a membrane does not contain a submembrane inside, the membrane is a basic membrane (such as membrane s2, 3, 5, 8, 9, and 7). Otherwise, it is called a non-basic membrane (such as membranes 1, 4, and 6). The cell-like P system is the first proposed P system, and its membrane structure is shown in Figure 1. The outermost membrane 1 is the skin membrane. The skin membrane separates the entire P system from the external environment. If a membrane does not contain a submembrane inside, the membrane is a basic membrane (such as membrane s2, 3, 5, 8, 9, and 7). Otherwise, it is called a non-basic membrane (such as membranes 1, 4, and 6). μ represents the membrane structure;

4.
(1 ) i i m ω ≤ ≤ refers to the multiple set of objects contained in region i in the membrane structure; 5. R contains all the rules; 6.
represents the input/output area of the system, where e is a reserved character not included in H. Given a P system, that is, given the membrane structure, the simplified objects of each membrane area and the corresponding rules. Each process of the P system is an execution rule of non-determinism and maximum parallelism. After each time step, the system enters a new pattern.

Tissue-like P system
The tissue-like P system regards cells as the vertices of the graph in the system. The cells in the P system have different states, and the rules can be executed only when the required states are met. The basic membrane structure of the tissue-like P system is shown in Figure 2. Cell 0 is the input cell, which contains the initial object. The initial object uses rules and communication mechanisms to communicate in cell 1 to cell n. Cell n+1 is the output cell, used to store the obtained results. The formal definition of the cell-like P system is O is the alphabet, where the elements represent objects; 2.
H is a collection of membrane labels; 3.
µ represents the membrane structure; 4. ω i (1 ≤ i ≤ m) refers to the multiple set of objects contained in region i in the membrane structure; 5.
R contains all the rules; 6.
i 0 ∈ H ∪ {e} represents the input/output area of the system, where e is a reserved character not included in H. Given a P system, that is, given the membrane structure, the simplified objects of each membrane area and the corresponding rules. Each process of the P system is an execution rule of non-determinism and maximum parallelism. After each time step, the system enters a new pattern.

Tissue-Like P System
The tissue-like P system regards cells as the vertices of the graph in the system. The cells in the P system have different states, and the rules can be executed only when the required states are met. The basic membrane structure of the tissue-like P system is shown in Figure 2. Cell 0 is the input cell, which contains the initial object. The initial object uses rules and communication mechanisms to communicate in cell 1 to cell n. Cell n + 1 is the output cell, used to store the obtained results.
The formal definition of the traditional tissue-like P system is O is the alphabet, which contains all objects in the system; 2.
i out ∈ {1, 2, . . . , n} indicates the output cells of the system; 4. σ 1 , · · · , σ n represents n cells in the system, the detail definition are as follows: (1) Q i shows the collection of all states; (2) s i,0 ∈ Q i refers to the initial state; (3) w i,0 ∈ O * indicates the initial multiset of the object, when w i,0 = λ, there is no object in cell i; (4) P i stands for the rules of the entire system.  (2). ,0 i i s Q ∈ refers to the initial state; (3).
indicates the initial multiset of the object, when ,0 . i P stands for the rules of the entire system.

Noises Cutting and Natural Neighbors Spectral Clustering Based on Coupling P System
In this section, the spectral clustering method with noises cutting and the natural neighbors based on coupling P system is proposed. First, we explain the general framework of the coupling P system. Then, the different evolution rules and operations in the subsystems such as searching the natural neighbors, noises cutting, constructing affinity matrix, and clustering are introduced, respectively. Meanwhile, the communication rules between different membranes are elaborated. The flow chart of the proposed NCNNSC-CP algorithm is shown in Figure 3.

Noises Cutting and Natural Neighbors Spectral Clustering Based on Coupling P System
In this section, the spectral clustering method with noises cutting and the natural neighbors based on coupling P system is proposed. First, we explain the general framework of the coupling P system. Then, the different evolution rules and operations in the subsystems such as searching the natural neighbors, noises cutting, constructing affinity matrix, and clustering are introduced, respectively. Meanwhile, the communication rules between different membranes are elaborated. The flow chart of the proposed NCNNSC-CP algorithm is shown in Figure 3. The proposed coupled P system (NCNNSC-CP) is the coupling of the cell-like P system and the tissue-like P system. According to the related concepts introduced in Section 2.3, the basic structure of the coupled P system is shown in Figure 4.

The General Framework of the Coupling P System
The proposed coupled P system (NCNNSC-CP) is the coupling of the cell-like P system and the tissue-like P system. According to the related concepts introduced in Section 2.3, the basic structure of the coupled P system is shown in Figure 4.

The General Framework of the Coupling P System.
The proposed coupled P system (NCNNSC-CP) is the coupling of the cell-like P system and the tissue-like P system. According to the related concepts introduced in Section 2.3, the basic structure of the coupled P system is shown in Figure 4.  The formal definition of the coupled P system is , Sup, Noises, c, w ij , D ii , L, σ . x i represents each data point. d ij represents the distance between arbitrary two points. The natural neighbors of x denoted as NaN (x). Nb(x) refers to the number of reverse neighbors of x. Sup is the natural characteristic value. Noises stand for the noisy points in the dataset. c is the number of clusters. The similarity between data points x i and x j represented by w ij. D ii indicates the degree of data point x i . L represents the Laplace matrix. σ is the tuning parameters parameter. 2. η = {x 1 , x 2, · · · , x n , σ, c} ∈ O represents the initial objects in the system. 3.
µ stands for the structure of the membrane. in is cell 0, the input membrane. out is cell 5, the output membrane. 6.
σ 0 , · · · , σ m refers to cells in the system. The m is determined according to the number of clusters and the number of data points. 7.
R is the collection of rules, including communication rules and evolution rules.
Evolution rules are used to modify objects in the cluster, and communication rules are used to transfer objects from one cell to another.

The Evolution Rules
The rule R 0 for inputting cell is to transfer the raw dataset and the parameters to cell 1 for subsequent clustering algorithm operations. At the same time, the original data are transmitted to the cell 2 to perform noise cutting on the original data after the noises recognition. The specific R 0 rules can be described as Processes 2021, 9, 439 8 of 22 In terms of output cell 5, it is mainly used to store clustering results R 5 = ∅.

The Evolution Rules of Searching the Natural Neighbors in Cell 1
The construction of affinity matrix can directly affect the clustering results of spectral clustering. In traditional algorithms, most of the parameters are determined based on artificial experience and manually input. According to the related concepts of natural neighbors and membrane system in Section 2, this paper uses the rules of searching natural neighbors without parameters in the membrane system to determine the natural characteristic value Sup and natural neighbors NaN(x).
In summary, the details rules of the evolution rules of The Natural Neighbors Searching (NaN-searching) in cell 1 are shown in rules R 1 .
• R 11 (Sorting rule): Create a KD-tree T from the dataset D, which calculates the Euclidean distance of all points in the dataset D, and then sorts them in ascending order. • R 12 (Searching rule): For each point x i in D, we use a KD-tree T to find its rth neighbor x j . Then, Nb(x j ) = Nb( Apparently, the search of natural neighbors is different from the traditional k-nearest method. The k nearest neighbors of each point x i can be found without any parameters in the whole algorithm process in cell 1.

The Evolution Rules of Noises Recognition and Cutting in Cell 2
In cell 2, we execute the evolutionary rules of noises recognition and cutting. Noise points refer to data with errors or anomalies (deviations from expected values) in the data, which are neither core points nor boundary points in the data set. At the same time, these noises cause great interference to the data analysis and preprocessing, especially for clustering, which is extremely sensitive to them based on empirical data. Therefore, it is particularly significant to identify and eliminate noise points. Jokinen et al. [6] deal with noise on the basis of spatial clustering based on hierarchical density, and proposes a density-based cluster ability measure. We propose a noise recognition and cutting method based on the reverse density and critical reverse density in natural neighbors. The reverse density and critical reverse density are specifically defined as follows.
Definition 6: (Reverse density). Based on the natural neighbors NaN(x i ) of the data object x i , we define the inverse density as the average distance between x i and all its natural neighbors: Definition 7: (Critical Reverse density). The critical reverse density of point x i is calculated from the average reverse density Rd(x i ) and the standard deviation of the reverse density std(Rd(x i )) of all objects in the data set D: CRd(x i ) = mean(Rd(x)) + α · std(Rd(x))(∀x ∈ D) where α is a tuning coefficient, and experiments show that a = 1 is suitable for most data sets.

Definition 8: (Noises).
If the x i 's reverse density is larger than its critical inverse density, it is a noise point.
In accordance with the natural neighbors of point x obtained in cell 1 and the above concepts, we simultaneously conduct noise recognition for all points in the n sub-cells of cell 2. This step is parallel to improve computational efficiency. When it is judged that x is a noise, it is transported to the environment outside the cell 2 and discard x. The details rules of the evolution rules of Noises Recognition and Cutting (Noises-rc) in cell 2 are shown in rules R 2 .
• R 21 (Noise recognition rule): For each point x i in D, using Equation (6) to get Rd(x) and then using Equation (7) to get CRd(x) in sub-cells of cell 2. In spectral clustering, the construction of affinity matrix plays an important role in the clustering result. Generally, it is obtained by calculating the Gaussian kernel distance between the data point x i and its natural neighbors. However, due to the influence of noise points, the natural neighbors of the data points are mixed with noises, which is not conducive to the construction of the affinity matrix. Relatively speaking, noise identification and screening based on reverse density and critical reverse density are extremely effective. The data set D' is deduced through the Section 3.2.2 which is the core data set. Therefore, we perform the natural neighbor searching again to acquire the core natural neighbor of the data object. Although it will increase the complexity of the algorithm to a certain extent, it is worthwhile compared with the greatly improved accuracy. The core natural neighbor is defined as follows.
Definition 9: (The Core Natural Neighbors). For each object x in dataset D, its core natural neighbors are k nearest neighbors without noises, denoted as CNaN (x).
At last, we perform spectral clustering (NJW [28]). On the basis of the affinity matrix W, we calculate the degree matrix D: As for the Laplacian matrix L, we utilize the Symmetric normalized Laplacian matrix: Next, we choose the eigenvector u 1 , u 2 , · · · u c corresponding to the first k eigenvalues of the L to comprise U = {u 1 , u 2 , · · · u c }, U ∈ R n * c , and standardize it by row to get Y: The details rules of the evolution rules of constructing the Affinity Matrix, Degree Matrix, and Laplace matrix in cell 3 are shown in rules R 3 .
• R 31 (Constructing the affinity matrix rule): Based on the core natural neighbors CNaN (x) and the input parameters, the affinity matrix W is calculated in cell 3. For the jth natural neighbor in CNaN(x i ) of x i , W ij = exp(d(i, j) 2 /2σ 2 ). Moreover, If W ij is a real number while W ji is equal to 0, the value of W ij is assigned to W ji by the principle of symmetry. • R 32 (Constructing the degree matrix rule): According as the affinity matrix and the Equation (8), the degree matrix is obtained in cell 3. • R 33 (Constructing the Laplacian matrix rule): As for the Laplacian matrix L, we utilize the Symmetric normalized Laplacian matrix using Equation (9) in cell 3. • R 34 (Constructing novel cluster sample): Based on the above concepts and Equation (10), we construct the Y in cell 3 for the next step of clustering. Each row of Y represents a sample. Simultaneously, it is transmitted into cell 4.

The Evolution Rules of K-Means (Clustering Method)
In the last step of NCNNSC-CP, the K-means is applied to cluster the new samples into clusters C 1 , C 2 , · · · C c . In cell 4, there are k sub-cells running simultaneously.
The details rules of the evolution rules of K-means are shown in rules R 4 .
• R 41 (Random selection of cluster center rule): Randomly selecting c points from the dataset as the initial cluster centers and store them in c sub-cells. • R 42 (Clustering rule): The distance between each sample point and each cluster center in sub-cell is calculated and transmitted to cell 4. Then, the data points are clustered according to the principle of nearest distance in cell 4. • R 43 (Redefine the cluster center rule): According to each cluster divided by rule R42, the average distance of each cluster is calculated to change the cluster center. If the cluster center changes, the clustering process are repeated. Otherwise, the cluster result C 1 , C 2 , · · · C c is output to cell 5.

The Communication Rules between Different Cells
Communication between cells in the CP system can only be achieved when there is a synapse between different cells. In order to ensure the effectiveness of the system and improve the efficiency of the system, this paper constructs a CP system with directional communication rules. In the CP system, some membranes are responsible for initializing objects and outputting results, and some membranes are responsible for algorithm execution. Orderly communication between different membranes makes the whole algorithm more effective.
There are three communication rules in the CP system: one-way transmission and twoway transmission between cells and one-way transmission between cells and the environment.
• Rule3 : (1, u/v, 2) It can transfer the string u of the natural neighbors and the natural characteristic in cell 2 to cell 3, and transfer the dataset without noises υ in cell 3 to cell 2.
(3) One-way transmission between cell and the environment is Rule4.
• Rule4 : (2, u/λ, e) The noise data u generated in cell 2 is transferred to environment e and discarded.

Computational Complexity
We assume that n is the total number of points in the dataset. The time complexity of NCNNSC-CP algorithm can be calculated as follows. (1)

Experimental Setting
All experiments were conducted in Matlab 2016a on a PC with Intel core i5-940M CPU, 4GB RAM, Windows 7 64-bit operating system. In the section, we conduct experiments on synthetic data sets and real data sets to evaluate the performance of the proposed NCNNSC-CP method. At the same time, we compare it with state-of-the-art clustering methods, including K-means [32], DBSCAN [33], DPC [34], Cut-PC [35], SC [29], U-SPEC (Ultra-Scalable Spectral Clustering) [15], U-SENC (Ultra-Scalable Ensemble Clustering) [15], and NNSC, for comparative analysis.

Evaluation Metrics
In order to measure the quality of the clustering results, external indicators have been usually used in the clustering validity indexes. This paper uses 4 commonly used clustering indicators-accuracy (Acc), F1-measure, Normalized Mutual Information (NMI), and Adjusted Rand Index (ARI)-to evaluate clustering performance, which has been advocated and discussed [36].
(1) Acc: Accuracy indicates the ratio of the number of samples with correct clustering to the total number of samples, where P is the predicted label and T is the true label.
(2) ARI: Rand Index represents the degree of similarity between the predicted value and the actual value of the sample, and its range is [0, 1]. However, for random results, RI does not guarantee that the score is close to 0, so the Adjusted Rand Index with higher discrimination is proposed. The value range is [−1, 1]. The larger the value, the more consistent the clustering result is with the real situation.
where TP, FP, TN, and FN represent true positive, false positive, true negative, and false negative decisions, respectively.
(3) F1-measure: The F1-measure is the harmonic mean of precision and recall, and it is a commonly used comprehensive evaluation index in clustering. The Precision indicates the proportion of samples classified as positive samples that are actually positive samples. The Recall rate refers to the proportion of instances that are actually classified as positive instances.
(4) NMI: Normalized Mutual Information utilized information theory to measure the differences between the clustering partitions and its range is in [0,1].
where I(X;Y) is the mutual information between X and Y, and H(X), H(Y) are the entropy of random variables. Table 1 shows the basic information of the nine synthetic data sets. The original data of the nine synthetic data sets are shown in Figure 5.  Figure 5. The original data of the nine synthetic data sets. Figure 5. The original data of the nine synthetic data sets. Figure 6 shows that all clustering methods have good clustering performance on the D1 dataset. It can be seen from Figure 7 that in addition to the USENC and USPEC algorithms, other clustering methods work well on the D2 dataset. This may be because USENC and USPEC are more focused on clustering of very Ultra-scale data sets with limited resources. As shown in Figure 8, only the DPC, K-means, and USENC methods are not effective for the D3 dataset. In fact, DPC and K-means cannot detect non-spherical clusters.       Figure 6 shows that all clustering methods have good clustering performance on the D1 dataset. It can be seen from Figure 7 that in addition to the USENC and USPEC algorithms, other clustering methods work well on the D2 dataset. This may be because The clustering results in Figure 9 demonstrate that DBSCAN, Cut-PC, USENC, and NCNNSC algorithms can cluster D4 well. Although k-means, DPC, SC, and NNSC perform well on the spherical cluster dataset, they cannot handle circular clusters. USPEC algorithms performed the worst. It can be seen from Figure 10 that most of the algorithms can perform clustering excellently on the D5 dataset. However, DPC, K-means, and USPEC are not well recognized, which again proves that DPC and K-means cannot handle circular clusters well. USENC and USPEC are more focused on clustering of very Ultra-scale data sets with limited resources. As shown in Figure 8, only the DPC, K-means, and USENC methods are not effective for the D3 dataset. In fact, DPC and K-means cannot detect non-spherical clusters. Figure 9. The results of 9 clustering algorithms in D4.    Figure 11 illustrates that DBSCAN, Cut-PC, and NN-NCSC are processed well for clusters of spiral clusters. The similarity is that these three algorithms have been processed with noise. Generally, SC can handle spiral clusters well, but part of the data in this dataset is too scattered, which affects the final clustering performance. As shown in Figure 12 The clustering results in Figure 9 demonstrate that DBSCAN, Cut-PC, USENC, and NCNNSC algorithms can cluster D4 well. Although k-means, DPC, SC, and NNSC perform well on the spherical cluster dataset, they cannot handle circular clusters. USPEC algorithms performed the worst. It can be seen from Figure 10 that most of the algorithms can perform clustering excellently on the D5 dataset. However, DPC, K-means, and US-PEC are not well recognized, which again proves that DPC and K-means cannot handle circular clusters well. Figure 11. The results of 9 clustering algorithms in D6. Figure 11. The results of 9 clustering algorithms in D6.  Figure 11 illustrates that DBSCAN, Cut-PC, and NN-NCSC are processed well for clusters of spiral clusters. The similarity is that these three algorithms have been processed with noise. Generally, SC can handle spiral clusters well, but part of the data in this dataset is too scattered, which affects the final clustering performance. As shown in Figure 12, only DBSCAN, USENC, and NCNNSC-CP have good clustering results for the data set D7. SC and Cut-PC achieved similar results. Other algorithms are less efficient. From the clustering shown in Figure 13, Cut-PC, USENC, and NCNNSC-CP can identify clusters in the D8 dataset, while other algorithms cannot. As for the last data set D9, which is displayed in Figure 14, only Cut-PC and NCNNSC-CP clustered it correctly. The USENC clustering results are almost correct but there are some deviations. The clustering results of SC and USPEC are similar, and most of the data are classified into four categories. For other algorithms, the clustering results are very poor. Therefore, NCNNSC-CP algorithms can effectively detect irregular data sets.    From the clustering shown in Figure 13, Cut-PC, USENC, and NCNNSC-CP can identify clusters in the D8 dataset, while other algorithms cannot. As for the last data set D9, which is displayed in Figure 14, only Cut-PC and NCNNSC-CP clustered it correctly. In conclusion, the NCNNSC-CP algorithm performs better than other algorithms from Figures 2-10 and can properly recognize different types of clusters. Simultaneously, it can achieve effective clustering results on data sets of different scales and can be applied to more complicated situations.

Experiments on Real Datasets
The UCI data set is a commonly used standard test data set, which is often used in many clustering experiments. In this paper, we conduct experiments on six real UCI data sets. The specific information is shown in Table 2. For the parameter setting of different algorithms, we conduct 20 iterations of experiments for each algorithm. The results of ACC, ARI, F, and NMI of the nine algorithms on the six UCI data sets are given in Tables 3-6, respectively. The best results are shown in bold, and the second best are shown in asterisks (*).     In order to more intuitively observe the Acc, ARI, F-measure, and NMI of the nine algorithms on the six real UCI data sets, histograms have been used to represent them, as shown in Figures 15-18.

Result analysis
As aforementioned, combining the experimental results on the synthetic data set and the real data set, compared with other comparison algorithms, the Noises Cutting and Natural Neighbors Spectral Clustering Based on Coupling P System is better. For different types of clusters, the corresponding shapes can be correctly identified. Moreover, the difference in the size of each data set can indicate that the NCNNSC-CP method can obtain effective clustering results for data sets of different scales and can be applied to more complex situations. In this paper, all the procedures of the algorithm are carried out in the structure of the coupled P system. Taking advantage of the extremely parallel computing characteristics of the coupled P system in membrane computing, the computing efficiency is improved theoretically. In the part of noises recognition and cutting, all data points in the data set are determined in parallel at the same time, instead of the traditional sequential method. After the noise is identified, the noise points can be transported to the environment in real-time and discarded. When the processed data are clustered by the Kmeans algorithm, the c randomly selected cluster centers are simultaneously operated in parallel on the c sub-cells in the coupled P system. Membrane computing has the characteristics of extremely parallelism. In the constructed coupled P system, it can theoretically operate in parallel, which can improve efficiency to a certain extent.

Conclusion
In this paper, we propose noise cutting and natural neighbors spectral clustering based on coupled P system. The concept of natural neighbors and the method of spectral clustering are integrated into the coupled P system. The entire algorithm flow runs under the framework of the coupled P system, which detect and cut noises through the concepts of critical density and reverse density of natural neighbors. At the same time, the core natural neighbors of each data object are automatically determined from the noise-processed dataset, and the affinity matrix of spectral clustering is constructed according to the core natural neighbors. Then, further cluster them to achieve the final clustering result. Experimental results indicate that the proposed algorithm is better than other algorithms on artificially synthetic datasets and UCI real data sets.
In future work, we will further expand the CP system so that it can solve more optimization and clustering problems. Moreover, it will further improve the performance of the algorithm, such as the consideration of the automatic determination of the number of clusters, and the further improvement of the final clustering algorithm. Regarding accuracy, as shown in Figure 15, except for the banknote data set, the NCNNSC-CP algorithm has the best results on the five UCI data sets. The result on the banknote dataset is also the second best. In terms of accuracy, Figure 16 demonstrates that although the proposed algorithm has general results on the two data sets of Thyroid and Ionosphere, it has achieved excellent results on the three data sets of Iris, Seeds, and Banknote. At the same time, it is second only to the USENC algorithm on the Breastcancer data set. As for F-measure, it is obvious that the NCNNSC-CP algorithm performs best on all data sets from Figure 17. In the aspect of the NMI, as displayed in Figure 18, except for the general performance on the ionosphere data set, the proposed algorithm achieved the best results on the three datasets of Iris, Seeds, and Banknote, and also achieved the second-best results on the datasets of Breastcancer and Thyroid.
Based on the above analysis, compared with other clustering algorithms, the NCNNSC-CP algorithm proposed in this paper has excellent performance in clustering.

Result Analysis
As aforementioned, combining the experimental results on the synthetic data set and the real data set, compared with other comparison algorithms, the Noises Cutting and Natural Neighbors Spectral Clustering Based on Coupling P System is better. For different types of clusters, the corresponding shapes can be correctly identified. Moreover, the difference in the size of each data set can indicate that the NCNNSC-CP method can obtain effective clustering results for data sets of different scales and can be applied to more complex situations. In this paper, all the procedures of the algorithm are carried out in the structure of the coupled P system. Taking advantage of the extremely parallel computing characteristics of the coupled P system in membrane computing, the computing efficiency is improved theoretically. In the part of noises recognition and cutting, all data points in the data set are determined in parallel at the same time, instead of the traditional sequential method. After the noise is identified, the noise points can be transported to the environment in real-time and discarded. When the processed data are clustered by the K-means algorithm, the c randomly selected cluster centers are simultaneously operated in parallel on the c sub-cells in the coupled P system. Membrane computing has the characteristics of extremely parallelism. In the constructed coupled P system, it can theoretically operate in parallel, which can improve efficiency to a certain extent.

Conclusions
In this paper, we propose noise cutting and natural neighbors spectral clustering based on coupled P system. The concept of natural neighbors and the method of spectral clustering are integrated into the coupled P system. The entire algorithm flow runs under the framework of the coupled P system, which detect and cut noises through the concepts of critical density and reverse density of natural neighbors. At the same time, the core natural neighbors of each data object are automatically determined from the noise-processed dataset, and the affinity matrix of spectral clustering is constructed according to the core natural neighbors. Then, further cluster them to achieve the final clustering result. Experimental results indicate that the proposed algorithm is better than other algorithms on artificially synthetic datasets and UCI real data sets.
In future work, we will further expand the CP system so that it can solve more optimization and clustering problems. Moreover, it will further improve the performance of the algorithm, such as the consideration of the automatic determination of the number of clusters, and the further improvement of the final clustering algorithm.