One-Step Clustering with Adaptively Local Kernels and a Neighborhood Kernel

Among the methods of multiple kernel clustering (MKC), some adopt a neighborhood kernel as the optimal kernel, and some use local base kernels to generate an optimal kernel. However, these two methods are not synthetically combined together to leverage their advantages, which affects the quality of the optimal kernel. Furthermore, most existing MKC methods require a twostep strategy to cluster, i.e., first learn an indicator matrix, then executive clustering. This does not guarantee the optimality of the final results. To overcome the above drawbacks, a one-step clustering with adaptively local kernels and a neighborhood kernel (OSC-ALK-ONK) is proposed in this paper, where the two methods are combined together to produce an optimal kernel. In particular, the neighborhood kernel improves the expression capability of the optimal kernel and enlarges its search range, and local base kernels avoid the redundancy of base kernels and promote their variety. Accordingly, the quality of the optimal kernel is enhanced. Further, a soft block diagonal (BD) regularizer is utilized to encourage the indicator matrix to be BD. It is helpful to obtain explicit clustering results directly and achieve one-step clustering, then overcome the disadvantage of the two-step strategy. In addition, extensive experiments on eight data sets and comparisons with six clustering methods show that OSC-ALK-ONK is effective.


Introduction
The data in real problems usually contain nonlinear structures.When clustering these data, it is necessary to use a clustering method that can capture the nonlinear structure.Multiple kernel clustering (MKC) has the advantage of not only processing nonlinear data but also fusing the information of multiple given kernels to yield an optimal kernel.Therefore, it attracts extensive attention from scholars.Recently, many MKC methods for generating an optimal kernel have been proposed.
One strategy is to use a linear combination of given kernels to form an optimal kernel.The weights of given kernels in [1,2] are learned by 1 -regular term, while the weights of given kernels in [3,4] are yielded from 2 -regular term.More generally, p -regular term [5,6] is used to optimize the weights of given kernels and learn an optimal kernel, and it makes the selection of regular term more flexible.In addition, many research studies adopt the strategy of linear combination to learn the optimal kernel [7][8][9][10][11].In particular, a mini-max model is utilized in a simple MKC method (SimpleMKKM) to learn the kernel coefficient and update the indicator matrix [12].It is worth noting that this strategy is based on the assumption that the optimal kernel stays in a linear combination of given kernels.This assumption may not hold according to the fact, because this strategy restricts the search scope of the optimal kernel and degrades its quality.
In order to expand the search scope of the optimal kernel, a neighborhood kernel is used in [13][14][15].The optimal kernel in [13,14] is learned from a neighborhood of the consensus kernel, where a low-rank constraint [14] is applied to the neighborhood kernel to reveal the clustering structure between samples.In particular, the base neighbor kernels with block diagonal structure [15] are produced by defining the neighbor kernel of base kernels, and then an optimal kernel is obtained by combining linearly the neighbor kernels.However, the neighborhood kernels in the literature above are generated from all the base kernels.The disadvantage is that it leads to the redundancy of base kernels because of not taking into account the correlation between given kernels.
Based on the consideration of the correlation between given kernels, selecting local base kernels to generate an optimal kernel emerges.This can avoid the redundancy of given kernels and promote diversity.On the basis of simpleMKKM [12], by considering the similarity of k-nearest neighbors between samples, a local simpleMKKM is proposed [16].By selecting subsets from the predefined kernel pool to determine local kernels, an MKC method by using representative kernels (MKKM-RK) to learn an optimal kernel is presented [17].In [18], a matrix-induced regularization is applied in an MKC method (MKKM-MR) to measure the correlation between each pair of kernels to generate an optimal kernel, where the kernels with strong correlation are assigned smaller coefficients, and those with weak correlation are assigned larger coefficients.By constructing the index set of samples to select local base kernels, the optimal kernel is relaxed into a neighborhood of the combination of local base kernels [19].
In recent years, various kernel evaluation methods for model selection have emerged in endless succession; for example, kernel alignment [20], kernel polarization [21], kernel class separability [22], etc.Among them, kernel alignment is one of the most commonly used evaluation methods on account of its simplicity, efficiency, and theoretical support.For example, centered kernel alignment is merged in an MKC method [23].And, in [24], a local kernel alignment strategy is proposed by requiring only one sample to align with its k-nearest neighbors.Further, the global and local structure alignment, i.e., the internal structure of the data, is preserved in [25].
The research mentioned above fully shows that MKC has been widely used.However, most of them only adopt either a neighborhood kernel or local base kernels and do not combine these two methods together.Thus, they cannot broaden the search area of the optimal kernel and promote the variety of given kernels simultaneously and therefore cannot ensure the quality of the optimal kernel.In addition, most of the above methods require two steps; that is, first obtain the indicator matrix and then perform clustering.The two-step strategy does not guarantee the reliability and optimality of the final results because of error propagation and accumulation from each step.
In an ideal state, there is only one nonzero element in each row of the indicator matrix and the column in which the nonzero element resides corresponds to the cluster to which the sample belongs.That is, the indicator matrix in the ideal state directly displays clustering results.In this state, multiplying the indicator matrix by its transpose yields a block diagonal (BD) matrix [23].However, in the actual clustering process, the indicator matrix is usually not the ideal case.As a result, clustering results can only be obtained after clustering is performed on the indicator matrix.This is why most MKC methods adopt the two-step operation.The shortcomings of this operation have been mentioned above.In this case, the product of the indicator matrix and its transpose is not BD.Nevertheless, we can think in reverse: if the product is BD, the indicator matrix is guided towards the ideal state, and clustering results are obtained directly.
Inspired by the above idea, we impose a BD constraint on the product of the indicator matrix and its transpose to guide it be BD, which aims to obtain clustering results directly from the indicator matrix, i.e., one-step clustering.Then, we propose a one-step clustering with adaptively local kernels and a neighborhood kernel (OSC-ALK-ONK) in this paper.This method not only merges the advantages of local base kernels and the neighborhood kernel but also achieves one-step clustering.The process of generating a neighborhood kernel can be seen in Figure 1.
Here are the main contributions of this paper.
• By considering the correlation between base kernels, a simple strategy for selecting local base kernels is used to produce a consensus kernel, which adjusts adaptively to avoid the redundancy of given kernels and promote variety.• By selecting a neighborhood kernel of the consensus kernel as the optimal kernel, the expression capability of the optimal kernel is improved and its search scope is expanded.

•
A soft BD regularizer is used to encourage the product of the indicator matrix and its transpose to be BD, which means that the clustering results are obtained from the indicator matrix directly.Therefore, one-step clustering is realized, which ensures the final clustering results are optimal.• A four-step iterative algorithm including the Riemann conjugate gradient method in [26], is used to overcome the difficulty of solving the model.

•
Extensive experiment results conducted on eight benchmark datasets and compared with six clustering methods indicate that OSC-ALK-ONK is effective.
The remaining sections of the paper are as follows.Section 2 presents the notations used and the background of MKKC.In Section 3, the proposed OSC-ALK-ONK method and the optimization process are introduced in detail.Section 4 presents the experimental results and makes some discussions.The conclusions are stated in Section 5.

Related Work 2.1. Notations
The details of notations used in this paper are listed in Table 1.
where Z ∈ {0, 1} n×k is an assignment matrix, k is the number of clusters, are the centroid and the number of the c-th (1 ≤ c ≤ k) cluster.Denoting the design matrix as Φ = [φ(x 1 ), φ(x 2 ), . . ., φ(x n )] ∈ R d×n and the centroid matrix as U = [µ 1 , µ 2 , . . ., µ k ] ∈ R d×k , problem (1) can be rewritten as min According to the matrix decomposition, problem ( 4) is equivalent to min The difficulty of solving ( 5) is from the discreteness of Z.To overcome this difficulty, the discrete Z is usually relaxed to arbitrary real values, and its approximate values are treated as the solution of (5).Specifically, denoting H = ZL 1 2 , the following relaxed form of ( 5) is derived: min where H ∈ R n×k , I k is a k-order identity matrix.The optimal H for ( 6) is made up of the k eigenvectors corresponding to the k largest eigenvalues of K.

Multiple Kernel k-Means Clustering (MKKC)
In MKC, a consensus kernel is computed by where K p is the p-th base kernel, w p is the p-th component of the weight vector w = [w 1 , w 2 , • • • , w m ] T , m is the number of base kernels.
Replacing K in (6) with K w , the model of MKKC is: min Problem ( 8) can be solved by updating H and w alternately.(i) Updating H with fixed w, i.e., solving the similar one to problem (6).(ii) Updating w with fixed H, i.e., solving a quadratic programming problem:

Localized Kernel Selection
For a series of base kernels K 1 , K 2 , . . ., K m , considering the relationship between base kernel pairs, we define For the matrix G and a given positive parameter δ, in one hand, in the neighborhood of G, i.e., they have large similarity.In this case, we set that y pq = 0, which aims to discard the base kernels with high similarity.On the other hand, if F | < δ does not hold, we set that y pq = 1, which means that we select the base kernels with low similarity to yield an optimal kernel.In summary, (10) can effectively avoid the redundancy of base kernels while maintaining their variety.
Evidently, y pq in (10) reflects the similarity between K p and K q , then ∑ m q=1 y pq represents the similarity between K p and all the K q (q = 1, 2, . . ., m), Tr(Y T 1 M ) is the total similarity between each K p and K q (q = 1, 2, . . ., m). Let and Thereby such a w p can balance the contribution of different given kernels to generate an optimal kernel.

Block Diagonal Regularizer
The clustering indicator matrix H in ( 6) and ( 8) is not a square matrix.In the ideal case, its element can be computed as: where x i denotes the i-th sample, C j denotes the j-th cluster, n j represents the number of samples in C j .From ( 13), only one element in each row of H is nonzero, and this means the corresponding sample belongs to one and only one cluster.Further, if the samples are arranged from C 1 to C k by the cluster they belong to, then HH T is a BD matrix as follows: . . .
(14) prompts us to have the following idea: If HH T itself has the property of ( 14), then it will in turn induce H to have the elements as (13), which means explicit clustering results are obtained directly from H.
Inspired by this idea, we hope that HH T possesses the BD property.
Since HH T is a square matrix, we view HH T as an adjacency matrix, then according to Laplacian matrix in graph theory, its degree matrix D is a diagonal matrix with There is an important conclusion between a matrix and a Laplacian matrix.

Theorem 1 ([27]
).For any A ∈ R n×n 0, the number of connected components (blocks) in A equals the multiplicity k of the eigenvalue 0 of the corresponding Laplacian matrix L A .
Then, A has k connected components if and only if where λ i (L A )(i = 1, . . ., n) are the eigenvalues of L A in decreasing order.Hence, the k-BD representation of HH T can be given as follows.
Definition 1 ([27]).For HH T ∈ R n×n , the k-BD representation is defined as the sum of the k smallest eigenvalues of L HH T , i.e., From Theorem 1, ( 16) and ( 17), HH T k = 0 means that HH T is k-BD.Then, minimizing HH T  k is to encourage it to be BD.Thereby, it is a natural idea that HH T k is viewed as a BD regularizer.Its advantages, such as controlling the number of blocks, are softer than the BD method in [28] and be better than the alternatives of Rank (L HH T ) or the convex relaxation L HH T * , are stated in detail in [27].

Objective Function
Hereto, combining localized kernel selection, the block diagonal regularizer, and choosing a neighborhood kernel as the optimal kernel, we formulate the final model as follows: where w p is computed according to (10) and (11).
The loss function of the objective function is used to executive multiple kernel clustering, the consensus term is used to choose a neighbor kernel, and the block diagonal term is used to encourage HH T to be block diagonal, the aim of which is to obtain an expected H as Equation ( 13) and to implement one-step clustering.

Optimization
The regularizer HH T  k in problem ( 18) is non-convex, which leads the difficulty of solving it.For this, a theorem is introduced to reformulate HH T k .
Theorem 2 ([29], p. 515).Let L ∈ R n×n and L 0. Then By ( 17) and ( 19), then Although problem (20) is not jointly convex on W, G, K w and H, it is convex for each variable with the rest variables fixed.Thus, we optimize each variable alternately to solve (20).
Problem ( 22) can be expressed as min where The optimal solution of problem ( 23) is B is a diagonal matrix where the diagonal elements are the positive elements of Σ B and zeros [13].
By introducing a parameter γ, problem ( 24) can be turned into The closed-form solution of K w in ( 25) is computed by taking its derivative with respect to K w to zero: where w p is updated according to newly generated Y that is learned from new G.

Update H While Fixing G, K w and W
Here, problem ( 20) is min The term βTr((Diag(HH T • 1 n ) − HH T )W) leads to the difficulty of solving (27) directly.By means of matrix operations and the properties of the trace, Tr(Diag(HH T Because 1 M Diag(W) in ( 28) is not symmetric, the same solution as a kernel k-means clustering is not suitable for (28).Notably, (28) is similar to the problem on the Stiefel manifold in [26]; thus, the Riemann conjugate gradient method in [26] can be used to solve it.
These are the main steps of our proposed algorithm.

Experiments 4.1. Data Sets
Eight real data sets are used in our experiments, and their sizes and classes are summarized in Table 2.

Comparison Methods
To demonstrate the clustering performance, we compare OSC-ALK-ONK with six clustering methods.Among them, KKM is a single kernel clustering, MKKM and RMKKM are two classic MKC methods, and MKKM-MR, SimpleMKKM, and MKKM-RK are three MKC methods recently proposed.
• KKM integrates integral operator kernel functions in principal component analysis to deal with nonlinear data [30].• MKKM combines the fuzzy k-means clustering with multiple kernel learning, where the weights of base kernels are automatically updated to produce the optimal kernel [31].

•
RMKKM is an extension based on MKKM, and its robustness is ensured by an 2,1norm in kernel space [7].• MKKM-MR uses a matrix-induced regularization to measure the correlation between all the kernel pairs and implements MKC [18].• SimpleMKKM adopts a min-max model to minimize kernel alignment on the kernel coefficient and maximize kernel alignment on the clustering matrix, and is a simple MKC [12].• MKKM-RK is an MKC method by selecting representative kernels from the base kernel pool to generate the optimal kernel [17].

Multiple Kernels' Construction
In this paper, we construct a kernel pool by selecting twelve base kernels (i.e., m = 12), which consists of seven radial basis function kernels with ker(x i , x j ) = exp(− x i − x j 2 2 /(2τσ 2 )), where the value of τ is selected from {0.01, 0.05, 0.1, 1, 10, 50, 100} and σ is the maximum distance between samples; four polynomial kernels with ker(x i , x j ) = (a + x T i x j ) b , where a and b are chosen from {0, 1} and {2, 4}, respectively; and a cosine kernel with ker(x i , x j ) = (x T i x j )/( x i • x j ).And all the kernels {K p } m p=1 are normalized to the range of [0, 1].

Experimental Results and Analysis
To obtain better and more stable clustering performance, we utilize the ten-fold crossvalidation method with the five-fold cross-validation embedded in OSC-ALK-ONK.To this end, at first, we randomly partition all the samples into ten subsets without repetition, where nine subsets are viewed as training sets and the rest are regarded as testing sets.Further, the nine training sets are partitioned into five subsets without repetition, where four subsets are utilized as training sets and the rest one is the validation set.In order to lose generality, the values of two parameters α, β change from ∈ [10 −2 , 10 −1 , • • • , 10 1 , 10 2 ].Five-fold cross validation aims to select the optimal combination of parameters α, β.The obtained optimal combinations are used in the test set to produce the final clustering results.And the number of cluster k in each data set is set as the true value of the cluster.
For each method used for comparison, we set the parameters according to the corresponding literature.
The final experimental results of each method on each data set, namely the average ACC, NMI, and Purity of 15 experiments, are reported in Table 3.The best ACC, NMI, and Purity on each data set are highlighted in boldface.The last three rows in Table 3 are the mean ACC, NMI, and Purity of each method on all the data sets.Evidently, the proposed OSC-ALK-ONK performs best.The detailed analyses are as follows.
(2) OSC-ALK-ONK exceeds MKKM and RMKKM by 58.62%, 67.01%, 60.53% and 31.29%,34.36%, 32.89% in terms of ACC, NMI, and Purity.The reason should be that the combination of the local kernel method and the neighborhood kernel method is used to avoid the redundancy of the base kernel and expand the search range of the optimal kernel.And the clustering results of OSC-ALK-ONK are better than SimpleMKKM, which should also give credit to the combination of the two methods.(3) Although MKKM-MR and MKKM-RK exceed KKM, MKKM, and RMKKM, they are inferior to OSC-ALK-ONK.The reason should be the localized kernel strategy in OSC-ALK-ONK ensures the sparsity of base kernels and successfully avoids the redundance of base kernels.In a word, OSC-ALK-ONK improves the quality of the optimal kernel and promotes the clustering performance by combining local kernels and a neighborhood kernel.In addition, the BD representation ensures the reliability of clustering results and further promotes the clustering performance of OSC-ALK-ONK.
Overall, the experiment results show that OSC-ALK-ONK is an effective clustering method.In order to further substantiate the effectiveness of OSC-ALK-ONK, we present the visualization of clustering results for all methods on ISOLET (for convenience, only a fifth of samples in ISOLET are chosen).As can be seen from Figure 2, OSC-ALK-ONK achieves a good clustering effect.

Ablation Study
In OSC-ALK-ONK, the weights of base kernels are adjusted adaptively, which aims to choose base kernels with small correlation and discard those with large correlation.These weights are automatically updated during the optimization process of the model.To verify the effectiveness of the localized kernel selection strategy, we adopt the uniform weight strategy as a contrast, i.e., w p = 1 m , p = 1, 2, . . ., m, to perform ablation study.For convenience's sake, this model is denoted as OSC-ONK-UW.That is, all the base kernels are selected in OSC-ONK-UW.In addition, the BD regularization term is used in our OSC-ALK-ONK.To validate its effect, we also conduct an ablation study on the model not including this term, i.e., we only consider the following model (ALK-ONK-NoBD): where w p is computed according to (10) and (11).
The results of ablation studies on eight data sets, namely OSC-ONK-UW, ALK-ONK-NoBD, and OSC-ALK-ONK, are shown in Figure 3, which indicates that OSC-ALK-ONK outperforms OSC-ONK-UW and ALK-ONK-NoBD.Accordingly, OSC-ALK-ONK improves the clustering performance through the strategy of localized kernel selection and BD regularizer.

Parameters' Sensitivity
The model of OSC-ALK-ONK involves the parameters α, β, and a penalty parameter γ.We set γ to be 0.1 in experiments.To verify the sensitivity of OSC-ALK-ONK to α and β, they are tuned in the ranges [10 −2 , 10 −1 , • • • , 10 1 , 10 2 ] via leveraging a grid search technique.Figure 4 shows the clustering performance of OSC-ALK-ONK corresponding to varying α and β, which indicates that OSC-ALK-ONK is data-driven.

Convergence
In this section, we first prove the convergence of the objective function of (20).For convenience, we express the objective function of problem (20) as When updating W with fixed G, K w , H, problem ( 21) is a convex programming problem [27], so it can converge to the global optimal solution.We denote the optimal solution as W t+1 , then When updating G with fixed W, K w , H, problem ( 22) is convex and the global optimal solution can be obtained, which we denote as G t+1 , then When updating K w with fixed W, G, H, problem ( 25) is convex, then the global optimal solution can be obtained.It is denoted as K t+1 w , then When updating H with fixed W, G, K w , since 1 M Diag(W) is not symmetric, it is difficult to prove that problem (28) for H is convex.Nevertheless, the global convergence of the conjugate gradient method after finite step iteration has been proved in [26], i.e., the conjugate gradient method ensures that problem (28) can converge to the global optimal solution when updating H.The optimal solution is denoted as H t+1 , then Combining ( 31)-(34), it is concluded that Therefore, J (W t , G t , K t w , H t ) monotonically decreases at each iteration, until it converges to the global optimal solution.
The above proof shows that Algorithm 1 can monotonically reduce the value of the objective function at each iteration, i.e., the objective function is monotonically decreasing.The convergence graphs of OSC-ALK-ONK on all the data sets are shown in Figure 5, where the stopping criteria of the algorithm are |obj(t+1)−obj(t)| |obj(t)| ≤ 10 −3 , and obj(t) denotes the objective function value at the t-th iteration.Evidently, the changing trend of the objective function value with respect to the iteration number in Figure 5 shows the monotone descent.Further, they converge within 10 iterations on all the data sets, which demonstrates the rapid convergence of OSC-ALK-ONK.

Conclusions
In this paper, we proposed a novel MKC method called OSC-ALK-ONK.It selects adaptively local given kernels to generate a consensus kernel and uses a neighborhood kernel of this consensus kernel as an optimal one.The combination of these two methods promotes the quality of the optimal kernel by enlarging its search area while avoiding the redundancy of base kernels.Furthermore, a BD regularizer is utilized on the indicator matrix to execute one-step clustering and avoid two-step operations.In addition, sufficient experiment results indicate the effectiveness of OSC-ALK-ONK.
In real applications, a lot of data are multi-view data, which may be incomplete for some objective reasons.Due to the effectiveness of the local kernel selection strategy in this paper, it can be considered to combine this strategy with the neighborhood kernel in the future to obtain a high-quality optimal kernel in multi-view data.In addition, on account of the advantages of the BD regular term in this paper, it is also used in multi-view data, even incomplete multi-view data.All these are worth studying in the future.

Figure 1 .
Figure 1.The process of generating a neighborhood kernel.

3. 4 . 1 .
Update W While Fixing G, K w and H While G, K w and H are fixed, problem (20) is min W Tr((Diag(HH T • 1 n ) − HH T ) T W) s.t.0 W I, Tr(W) = k.(21) For (21), W k+1 = UU T , where U ∈ R n×k is composed of the k eigenvectors associated with the k smallest eigenvalues of Diag(HH T • 1 n ) − HH T [27].3.4.2.Update G While Fixing W, K w and H While W, K w and H are fixed, problem (20) is the following form:

3. 4 . 3 .w α 2 G − K w 2 Fs
Update K w While Fixing W, G and H With fixed W, G and H, problem (20) reduces to min K .t.K w = ∑ m p=1 w p K p .

Figure 2 .
Figure 2. The visualization of clustering results for OSC-ALK-ONK and comparison methods on ISOLET.

Table 1 .
Details of notations.

Table 2 .
Summaries of data sets.

Table 3 .
Clustering results of different methods.